In [5]:
import pandas as pd
import numpy as np
import re

1. Import Library

In [6]:
df = pd.read_csv("../collecting-data/wiki-data.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,FlightNo.,Dateandtime(UTC),VersionBooster,Launchsite,Payload,Payloadmass,Orbit,Customer,Launchoutcome,Boosterlanding
0,0,1,4 June 2010,F9 v1.0B0003[3],"CCSFS,SLC-40",Dragon Spacecraft Qualification Unit,N,LEO,SpaceX,Success,Failure[5]
1,1,2,8 December 2010,F9 v1.0B0004[3],"CCSFS,SLC-40",SpaceX COTS Demo Flight 1,U,LEO(ISS),NASA(COTS)various others,Success,Failure[10]
2,2,3,22 May 2012,F9 v1.0B0005[3],"CCSFS,SLC-40",SpaceX COTS Demo Flight 2,525 kg,LEO(ISS),NASA(COTS),Success,No attempt
3,3,4,8 October 2012,F9 v1.0B0006[3],"CCSFS,SLC-40",SpaceX CRS-1,"4,700 kg",LEO(ISS),NASA(CRS),Success,No attempt
4,4,5,1 March 2013,F9 v1.0B0007[3],"CCSFS,SLC-40",SpaceX CRS-2,"4,877 kg",LEO(ISS),NASA(CRS),Success,No attempt


2. Drop unnamed, payload column, set flight no as index, drop falcon heavy

In [7]:
df.drop(['Unnamed: 0', 'Payload'], axis=1, inplace=True)
df.set_index('FlightNo.', inplace=True)
df.drop(['FH 1', 'FH 2', 'FH 3', 'FH 4', 'FH 5'], inplace = True)

3. Handle Date column

In [8]:
df['Dateandtime(UTC)'].replace("12 February 202305:10[479]", "12 February 2023", inplace = True)
df['Dateandtime(UTC)'] = pd.to_datetime(df['Dateandtime(UTC)'])

4. Handle VersionBooster

* Split the column into 2 columns: booster version & flight serial

* Drop the VersionBooster

* Change all F9 to Falcon 9

* With flights have no serial, set to B0000

In [12]:
df[['BoosterVersion','Serial']] = df['VersionBooster'].str.split(expand=True)
df.drop(['VersionBooster'], axis=1, inplace=True)
df.head()

Unnamed: 0_level_0,Dateandtime(UTC),Launchsite,Payloadmass,Orbit,Customer,Launchoutcome,Boosterlanding,BoosterVersion,Serial
FlightNo.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,2010-06-04,"CCSFS,SLC-40",N,LEO,SpaceX,Success,Failure[5],F9,v1.0B0003[3]
2,2010-12-08,"CCSFS,SLC-40",U,LEO(ISS),NASA(COTS)various others,Success,Failure[10],F9,v1.0B0004[3]
3,2012-05-22,"CCSFS,SLC-40",525 kg,LEO(ISS),NASA(COTS),Success,No attempt,F9,v1.0B0005[3]
4,2012-10-08,"CCSFS,SLC-40","4,700 kg",LEO(ISS),NASA(CRS),Success,No attempt,F9,v1.0B0006[3]
5,2013-03-01,"CCSFS,SLC-40","4,877 kg",LEO(ISS),NASA(CRS),Success,No attempt,F9,v1.0B0007[3]


In [13]:
df['BoosterVersion'].replace('F9', 'Falcon 9', inplace=True)
df['Serial'].replace('v1.1', 'B0000', inplace=True)
df['Serial'] = df['Serial'].map(lambda x: re.search("(B\d{4})", str(x)).group())

5. We have 3 main launch sites:

* **CCSFS**: Cape Canaveral Space Force Station

* **VSFB**: Vandenberg Space Force Base, previously Vandenberg Air Force Base (VAFB)

* **KSC**: John F.Kennedy Space Center

* **SLC-40**: Space Launch Complex 40, previously Launch Complex 40 (LC-40), launch pad for rockets, located at CCSFS 

* **SLC-4E**: Space Launch Complex 4, 2 launch pads, landing at VSFB

* **LC-39A**: Launch Complex 39A, 3 launch pads, located at NASA's KSC

In [19]:
df['Launchsite'].unique()

array(['CCSFS,SLC-40', 'VSFB,SLC-4E', 'KSC,LC-39A', 'CCSFS,SLC-40[121]',
       'CCSFS,SLC-40[486]'], dtype=object)

In [24]:
df['Launchsite'] = df['Launchsite'].map(lambda x: x.replace(',', ' '))
df['Launchsite'] = df['Launchsite'].map(lambda x: x.replace('[121]', '').replace('[486]', ''))

6. Handle Payloadmass

* turn into float data, fill in missing value with mass average

In [25]:
df['Payloadmass'].head()

FlightNo.
1           N
2           U
3      525 kg
4    4,700 kg
5    4,877 kg
Name: Payloadmass, dtype: object

In [26]:
df['Payloadmass'].replace(['N', 'U', 'C'], '0 kg', inplace=True)
df['Payloadmass'] = df['Payloadmass'].map(lambda x: re.search('[\d\.\,]+', x).group().replace(',', ''))
df['Payloadmass'].head()

FlightNo.
1       0
2       0
3     525
4    4700
5    4877
Name: Payloadmass, dtype: object

In [27]:
df['Payloadmass'].replace('0', np.nan, inplace=True)
df['Payloadmass'] = df['Payloadmass'].astype('float64')
df['Payloadmass'].replace(np.nan, df['Payloadmass'].mean(), inplace=True)
df['Payloadmass'].head()

FlightNo.
1    8876.417526
2    8876.417526
3     525.000000
4    4700.000000
5    4877.000000
Name: Payloadmass, dtype: float64

7. Handle Orbit

* **LEO**: Low Earth Orbit, an orbit around Earth with a period of 128 minutes or less, altitude 2000 km, most of artificial objects in outer space are LEO. ISS are the largest international Space Station in LEO

* **MEO**: Medium Earth Orbit, an altitude above a low Earth orbit (LEO) and below a high Earth orbit (HEO) – between 2,000 and 35,786 km

* **GTO**: Geosynchronous Orbit, a high Earth orbit that allows satellites to match Earth's rotation. Located at 22,236 miles (35,786 kilometers) above Earth's equator, this position is a valuable spot for monitoring weather, communications and surveillance.

* **HEO**: High Earth Orbit, a geocentric orbit with an altitude entirely above that of a geosynchronous orbit (35,786 kilometres)

* **Heliocentric**: an orbit around the barycenter of the Solar System, which is usually located within or very near the surface of the Sun

* **PO**: Polar Orbit, a satellite passes above or nearly above both poles of the body being orbited

* **SSO**: Sun-synchronous Orbit, a heliosynchronous orbit is a nearly polar orbit around a planet, in which the satellite passes over any given point of the planet's surface at the same local mean solar time

* **BLT**: Ballistic Capture, a low energy method for a spacecraft to achieve an orbit around a distant planet or moon with no fuel required to go into orbit

* **Sub-orbital**: a spaceflight in which the spacecraft reaches outer space, but its trajectory intersects the atmosphere or surface of the gravitating body from which it was launched, so that it will not complete one orbital revolution (it does not become an artificial satellite)

* **Sun-Earth-L1**: the satellite launched into orbit toward Sun Earth Lagrange L1 point

In [28]:
df['Orbit'].value_counts()

LEO                               84
GTO                               40
LEO(ISS)                          36
SSO                               19
PolarLEO                           9
MEO                                6
Ballistic lunar transfer (BLT)     2
GTO[398]                           1
Heliocentric                       1
Sub-orbital[18]                    1
HEOforP/2 orbit                    1
GTO[356]                           1
GTO[338]                           1
LEO[172]                           1
Sun–Earth L1insertion              1
Polar orbitLEO                     1
RetrogradeLEO                      1
Name: Orbit, dtype: int64

In [29]:
df['Orbit'].replace(['LEO(ISS)', 'LEO[172]', 'RetrogradeLEO'], 'LEO', inplace=True)
df['Orbit'].replace(['PolarLEO','Polar orbitLEO'], 'PO', inplace=True)
df['Orbit'].replace('Ballistic lunar transfer (BLT)', 'BLT', inplace=True)
df['Orbit'].replace(['GTO[398]', 'GTO[356]', 'GTO[338]'], 'GTO', inplace=True)
df['Orbit'].replace('Sub-orbital[18]', 'Sub-orbital', inplace=True)
df['Orbit'].replace('HEOforP/2 orbit', 'HEO', inplace=True)
df['Orbit'].replace('Sun–Earth L1insertion', 'Sun–Earth-L1', inplace=True)

In [30]:
df['Orbit'].value_counts()

LEO             122
GTO              43
SSO              19
PO               10
MEO               6
BLT               2
Sun–Earth-L1      1
HEO               1
Sub-orbital       1
Heliocentric      1
Name: Orbit, dtype: int64

8. Handle customers

In [31]:
df['Customer'].unique()

array(['SpaceX', 'NASA(COTS)various others', 'NASA(COTS)', 'NASA(CRS)',
       'MDA', 'SES', 'Thaicom', 'Orbcomm', 'AsiaSat', 'USAFNASANOAA',
       'ABSEutelsat', 'Turkmenistan NationalSpace Agency',
       'NASA(LSP)NOAACNES', 'SKY Perfect JSAT Group',
       'Iridium Communications', 'EchoStar', 'NRO', 'Inmarsat',
       'Bulsatcom', 'Intelsat', 'NSPO', 'USAF', 'SES S.A.EchoStar',
       'KT Corporation', 'Northrop Grumman', 'HisdesatexactEarthSpaceX',
       'HispasatNovaWurks', 'NASA(LSP)', 'Thales-Alenia/BTRC',
       'Iridium CommunicationsGFZ•NASA', 'Telesat', 'Telkom Indonesia',
       'CONAE', "Es'hailSat", 'Spaceflight Industries',
       'PSNSpaceIL/IAIAir Force Research', 'NASA(CCD)',
       'Canadian Space Agency(CSA)', 'Spacecom',
       'Sky Perfect JSATKacific 1', 'NASA(CTS)', 'NASA(CCDev)',
       'SpaceXPlanet Labs', 'U.S. Space Force', 'Republic of Korea Army',
       'SpaceXSpaceflight Industries(BlackSky)', 'CONAEPlanetIQTyvak',
       'USSF', 'NASA(CCP)', 'NASA/N

In [32]:
to_NASA = ['NASA(COTS)various others', 
            'NASA(COTS)', 
            'NASA(CRS)', 
            'NASA(LSP)', 
            'NASA(CCD)', 
            'NASA(CTS)', 
            'NASA(CCDev)', 
            'NASA(CCP)']
df['Customer'].replace(to_NASA, 'NASA', inplace=True)
df['Customer'].replace('USAFNASANOAA', 'USAF/NASA/NOAA', inplace=True)
df['Customer'].replace('ABSEutelsat', 'ABS/Eutelsat', inplace=True)
df['Customer'].replace('NASA(LSP)NOAACNES', 'NASA/NOAA/CNES', inplace=True)
df['Customer'].replace('SES S.A.EchoStar', 'SES/EchoStar', inplace=True)
df['Customer'].replace('HisdesatexactEarthSpaceX', 'Hisdesat/exactEarth/SpaceX', inplace=True)
df['Customer'].replace('HispasatNovaWurks', 'Hispasat/NovaWurks', inplace=True)
df['Customer'].replace('Iridium CommunicationsGFZ•NASA', 'Iridium Communications/GFZ/NASA', inplace=True)
df['Customer'].replace('PSNSpaceIL/IAIAir Force Research', 'PSN/SpaceIL/IAI/Air Force Research', inplace=True)
df['Customer'].replace('Sky Perfect JSATKacific 1', 'SKY Perfect JSAT Group/Kacific 1', inplace=True)
df['Customer'].replace('SpaceXPlanet Labs', 'SpaceX/Planet Labs', inplace=True)
df['Customer'].replace(['SpaceXSpaceflight Industries(BlackSky)', 
                        'SpaceXSpaceflight, Inc.(BlackSky Global)', 
                        'SpaceXSpaceflight Industries'], 'SpaceX/Spaceflight', inplace=True)
df['Customer'].replace('CONAEPlanetIQTyvak', 'CONAE/PlanetIQ/Tyvak', inplace=True)
df['Customer'].replace('SpaceXCapella SpaceandTyvak', 'SpaceX/Capella Space/Tyvak', inplace=True)
df['Customer'].replace('Jared Isaacman[225][226]', 'Jared Isaacman', inplace=True)
df['Customer'].replace('GlobalstarUnknown US Government Agency', 'Globalstar/US Government Agency', inplace=True)
df['Customer'].replace('SpaceXAST SpaceMobile', 'SpaceX/AST SpaceMobile', inplace=True)
df['Customer'].replace('ispaceMBRSCJAXANASA', 'ispace/MBRSC/JAXA/NASA', inplace=True)
df['Customer'].replace('SpaceXD-Orbit', 'SpaceX/D-Orbit', inplace=True)

9. Launch Outcome

In [33]:
df['Launchoutcome'].replace('Failure(in flight)', 'Failure', inplace=True)

10. Booster landing

In [34]:
df['Boosterlanding'].value_counts()

Success             163
No attempt           24
Failure               9
Controlled[42]        2
Failure[5]            1
Failure[10]           1
Uncontrolled          1
Uncontrolled[60]      1
Controlled            1
Precluded             1
Controlled[229]       1
Controlled[244]       1
Name: Boosterlanding, dtype: int64

In [35]:
df['Boosterlanding'] = df['Boosterlanding'].map(lambda x: re.sub('[\[\d\]]+', '', x))
df['Boosterlanding'].value_counts()

Success         163
No attempt       24
Failure          11
Controlled        5
Uncontrolled      2
Precluded         1
Name: Boosterlanding, dtype: int64

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 206 entries, 1 to 206
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Dateandtime(UTC)  206 non-null    datetime64[ns]
 1   Launchsite        206 non-null    object        
 2   Payloadmass       206 non-null    float64       
 3   Orbit             206 non-null    object        
 4   Customer          206 non-null    object        
 5   Launchoutcome     206 non-null    object        
 6   Boosterlanding    206 non-null    object        
 7   BoosterVersion    206 non-null    object        
 8   Serial            206 non-null    object        
dtypes: datetime64[ns](1), float64(1), object(7)
memory usage: 24.2+ KB


In [37]:
df.head()

Unnamed: 0_level_0,Dateandtime(UTC),Launchsite,Payloadmass,Orbit,Customer,Launchoutcome,Boosterlanding,BoosterVersion,Serial
FlightNo.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,2010-06-04,CCSFS SLC-40,8876.417526,LEO,SpaceX,Success,Failure,Falcon 9,B0003
2,2010-12-08,CCSFS SLC-40,8876.417526,LEO,NASA,Success,Failure,Falcon 9,B0004
3,2012-05-22,CCSFS SLC-40,525.0,LEO,NASA,Success,No attempt,Falcon 9,B0005
4,2012-10-08,CCSFS SLC-40,4700.0,LEO,NASA,Success,No attempt,Falcon 9,B0006
5,2013-03-01,CCSFS SLC-40,4877.0,LEO,NASA,Success,No attempt,Falcon 9,B0007


In [39]:
df.to_csv('../cleaning-data/web-cleaned-data.csv')