## Objectives: remove & fill in missing values, fix data types

In [223]:
import pandas as pd
import numpy as np

1. Overview: there is missing data in payload mass, orbit, landing type, outcome, landing pad, block

In [224]:
df = pd.read_csv("api-data.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 187 entries, 0 to 186
Data columns (total 22 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       187 non-null    int64  
 1   date             187 non-null    object 
 2   flight no        187 non-null    int64  
 3   booster version  187 non-null    object 
 4   payload mass     162 non-null    float64
 5   orbit            186 non-null    object 
 6   launch site      187 non-null    object 
 7   landing type     158 non-null    object 
 8   outcome          156 non-null    object 
 9   flights          187 non-null    int64  
 10  gridfins         187 non-null    bool   
 11  reused           187 non-null    bool   
 12  legs             187 non-null    bool   
 13  landing pad      151 non-null    object 
 14  block            182 non-null    float64
 15  reused count     187 non-null    int64  
 16  serial           187 non-null    object 
 17  longitude       

2. Drop unnamed column

In [225]:
df.drop(['Unnamed: 0'], inplace=True, axis=1)

3. Fix the date data type to right format

In [226]:
df['date'] = pd.to_datetime(df["date"]).dt.date


4. Handle the payload mass
* Since we have 3 types of Falcon, each type will have the different average mass, so we will fill in according to booster version
* Most of missing values fall into Falcon 9 and Falcon 1 miss 2 values

In [227]:
mass_bv = df.groupby("booster version").describe()["payload mass"]
mass_bv

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
booster version,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Falcon 1,3.0,128.333333,95.437589,20.0,92.5,165.0,182.5,200.0
Falcon 9,156.0,8117.574038,5545.558195,330.0,2956.5,6630.5,13260.0,15600.0
Falcon Heavy,3.0,2650.0,2925.320495,600.0,975.0,1350.0,3675.0,6000.0


In [228]:
df.loc[df['booster version'] == 'Falcon 1', ['payload mass']]
df['payload mass'][1] = mass_bv["mean"]["Falcon 1"]
df['payload mass'][2] = mass_bv["mean"]["Falcon 1"]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['payload mass'][1] = mass_bv["mean"]["Falcon 1"]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['payload mass'][2] = mass_bv["mean"]["Falcon 1"]


In [229]:
df['payload mass'].replace(np.nan, mass_bv["mean"]["Falcon 9"], inplace = True)

5. Handle orbit
* Orbit only misses 1 value on flight no 112. There is one article on tesla north about this flight information, and it told the orbit were Low Earth Orbit (LEO)
* Click here to the article <a href="https://teslanorth.com/2022/02/03/spacex-falcon-9-rocket-makes-history-112-successful-flights-in-a-row/"> Space X Falcon 9 rocket makes history 112 successful</a>

In [230]:
df['orbit'].replace(np.nan, "LEO", inplace=True)

6. Handle outcome
* Most of missing outcome fall into initial launches, which has high rate of failure, so change missing one to False

In [231]:
df['outcome'].replace(np.nan, False, inplace=True)

7. Handle landing  & landing type
* There is lots of missing value in Falcon 1, also the number launches of Falcon Heavy and Falcon 1 is small (3 & 5), so remove it from dataframe
* Change missing value in landing pad & landing type to a new field called "no data"

In [232]:
df = df[df['booster version'] == 'Falcon 9']

In [233]:
df['landing pad'].replace(np.nan, 'no data', inplace=True)

In [234]:
df['landing type'].replace(np.nan, 'no data', inplace=True)

8. Handle structure of customers & manufacterers

In [235]:
df["customers"] = df["customers"].map(lambda x: x[1:-1].replace('\'', ""))
df['manufacterers'].replace('[]', "['SpaceX']", inplace = True)
df['manufacterers'] = df['manufacterers'].map(lambda x: x[1:-1].replace('\'', ""))

9. Checking the final data we have & save it

In [236]:
df.tail()

Unnamed: 0,date,flight no,booster version,payload mass,orbit,launch site,landing type,outcome,flights,gridfins,...,legs,landing pad,block,reused count,serial,longitude,latitude,cost per launch,customers,manufacterers
182,2022-09-05,183,Falcon 9,13260.0,VLEO,Cape Canaveral Space Force Station Space Launc...,ASDS,True,7,True,...,True,5e9e3033383ecbb9e534e7cc,5.0,6,B1052,-80.577366,28.561857,50000000,SpaceX,SpaceX
183,2022-09-11,184,Falcon 9,13260.0,VLEO,Kennedy Space Center Historic Launch Complex 39A,ASDS,True,14,True,...,True,5e9e3033383ecb075134e7cd,5.0,13,B1058,-80.603956,28.608058,50000000,SpaceX,SpaceX
184,2022-09-17,185,Falcon 9,13260.0,VLEO,Cape Canaveral Space Force Station Space Launc...,ASDS,True,6,True,...,True,5e9e3033383ecbb9e534e7cc,5.0,5,B1067,-80.577366,28.561857,50000000,SpaceX,SpaceX
185,2022-09-24,186,Falcon 9,13260.0,VLEO,Cape Canaveral Space Force Station Space Launc...,ASDS,True,4,True,...,True,5e9e3033383ecbb9e534e7cc,5.0,0,B1072,-80.577366,28.561857,50000000,SpaceX,SpaceX
186,2022-10-05,187,Falcon 9,8117.574038,ISS,Kennedy Space Center Historic Launch Complex 39A,ASDS,True,1,True,...,True,5e9e3033383ecbb9e534e7cc,5.0,0,B1077,-80.603956,28.608058,50000000,NASA (CCtCap),SpaceX


In [237]:
df.isna().sum()

date               0
flight no          0
booster version    0
payload mass       0
orbit              0
launch site        0
landing type       0
outcome            0
flights            0
gridfins           0
reused             0
legs               0
landing pad        0
block              0
reused count       0
serial             0
longitude          0
latitude           0
cost per launch    0
customers          0
manufacterers      0
dtype: int64

In [238]:
df.to_csv('../collecting-data/api-data-after-clean.csv')