### Data Cleaning

>**import** basic libraries

In [16]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

> **Load** the Extracted Dataset and **Analyse** the Null Data

In [17]:
data = pd.read_csv("Collected_data_falcon_9.csv")

data.isnull().sum()/data.count() * 100





FlightNumber       0.000000
Date               0.000000
BoosterVersion     0.000000
PayloadMass       15.068493
Orbit              0.598802
LaunchSite         0.000000
Outcome            0.000000
Flights            0.000000
GridFins           0.000000
Reused             0.000000
Legs               0.000000
LandingPad        18.309859
Block              0.000000
ReusedCount        0.000000
Serial             0.000000
Longitude          0.000000
Latitude           0.000000
dtype: float64

> **Fill** Missing data using Suitable Operation:

| Feature     | Percentage of Null Value | Data Type  | Type        | Remarks                              |
|-------------|------------|------------|-------------|--------------------------------------|
| PayloadMass | 15%        | int        | Discrete    | Filled with mean of 'PayloadMass' as per Orbit |
| Orbit       | 0.6%       | float64    | Categorical | Not much effect on Data, Drop       |
| LandingPad  | 18%        | boolean    | Categorical | No data available, Drop             |


In [18]:
orbits_values = list(data['Orbit'].value_counts().keys())
orbits_values

orbit_mean_payload = {}     # key->"string" | Value->"float"
for orb in orbits_values:
    temp_data = data[data['Orbit'] == orb]
    # Calculate the mean of the payload for "LEO" orbit
    mean_payload = float(temp_data['PayloadMass'].mean())
    if mean_payload>=0:
        orbit_mean_payload[orb] = round(mean_payload)
    else:
        orbit_mean_payload[orb] = 0


for key, value in orbit_mean_payload.items():

    print(f"Filling Value for {key}  by --> {value} \n")
    data.loc[(data['Orbit'] == key) & (data['PayloadMass'].isna()), 'PayloadMass'] = value

    
data.isnull().sum()




Filling Value for VLEO  by --> 14386 

Filling Value for ISS  by --> 3654 

Filling Value for GTO  by --> 5043 

Filling Value for LEO  by --> 3308 

Filling Value for PO  by --> 8601 

Filling Value for SSO  by --> 2106 

Filling Value for MEO  by --> 3828 

Filling Value for GEO  by --> 3938 

Filling Value for TLI  by --> 674 

Filling Value for ES-L1  by --> 570 

Filling Value for HEO  by --> 350 

Filling Value for SO  by --> 0 



FlightNumber       0
Date               0
BoosterVersion     0
PayloadMass        1
Orbit              1
LaunchSite         0
Outcome            0
Flights            0
GridFins           0
Reused             0
Legs               0
LandingPad        26
Block              0
ReusedCount        0
Serial             0
Longitude          0
Latitude           0
dtype: int64

In [19]:
data.shape[0]

168

In [20]:
print(f"Before Dropping Null Rows: No.of Rows-> {data.shape[0]}")
data.dropna(inplace=True, axis=0)
print(f"After Dropping Null Rows: No.of Rows-> {data.shape[0]}")


Before Dropping Null Rows: No.of Rows-> 168
After Dropping Null Rows: No.of Rows-> 141


> **Analyse** Unique values and find Insights for Data Manipulations

In [21]:
for cols in data.columns:
    print(f"\n\n--------------")
    print(data[cols].value_counts())
    



--------------
FlightNumber
12     1
14     1
16     1
17     1
18     1
      ..
164    1
165    1
166    1
167    1
168    1
Name: count, Length: 141, dtype: int64


--------------
Date
2015-01-10    1
2015-04-14    1
2015-06-28    1
2015-12-22    1
2016-01-17    1
             ..
2022-08-28    1
2022-08-31    1
2022-09-17    1
2022-09-24    1
2022-10-05    1
Name: count, Length: 141, dtype: int64


--------------
BoosterVersion
Falcon 9    141
Name: count, dtype: int64


--------------
PayloadMass
13260.0    30
15600.0    27
3654.0      8
3308.0      6
9600.0      5
           ..
330.0       1
2240.0      1
2200.0      1
700.0       1
678.0       1
Name: count, Length: 62, dtype: int64


--------------
Orbit
VLEO    54
ISS     27
GTO     19
LEO     11
SSO     11
PO      10
MEO      4
GEO      2
TLI      2
HEO      1
Name: count, dtype: int64


--------------
LaunchSite
CCSFS SLC 40    74
KSC LC 39A      44
VAFB SLC 4E     23
Name: count, dtype: int64


--------------
Outcome
True 

>  viewing above data, **convert Data** in Model Readable form, Data Requires these Data Manipulation Techniques:

| Feature        | Data Manipulation Technique      |
|----------------|----------------------------------|
| Flight number  | Drop                            |
| Date           | Drop                            |
| Booster        | Drop                            |
| Payload mass   | Convert to float                 |
| Orbit          | One Hot Encoding                |
| Launch site    | One Hot Encoding                |
| Outcome        | 0 or 1 (in float) based on Failure/Success |
| GridFins       | Only True; Drop                 |
| Reused         | Convert to float                 |
| Legs           | Convert to float                 |
| Landing pad    | One Hot Encode                  |
| Block          | One Hot Encode                  |
| Reused_count   | Convert to float                 |
| Serial         | One Hot Encode                  |
| Longitude      | Drop                            |
| Latitude       | Drop                            |


> **Create** <code>Class</code> and Assign value to 0 or 1 based on <code>Outcome</code> 

In [22]:
bad_outcome = ['False ASDS', 'False RTLS', 'None ASDS']
data['Class'] = data['Outcome'].apply(lambda x: 0 if x in bad_outcome else 1)
print(data['Class'].value_counts())
print(data['Outcome'].value_counts())

Class
1    131
0     10
Name: count, dtype: int64
Outcome
True ASDS     109
True RTLS      22
False ASDS      7
None ASDS       2
False RTLS      1
Name: count, dtype: int64


> **Drop** Unnecessary Columns (that has only 1 value)

In [23]:
data = data.drop(columns=['FlightNumber' , 'Date', 'BoosterVersion', 'GridFins', 'Longitude', 'Latitude', 'Outcome'])
data.head(2)


Unnamed: 0,PayloadMass,Orbit,LaunchSite,Flights,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Class
11,2395.0,ISS,CCSFS SLC 40,1,False,True,5e9e3032383ecb761634e7cb,1.0,0,B1012,0
13,1898.0,ISS,CCSFS SLC 40,1,False,True,5e9e3032383ecb761634e7cb,1.0,0,B1015,0


> Do **One Hot Encoding** to respective Columns: <code>['Orbit', 'LaunchSite' , 'LandingPad' , 'Block', 'Serial']</code>

In [24]:
data = pd.get_dummies(data, columns = ['Orbit', 'LaunchSite' , 'LandingPad' , 'Block', 'Serial'])
data.head(2)

Unnamed: 0,PayloadMass,Flights,Reused,Legs,ReusedCount,Class,Orbit_GEO,Orbit_GTO,Orbit_HEO,Orbit_ISS,...,Serial_B1060,Serial_B1061,Serial_B1062,Serial_B1063,Serial_B1067,Serial_B1069,Serial_B1071,Serial_B1072,Serial_B1073,Serial_B1077
11,2395.0,1,False,True,0,0,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
13,1898.0,1,False,True,0,0,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False


In [25]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 141 entries, 11 to 167
Data columns (total 74 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   PayloadMass                          141 non-null    float64
 1   Flights                              141 non-null    int64  
 2   Reused                               141 non-null    bool   
 3   Legs                                 141 non-null    bool   
 4   ReusedCount                          141 non-null    int64  
 5   Class                                141 non-null    int64  
 6   Orbit_GEO                            141 non-null    bool   
 7   Orbit_GTO                            141 non-null    bool   
 8   Orbit_HEO                            141 non-null    bool   
 9   Orbit_ISS                            141 non-null    bool   
 10  Orbit_LEO                            141 non-null    bool   
 11  Orbit_MEO                           

In [26]:


data.head(2)

Unnamed: 0,PayloadMass,Flights,Reused,Legs,ReusedCount,Class,Orbit_GEO,Orbit_GTO,Orbit_HEO,Orbit_ISS,...,Serial_B1060,Serial_B1061,Serial_B1062,Serial_B1063,Serial_B1067,Serial_B1069,Serial_B1071,Serial_B1072,Serial_B1073,Serial_B1077
11,2395.0,1,False,True,0,0,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
13,1898.0,1,False,True,0,0,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False


> **Convert** Data to Float Values

In [27]:
data.astype(float)

Unnamed: 0,PayloadMass,Flights,Reused,Legs,ReusedCount,Class,Orbit_GEO,Orbit_GTO,Orbit_HEO,Orbit_ISS,...,Serial_B1060,Serial_B1061,Serial_B1062,Serial_B1063,Serial_B1067,Serial_B1069,Serial_B1071,Serial_B1072,Serial_B1073,Serial_B1077
11,2395.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13,1898.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15,2477.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16,2034.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17,553.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
163,13260.0,2.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
164,13260.0,7.0,1.0,1.0,6.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
165,13260.0,6.0,1.0,1.0,5.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
166,13260.0,4.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [28]:
data.to_csv("Wrangled_data.csv", index=False)