# Data pre-processing in SpaceX

## Exploratory Data Analysis

### Load the dataset and import necessary libraries:

We will import pandas library to read a file.

In [1]:
import pandas as pd
df = pd.read_csv("dataset_falcon9.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   FlightNumber    90 non-null     int64  
 1   Date            90 non-null     object 
 2   BoosterVersion  90 non-null     object 
 3   PayloadMass     90 non-null     float64
 4   Orbit           90 non-null     object 
 5   LaunchSite      90 non-null     object 
 6   Outcome         90 non-null     object 
 7   Flights         90 non-null     int64  
 8   GridFins        90 non-null     bool   
 9   Reused          90 non-null     bool   
 10  Legs            90 non-null     bool   
 11  LandingPad      64 non-null     object 
 12  Block           90 non-null     float64
 13  ReusedCount     90 non-null     int64  
 14  Serial          90 non-null     object 
 15  Longitude       90 non-null     float64
 16  Latitude        90 non-null     float64
 17  Class           90 non-null     int64

We can also check each column info, for instance:

In [2]:
df["FlightNumber"]
df["FlightNumber"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 90 entries, 0 to 89
Series name: FlightNumber
Non-Null Count  Dtype
--------------  -----
90 non-null     int64
dtypes: int64(1)
memory usage: 848.0 bytes


In [3]:
df["BoosterVersion"]
df["BoosterVersion"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 90 entries, 0 to 89
Series name: BoosterVersion
Non-Null Count  Dtype 
--------------  ----- 
90 non-null     object
dtypes: object(1)
memory usage: 848.0+ bytes


We can check count, mean, std, min, max and etc by following method:

In [4]:
df.describe()

Unnamed: 0,FlightNumber,PayloadMass,Flights,Block,ReusedCount,Longitude,Latitude,Class
count,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0
mean,45.5,6104.959412,1.788889,3.5,1.655556,-86.366477,29.449963,0.666667
std,26.124701,4694.67172,1.213172,1.595288,1.710254,14.149518,2.141306,0.474045
min,1.0,350.0,1.0,1.0,0.0,-120.610829,28.561857,0.0
25%,23.25,2510.75,1.0,2.0,0.0,-80.603956,28.561857,0.0
50%,45.5,4701.5,1.0,4.0,1.0,-80.577366,28.561857,1.0
75%,67.75,8912.75,2.0,5.0,3.0,-80.577366,28.608058,1.0
max,90.0,15600.0,6.0,5.0,5.0,-80.577366,34.632093,1.0


## Data Preparation
### Manipulate Missing Values

We will check the missing vale and make an decision about how handle them in data frame:

In [5]:
#detecting missing values
df.columns[df.isna().any()].tolist()

['LandingPad']

In [6]:
#total missing values
df["LandingPad"].isnull().sum()

26

In [7]:
#Checking the column LandingPad with missing value and comparing it with the related column('orbit')
df['LandingPad'].value_counts()

5e9e3032383ecb6bb234e7ca    35
5e9e3032383ecb267a34e7c7    13
5e9e3033383ecbb9e534e7cc    12
5e9e3032383ecb761634e7cb     2
5e9e3032383ecb554034e7c9     2
Name: LandingPad, dtype: int64

In [8]:
df[['LandingPad', 'Orbit']]

Unnamed: 0,LandingPad,Orbit
0,,LEO
1,,LEO
2,,ISS
3,,PO
4,,GTO
...,...,...
85,5e9e3032383ecb6bb234e7ca,VLEO
86,5e9e3032383ecb6bb234e7ca,VLEO
87,5e9e3032383ecb6bb234e7ca,VLEO
88,5e9e3033383ecbb9e534e7cc,VLEO


### Remove additional features

We will search more about the features and remove the additional features that do not affect on the analysis:

In [9]:
df = df.drop(['FlightNumber','Date','BoosterVersion','Longitude','Latitude'],axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PayloadMass  90 non-null     float64
 1   Orbit        90 non-null     object 
 2   LaunchSite   90 non-null     object 
 3   Outcome      90 non-null     object 
 4   Flights      90 non-null     int64  
 5   GridFins     90 non-null     bool   
 6   Reused       90 non-null     bool   
 7   Legs         90 non-null     bool   
 8   LandingPad   64 non-null     object 
 9   Block        90 non-null     float64
 10  ReusedCount  90 non-null     int64  
 11  Serial       90 non-null     object 
 12  Class        90 non-null     int64  
dtypes: bool(3), float64(2), int64(3), object(5)
memory usage: 7.4+ KB


### Feature Engineering: Convert object and boolean columns to number

All data need to be in number to analyze, so we will convert object columns to number by using get_dummies method:

In [10]:
df_dummy = pd.get_dummies(df[['Orbit', 'LaunchSite', 'Outcome', 'LandingPad', 'Serial']])
df_dummy

Unnamed: 0,Orbit_ES-L1,Orbit_GEO,Orbit_GTO,Orbit_HEO,Orbit_ISS,Orbit_LEO,Orbit_MEO,Orbit_PO,Orbit_SO,Orbit_SSO,...,Serial_B1048,Serial_B1049,Serial_B1050,Serial_B1051,Serial_B1054,Serial_B1056,Serial_B1058,Serial_B1059,Serial_B1060,Serial_B1062
0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
86,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
87,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
88,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


We will remove the previous columns (object) from dataset and merge it with new columns (number):

In [11]:
df = df.drop(['Orbit', 'LaunchSite', 'Outcome', 'LandingPad', 'Serial'], axis=1)
df = pd.concat([df, df_dummy], axis = 1)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 88 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   PayloadMass                          90 non-null     float64
 1   Flights                              90 non-null     int64  
 2   GridFins                             90 non-null     bool   
 3   Reused                               90 non-null     bool   
 4   Legs                                 90 non-null     bool   
 5   Block                                90 non-null     float64
 6   ReusedCount                          90 non-null     int64  
 7   Class                                90 non-null     int64  
 8   Orbit_ES-L1                          90 non-null     uint8  
 9   Orbit_GEO                            90 non-null     uint8  
 10  Orbit_GTO                            90 non-null     uint8  
 11  Orbit_HEO                         

Since all data need to be in number to analyze, so we will convert boolean columns to integer by using following method:

In [12]:
df["GridFins"] = df["GridFins"].astype(int)
df["GridFins"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 90 entries, 0 to 89
Series name: GridFins
Non-Null Count  Dtype
--------------  -----
90 non-null     int32
dtypes: int32(1)
memory usage: 488.0 bytes


In [13]:
df["Reused"] = df["Reused"].astype(int)
df["Reused"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 90 entries, 0 to 89
Series name: Reused
Non-Null Count  Dtype
--------------  -----
90 non-null     int32
dtypes: int32(1)
memory usage: 488.0 bytes


In [14]:
df["Legs"] = df ["Legs"].astype(int)
df["Legs"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 90 entries, 0 to 89
Series name: Legs
Non-Null Count  Dtype
--------------  -----
90 non-null     int32
dtypes: int32(1)
memory usage: 488.0 bytes


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 88 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   PayloadMass                          90 non-null     float64
 1   Flights                              90 non-null     int64  
 2   GridFins                             90 non-null     int32  
 3   Reused                               90 non-null     int32  
 4   Legs                                 90 non-null     int32  
 5   Block                                90 non-null     float64
 6   ReusedCount                          90 non-null     int64  
 7   Class                                90 non-null     int64  
 8   Orbit_ES-L1                          90 non-null     uint8  
 9   Orbit_GEO                            90 non-null     uint8  
 10  Orbit_GTO                            90 non-null     uint8  
 11  Orbit_HEO                         

### Save the preprocessed dataset 

In [16]:
#Save the dataset in the same folder:
df.to_csv("New_Preprocessed_SpaceX.csv")

In [17]:
#Save the dataset in a new address in the computer:
df.to_csv("G:\\Data science bootcamp\\New_Preprocessed_SpaceX.csv")

#### DONE! The dataset is ready for model training, model building and model testing.