# Space missions feature engineering and predictions

Who does not love Space? 
This DataSet was scraped from https://nextspaceflight.com/launches/past/?page=1 and includes all the space missions since the beginning of Space Race (1957) 

![Image](https://i1.wp.com/matmatch.com/blog/wp-content/uploads/2019/03/AdobeStock_80273384-compressor.jpeg?resize=2000%2C1125&ssl=1)

### Importing Libraries

In [2]:
import pandas as pd

### Importing the dataset

In [3]:
df=pd.read_csv('../input/space-missions-cleaned/Space_Missions_Cleaned.csv')

## Feature Engineering

When we try to build a model, we cant just pass null values to it. We need to fill those values somehow and feed it.

In [4]:
df.isnull().sum()# To find how many values are missing

Company Name         0
Location             0
Datum                0
Detail               0
Status Rocket        0
Rocket            3360
Status Mission       0
Country              0
DateTime             0
Year                 0
Launch_Site          0
Count                0
Month                0
dtype: int64

So, 3360 rocket data is missing..

In [5]:
df['Rocket'] = df['Rocket'].fillna(df['Rocket'].mean())
# filling out the missing rocket data by the mean of all missing data
# fillna() allows us to fill the missing data

In [6]:
df.isnull().sum()

Company Name      0
Location          0
Datum             0
Detail            0
Status Rocket     0
Rocket            0
Status Mission    0
Country           0
DateTime          0
Year              0
Launch_Site       0
Count             0
Month             0
dtype: int64

There are no more null data

In [7]:
df.head()

Unnamed: 0,Company Name,Location,Datum,Detail,Status Rocket,Rocket,Status Mission,Country,DateTime,Year,Launch_Site,Count,Month
0,SpaceX,"LC-39A, Kennedy Space Center, Florida, USA","Fri Aug 07, 2020 05:12 UTC",Falcon 9 Block 5 | Starlink V1 L9 & BlackSky,StatusActive,50.0,Success,USA,2020-08-07 05:12:00+00:00,2020,"LC-39A, Kennedy Space Center, Florida",1,Aug
1,CASC,"Site 9401 (SLS-2), Jiuquan Satellite Launch Ce...","Thu Aug 06, 2020 04:01 UTC",Long March 2D | Gaofen-9 04 & Q-SAT,StatusActive,29.75,Success,China,2020-08-06 04:01:00+00:00,2020,"Site 9401 (SLS-2), Jiuquan Satellite Launch Ce...",1,Aug
2,SpaceX,"Pad A, Boca Chica, Texas, USA","Tue Aug 04, 2020 23:57 UTC",Starship Prototype | 150 Meter Hop,StatusActive,153.792199,Success,USA,2020-08-04 23:57:00+00:00,2020,"Pad A, Boca Chica, Texas",1,Aug
3,Roscosmos,"Site 200/39, Baikonur Cosmodrome, Kazakhstan","Thu Jul 30, 2020 21:25 UTC",Proton-M/Briz-M | Ekspress-80 & Ekspress-103,StatusActive,65.0,Success,Kazakhstan,2020-07-30 21:25:00+00:00,2020,"Site 200/39, Baikonur Cosmodrome",1,Jul
4,ULA,"SLC-41, Cape Canaveral AFS, Florida, USA","Thu Jul 30, 2020 11:50 UTC",Atlas V 541 | Perseverance,StatusActive,145.0,Success,USA,2020-07-30 11:50:00+00:00,2020,"SLC-41, Cape Canaveral AFS, Florida",1,Jul


Next we need to make sure what columns would be ideal for a model to train. For example Detail,Datum has no need to be included in training data.

In [8]:
df=df.drop(['Location','Datum','Detail','DateTime','Launch_Site','Month','Count'],axis=1)
# Dropping unnecessary columns
# axis=1 means we are dropping columns, 0 would be for dropping rows

In [9]:
df.head()

Unnamed: 0,Company Name,Status Rocket,Rocket,Status Mission,Country,Year
0,SpaceX,StatusActive,50.0,Success,USA,2020
1,CASC,StatusActive,29.75,Success,China,2020
2,SpaceX,StatusActive,153.792199,Success,USA,2020
3,Roscosmos,StatusActive,65.0,Success,Kazakhstan,2020
4,ULA,StatusActive,145.0,Success,USA,2020


Another very important thing is that we cant pass string values to a model for training. We have to convert it to some numerical form for a model to understand.

In [10]:
df['Status Mission'].value_counts()# Counts of unique values of Status Mission column

Success              3879
Failure               339
Partial Failure       102
Prelaunch Failure       4
Name: Status Mission, dtype: int64

The thing we intend to predict here is whether the mission will fail or not. So we have to reduce four unique values into two unique values.

In [11]:
df['Status Mission'] =df['Status Mission'].apply(lambda x: x if x == 'Success' else 'Failure')
# converting four unique values namely Success, Failure, Partial Failure and Prelaunch Failure 
# into just two values namely Success and Failure
df['Status Mission'].value_counts()

Success    3879
Failure     445
Name: Status Mission, dtype: int64

Now we have to convert those values into numerical form. The simplest way to do this is make value success 1 and failure 0. LabelEncoder helps us to do just that.

In [12]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()# creating an object of labelEncoder class
df['Status Mission'] = encoder.fit_transform(df['Status Mission'])# fit_transform() method scales all the data
                                                                  # and convertes it into 0 and 1.

In [13]:
df[:10]

Unnamed: 0,Company Name,Status Rocket,Rocket,Status Mission,Country,Year
0,SpaceX,StatusActive,50.0,1,USA,2020
1,CASC,StatusActive,29.75,1,China,2020
2,SpaceX,StatusActive,153.792199,1,USA,2020
3,Roscosmos,StatusActive,65.0,1,Kazakhstan,2020
4,ULA,StatusActive,145.0,1,USA,2020
5,CASC,StatusActive,64.68,1,China,2020
6,Roscosmos,StatusActive,48.5,1,Kazakhstan,2020
7,CASC,StatusActive,153.792199,1,China,2020
8,SpaceX,StatusActive,50.0,1,USA,2020
9,JAXA,StatusActive,90.0,1,Japan,2020


In [14]:
df['Status Mission'].value_counts()

1    3879
0     445
Name: Status Mission, dtype: int64

Similiarly, we convert Status Rocket into numerical form

In [15]:
encoder = LabelEncoder()
df['Status Rocket']=encoder.fit_transform(df['Status Rocket'])

In [16]:
df.head()

Unnamed: 0,Company Name,Status Rocket,Rocket,Status Mission,Country,Year
0,SpaceX,0,50.0,1,USA,2020
1,CASC,0,29.75,1,China,2020
2,SpaceX,0,153.792199,1,USA,2020
3,Roscosmos,0,65.0,1,Kazakhstan,2020
4,ULA,0,145.0,1,USA,2020


In [17]:
df['Status Rocket'].value_counts()

1    3534
0     790
Name: Status Rocket, dtype: int64

We can predict data for both company and country column, but I decided to drop Country Column.

In [18]:
df=df.drop(['Country'],axis=1)

In [19]:
df.head()

Unnamed: 0,Company Name,Status Rocket,Rocket,Status Mission,Year
0,SpaceX,0,50.0,1,2020
1,CASC,0,29.75,1,2020
2,SpaceX,0,153.792199,1,2020
3,Roscosmos,0,65.0,1,2020
4,ULA,0,145.0,1,2020


### One hot encoding the Country Column

In [20]:
def onehot_encode(data, column):
    dummies = pd.get_dummies(data[column])
    data = pd.concat([data, dummies], axis=1)
    data.drop(column, axis=1, inplace=True)
    return data

In [21]:
df=onehot_encode(df,'Company Name')

Segregating the X and y values. What that means is given X data columns, we have to predict y. So, y will only have 1 column and X should not have that column.

In [22]:
df.head()

Unnamed: 0,Status Rocket,Rocket,Status Mission,Year,AEB,AMBA,ASI,Arianespace,Arm??e de l'Air,Blue Origin,...,SpaceX,Starsem,ULA,US Air Force,US Navy,UT,VKS RF,Virgin Orbit,Yuzhmash,i-Space
0,0,50.0,1,2020,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,0,29.75,1,2020,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,153.792199,1,2020,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,65.0,1,2020,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,145.0,1,2020,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [23]:
X=df.drop('Status Mission',axis=1)
y=df['Status Mission']

In [24]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: Status Mission, dtype: int64

In [25]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7,random_state=101)

In [26]:
X_train.head()

Unnamed: 0,Status Rocket,Rocket,Year,AEB,AMBA,ASI,Arianespace,Arm??e de l'Air,Blue Origin,Boeing,...,SpaceX,Starsem,ULA,US Air Force,US Navy,UT,VKS RF,Virgin Orbit,Yuzhmash,i-Space
3677,1,59.0,1968,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1683,1,153.792199,1992,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1193,1,153.792199,2000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2266,1,153.792199,1983,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2968,1,153.792199,1975,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Using StandardScalar to scale the data

In [27]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaled_X_train=scaler.fit_transform(X_train)
scaled_X_test=scaler.transform(X_test)

In [28]:
scaled_X_train

array([[ 0.47026172, -0.62229982, -1.07893777, ..., -0.01818182,
        -0.02571722, -0.01818182],
       [ 0.47026172, -0.01259235,  0.25227945, ..., -0.01818182,
        -0.02571722, -0.01818182],
       [ 0.47026172, -0.01259235,  0.69601852, ..., -0.01818182,
        -0.02571722, -0.01818182],
       ...,
       [-2.12647544,  0.28461819,  1.41709452, ..., -0.01818182,
        -0.02571722, -0.01818182],
       [ 0.47026172, -0.01259235,  0.52961637, ..., -0.01818182,
        -0.02571722, -0.01818182],
       [ 0.47026172, -0.01259235,  0.36321422, ..., -0.01818182,
        -0.02571722, -0.01818182]])

using Logistic Regression Model for Prediction

In [29]:
from sklearn.linear_model import LogisticRegressionCV
log_model=LogisticRegressionCV()
log_model.fit(scaled_X_train,y_train)

LogisticRegressionCV()

In [30]:
y_pred=log_model.predict(scaled_X_test)
y_pred

array([1, 1, 1, ..., 1, 1, 1])

In [31]:
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
accuracy_score(y_test,y_pred)

0.9090909090909091

In [32]:
confusion_matrix(y_test,y_pred)

array([[   9,  114],
       [   4, 1171]])

In [33]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.69      0.07      0.13       123
           1       0.91      1.00      0.95      1175

    accuracy                           0.91      1298
   macro avg       0.80      0.53      0.54      1298
weighted avg       0.89      0.91      0.87      1298



We see that the acuuracy is 90%, but seeing the classification report we infer that 118 predictions for failure have been given wrong by our model. So this model is not good at all.

In [34]:
from sklearn.ensemble import RandomForestClassifier
rfc=RandomForestClassifier(n_estimators=200,max_features='auto',random_state=101)
rfc.fit(scaled_X_train,y_train)

RandomForestClassifier(n_estimators=200, random_state=101)

In [35]:
rfc_pred=rfc.predict(scaled_X_test)

In [36]:
accuracy_score(rfc_pred,y_test)

0.8936825885978429

In [37]:
print(confusion_matrix(y_test,rfc_pred))

[[  21  102]
 [  36 1139]]


In [38]:
print(classification_report(y_test,rfc_pred))

              precision    recall  f1-score   support

           0       0.37      0.17      0.23       123
           1       0.92      0.97      0.94      1175

    accuracy                           0.89      1298
   macro avg       0.64      0.57      0.59      1298
weighted avg       0.87      0.89      0.88      1298



This model is somewhat better than the logistic regression model but its recall and f1-score is very low which means this is also a poor model.