# <center> **The Mystery of Space Titanic**

Things can't get more fancier than imagining a spaceship titanic in future, and facing the same consequence as its counter-part on Earth 1000 years after. For the first time we are solving a data science problem from the future, and its a challenge of our power to play with data and let's examine the context of this challenge

## The Context

I realize setting up a context is very important for any data science problem, which helps in setting the expectation, and what exactly are we trying to achieve. 

The problem is a continuation to the [Titanic](https://www.kaggle.com/c/titanic/overview) challenge, launched in kaggle 10 years ago, and that has been the  starting point of all the kagglers(including me). The goal was to predict whether a person survived the disaster or not, given the set of features like age, gender, class.

This one looks a bit different as obviously we are at space, things can go more complex, however, let's see how the data is, and how we can derive meaningful insights from it.

![](https://cdn.pixabay.com/photo/2020/10/28/17/32/spaceship-5694112__340.jpg)

In [None]:
#Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
import warnings
warnings.filterwarnings('ignore')

import plotly.express as px
import plotly.graph_objs as go
from plotly.tools import FigureFactory as FF
from plotly.subplots import make_subplots

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

## Exploring the Data

In [None]:
#Loading the CSV files from future!!
df=pd.read_csv('../input/spaceship-titanic/train.csv')
test_df = pd.read_csv('../input/spaceship-titanic/test.csv')

In [None]:
df.head()

From the initial look, the features look completely different from the original one, and we have some fancy space terms,hence, it would be helpful to take a look at the features, and what each one stands for, especially if you are not a space enthusiast!

## Transported: The Target Variable

Let's start the analysis from the target, which will give us an idea what exactly we are trying to predict.

As per the documentation, 'Transported' suggests whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

Let's try checking the class balance

In [None]:
df['Transported'].value_counts()

Fortunately there is no issue here with the class imbalance, equal number of people were transported, as not transported

In [None]:
colors = ['#45d96d', '#6376f2']
fig = go.Figure([go.Bar(y=df['Transported'].value_counts().values, x=df['Transported'].value_counts().index,
                 text=df['Transported'].value_counts().values,
                 marker=dict(color=colors,line=dict(color='#000000', 
                          width=2))
                            )])
fig.update_layout(title_text='Target Distribution')
fig.show()

## 1. Home Planet

The planet the passenger departed from, typically their planet of permanent residence.

This doesn't need any further explanation, however, it is interesting to note that we have people residing in different planets by 2912, and I'd be curious to know, if all the passengers are human beings!!

Let's see how many of them resides in Earth, and how many from other planets

In [None]:

colors = ['#F31840', '#50BD30','#8918F3']
fig = go.Figure(data=[go.Pie(
    labels=df['HomePlanet'].value_counts().index,
    values=df['HomePlanet'].value_counts().values, pull=[0,0,0.1],
    marker=dict(colors=colors, 
                line=dict(color='#000000', 
                          width=2)),
)])
fig.update_layout(title_text='Home Planet Distribution')
fig.show()

## 2. Cryo Sleep

Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

For all those wondering what is cryo sleep, and would like to know more can read more from [here](https://medium.com/predict/the-truth-about-cryosleep-7d114ec22eb5)

![](https://miro.medium.com/max/1400/0*paIuqZ-aZW0sPLPl.jpg)

In [None]:
df.head()

In [None]:
colors = ['#F31840', '#50BD30']
fig = px.histogram(df,x="CryoSleep", color="Transported")
fig.show()

In [None]:
df.head()

## 3. Destination

The planet the passenger will be debarking to

Well, I'm also curious to know, which are the planets we are exploring next, so let's look closely

In [None]:
df['Destination'].value_counts()

In [None]:
plt.figure(figsize=(8,4),dpi=150)
sns.countplot(data=df, x='Destination',hue="Transported",palette='mako')

Comparitively, more people to TRAPPIST-1e were not transported , PSO J318.5-22 shows almost equal distribution, and for 55 Cancri e, more people were transported

## 4. Age

The age of the passenger.

The previous titanic challenge had a lots of influence on age and gender, however, let's see if the pattern repeats here:

In [None]:
plt.figure(figsize=(6,3),dpi=150)
sns.kdeplot(df['Age'],shade=True,hue=df['Transported'],color='#eb5409')

The distribution seems right skewed, with the peak at age between 20-25, however with some irregularities in the age between 0-10, where more have been transported, and mostly the trend continues where young age passengers were transported more than the other ages

In [None]:
df.head()

## 5.VIP

Whether the passenger has paid for special VIP service during the voyage.

This is again a feature that makes me curious, and let's see if the spaceship was able to take care of the people who paid extra for the VIP services

In [None]:
fig=make_subplots(rows=1,cols=2,specs=[[{"type": "pie"},{"type": "pie"}]],
                 subplot_titles=("VIP distribution", "Transported distribution of VIP passengers"))

fig.add_trace(go.Pie(
    labels=df['VIP'].value_counts().index,
    values=df['VIP'].value_counts().values, pull=[0,0.1],
    marker=dict(colors=colors, 
                line=dict(color='#000000', 
                          width=2))
    ),row=1,col=1)
fig.add_trace(go.Pie(
    labels=df[df['VIP']==True]['Transported'].value_counts().index,
    values=df[df['VIP']==True]['Transported'].value_counts().values, pull=[0,0.1],
    marker=dict(colors=['#34e8eb','#ebe534'], 
                line=dict(color='#000000', 
                          width=2))
    ),row=1,col=2)  

fig.update_layout(showlegend=False)
fig.show()

Only 2.34% of the passengers paid for the VIP services, and the secong pie chart shows that only 38% of the VIP passengers were transported

## 6. Total Spending

RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.

We would convert all the features such as RoomService, FoodCourt, ShoppingMall, Spa, VRDeck, into one feature, as 'Total SPending for the analysis purpose, and let us see if there is something interesting there

In [None]:
df.head()

In [None]:
df['Total_Spending']=df['RoomService']+df['FoodCourt']+df['ShoppingMall']+df['Spa']+df['VRDeck']

In [None]:
plt.figure(figsize=(6,3),dpi=150)
sns.kdeplot(df['Total_Spending'],shade=True,hue=df['Transported'],color='#eb5409')

## The Missing Values

From the initial look, the data seems to have a lots of missing values, and we do expect it, as the data using records was recovered from the spaceship’s damaged computer system.

In [None]:
fig=go.Figure(data=[go.Bar(y=df.isna().sum().sort_values(ascending=False).index[1:], 
                     x=df.isna().sum().sort_values(ascending=False).values[1:],
                     orientation="h",
                    marker=dict(color=[n for n in range(14)], 
                                line_color='rgb(0,0,0)', 
                                line_width = 2,
                                coloraxis="coloraxis")
                    )
                    ])
fig.update_layout(showlegend=False, title_text="Missing values distribution", title_x=0.5)
fig.show()

Well, we don't have much of the missing values, and we will treat it separately by segregating to categorical and numerical features

In [None]:
cat_features=['HomePlanet','CryoSleep','Cabin','Destination','VIP','Name','Transported']
numerical_feat= ['Age','RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']

As there is a clear distinction for all the categorical values, where one dominates, it is safe to fill the missing values with the mode of the specific category

In [None]:
df['HomePlanet'] = df['HomePlanet'].fillna(df['HomePlanet'].mode()[0])
df['CryoSleep'] = df['CryoSleep'].fillna(df['CryoSleep'].mode()[0])
df['Cabin']=df['Cabin'].fillna(df['Cabin'].mode()[0])
df['Destination'] = df['Destination'].fillna(df['Destination'].mode()[0])
df['VIP'] = df['VIP'].fillna(df['VIP'].mode()[0])
df['Name'] = df['Name'].fillna("Name None")

In [None]:
test_df['HomePlanet'] = test_df['HomePlanet'].fillna(test_df['HomePlanet'].mode()[0])
test_df['CryoSleep'] = test_df['CryoSleep'].fillna(test_df['CryoSleep'].mode()[0])
test_df['Cabin']=test_df['Cabin'].fillna(test_df['Cabin'].mode()[0])
test_df['Destination'] = test_df['Destination'].fillna(test_df['Destination'].mode()[0])
test_df['VIP'] = test_df['VIP'].fillna(test_df['VIP'].mode()[0])
test_df['Name'] = test_df['Name'].fillna("Name None")

For the numerical features, we would be filling the missing values using iterative imputer strategy

In [None]:
#Handling the numerical feature missing values
imputer = IterativeImputer(max_iter=10,verbose=1)
df_imp = imputer.fit_transform(df[numerical_feat])
test_imp = imputer.transform(test_df[numerical_feat])

In [None]:
#Creating dataframe of the transformed values, and concating to the original frame
df_imp_new = pd.DataFrame(df_imp,columns=numerical_feat)
train_df=pd.concat([df[cat_features],df_imp_new],axis=1)

test_imp_new = pd.DataFrame(test_imp,columns=numerical_feat)
test_df = pd.concat([test_df[cat_features[:-1]],test_imp_new],axis=1)

## Feature Engineering

We are not completely set to go to modelling, as I feel there are some features that could be broken down into simple ones, and could be added. We will break the feature 'Name, and we would only keep the last name, as the first name cannot give much of the information.

Apart from that we'd also break down the Cabin. It takes the form deck/num/side, where side can be either P for Port or S for Starboard. We'll split the features into Deck, Num and side

#### 1. Extracting the Last Name

In [None]:
train_df['last_name'] = train_df['Name'].apply(lambda s:str(s).split()[1])
test_df['last_name']=test_df['Name'].apply(lambda s:str(s).split()[1])

In [None]:
train_df.head()

In [None]:
print("There are",train_df['last_name'].nunique(),"unique last names in train set")
print("There are",test_df['last_name'].nunique(),"unique last names in test set")

#### 2. Splitting the cabin feature

In [None]:
#Splitting the cabin feature
train_df = train_df.merge(train_df.Cabin.apply(lambda s: pd.Series({'Deck':str(s).split('/')[0], 'Num':str(s).split('/')[1],'Side':str(s).split('/')[2]})),
                                                left_index=True, right_index=True)
test_df = test_df.merge(test_df.Cabin.apply(lambda s: pd.Series({'Deck':str(s).split('/')[0], 'Num':str(s).split('/')[1],'Side':str(s).split('/')[2]})),
                                                left_index=True, right_index=True)

#### 3. Adding 'Total Spending' Feature

In [None]:
#Creating total spending feature
train_df['Total_Spending']=train_df['RoomService']+train_df['FoodCourt']+train_df['ShoppingMall']+train_df['Spa']+train_df['VRDeck']
test_df['Total_Spending']=test_df['RoomService']+test_df['FoodCourt']+test_df['ShoppingMall']+test_df['Spa']+test_df['VRDeck']

In [None]:
#Dropping the remaining columns
train_df.drop(['Cabin','Name'],axis=1,inplace=True)
test_df.drop(['Cabin','Name'],axis=1,inplace=True)

In [None]:
train_df.head()

The data looks good, and we shall pass it onto the final step of data preprocessing, which is label encoding, and then we'd proceed to model creation, and prediction

In [None]:
#Creating the new categorical features for label encoding
cat_features_new = ['HomePlanet','CryoSleep','Destination','VIP','last_name','Deck','Side','Transported']
train_df['Num']=train_df['Num'].astype(int)
test_df['Num']=test_df['Num'].astype(int)

## Encoding the Categorical Variables

In [None]:
#Label encoding for the train set
for feature in cat_features_new:
    le = LabelEncoder()
    train_df[feature]=le.fit_transform(train_df[feature])
    
#Label encoding for the test set(excluding the transported column)
for feature in cat_features_new[:-1]:
    le = LabelEncoder()
    test_df[feature]=le.fit_transform(test_df[feature])

In [None]:
train_df.head()

Now, the data looks good for the modelling

## XGBoost - Default Parameters

As a start, let's build a model using the most popular XGBoost, with the default parameters, and let's see how the predictions fair. If it does good, we will do hyperparameter tuning, and look to improve the predictions

In [None]:
X = train_df.drop('Transported',axis=1)
y = train_df['Transported']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
#Model using XGboost with default parameters
model_xgb = XGBClassifier()
model_xgb.fit(X_train,y_train)

In [None]:
#Checking the precition accuracy with the validation set
pred_xgb = model_xgb.predict(X_test)
accuracy_score(y_test,pred_xgb)

In [None]:
#Plotting the feature importances
orig_feature_names = X_train.columns
feature_important = model_xgb.get_booster().get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())

data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score")#, ascending=False)
ax =data.plot(kind='barh', figsize = (20,10))
ax.set_yticklabels(orig_feature_names); ## plot top features
#ax.set_xlabel("F-Score")
ax.set(xlabel="F-Score", ylabel="y label")
ax.set_title('Feature Importance')

As per the plot above, the most important features are "Total Spending","Cabin", and to everyone's surprise the last name!

In [None]:
#Calculating the test set predictions
prediction = model_xgb.predict(test_df)

In [None]:
test1_df = pd.read_csv('../input/spaceship-titanic/test.csv')
sub_file = pd.DataFrame({'PassengerId': test1_df['PassengerId'], 'Transported' : prediction})

#Mapping the predicted values back to 'True','False'
sub_file['Transported']=sub_file['Transported'].map({0:'False',1:'True'})

In [None]:
sub_file.head()

In [None]:
sub_file.to_csv("submission.csv",index=False)

The XGBoost model gives a decent prediction accuracy, and hence we'd be doing a hyperparameter tuning on this model soon, 

## Work in Progress !!

![](https://thumbs.dreamstime.com/b/sketchy-loading-sign-isolated-white-vector-123220895.jpg)