### Spaceship Titanic

Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

Help save them and change history!

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df=pd.read_csv('train.csv')
t=pd.read_csv('test.csv')

In [3]:
df.shape

(8693, 14)

In [4]:
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [6]:
df.describe(include='all')

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
count,8693,8492,8476,8494,8511,8514.0,8490,8512.0,8510.0,8485.0,8510.0,8505.0,8493,8693
unique,8693,3,2,6560,3,,2,,,,,,8473,2
top,0001_01,Earth,False,G/734/S,TRAPPIST-1e,,False,,,,,,Gollux Reedall,True
freq,1,4602,5439,8,5915,,8291,,,,,,2,4378
mean,,,,,,28.82793,,224.687617,458.077203,173.729169,311.138778,304.854791,,
std,,,,,,14.489021,,666.717663,1611.48924,604.696458,1136.705535,1145.717189,,
min,,,,,,0.0,,0.0,0.0,0.0,0.0,0.0,,
25%,,,,,,19.0,,0.0,0.0,0.0,0.0,0.0,,
50%,,,,,,27.0,,0.0,0.0,0.0,0.0,0.0,,
75%,,,,,,38.0,,47.0,76.0,27.0,59.0,46.0,,


### Feature Engineering

In [7]:
num=df._get_numeric_data().columns

In [8]:
cat=df.select_dtypes(exclude='number').columns

In [9]:
num

Index(['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Transported'],
      dtype='object')

In [10]:
from sklearn.impute import SimpleImputer
si=SimpleImputer()
for i in num:
    a=df[i].values
    df[i]=si.fit_transform(a.reshape(-1,1))

In [11]:
si1=SimpleImputer(strategy='most_frequent')
for j in cat:
    b=df[j].values
    df[j]=si1.fit_transform(b.reshape(-1,1))

In [12]:
df['group']=df['PassengerId'].str[:4]

In [13]:
df['nwg']=df['PassengerId'].str[5:]

In [14]:
df.drop('PassengerId',axis=1,inplace=True)

In [15]:
df['deck']=df['Cabin'].str[:1]

In [16]:
df['Cabin']=df['Cabin'].str.split('/')

In [17]:
df[['deck','num','side']] = pd.DataFrame(df.Cabin.tolist(), index= df.index)

In [18]:
df.drop('Cabin', axis=1, inplace=True)

In [19]:
df['Name']=df['Name'].str.split(' ')

In [20]:
df[['FN','LN']]=pd.DataFrame(df.Name.tolist(),index=df.index)

In [21]:
df.drop('Name',axis=1, inplace=True)

In [22]:
df.head()

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,group,nwg,deck,num,side,FN,LN
0,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,0.0,1,1,B,0,P,Maham,Ofracculy
1,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,1.0,2,1,F,0,S,Juanna,Vines
2,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,0.0,3,1,A,0,S,Altark,Susent
3,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,0.0,3,2,A,0,S,Solam,Susent
4,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,1.0,4,1,F,1,S,Willy,Santantines


In [23]:
df.describe(include='all')

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,group,nwg,deck,num,side,FN,LN
count,8693,8693,8693,8693.0,8693,8693.0,8693.0,8693.0,8693.0,8693.0,8693.0,8693.0,8693.0,8693,8693.0,8693,8693,8693
unique,3,2,3,,2,,,,,,,6217.0,8.0,8,1817.0,2,2706,2217
top,Earth,False,TRAPPIST-1e,,False,,,,,,,4498.0,1.0,F,734.0,S,Alraium,Disivering
freq,4803,5656,6097,,8494,,,,,,,8.0,6217.0,2794,208.0,4487,203,207
mean,,,,28.82793,,224.687617,458.077203,173.729169,311.138778,304.854791,0.503624,,,,,,,
std,,,,14.339054,,659.739364,1594.434978,597.41744,1124.675871,1133.259049,0.500016,,,,,,,
min,,,,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,
25%,,,,20.0,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,
50%,,,,27.0,,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,,
75%,,,,37.0,,78.0,118.0,45.0,89.0,71.0,1.0,,,,,,,


### Feature like HomePlanet, CryoSleep, Destination, VIP, side could be encoded with OHE
### The remaining categorical features are to be encoded with Target Encoder

In [24]:
df.select_dtypes(exclude='number').columns

Index(['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'group', 'nwg', 'deck',
       'num', 'side', 'FN', 'LN'],
      dtype='object')

In [25]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['Destination']=le.fit_transform(df['Transported'])

In [26]:
cat_te=['group', 'nwg', 'deck','num','FN', 'LN']
cat_ohe=['HomePlanet', 'CryoSleep','Destination', 'VIP','side']

In [27]:
for i in cat_ohe:
    a=pd.get_dummies(df[i])
    df=df.merge(a,right_index=True,left_index=True)

In [28]:
from category_encoders import TargetEncoder
te=TargetEncoder()
for j in cat_te:
    df[j]=te.fit_transform(df[j],df['Destination'])



In [29]:
df.drop(cat_ohe,axis=1, inplace=True)

In [30]:
df['Transported']=df['Transported'].map({1.0:1,0.0:0})

In [31]:
df.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,group,nwg,deck,...,Europa,Mars,False_x,True_x,0_y,1_y,False,True,P,S
0,39.0,0.0,0.0,0.0,0.0,0.0,0,0.503624,0.475953,0.734275,...,1,0,1,0,1,0,1,0,1,0
1,24.0,109.0,9.0,25.0,549.0,44.0,1,0.503624,0.475953,0.439871,...,0,0,1,0,0,1,1,0,0,1
2,58.0,43.0,3576.0,0.0,6715.0,49.0,0,0.135445,0.475953,0.496094,...,1,0,1,0,1,0,0,1,0,1
3,33.0,0.0,1283.0,371.0,3329.0,193.0,0,0.135445,0.558782,0.496094,...,1,0,1,0,1,0,1,0,0,1
4,16.0,303.0,70.0,151.0,565.0,2.0,1,0.503624,0.475953,0.439871,...,0,0,1,0,0,1,1,0,0,1


In [32]:
cols=['Age','RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']

In [33]:
def outliers(data,feature):
  q1=data[feature].quantile(0.05)
  q3=data[feature].quantile(0.95)
  iqr=q3-q1
  ul=q3+1.5*iqr
  ll=q1-1.5*iqr
  return ul,ll


In [34]:
df.shape

(8693, 24)

In [35]:
for i in cols:
  ul,ll=outliers(df,i)
  df=df[(df[i]<ul) & (df[i]>ll)]

In [36]:
X=df.drop('Transported',axis=1)
y=df['Transported'].values

In [37]:
from imblearn.over_sampling import SMOTE
sm=SMOTE()
X_sm,y_sm=sm.fit_resample(X,y)



In [38]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_sm=sc.fit_transform(X_sm)



In [39]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X_sm,y_sm,test_size=0.2)

In [40]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score,confusion_matrix,recall_score,precision_score,accuracy_score
rf=RandomForestClassifier(random_state=42)
rf.fit(X_train,y_train)
y_pred=rf.predict(X_test)
f1_score(y_test,y_pred)

1.0

In [41]:
confusion_matrix(y_test,y_pred)

array([[813,   0],
       [  0, 876]])

In [42]:
recall_score(y_test,y_pred)

1.0

In [43]:
precision_score(y_test,y_pred)

1.0

In [44]:
accuracy_score(y_test,y_pred)

1.0

In [45]:
from sklearn.ensemble import ExtraTreesClassifier
xt=ExtraTreesClassifier()
cross_val_score(xt,X_train,y_train,cv=3,scoring='f1')

array([1., 1., 1.])

In [46]:
from sklearn.tree import DecisionTreeClassifier
dt=DecisionTreeClassifier()
cross_val_score(dt,X_train,y_train,cv=3,scoring='f1')

array([1., 1., 1.])

In [54]:
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()
lr.fit(X_train,y_train)
cross_val_score(lr,X_train,y_train,cv=3,scoring='f1')

array([1., 1., 1.])

In [48]:
q=rf.feature_importances_

In [49]:
w=X.columns

In [50]:
pd.DataFrame(q,w)

Unnamed: 0,0
Age,0.001932
RoomService,0.020588
FoodCourt,0.002671
ShoppingMall,0.002342
Spa,0.006398
VRDeck,0.004336
group,0.021577
nwg,0.00025
deck,0.001302
num,0.015011


In [65]:
e=lr.coef_.reshape(-1,1)

In [70]:
e=abs(e)

In [71]:
r=pd.DataFrame(e,w)

In [72]:
r.sort_values(by=0,ascending=False)

Unnamed: 0,0
1_y,3.433294
0_y,3.433294
FN,0.504158
LN,0.387415
num,0.361441
group,0.326999
Spa,0.267323
VRDeck,0.240154
RoomService,0.233025
True_x,0.17994
