# Spaceship Titanic Transported Prediction

Predicting whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.

## File and Data Field Descriptions
train.csv - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.

- PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is -travelling with and pp is their number within the group. People in a group are often family members, but not always.
- HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
- CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
- Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
- Destination - The planet the passenger will be debarking to.
- Age - The age of the passenger.
- VIP - Whether the passenger has paid for special VIP service during the voyage.
- RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
- Name - The first and last names of the passenger.
- Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

test.csv - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.

sample_submission.csv - A submission file in the correct format.
PassengerId - Id for each passenger in the test set.
Transported - The target. For each passenger, predict either True or False.

## Importing Neccessary Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from pandas_profiling import ProfileReport

## Get the Data

In [2]:
df = pd.read_csv("train.csv")

## EDA - Exploratory Data Analysis

In [3]:
# Checking the head
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [4]:
# passenger id separate group (feature engineer)
# cabin feature engineering
# home planet visualize,home fill - max repeated value and getdummies
# Check vip correlation and fill in its missing values
# Vip convert object to 0 n 1
# Visualize Destination and fill in the max repeated value
# Cryo Sleep convert to 0 n 1
# Transported convert to 0 n 1
# Check Destination

In [5]:
# info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [6]:
df.describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,8514.0,8512.0,8510.0,8485.0,8510.0,8505.0
mean,28.82793,224.687617,458.077203,173.729169,311.138778,304.854791
std,14.489021,666.717663,1611.48924,604.696458,1136.705535,1145.717189
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,0.0,0.0,0.0,0.0,0.0
50%,27.0,0.0,0.0,0.0,0.0,0.0
75%,38.0,47.0,76.0,27.0,59.0,46.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0


## Checking for Missing Data

In [7]:
df.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [8]:
df.isna().sum()/len(df)*100

PassengerId     0.000000
HomePlanet      2.312205
CryoSleep       2.496261
Cabin           2.289198
Destination     2.093639
Age             2.059128
VIP             2.335212
RoomService     2.082135
FoodCourt       2.105142
ShoppingMall    2.392730
Spa             2.105142
VRDeck          2.162660
Name            2.300702
Transported     0.000000
dtype: float64

In [9]:
df.corr()['Transported'].sort_values()

RoomService    -0.244611
Spa            -0.221131
VRDeck         -0.207075
Age            -0.075026
ShoppingMall    0.010141
FoodCourt       0.046566
Transported     1.000000
Name: Transported, dtype: float64

In [None]:
profile = ProfileReport(df,title="Report",explorative=True)

In [None]:
profile.to_widgets()

In [None]:
profile.to_file("Analysis Report.html")

## Filling in Missing Data

In [11]:
# Room Service
df['RoomService'].fillna(df['RoomService'].mean(),inplace=True)
# FoodCourt
df['FoodCourt'].fillna(df['FoodCourt'].mean(),inplace=True)
# ShoppingMall
df['ShoppingMall'].fillna(df['ShoppingMall'].mean(),inplace=True)
# mean - Spa            
df['Spa'].fillna(df['Spa'].mean(),inplace=True)
# median -VRDeck 
df['VRDeck'].fillna(df['VRDeck'].mean(),inplace=True)


In [12]:
df.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService       0
FoodCourt         0
ShoppingMall      0
Spa               0
VRDeck            0
Name            200
Transported       0
dtype: int64

In [13]:
# Dropping the Name
df.drop("Name",axis=1,inplace=True)

In [14]:
print(df['HomePlanet'].dropna().sort_values().unique())

['Earth' 'Europa' 'Mars']


In [15]:
df["HomePlanet"].value_counts()

Earth     4602
Europa    2131
Mars      1759
Name: HomePlanet, dtype: int64

In [16]:
df["HomePlanet"].fillna("Earth",inplace=True)

In [17]:
df["VIP"].value_counts()

False    8291
True      199
Name: VIP, dtype: int64

In [18]:
df["VIP"].fillna("False",inplace=True)

In [19]:
df['Age'].fillna(df['Age'].median(),inplace=True)

In [20]:
df['Destination'].value_counts()

TRAPPIST-1e      5915
55 Cancri e      1800
PSO J318.5-22     796
Name: Destination, dtype: int64

In [21]:
df['Destination'].fillna("TRAPPIST-1e",inplace=True)

In [22]:
df["CryoSleep"].value_counts().idxmax()

False

In [23]:
df['CryoSleep'].fillna(df["CryoSleep"].value_counts().idxmax(),inplace=True)

In [24]:
df["Cabin"].value_counts().sort_values()

F/1433/P    1
G/71/P      1
G/64/S      1
E/26/S      1
F/83/S      1
           ..
C/21/P      7
F/1411/P    7
B/11/S      7
F/1194/P    7
G/734/S     8
Name: Cabin, Length: 6560, dtype: int64

In [25]:
df=df.dropna()

In [26]:
df.isnull().sum()

PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Transported     0
dtype: int64

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8494 entries, 0 to 8692
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8494 non-null   object 
 1   HomePlanet    8494 non-null   object 
 2   CryoSleep     8494 non-null   bool   
 3   Cabin         8494 non-null   object 
 4   Destination   8494 non-null   object 
 5   Age           8494 non-null   float64
 6   VIP           8494 non-null   object 
 7   RoomService   8494 non-null   float64
 8   FoodCourt     8494 non-null   float64
 9   ShoppingMall  8494 non-null   float64
 10  Spa           8494 non-null   float64
 11  VRDeck        8494 non-null   float64
 12  Transported   8494 non-null   bool   
dtypes: bool(2), float64(6), object(5)
memory usage: 812.9+ KB


In [28]:
# Mapping 0 n 1
df['CryoSleep'].replace({True:1,False:0},inplace=True)
# pd get dummies for - HomePlanet, Destination

In [29]:
df['Transported'].replace({True:1,False:0},inplace=True)

In [31]:
df['VIP'].replace({True:1,False:0},inplace=True)

In [32]:
df2 = pd.get_dummies(df['HomePlanet'],drop_first=True)

In [33]:
df.drop("HomePlanet",axis=1,inplace=True)

In [34]:
# concatenated = pandas.concat([df1, df2], axis="columns")
df = pd.concat([df, df2], axis="columns")

In [35]:
df3 = pd.get_dummies(df['Destination'],drop_first=True)
df.drop("Destination",axis=1,inplace=True)
# concatenated = pandas.concat([df1, df2], axis="columns")
df = pd.concat([df, df3], axis="columns")

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8494 entries, 0 to 8692
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   PassengerId    8494 non-null   object 
 1   CryoSleep      8494 non-null   int64  
 2   Cabin          8494 non-null   object 
 3   Age            8494 non-null   float64
 4   VIP            8494 non-null   object 
 5   RoomService    8494 non-null   float64
 6   FoodCourt      8494 non-null   float64
 7   ShoppingMall   8494 non-null   float64
 8   Spa            8494 non-null   float64
 9   VRDeck         8494 non-null   float64
 10  Transported    8494 non-null   int64  
 11  Europa         8494 non-null   uint8  
 12  Mars           8494 non-null   uint8  
 13  PSO J318.5-22  8494 non-null   uint8  
 14  TRAPPIST-1e    8494 non-null   uint8  
dtypes: float64(6), int64(2), object(3), uint8(4)
memory usage: 829.5+ KB


## Feature Engineering

In [37]:
# Cabin n Passenger - Group
df.head()

Unnamed: 0,PassengerId,CryoSleep,Cabin,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Europa,Mars,PSO J318.5-22,TRAPPIST-1e
0,0001_01,0,B/0/P,39.0,0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,1
1,0002_01,0,F/0/S,24.0,0,109.0,9.0,25.0,549.0,44.0,1,0,0,0,1
2,0003_01,0,A/0/S,58.0,1,43.0,3576.0,0.0,6715.0,49.0,0,1,0,0,1
3,0003_02,0,A/0/S,33.0,0,0.0,1283.0,371.0,3329.0,193.0,0,1,0,0,1
4,0004_01,0,F/1/S,16.0,0,303.0,70.0,151.0,565.0,2.0,1,0,0,0,1


In [38]:
df["Cabin"]

0          B/0/P
1          F/0/S
2          A/0/S
3          A/0/S
4          F/1/S
          ...   
8688      A/98/P
8689    G/1499/S
8690    G/1500/S
8691     E/608/S
8692     E/608/S
Name: Cabin, Length: 8494, dtype: object

In [39]:
df["Cabin"].value_counts()

G/734/S     8
G/109/P     7
B/201/P     7
G/1368/P    7
G/981/S     7
           ..
G/556/P     1
E/231/S     1
G/545/S     1
G/543/S     1
F/947/P     1
Name: Cabin, Length: 6560, dtype: int64

In [40]:
df['deck'] = df['Cabin'].apply(lambda x: x.split("/")[0])
df['num'] = df['Cabin'].apply(lambda x: x.split("/")[1])
df['side'] = df['Cabin'].apply(lambda x: x.split("/")[2])

In [41]:
df.head()

Unnamed: 0,PassengerId,CryoSleep,Cabin,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Europa,Mars,PSO J318.5-22,TRAPPIST-1e,deck,num,side
0,0001_01,0,B/0/P,39.0,0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,1,B,0,P
1,0002_01,0,F/0/S,24.0,0,109.0,9.0,25.0,549.0,44.0,1,0,0,0,1,F,0,S
2,0003_01,0,A/0/S,58.0,1,43.0,3576.0,0.0,6715.0,49.0,0,1,0,0,1,A,0,S
3,0003_02,0,A/0/S,33.0,0,0.0,1283.0,371.0,3329.0,193.0,0,1,0,0,1,A,0,S
4,0004_01,0,F/1/S,16.0,0,303.0,70.0,151.0,565.0,2.0,1,0,0,0,1,F,1,S


In [42]:
df.drop("Cabin",axis=1,inplace=True)

In [43]:
df['deck'].value_counts()

F    2794
G    2559
E     876
B     779
C     747
D     478
A     256
T       5
Name: deck, dtype: int64

In [44]:
df['num'].value_counts()

82      28
86      22
19      22
56      21
176     21
        ..
1644     1
1515     1
1639     1
1277     1
1894     1
Name: num, Length: 1817, dtype: int64

In [45]:
df['side'].value_counts()

S    4288
P    4206
Name: side, dtype: int64

In [46]:
df4 = pd.get_dummies(df['side'],drop_first=True)
df.drop("side",axis=1,inplace=True)
# concatenated = pandas.concat([df1, df2], axis="columns")
df = pd.concat([df, df4], axis="columns")
df.head()

Unnamed: 0,PassengerId,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Europa,Mars,PSO J318.5-22,TRAPPIST-1e,deck,num,S
0,0001_01,0,39.0,0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,1,B,0,0
1,0002_01,0,24.0,0,109.0,9.0,25.0,549.0,44.0,1,0,0,0,1,F,0,1
2,0003_01,0,58.0,1,43.0,3576.0,0.0,6715.0,49.0,0,1,0,0,1,A,0,1
3,0003_02,0,33.0,0,0.0,1283.0,371.0,3329.0,193.0,0,1,0,0,1,A,0,1
4,0004_01,0,16.0,0,303.0,70.0,151.0,565.0,2.0,1,0,0,0,1,F,1,1


In [54]:
df['gggg'] = df["PassengerId"].apply(lambda x: int(x.split("_")[0]))
df['pp'] = df["PassengerId"].apply(lambda x: int(x.split("_")[1]))

In [55]:
df['gggg'].value_counts()

5133    8
4498    8
8168    8
8956    8
984     8
       ..
3487    1
3486    1
3483    1
3480    1
4638    1
Name: gggg, Length: 6118, dtype: int64

In [56]:
df['pp'].value_counts()

1    6083
2    1377
3     551
4     225
5     127
6      75
7      43
8      13
Name: pp, dtype: int64

In [58]:
df.drop("PassengerId",axis=1,inplace=True)

In [61]:
df.corr()['Transported'].sort_values()

RoomService     -0.246156
Spa             -0.217905
VRDeck          -0.205386
TRAPPIST-1e     -0.097219
Age             -0.076427
PSO J318.5-22    0.003396
ShoppingMall     0.011715
Mars             0.020332
gggg             0.022568
FoodCourt        0.048048
pp               0.065912
S                0.103775
Europa           0.176303
CryoSleep        0.459200
Transported      1.000000
Name: Transported, dtype: float64

## Normalizing the Data

In [62]:
from sklearn.preprocessing import MinMaxScaler

In [63]:
scaler = MinMaxScaler()

## Splitting the Dataset

## Model Creation

## Different Models Compared

## Ensemble Learning Implemented