# Colin Ng Space Ship Compitetion
Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

Help save them and change history!

# Evaluation
Submissions are evaluated based on their classification accuracy, the percentage of predicted labels that are correct.

# Outline:
1. Data Analysis
2. Data Preprocessing
3. Machine Learning
4. Submission

# 1.Data Analysis

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import GridSearchCV,KFold,StratifiedKFold,train_test_split,cross_val_score
from sklearn.ensemble import RandomForestClassifier
import csv

In [2]:
train_df = pd.read_csv('train.csv')
#train_df = pd.read_csv('train.csv')
test_df  = pd.read_csv('test.csv')

In [3]:
train_df.describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,8514.0,8512.0,8510.0,8485.0,8510.0,8505.0
mean,28.82793,224.687617,458.077203,173.729169,311.138778,304.854791
std,14.489021,666.717663,1611.48924,604.696458,1136.705535,1145.717189
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,0.0,0.0,0.0,0.0,0.0
50%,27.0,0.0,0.0,0.0,0.0,0.0
75%,38.0,47.0,76.0,27.0,59.0,46.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0


In [4]:
train_df['Transported'].value_counts()

True     4378
False    4315
Name: Transported, dtype: int64

In [5]:
train_df.isnull().sum().sort_values(ascending=False)

CryoSleep       217
ShoppingMall    208
VIP             203
HomePlanet      201
Name            200
Cabin           199
VRDeck          188
FoodCourt       183
Spa             183
Destination     182
RoomService     181
Age             179
PassengerId       0
Transported       0
dtype: int64

In [6]:
test_df.isnull().sum().sort_values(ascending=False)

FoodCourt       106
Spa             101
Cabin           100
ShoppingMall     98
Name             94
CryoSleep        93
VIP              93
Destination      92
Age              91
HomePlanet       87
RoomService      82
VRDeck           80
PassengerId       0
dtype: int64

# 2. Data Preprocessing

In [7]:
train_df = train_df.drop(['PassengerId', 'Name'], axis=1)
train_df.head(5)

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True


In [8]:
train_df[['VIP', 'CryoSleep', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']] = train_df[['VIP', 'CryoSleep', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].fillna(value=0)
train_df.isnull().sum().sort_values(ascending=False)

HomePlanet      201
Cabin           199
Destination     182
RoomService     181
Age             179
CryoSleep         0
VIP               0
FoodCourt         0
ShoppingMall      0
Spa               0
VRDeck            0
Transported       0
dtype: int64

In [9]:
label = "Transported"
train_df[label] = train_df[label].astype(int)

In [10]:
train_df['VIP'] = train_df['VIP'].astype(int)
train_df['CryoSleep'] = train_df['CryoSleep'].astype(int)
train_df.head()

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,0,B/0/P,TRAPPIST-1e,39.0,0,0.0,0.0,0.0,0.0,0.0,0
1,Earth,0,F/0/S,TRAPPIST-1e,24.0,0,109.0,9.0,25.0,549.0,44.0,1
2,Europa,0,A/0/S,TRAPPIST-1e,58.0,1,43.0,3576.0,0.0,6715.0,49.0,0
3,Europa,0,A/0/S,TRAPPIST-1e,33.0,0,0.0,1283.0,371.0,3329.0,193.0,0
4,Earth,0,F/1/S,TRAPPIST-1e,16.0,0,303.0,70.0,151.0,565.0,2.0,1


In [11]:
train_df[["Deck", "Cabin_num", "Side"]] = train_df["Cabin"].str.split("/", expand=True)
train_df = train_df.drop('Cabin', axis=1)
train_df = train_df.drop('Deck', axis=1)
train_df = train_df.drop('Side', axis=1)
train_df

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Cabin_num
0,Europa,0,TRAPPIST-1e,39.0,0,0.0,0.0,0.0,0.0,0.0,0,0
1,Earth,0,TRAPPIST-1e,24.0,0,109.0,9.0,25.0,549.0,44.0,1,0
2,Europa,0,TRAPPIST-1e,58.0,1,43.0,3576.0,0.0,6715.0,49.0,0,0
3,Europa,0,TRAPPIST-1e,33.0,0,0.0,1283.0,371.0,3329.0,193.0,0,0
4,Earth,0,TRAPPIST-1e,16.0,0,303.0,70.0,151.0,565.0,2.0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
8688,Europa,0,55 Cancri e,41.0,1,0.0,6819.0,0.0,1643.0,74.0,0,98
8689,Earth,1,PSO J318.5-22,18.0,0,0.0,0.0,0.0,0.0,0.0,0,1499
8690,Earth,0,TRAPPIST-1e,26.0,0,0.0,0.0,1872.0,1.0,0.0,1,1500
8691,Europa,0,55 Cancri e,32.0,0,0.0,1049.0,0.0,353.0,3235.0,0,608


In [12]:
cat_cols = ['HomePlanet', 'Destination']
train_df2 = pd.get_dummies(train_df, columns=cat_cols)
train_df2

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Cabin_num,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,0,39.0,0,0.0,0.0,0.0,0.0,0.0,0,0,0,1,0,0,0,1
1,0,24.0,0,109.0,9.0,25.0,549.0,44.0,1,0,1,0,0,0,0,1
2,0,58.0,1,43.0,3576.0,0.0,6715.0,49.0,0,0,0,1,0,0,0,1
3,0,33.0,0,0.0,1283.0,371.0,3329.0,193.0,0,0,0,1,0,0,0,1
4,0,16.0,0,303.0,70.0,151.0,565.0,2.0,1,1,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,0,41.0,1,0.0,6819.0,0.0,1643.0,74.0,0,98,0,1,0,1,0,0
8689,1,18.0,0,0.0,0.0,0.0,0.0,0.0,0,1499,1,0,0,0,1,0
8690,0,26.0,0,0.0,0.0,1872.0,1.0,0.0,1,1500,1,0,0,0,0,1
8691,0,32.0,0,0.0,1049.0,0.0,353.0,3235.0,0,608,0,1,0,1,0,0


In [13]:
train_df2.to_csv('train_df2.csv')

In [14]:
train_df2.isnull().sum().sort_values(ascending=False)

Cabin_num                    199
RoomService                  181
Age                          179
CryoSleep                      0
VIP                            0
FoodCourt                      0
ShoppingMall                   0
Spa                            0
VRDeck                         0
Transported                    0
HomePlanet_Earth               0
HomePlanet_Europa              0
HomePlanet_Mars                0
Destination_55 Cancri e        0
Destination_PSO J318.5-22      0
Destination_TRAPPIST-1e        0
dtype: int64

In [15]:
train_df2[['Cabin_num', 'RoomService', 'Age']] = train_df2[['Cabin_num', 'RoomService', 'Age']].fillna(value=0)

# 3. Machine Learning

In [16]:
X = train_df2.drop(columns='Transported', axis=1)
Y = train_df2['Transported']

In [17]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=42)

In [18]:
X_train

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Cabin_num,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
3600,0,0.0,0,0.0,0.0,0.0,0.0,0.0,630,1,0,0,0,0,1
1262,1,17.0,0,0.0,0.0,0.0,0.0,0.0,201,1,0,0,0,0,1
8612,0,35.0,0,0.0,0.0,0.0,0.0,0.0,1483,1,0,0,0,1,0
5075,1,26.0,0,0.0,0.0,0.0,0.0,0.0,164,0,1,0,1,0,0
4758,0,13.0,0,0.0,0.0,60.0,1.0,5147.0,818,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4087,0,43.0,0,0.0,1947.0,0.0,0.0,1651.0,168,0,1,0,0,0,1
4406,0,38.0,0,183.0,203.0,0.0,110.0,374.0,951,1,0,0,0,0,1
7111,0,45.0,0,1.0,7.0,56.0,613.0,0.0,1229,1,0,0,0,1,0
426,1,24.0,0,0.0,0.0,0.0,0.0,0.0,65,1,0,0,0,1,0


In [19]:
X_test

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Cabin_num,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
3586,1,34.0,0,0.0,0.0,0.0,0.0,0.0,230,0,1,0,1,0,0
7173,1,4.0,0,0.0,0.0,0.0,0.0,0.0,1242,1,0,0,0,0,1
8559,0,25.0,0,410.0,32.0,14.0,1239.0,10.0,1766,0,0,1,0,0,1
6528,0,12.0,0,0.0,0.0,0.0,0.0,0.0,1319,0,0,1,0,0,1
7934,0,66.0,1,0.0,1828.0,1.0,1873.0,45.0,556,0,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3749,1,33.0,0,0.0,0.0,0.0,0.0,0.0,760,0,0,1,0,0,1
1637,0,15.0,0,1336.0,108.0,0.0,0.0,0.0,344,1,0,0,0,0,1
5820,1,14.0,0,0.0,0.0,0.0,0.0,0.0,996,1,0,0,0,1,0
5757,0,26.0,0,104.0,0.0,0.0,280.0,216.0,1165,1,0,0,0,1,0


In [20]:
Y_train

3600    1
1262    1
8612    0
5075    1
4758    0
       ..
4087    1
4406    0
7111    1
426     1
7925    1
Name: Transported, Length: 6954, dtype: int32

In [21]:
model = RandomForestClassifier()
model.fit(X_train, Y_train)

RandomForestClassifier()

In [22]:
y_pred=model.predict(X_test)

In [23]:
from sklearn.metrics import accuracy_score
score = accuracy_score(Y_test, y_pred)
print('Accuracy on Testing data : ', score)

Accuracy on Testing data :  0.7855089131684876


# 4. Submission

In [24]:
test_df = pd.read_csv('test.csv')

In [25]:
test_df = test_df.drop(['PassengerId', 'Name'], axis=1)
test_df.head(5)

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,Earth,True,G/3/S,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0
1,Earth,False,F/4/S,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0
2,Europa,True,C/0/S,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0
3,Europa,False,C/1/S,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,181.0,585.0
4,Earth,False,F/5/S,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,0.0,0.0


In [26]:
test_df[['VIP', 'CryoSleep', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']] = test_df[['VIP', 'CryoSleep', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].fillna(value=0)
test_df.isnull().sum().sort_values(ascending=False)

Cabin           100
Destination      92
Age              91
HomePlanet       87
RoomService      82
CryoSleep         0
VIP               0
FoodCourt         0
ShoppingMall      0
Spa               0
VRDeck            0
dtype: int64

In [27]:
test_df['VIP'] = train_df['VIP'].astype(int)
test_df['CryoSleep'] = train_df['CryoSleep'].astype(int)
test_df.head()

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,Earth,0,G/3/S,TRAPPIST-1e,27.0,0,0.0,0.0,0.0,0.0,0.0
1,Earth,0,F/4/S,TRAPPIST-1e,19.0,0,0.0,9.0,0.0,2823.0,0.0
2,Europa,0,C/0/S,55 Cancri e,31.0,1,0.0,0.0,0.0,0.0,0.0
3,Europa,0,C/1/S,TRAPPIST-1e,38.0,0,0.0,6652.0,0.0,181.0,585.0
4,Earth,0,F/5/S,TRAPPIST-1e,20.0,0,10.0,0.0,635.0,0.0,0.0


In [28]:
test_df[["Deck", "Cabin_num", "Side"]] = test_df["Cabin"].str.split("/", expand=True)
test_df = test_df.drop('Cabin', axis=1)
test_df = test_df.drop('Deck', axis=1)
test_df = test_df.drop('Side', axis=1)
test_df

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Cabin_num
0,Earth,0,TRAPPIST-1e,27.0,0,0.0,0.0,0.0,0.0,0.0,3
1,Earth,0,TRAPPIST-1e,19.0,0,0.0,9.0,0.0,2823.0,0.0,4
2,Europa,0,55 Cancri e,31.0,1,0.0,0.0,0.0,0.0,0.0,0
3,Europa,0,TRAPPIST-1e,38.0,0,0.0,6652.0,0.0,181.0,585.0,1
4,Earth,0,TRAPPIST-1e,20.0,0,10.0,0.0,635.0,0.0,0.0,5
...,...,...,...,...,...,...,...,...,...,...,...
4272,Earth,0,TRAPPIST-1e,34.0,0,0.0,0.0,0.0,0.0,0.0,1496
4273,Earth,0,TRAPPIST-1e,42.0,0,0.0,847.0,17.0,10.0,144.0,
4274,Mars,0,55 Cancri e,,0,0.0,0.0,0.0,0.0,0.0,296
4275,Europa,0,,,1,0.0,2680.0,0.0,0.0,523.0,297


In [29]:
cat_cols = ['HomePlanet', 'Destination']
test_df2 = pd.get_dummies(test_df, columns=cat_cols)
test_df2

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Cabin_num,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,0,27.0,0,0.0,0.0,0.0,0.0,0.0,3,1,0,0,0,0,1
1,0,19.0,0,0.0,9.0,0.0,2823.0,0.0,4,1,0,0,0,0,1
2,0,31.0,1,0.0,0.0,0.0,0.0,0.0,0,0,1,0,1,0,0
3,0,38.0,0,0.0,6652.0,0.0,181.0,585.0,1,0,1,0,0,0,1
4,0,20.0,0,10.0,0.0,635.0,0.0,0.0,5,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4272,0,34.0,0,0.0,0.0,0.0,0.0,0.0,1496,1,0,0,0,0,1
4273,0,42.0,0,0.0,847.0,17.0,10.0,144.0,,1,0,0,0,0,1
4274,0,,0,0.0,0.0,0.0,0.0,0.0,296,0,0,1,1,0,0
4275,0,,1,0.0,2680.0,0.0,0.0,523.0,297,0,1,0,0,0,0


In [30]:
test_df2[['Cabin_num', 'RoomService', 'Age']] = test_df2[['Cabin_num', 'RoomService', 'Age']].fillna(value=0)
test_df2

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Cabin_num,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,0,27.0,0,0.0,0.0,0.0,0.0,0.0,3,1,0,0,0,0,1
1,0,19.0,0,0.0,9.0,0.0,2823.0,0.0,4,1,0,0,0,0,1
2,0,31.0,1,0.0,0.0,0.0,0.0,0.0,0,0,1,0,1,0,0
3,0,38.0,0,0.0,6652.0,0.0,181.0,585.0,1,0,1,0,0,0,1
4,0,20.0,0,10.0,0.0,635.0,0.0,0.0,5,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4272,0,34.0,0,0.0,0.0,0.0,0.0,0.0,1496,1,0,0,0,0,1
4273,0,42.0,0,0.0,847.0,17.0,10.0,144.0,0,1,0,0,0,0,1
4274,0,0.0,0,0.0,0.0,0.0,0.0,0.0,296,0,0,1,1,0,0
4275,0,0.0,1,0.0,2680.0,0.0,0.0,523.0,297,0,1,0,0,0,0


In [31]:
pred = model.predict(test_df2)

In [32]:
pred

array([0, 0, 1, ..., 1, 1, 0])

In [33]:
sub_df = pd.read_csv('sample_submission.csv', index_col='PassengerId')
sub_df['total'] = 0
sub_df

Unnamed: 0_level_0,Transported,total
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
0013_01,False,0
0018_01,False,0
0019_01,False,0
0021_01,False,0
0023_01,False,0
...,...,...
9266_02,False,0
9269_01,False,0
9271_01,False,0
9273_01,False,0


In [34]:
sub_df['Transported'] = pred
sub_df

Unnamed: 0_level_0,Transported,total
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
0013_01,0,0
0018_01,0,0
0019_01,1,0
0021_01,1,0
0023_01,0,0
...,...,...
9266_02,0,0
9269_01,0,0
9271_01,1,0
9273_01,1,0


In [35]:
sub_df['Transported'] = sub_df['Transported'].astype(bool)
sub_df = sub_df.drop('total', axis=1)
sub_df

Unnamed: 0_level_0,Transported
PassengerId,Unnamed: 1_level_1
0013_01,False
0018_01,False
0019_01,True
0021_01,True
0023_01,False
...,...
9266_02,False
9269_01,False
9271_01,True
9273_01,True


In [36]:
sub_df.to_csv('ColinNgSpaceShip.csv')