Notes:
Year: 2912
We've received a transmission from four lightyears away
The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.
you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

Evaluation Metric:
Submissions are evaluated based on their classification accuracy, the percentage of predicted labels that are correct.

Submission Format
The submission format for the competition is a csv file with the following format:

PassengerId,Transported
0013_01,False
0018_01,False
0019_01,False
0021_01,False
etc.

In [1]:
# Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
# PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
# HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
# CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
# Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
# Destination - The planet the passenger will be debarking to.
# Age - The age of the passenger.
# VIP - Whether the passenger has paid for special VIP service during the voyage.
# RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
# Name - The first and last names of the passenger.
# Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

In [2]:
import pandas as pd

In [3]:
train = pd.read_csv(r'../data/train.csv')
train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


## Feature Engineering

##### Checking all the groups are from same planet, have same destination, are/are not VIP

In [4]:
# Passenger Group:
# Check if all the groups are from same planet, have same destination, are/are not VIP
train['PassengerGroup'] = train['PassengerId'].str[:4]
x = train[['PassengerGroup','HomePlanet']].groupby('PassengerGroup').nunique().reset_index()
x.sort_values(['HomePlanet'],ascending=False)
# x.groupby('HomePlanet').nunique()
x['HomePlanet'].value_counts(normalize=True)*100
# So presumaably, everyone from one group is usually from the same HomePlanet. For NaNs, we can impute the HomePlanet values from other passengers from same group
del(x)

In [5]:
x = train[['PassengerGroup','Destination']].groupby('PassengerGroup').nunique().reset_index()
x.sort_values(['Destination'],ascending=False)
# x.groupby('Destination').nunique()
x['Destination'].value_counts(normalize=True)*100
del(x)

In [6]:
x = train[['PassengerGroup','VIP']].groupby('PassengerGroup').nunique().reset_index()
x.sort_values(['VIP'],ascending=False)
# x.groupby('VIP').nunique()
x['VIP'].value_counts(normalize=True)*100
del(x)

# So everyone from one group is usually from the same HomePlanet, but may have a different destination & different VIP status

In [7]:
# Divide Cabin into 3 variables
train['Cabin1'] = train['Cabin'].str.split('/',expand=True)[0]
train['Cabin2'] = train['Cabin'].str.split('/',expand=True)[1]
train['Cabin3'] = train['Cabin'].str.split('/',expand=True)[2]

In [8]:
# Total Amount Spent:
train['TotalSpent'] = train['RoomService'] + train['FoodCourt'] + train['ShoppingMall'] + train['Spa'] + train['VRDeck']

In [9]:
# Group Size:
x = train[['PassengerId','PassengerGroup']].groupby('PassengerGroup').nunique().reset_index()
x.columns = ['PassengerGroup','GroupSize']
train = pd.merge(left=train, right=x, on='PassengerGroup',how='left')
del(x)

In [10]:
# Family Size:
train['Last Name'] = train['Name'].str.split(' ',expand=True)[1]
x = train[['PassengerId','Last Name']].groupby('Last Name').nunique().reset_index()
x.columns = ['Last Name','FamilySize']
train = pd.merge(left=train, right=x, on='Last Name',how='left')
del(x)

## EDA & Missing Value Treatment:

In [11]:
train.columns

Index(['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age',
       'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Name', 'Transported', 'PassengerGroup', 'Cabin1', 'Cabin2', 'Cabin3',
       'TotalSpent', 'GroupSize', 'Last Name', 'FamilySize'],
      dtype='object')

In [12]:
train.head(10)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,...,Name,Transported,PassengerGroup,Cabin1,Cabin2,Cabin3,TotalSpent,GroupSize,Last Name,FamilySize
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,...,Maham Ofracculy,False,1,B,0,P,0.0,1,Ofracculy,1.0
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,...,Juanna Vines,True,2,F,0,S,736.0,1,Vines,4.0
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,...,Altark Susent,False,3,A,0,S,10383.0,2,Susent,6.0
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,...,Solam Susent,False,3,A,0,S,5176.0,2,Susent,6.0
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,...,Willy Santantines,True,4,F,1,S,1091.0,1,Santantines,6.0
5,0005_01,Earth,False,F/0/P,PSO J318.5-22,44.0,False,0.0,483.0,0.0,...,Sandie Hinetthews,True,5,F,0,P,774.0,1,Hinetthews,7.0
6,0006_01,Earth,False,F/2/S,TRAPPIST-1e,26.0,False,42.0,1539.0,3.0,...,Billex Jacostaffey,True,6,F,2,S,1584.0,2,Jacostaffey,7.0
7,0006_02,Earth,True,G/0/S,TRAPPIST-1e,28.0,False,0.0,0.0,0.0,...,Candra Jacostaffey,True,6,G,0,S,,2,Jacostaffey,7.0
8,0007_01,Earth,False,F/3/S,TRAPPIST-1e,35.0,False,0.0,785.0,17.0,...,Andona Beston,True,7,F,3,S,1018.0,1,Beston,5.0
9,0008_01,Europa,True,B/1/P,55 Cancri e,14.0,False,0.0,0.0,0.0,...,Erraiam Flatic,True,8,B,1,P,0.0,3,Flatic,3.0


##### Destination:

In [13]:
# Check if passengers from a group travel to the same destination
x = train[['PassengerGroup','Destination']].groupby('PassengerGroup').nunique().reset_index()
# Need to keep only records with Detination for this
x.sort_values(['Destination'],ascending=False)
print(pd.merge(left=x.groupby('Destination').nunique(),right=x['Destination'].value_counts(normalize=True)*100,left_index=True,right_index=True))
del(x)
# Looks like in most cases (~87%) the entire group travels to the same destination, while 1.7% of the records have null. 
# We can impute the destinations using this knowledge

             PassengerGroup  proportion
Destination                            
0                       103    1.656748
1                      5397   86.810359
2                       668   10.744732
3                        49    0.788161


In [14]:
# Get list of passenger groups for passengerids with no Destination:
train.loc[train.Destination.isnull(),'PassengerGroup'].unique()
# Checking if any of these have a Destination in the dataset:
x = train.loc[(~train.Destination.isnull()) & (train.PassengerGroup.isin(train.loc[train.Destination.isnull(),'PassengerGroup'].unique())),['PassengerGroup','Destination']]
# Some of these PassengerGroups may have multiple destination. Let us check if any such case exists. If they do, we will take the one occuring the most frequent & random in case of a tie
x = pd.DataFrame(x.groupby(['PassengerGroup','Destination'],as_index=False).size())
x.sort_values(['PassengerGroup','size'],ascending=False)
x = x.groupby('PassengerGroup').first().reset_index()[['PassengerGroup','Destination']]
x.columns = ['PassengerGroup','Destination2']
train = pd.merge(left=train, right=x, on='PassengerGroup')
train['Destination'] = train['Destination'].fillna(train['Destination2'])
train.drop(['Destination2'], axis=1, inplace=True)
train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,...,Name,Transported,PassengerGroup,Cabin1,Cabin2,Cabin3,TotalSpent,GroupSize,Last Name,FamilySize
0,0045_01,Mars,False,F/10/P,TRAPPIST-1e,21.0,False,970.0,0.0,180.0,...,Zelowl Chmad,False,45,F,10,P,1214.0,2,Chmad,4.0
1,0045_02,Mars,True,F/10/P,TRAPPIST-1e,19.0,False,0.0,0.0,0.0,...,Mass Chmad,True,45,F,10,P,0.0,2,Chmad,4.0
2,0138_01,Earth,True,G/18/P,TRAPPIST-1e,13.0,False,0.0,0.0,0.0,...,Fayene Gambs,True,138,G,18,P,0.0,2,Gambs,4.0
3,0138_02,Earth,False,E/5/P,TRAPPIST-1e,34.0,False,0.0,22.0,0.0,...,Monah Gambs,False,138,E,5,P,793.0,2,Gambs,4.0
4,0504_01,Europa,True,B/19/S,55 Cancri e,18.0,False,0.0,0.0,0.0,...,Thabius Unpasine,True,504,B,19,S,0.0,6,Unpasine,6.0


In [15]:
train[train['Destination'].isnull()]

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,...,Name,Transported,PassengerGroup,Cabin1,Cabin2,Cabin3,TotalSpent,GroupSize,Last Name,FamilySize


##### HomePLanet

In [16]:
len(train[train['HomePlanet'].isnull()])

8

In [17]:
x = train[['PassengerGroup','HomePlanet']].groupby('PassengerGroup').nunique().reset_index()
x.sort_values(['HomePlanet'],ascending=False,inplace=True)
print(pd.merge(left=x.groupby('HomePlanet').nunique(),right=x['HomePlanet'].value_counts(normalize=True)*100,left_index=True,right_index=True))
del(x)

            PassengerGroup  proportion
HomePlanet                            
1                       79       100.0


In [18]:
# Get list of passenger groups for passengerids with no HomePlanet:
train.loc[train.HomePlanet.isnull(),'PassengerGroup'].unique()
# Checking if any of these have a HomePlanet in the dataset:
x = train.loc[(~train.HomePlanet.isnull()) & (train.PassengerGroup.isin(train.loc[train.HomePlanet.isnull(),'PassengerGroup'].unique())),['PassengerGroup','HomePlanet']]
x.columns = ['PassengerGroup','HomePlanet2']
x.drop_duplicates(inplace=True)
train = pd.merge(left=train, right=x, on='PassengerGroup',how='left')
train['HomePlanet'] = train['HomePlanet'].fillna(train['HomePlanet2'])
train.drop(['HomePlanet2'], axis=1, inplace=True)
train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,...,Name,Transported,PassengerGroup,Cabin1,Cabin2,Cabin3,TotalSpent,GroupSize,Last Name,FamilySize
0,0045_01,Mars,False,F/10/P,TRAPPIST-1e,21.0,False,970.0,0.0,180.0,...,Zelowl Chmad,False,45,F,10,P,1214.0,2,Chmad,4.0
1,0045_02,Mars,True,F/10/P,TRAPPIST-1e,19.0,False,0.0,0.0,0.0,...,Mass Chmad,True,45,F,10,P,0.0,2,Chmad,4.0
2,0138_01,Earth,True,G/18/P,TRAPPIST-1e,13.0,False,0.0,0.0,0.0,...,Fayene Gambs,True,138,G,18,P,0.0,2,Gambs,4.0
3,0138_02,Earth,False,E/5/P,TRAPPIST-1e,34.0,False,0.0,22.0,0.0,...,Monah Gambs,False,138,E,5,P,793.0,2,Gambs,4.0
4,0504_01,Europa,True,B/19/S,55 Cancri e,18.0,False,0.0,0.0,0.0,...,Thabius Unpasine,True,504,B,19,S,0.0,6,Unpasine,6.0


In [19]:
len(train[train['HomePlanet'].isnull()])

0

##### Cryosleep:

In [20]:
train[train['CryoSleep'].isnull()]

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,...,Name,Transported,PassengerGroup,Cabin1,Cabin2,Cabin3,TotalSpent,GroupSize,Last Name,FamilySize
76,2822_02,Earth,,G/450/S,TRAPPIST-1e,5.0,,0.0,0.0,0.0,...,Salley Harverez,False,2822,G,450.0,S,,5,Harverez,7.0
159,5090_01,Earth,,G/821/P,TRAPPIST-1e,0.0,False,0.0,0.0,0.0,...,,False,5090,G,821.0,P,0.0,6,,
185,5927_04,Europa,,B/201/P,55 Cancri e,49.0,False,0.0,1264.0,0.0,...,Alcoran Sysilstict,False,5927,B,201.0,P,6842.0,7,Sysilstict,10.0
189,6039_01,Earth,,G/983/S,TRAPPIST-1e,2.0,False,0.0,0.0,0.0,...,,True,6039,G,983.0,S,0.0,2,,
194,6324_02,Earth,,G/1025/S,55 Cancri e,44.0,False,0.0,0.0,0.0,...,Murie Hinetthews,True,6324,G,1025.0,S,0.0,3,Hinetthews,7.0
197,6405_02,Earth,,,55 Cancri e,2.0,False,0.0,0.0,0.0,...,Feline Toddleton,True,6405,,,,0.0,4,Toddleton,8.0
217,7584_01,Europa,,B/288/S,TRAPPIST-1e,19.0,False,0.0,0.0,0.0,...,Arrain Swingse,True,7584,B,288.0,S,0.0,3,Swingse,6.0
232,8574_02,Mars,,D/258/S,55 Cancri e,53.0,False,1042.0,0.0,2.0,...,Yakers Welte,False,8574,D,258.0,S,1056.0,5,Welte,6.0
255,9197_01,Europa,,C/308/P,55 Cancri e,44.0,False,0.0,0.0,0.0,...,Bellus Platch,True,9197,C,308.0,P,0.0,4,Platch,9.0


In [40]:
print(pd.concat([train.groupby(['Cabin1','CryoSleep'])[['Cabin1','Transported']].value_counts(dropna=False),train.groupby(['Cabin1','CryoSleep'])[['Cabin1','Transported']].value_counts(normalize=True, dropna=False)],axis=1))

                              count  proportion
Cabin1 CryoSleep Transported                   
A      False     False            2    0.500000
                 True             2    0.500000
       True      True             4    1.000000
B      False     False            8    0.666667
                 True             4    0.333333
       True      True            23    1.000000
C      False     False           11    0.733333
                 True             4    0.266667
       True      True            12    0.923077
                 False            1    0.076923
D      False     False            8    0.800000
                 True             2    0.200000
       True      True             5    1.000000
E      False     False           11    0.733333
                 True             4    0.266667
       True      True             2    0.666667
                 False            1    0.333333
F      False     False           28    0.636364
                 True            16    0

In [41]:
# While everyone in CryoSleep is more likely to be transported, this also seems to depend on Cabin1 variable where being in Cabin1 G & E will make it less likely ffor you to be transported

In [22]:
# I am inclined to say False if it is NA but lets check what % of their group & family is in the cryosleep:
print("For passengerid 2822_02:")
print(train.loc[train.PassengerGroup=='2822',['PassengerId','CryoSleep']].groupby('CryoSleep', dropna=False).nunique())
print(train.loc[train['Last Name']=='Harverez',['PassengerId','CryoSleep']].groupby('CryoSleep', dropna=False).nunique())

print("For passengerid 5090_01:")
print(train.loc[train.PassengerGroup=='5090',['PassengerId','CryoSleep']].groupby('CryoSleep', dropna=False).nunique())

print("For passengerid 6405_02:")
print(train.loc[train.PassengerGroup=='6405',['PassengerId','CryoSleep']].groupby('CryoSleep', dropna=False).nunique())
print(train.loc[train['Last Name']=='Toddleton',['PassengerId','CryoSleep']].groupby('CryoSleep', dropna=False).nunique())

print("For passengerid 7584_01:")
print(train.loc[train.PassengerGroup=='7584',['PassengerId','CryoSleep']].groupby('CryoSleep', dropna=False).nunique())
print(train.loc[train['Last Name']=='Swingse',['PassengerId','CryoSleep']].groupby('CryoSleep', dropna=False).nunique())

print("Overall:")
print(train[['PassengerId','CryoSleep']].groupby('CryoSleep', dropna=False).nunique())

For passengerid 2822_02:
           PassengerId
CryoSleep             
True                 4
NaN                  1
           PassengerId
CryoSleep             
True                 4
NaN                  1
For passengerid 5090_01:
           PassengerId
CryoSleep             
NaN                  1
False                3
True                 2
For passengerid 6405_02:
           PassengerId
CryoSleep             
False                2
NaN                  1
True                 1
           PassengerId
CryoSleep             
False                2
NaN                  1
True                 1
For passengerid 7584_01:
           PassengerId
CryoSleep             
NaN                  1
False                1
True                 1
           PassengerId
CryoSleep             
NaN                  1
False                1
True                 1
Overall:
           PassengerId
CryoSleep             
False              136
True               114
NaN                  9


In [23]:
train.PassengerId.nunique()

259