# Spaceship Titanic - Prediction Model

I created this notebook to practice my skills in classification, data vizualisation, data manipulation and analysis.

### Description
This is year 2912, we've received a transmission from 4 lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

### Mission
To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

### Citation & Source
Addison Howard, Ashley Chow, and Ryan Holbrook. [Spaceship Titanic](https://www.kaggle.com/competitions/spaceship-titanic/overview), 2022. Kaggle.

## Columns Description
<b>PassengerId:</b> A unique Id for each passenger. Each Id takes the form <i>gggg_pp</i> where <i>gggg</i> indicates a group the passenger is travelling with and <i>pp</i> is their number within the group. People in a group are often family members, but not always. <br>
<b>HomePlanet:</b> The planet the passenger departed from, typically their planet of permanent residence. <br>
<b>CryoSleep:</b> Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins. <br>
<b>Cabin:</b> The cabin number where the passenger is staying. Takes the form <i>deck/num/side</i>, where <i>side</i> can be either P for Port or S for Starboard. <br>
<b>Destination:</b> The planet the passenger will be debarking to. <br>
<b>Age:</b>  The age of the passenger. <br>
<b>VIP:</b> Whether the passenger has paid for special VIP service during the voyage. <br>
<b>RoomService, FoodCourt, ShoppingMall, Spa, VRDeck:</b> Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.<br>
<b>Name:</b> The first and last names of the passenger. <br>
<b>Transported:</b> Whether the passenger was transported to another dimension. This is the <b>Target</b>, the column you are trying to predict. 

## Import Libraries

In [4]:
#Data Manipulation
import pandas as pd  #collection of functions for data processing and analysis modeled after R dataframes with SQL like features
import numpy as np   #foundational package for scientific computing

#Common Model Algorithms
from sklearn.impute import KNNImputer
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Data Gathering 

### Load

In [7]:
import os
for dirname, _, filenames in os.walk('/kaggle/input/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

./input/test.csv
./input/train.csv
./input/.ipynb_checkpoints\train-checkpoint.csv


In [8]:
df_train = pd.read_csv("/kaggle/input/train.csv")
df_train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [9]:
df_test = pd.read_csv("/kaggle/input/test.csv")
df_test['Transported'] = False
df_test.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0013_01,Earth,True,G/3/S,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0,Nelly Carsoning,False
1,0018_01,Earth,False,F/4/S,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0,Lerome Peckers,False
2,0019_01,Europa,True,C/0/S,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0,Sabih Unhearfus,False
3,0021_01,Europa,False,C/1/S,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,181.0,585.0,Meratz Caltilter,False
4,0023_01,Earth,False,F/5/S,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,0.0,0.0,Brence Harperez,False


In [10]:
df_train.info()
df_train.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
1100,1165_03,Europa,True,B/44/P,TRAPPIST-1e,29.0,False,0.0,0.0,0.0,0.0,0.0,Nusakab Waring,True
7123,7588_02,Earth,False,G/1219/P,55 Cancri e,15.0,False,0.0,714.0,0.0,0.0,0.0,Aude Hersons,True
4272,4549_01,Earth,False,E/298/S,TRAPPIST-1e,42.0,False,0.0,1.0,0.0,17.0,1601.0,Leria Woody,False
6160,6502_01,Earth,False,G/1058/S,TRAPPIST-1e,6.0,,0.0,0.0,0.0,0.0,0.0,Thelix Mcdanield,True
5596,5957_03,Europa,False,B/203/P,55 Cancri e,58.0,True,2.0,3862.0,0.0,1482.0,73.0,Markard Chuble,False


In [11]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4277 entries, 0 to 4276
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   4277 non-null   object 
 1   HomePlanet    4190 non-null   object 
 2   CryoSleep     4184 non-null   object 
 3   Cabin         4177 non-null   object 
 4   Destination   4185 non-null   object 
 5   Age           4186 non-null   float64
 6   VIP           4184 non-null   object 
 7   RoomService   4195 non-null   float64
 8   FoodCourt     4171 non-null   float64
 9   ShoppingMall  4179 non-null   float64
 10  Spa           4176 non-null   float64
 11  VRDeck        4197 non-null   float64
 12  Name          4183 non-null   object 
 13  Transported   4277 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 438.7+ KB


### Manipulation
Let's clean our data by:
1) Correcting aberrant values and outliers
2) Completing missing information
3) Creating new features for analysis
4) Converting fields to the correct format for calculations and presentation.

In [13]:
# Any Duplicates ?
print(f'Duplicates in train: {df_train.duplicated().sum()}, ({np.round(100*df_train.duplicated().sum()/len(df_train),1)}%)')
print(f'Duplicates in test: {df_test.duplicated().sum()}, ({np.round(100*df_test.duplicated().sum()/len(df_test),1)}%)')

Duplicates in train: 0, (0.0%)
Duplicates in test: 0, (0.0%)


In [14]:
# Any Missing values ?
print('Train columns with null values:\n', df_train.isnull().sum())
print("-"*10)

print('Test columns with null values:\n', df_test.isnull().sum())
print("-"*10)

df_train.describe(include = 'all')

Train columns with null values:
 PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64
----------
Test columns with null values:
 PassengerId       0
HomePlanet       87
CryoSleep        93
Cabin           100
Destination      92
Age              91
VIP              93
RoomService      82
FoodCourt       106
ShoppingMall     98
Spa             101
VRDeck           80
Name             94
Transported       0
dtype: int64
----------


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
count,8693,8492,8476,8494,8511,8514.0,8490,8512.0,8510.0,8485.0,8510.0,8505.0,8493,8693
unique,8693,3,2,6560,3,,2,,,,,,8473,2
top,0001_01,Earth,False,G/734/S,TRAPPIST-1e,,False,,,,,,Gollux Reedall,True
freq,1,4602,5439,8,5915,,8291,,,,,,2,4378
mean,,,,,,28.82793,,224.687617,458.077203,173.729169,311.138778,304.854791,,
std,,,,,,14.489021,,666.717663,1611.48924,604.696458,1136.705535,1145.717189,,
min,,,,,,0.0,,0.0,0.0,0.0,0.0,0.0,,
25%,,,,,,19.0,,0.0,0.0,0.0,0.0,0.0,,
50%,,,,,,27.0,,0.0,0.0,0.0,0.0,0.0,,
75%,,,,,,38.0,,47.0,76.0,27.0,59.0,46.0,,


In [15]:
df_train.nunique()

PassengerId     8693
HomePlanet         3
CryoSleep          2
Cabin           6560
Destination        3
Age               80
VIP                2
RoomService     1273
FoodCourt       1507
ShoppingMall    1115
Spa             1327
VRDeck          1306
Name            8473
Transported        2
dtype: int64

### What do we have so far ?
We can seperate the columns in the following categories :
<br><b>Continuous:</b> RoomService, FoodCourt, ShoopingMall, Spa, VRDeck
<br><b>Categorical:</b> HomePlanet, CryoSleep, Destination, VIP
<br><b>Qualitative:</b> PassengerId, Cabin, Name

<b>Age</b> is a continuous variable and we will separate it in different ranges so we won't put it in the continuous category.<br>
<b>Target</b> is a categorical variable and the target so we won't put it in the categorical category. <br>

In the category <b>Continuous</b>, it's only columns indicating if the passengers spent money or not, so let's call it <b>Spending</b>

In [17]:
spending_col=['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

In [18]:
data1 = df_train.copy(deep = True)
df = pd.concat([data1, df_test])

In [19]:
df['Spending']=df[spending_col].sum(axis=1)
df['No_Spending']=(df['Spending']==0).astype(int)

In [20]:
df['Group'] = df['PassengerId'].apply(lambda x: x.split('_')[0]).astype(int)
df['Group_size']=df['Group'].map(lambda x: df['Group'].value_counts()[x])
df['Solo']=(df['Group_size']==1).astype(int)

In [21]:
# Split the column Cabin by 3 and use them as features for the model prediction
df[['Deck','Num','Side']] = df['Cabin'].str.split('/',expand=True)

df.drop('Cabin', axis=1, inplace=True)

df['Deck'] = df['Deck'].map({'A':0, 'B':1, 'C':2, 'D':3, 'E':4, 'F':5, 'G':6,'T':7})
df['Side'] = df['Side'].map({'U':-1, 'P':1, 'S':2})

In [22]:
# Based on family name, let's see if we can calculate family size and use it as feature for the model prediction
df['Name'].fillna('Unknown Unknown', inplace=True)
df['Family_Name']=df['Name'].str.split().str[-1]
df.drop('Name', axis=1, inplace=True)

df['Family_size']=df['Family_Name'].map(lambda x: df['Family_Name'].value_counts()[x])
df.drop('Family_Name', axis=1, inplace=True)
df.loc[df['Family_size']>100,'Family_size']=np.nan

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Name'].fillna('Unknown Unknown', inplace=True)


In [23]:
df['Destination'] = df['Destination'].fillna('Unknown')
df['HomePlanet'] = df['HomePlanet'].fillna('Unknown')

category_cols = ['HomePlanet','Destination']

for col in category_cols:
    df = pd.concat([df, pd.get_dummies(df[col], prefix = col)], axis = 1)
    df.drop(col, axis=1, inplace=True)

In [24]:
df.drop('PassengerId', axis=1, inplace=True)

## Modelisation

In [26]:
impute_list = ['CryoSleep', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Deck', 'Num', 'Side', 'Family_size']
rest = list(set(df.columns) - set(impute_list))
df_rest = df[rest]
imp = KNNImputer() #Will fulffill the missing the data
df_imputed = imp.fit_transform(df[impute_list])
df_imputed = pd.DataFrame(df_imputed, columns = impute_list)
df = pd.concat([df_rest.reset_index(drop = True), df_imputed.reset_index(drop = True)], axis = 1)

In [27]:
df['Num_0-299']=(df['Num']<300).astype(int) 
df['Num_300-599']=((df['Num']>=300) & (df['Num']<600)).astype(int)
df['Num_600-899']=((df['Num']>=600) & (df['Num']<900)).astype(int)
df['Num_900-1199']=((df['Num']>=900) & (df['Num']<1200)).astype(int)
df['Num_1200-1499']=((df['Num']>=1200) & (df['Num']<1500)).astype(int)
df['Num_1500-1799']=((df['Num']>=1500) & (df['Num']<1800)).astype(int)
df['Num_1800+']=(df['Num']>=1800).astype(int)
df.drop('Num', axis=1, inplace=True)

In [28]:
df.corr()['Transported'].sort_values(ascending=False)

Transported                  1.000000
No_Spending                  0.340510
CryoSleep                    0.324441
HomePlanet_Europa            0.131977
Destination_55 Cancri e      0.083625
Side                         0.074527
Group_size                   0.064970
Num_900-1199                 0.050854
Num_0-299                    0.047382
FoodCourt                    0.034706
Num_600-899                  0.021155
Group                        0.014628
HomePlanet_Unknown           0.006403
HomePlanet_Mars              0.005643
ShoppingMall                 0.004107
Destination_PSO J318.5-22    0.000760
Destination_Unknown         -0.000554
Num_1800+                   -0.008295
VIP                         -0.018645
Num_1500-1799               -0.033234
Family_size                 -0.038663
Num_1200-1499               -0.044645
Age                         -0.049919
Num_300-599                 -0.063042
Destination_TRAPPIST-1e     -0.072731
Solo                        -0.077944
Deck        

In [29]:
df_train, df_test = df[:df_train.shape[0]], df[df_train.shape[0]:]
df_test = df_test.drop(columns = 'Transported')
df_train.shape, df_test.shape

((8693, 32), (4277, 31))

In [30]:
X = df_train.drop(columns = 'Transported')
y = df_train['Transported']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=True)

Model = [RandomForestClassifier(), DecisionTreeClassifier(), LogisticRegression(), XGBClassifier(), LGBMClassifier()]

In [31]:
for alg in Model:
    alg.fit(X_train, y_train)
    pred = alg.predict(X_test)
    print(f'For the model', alg, f'the accuracy score is', accuracy_score(y_test, pred))
    print('')

For the model RandomForestClassifier() the accuracy score is 0.79700977573318

For the model DecisionTreeClassifier() the accuracy score is 0.753306497987349



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


For the model LogisticRegression() the accuracy score is 0.7837837837837838

For the model XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              feature_weights=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=None,
              n_jobs=None, num_parallel_tree=None, ...) the accuracy score is 0.7981598619896493

[LightGBM] [Info] Number of positive: 3482, number of negative: 3472
[LightGBM] [Info] Auto-choosing row-w

## Models Accuracy Result
Best model so far for this prediction is the LGBMClassifier. <br>

<b>This notebook will be updated and I will add more models to find the best model with the highest accuracy.</b>

In [33]:
best_model = LGBMClassifier() ## Accuracy of 0.82
best_model.fit(X_train, y_train)
df_dummy = pd.read_csv('/kaggle/input/test.csv')
pred = best_model.predict(df_test)

final = pd.DataFrame()
final['PassengerId'] = df_dummy['PassengerId']
final['Transported'] = pred

final.to_csv('/kaggle/output/submission.csv', index=False)

[LightGBM] [Info] Number of positive: 3482, number of negative: 3472
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001042 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2006
[LightGBM] [Info] Number of data points in the train set: 6954, number of used features: 31
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500719 -> initscore=0.002876
[LightGBM] [Info] Start training from score 0.002876
