# Space Titanic 🚀🚢🧊👽

**File and Data Field Descriptions**

train.csv - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
- PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
- HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
- CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
- Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
- Destination - The planet the passenger will be debarking to.
- Age - The age of the passenger.
- VIP - Whether the passenger has paid for special VIP service during the voyage.
- RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
- Name - The first and last names of the passenger.
- Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

test.csv - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.

sample_submission.csv - A submission file in the correct format.
- PassengerId - Id for each passenger in the test set.
- **Transported** - The target. For each passenger, predict either True or False. !!!!!! we care about this !!!!!


## 1. Setup

In [None]:
#imports
import numpy as np
import pandas as pd
import os
import time

from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import confusion_matrix, make_scorer

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from imblearn.ensemble import EasyEnsembleClassifier
from sklearn.metrics import accuracy_score

import matplotlib.pyplot as plt
import seaborn as sns

random_state=7

In [None]:
#list files
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
#general path
path = '/kaggle/input/spaceship-titanic/'

#get data
sample_submission = pd.read_csv(path + 'sample_submission.csv')
train = pd.read_csv(path + 'train.csv')
test = pd.read_csv(path + 'test.csv')

### 1.1 train data

In [None]:
#train
train.head()

**Split up the cabin column into 3 columns: deck/num/side --> deck | num | side**

In [None]:
# split column and add new columns to df
train[['Deck', 'Num', 'Side']] = train['Cabin'].str.split('/', expand=True)
# display the dataframe
train.head()

**I wonder if there are unlucky names lol**

In [None]:
# I wonder if there are unlucky names lol
unlucky_names = train[['Name','Transported']].copy()
unlucky_names[['First_Name', 'Last_Name']] = unlucky_names['Name'].str.split(' ', expand=True)
unlucky_names['First_Name_FirstLetter'] = unlucky_names['First_Name'].astype(str).str[0]
unlucky_names['Last_Name_FirstLetter'] = unlucky_names['Last_Name'].astype(str).str[0]
unlucky_names.head()

In [None]:
# I wonder if there are unlucky names lol
train[['First_Name', 'Last_Name']] = train['Name'].str.split(' ', expand=True)
train['First_Name_FirstLetter'] = train['First_Name'].astype(str).str[0]
train['Last_Name_FirstLetter'] = train['Last_Name'].astype(str).str[0]
# display the dataframe
train.head()

In [None]:
#shape of the dataframe. we have 8693 rows to work with
train.shape

In [None]:
train.dtypes

In [None]:
#count unique each column
train.nunique()

In [None]:
print(train['HomePlanet'].unique())
print(train['CryoSleep'].unique())
print(train['Cabin'].unique())
print(train['Destination'].unique())
print(train['VIP'].unique())
print(train['RoomService'].unique())
print(train['FoodCourt'].unique())
print(train['ShoppingMall'].unique())
print(train['Spa'].unique())
print(train['VRDeck'].unique())
print(train['Name'].unique())
print(train['Transported'].unique())
print(train['Deck'].unique())
print(train['Num'].unique())
print(train['Side'].unique())
print(train['First_Name'].unique())
print(train['Last_Name'].unique())
print(train['First_Name_FirstLetter'].unique())
print(train['Last_Name_FirstLetter'].unique())

In [None]:
#drop the passengerid, and name column because they won't be helpful to the models.
train.drop(['PassengerId', 'Cabin','Name'], axis=1, inplace=True)

In [None]:
#check for missing/null values
train.isnull().sum()

**Deal with missing values:**

In [None]:
#Imputation
#replace with median for float64 and replace with mode for object 

#HomePlanet      201
HomePlanet=train['HomePlanet'].mode()
train['HomePlanet'].fillna(value=HomePlanet[0], inplace=True)

#CryoSleep       217
CryoSleep = train['CryoSleep'].mode()
train['CryoSleep'].fillna(value=CryoSleep[0], inplace=True)

#Destination     182
Destination = train['Destination'].mode()
train['Destination'].fillna(value=Destination[0], inplace=True)

#Age             179
Age_med = train['Age'].median()
train['Age'].fillna(value=Age_med, inplace=True)

#VIP             203
VIP = train['VIP'].mode()
train['VIP'].fillna(value=VIP[0], inplace=True)

#RoomService     181
RoomService_med = train['RoomService'].median()
train['RoomService'].fillna(value=RoomService_med, inplace=True)

#FoodCourt       183
FoodCourt_med = train['FoodCourt'].median()
train['FoodCourt'].fillna(value=FoodCourt_med, inplace=True)

#ShoppingMall    208
ShoppingMall_med = train['ShoppingMall'].median()
train['ShoppingMall'].fillna(value=ShoppingMall_med, inplace=True)

#Spa             183
Spa_med = train['Spa'].median()
train['Spa'].fillna(value=Spa_med, inplace=True)

#VRDeck          188
VRDeck_med = train['VRDeck'].median()
train['VRDeck'].fillna(value=VRDeck_med, inplace=True)

#Deck            199
Deck_mode = train['Deck'].mode()
train['Deck'].fillna(value=Deck_mode[0], inplace=True)

#Num             199
Num_mode = train['Num'].mode()
train['Num'].fillna(value=Num_mode[0], inplace=True)

#Side            199
Side_mode = train['Side'].mode()
train['Side'].fillna(value=Side_mode[0], inplace=True)

#First_Name                  0
train['First_Name'].fillna(value='NaN', inplace=True)

#Last_Name                 200
train['Last_Name'].fillna(value='NaN', inplace=True)

#First_Name_FirstLetter      0
train['First_Name_FirstLetter'].fillna(value='ZZ', inplace=True)

#Last_Name_FirstLetter       0
train['Last_Name_FirstLetter'].fillna(value='ZZ', inplace=True)


In [None]:
train.head()

In [None]:
#check for missing/null values - good all gone
train.isnull().sum()

In [None]:
print(train['HomePlanet'].unique())
print(train['CryoSleep'].unique())
print(train['Destination'].unique())
print(train['VIP'].unique())
print(train['RoomService'].unique())
print(train['FoodCourt'].unique())
print(train['ShoppingMall'].unique())
print(train['Spa'].unique())
print(train['VRDeck'].unique())
print(train['Transported'].unique())
print(train['Deck'].unique())
print(train['Num'].unique())
print(train['Side'].unique())
print(train['First_Name'].unique())
print(train['Last_Name'].unique())
print(train['First_Name_FirstLetter'].unique())
print(train['Last_Name_FirstLetter'].unique())

In [None]:
train.describe()

**Get numerical values for all the categorical data (for now no dummy variables) --> for model building**

In [None]:
#Transported --> Target Variable
train['Transported'] = train['Transported'].astype(str)
Transported_dict1 = dict(enumerate(train['Transported'].unique()))
Transported_dict = dict((v,k) for k,v in Transported_dict1.items())
train['Transported'] = train['Transported'].replace(Transported_dict)

#HomePlanet      201
#dictionary of all homeplanets:
HomePlanet_dict1 = dict(enumerate(train['HomePlanet'].unique()))
HomePlanet_dict = dict((v,k) for k,v in HomePlanet_dict1.items())
train['HomePlanet'] = train['HomePlanet'].replace(HomePlanet_dict)

#CryoSleep       217
CryoSleep_dict1 = dict(enumerate(train['CryoSleep'].unique()))
CryoSleep_dict = dict((v,k) for k,v in CryoSleep_dict1.items())
train['CryoSleep'] = train['CryoSleep'].replace(CryoSleep_dict)

#Destination     182
Destination_dict1 = dict(enumerate(train['Destination'].unique()))
Destination_dict = dict((v,k) for k,v in Destination_dict1.items())
train['Destination'] = train['Destination'].replace(Destination_dict)

#VIP             203
VIP_dict1 = dict(enumerate(train['VIP'].unique()))
VIP_dict = dict((v,k) for k,v in VIP_dict1.items())
train['VIP'] = train['VIP'].replace(VIP_dict)

#Deck             199
Deck_dict1 = dict(enumerate(train['Deck'].unique()))
Deck_dict = dict((v,k) for k,v in Deck_dict1.items())
train['Deck'] = train['Deck'].replace(Deck_dict)

#Num             199
Num_dict1 = dict(enumerate(train['Num'].unique()))
Num_dict = dict((v,k) for k,v in Num_dict1.items())
train['Num'] = train['Num'].replace(Num_dict)

#Side             199
Side_dict1 = dict(enumerate(train['Side'].unique()))
Side_dict = dict((v,k) for k,v in Side_dict1.items())
train['Side'] = train['Side'].replace(Side_dict)

#------------------
#First_Name             203
First_Name_dict1 = dict(enumerate(np.sort(train['First_Name'].unique())))
First_Name_dict = dict((v,k) for k,v in First_Name_dict1.items())
train['First_Name'] = train['First_Name'].replace(First_Name_dict)

#Last_Name             199
Last_Name_dict1 = dict(enumerate(np.sort(train['Last_Name'].unique())))
Last_Name_dict = dict((v,k) for k,v in Last_Name_dict1.items())
train['Last_Name'] = train['Last_Name'].replace(Last_Name_dict)

#First_Name_FirstLetter             199
First_Name_FirstLetter_dict1 = dict(enumerate(np.sort(train['First_Name_FirstLetter'].unique())))
First_Name_FirstLetter_dict = dict((v,k) for k,v in First_Name_FirstLetter_dict1.items())
train['First_Name_FirstLetter'] = train['First_Name_FirstLetter'].replace(First_Name_FirstLetter_dict)

#Last_Name_FirstLetter             199
Last_Name_FirstLetter_dict1 = dict(enumerate(np.sort(train['Last_Name_FirstLetter'].unique())))
Last_Name_FirstLetter_dict = dict((v,k) for k,v in Last_Name_FirstLetter_dict1.items())
train['Last_Name_FirstLetter'] = train['Last_Name_FirstLetter'].replace(Last_Name_FirstLetter_dict)

In [None]:
#example of dictionaries created
print(Transported_dict1)
print(Transported_dict)

In [None]:
#example of dictionaries created
print(HomePlanet_dict1)
print(HomePlanet_dict)

In [None]:
#example of dictionaries created
print(Deck_dict1)
print(Deck_dict)

In [None]:
#example of dictionaries created
print(Last_Name_FirstLetter_dict1)
print(Last_Name_FirstLetter_dict)

In [None]:
train.head()

In [None]:
print(train.columns.tolist())

In [None]:
train = train[['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Deck', 'Num', 'Side','RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck','Age','First_Name', 'Last_Name', 'First_Name_FirstLetter', 'Last_Name_FirstLetter','Transported']]

In [None]:
train.head()

In [None]:
#Now lets do the same for all the variables with a simple for loop:
sns.set_theme(style="whitegrid")

#get a the column names into a list
categorical_variables = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Deck', 'Side']
numeric_variables = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck','Age','Num','First_Name', 'Last_Name', 'First_Name_FirstLetter', 'Last_Name_FirstLetter'] #num is categorical but we'll pretend lol

In [None]:
#for each of the categorical_variables
for variable in categorical_variables:
    #g = sns.catplot(data=train, kind="count", x='Transported', col=variable, hue='Transported', col_wrap=8, height=2, aspect=.5)
    ax = sns.countplot(x=variable, hue="Transported", data=train)
    plt.show()

In [None]:
#for each of the categorical_variables
for variable in numeric_variables:
    #ax = sns.histplot(data=train, x=variable, hue='Transported', kde=True)
    sns.boxplot(data=train, x='Transported', y=variable)
    plt.show()

### 1.2 test data

In [None]:
#test
test.head()

In [None]:
test.shape

In [None]:
test.isnull().sum()

In [None]:
submission = test[['PassengerId']].copy()
submission.head()

In [None]:
# split Cabin column and add new columns to df
test[['Deck', 'Num', 'Side']] = test['Cabin'].str.split('/', expand=True)
# display the dataframe
test.head()

In [None]:
# I wonder if there are unlucky names lol
test[['First_Name', 'Last_Name']] = test['Name'].str.split(' ', expand=True)
test['First_Name_FirstLetter'] = test['First_Name'].astype(str).str[0]
test['Last_Name_FirstLetter'] = test['Last_Name'].astype(str).str[0]
# display the dataframe
test.head()

In [None]:
#drop the passengerid, cabin, and name column because they won't be helpful to the models.
test.drop(['PassengerId', 'Cabin', 'Name'], axis=1, inplace=True)

In [None]:
#Imputation - use medians and modes from TRAIN DATA not test..
#replace with median for float64 and replace with mode for object 

#HomePlanet      201
test['HomePlanet'].fillna(value=HomePlanet[0], inplace=True)
#CryoSleep       217
test['CryoSleep'].fillna(value=CryoSleep[0], inplace=True)
#Destination     182
test['Destination'].fillna(value=Destination[0], inplace=True)
#Age             179
test['Age'].fillna(value=Age_med, inplace=True)
#VIP             203
test['VIP'].fillna(value=VIP[0], inplace=True)
#RoomService     181
test['RoomService'].fillna(value=RoomService_med, inplace=True)
#FoodCourt       183
test['FoodCourt'].fillna(value=FoodCourt_med, inplace=True)
#ShoppingMall    208
test['ShoppingMall'].fillna(value=ShoppingMall_med, inplace=True)
#Spa             183
test['Spa'].fillna(value=Spa_med, inplace=True)
#VRDeck          188
test['VRDeck'].fillna(value=VRDeck_med, inplace=True)
#Deck    208
test['Deck'].fillna(value=Deck_mode[0], inplace=True)
#Num             183
test['Num'].fillna(value=Num_mode[0], inplace=True)
#Side          188
test['Side'].fillna(value=Side_mode[0], inplace=True)

#First_Name                  0
test['First_Name'].fillna(value='NN', inplace=True)
#Last_Name                 200
test['Last_Name'].fillna(value='NN', inplace=True)
#First_Name_FirstLetter      0
test['First_Name_FirstLetter'].fillna(value='ZZ', inplace=True)
#Last_Name_FirstLetter       0
test['Last_Name_FirstLetter'].fillna(value='ZZ', inplace=True)

In [None]:
test.isnull().sum()

**Get numerical values for all the categorical data (for now no dummy variables)**
* Apply the dictionary mappings created in train.

In [None]:
#HomePlanet      
#dictionary of all homeplanets:
test['HomePlanet'] = test['HomePlanet'].replace(HomePlanet_dict)

#CryoSleep       
test['CryoSleep'] = test['CryoSleep'].replace(CryoSleep_dict)

#Destination     
test['Destination'] = test['Destination'].replace(Destination_dict)

#VIP             
test['VIP'] = test['VIP'].replace(VIP_dict)

#Deck    
test['Deck'] = test['Deck'].replace(Deck_dict)

#Num      
test['Num'] = test['Num'].replace(Num_dict)

#Side   
test['Side'] = test['Side'].replace(Side_dict)

#-----
#First_Name             203
test['First_Name'] = test['First_Name'].map(First_Name_dict)
test['First_Name'].fillna(value=999999, inplace=True)

#Last_Name             199
test['Last_Name'] = test['Last_Name'].map(Last_Name_dict)
test['Last_Name'].fillna(value=999999, inplace=True)

#First_Name_FirstLetter             199
test['First_Name_FirstLetter'] = test['First_Name_FirstLetter'].replace(First_Name_FirstLetter_dict)

#Last_Name_FirstLetter             199
test['Last_Name_FirstLetter'] = test['Last_Name_FirstLetter'].replace(Last_Name_FirstLetter_dict)

In [None]:
test.head()

### 1.3 sample submission

* Weird setup - this has the PassengerIds from the test dataframe. Can use either for submission... 

In [None]:
#sample_submission
sample_submission.head()

In [None]:
sample_submission.shape

## 2. Random Forest

In [None]:
#select Transported as target variable:
y = train['Transported']

#select all the other columns minus Transported as the feature variables:
X = train.drop(['Transported','First_Name', 'Last_Name','First_Name_FirstLetter','Last_Name_FirstLetter'],axis=1)

In [None]:
#now make the train-test splits
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=random_state)
print('Dimensions: \n x_train:{} \n x_test{} \n y_train{} \n y_test{}'.format(x_train.shape, x_test.shape, y_train.shape, y_test.shape))

In [None]:
x_train

In [None]:
y_train

In [None]:
#create true negative, false positive, false negative, and true positive 
def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1]

In [None]:
#Setup classifier scorers
scorers = {'Accuracy': 'accuracy', 
           'roc_auc': 'roc_auc', 
           'Sensitivity':'recall', 
           'precision':'precision',
            'tp': make_scorer(tp), 
           'tn': make_scorer(tn),
           'fp': make_scorer(fp), 
           'fn': make_scorer(fn)}          

In [None]:
#change this name here to change the print name
classifier_name = 'Random Forest'

start_ts=time.time()
#try swapping out the classifier for random forest instead
clf = RandomForestClassifier(n_estimators=600, max_depth=20, min_samples_split=20,criterion='entropy', random_state=random_state)   
#clf = EasyEnsembleClassifier(n_estimators=10)
scores = cross_validate(clf, X, y, scoring=scorers, cv=5)          

Sensitivity = round(scores['test_tp'].mean() / (scores['test_tp'].mean() + scores['test_fn'].mean()),3)*100   #TP/(TP+FN) also recall
Specificity = round(scores['test_tn'].mean() / (scores['test_tn'].mean() + scores['test_fp'].mean()),3)*100    #TN/(TN+FP)
PPV = round(scores['test_tp'].mean() / (scores['test_tp'].mean() + scores['test_fp'].mean()),3)*100           #PPV = tp/(tp+fp) also precision
NPV = round(scores['test_tn'].mean() / (scores['test_fn'].mean() + scores['test_tn'].mean()),3)*100           #TN(FN+TN)

scores_Acc = scores['test_Accuracy']                                                                                                                                    
print(f"{classifier_name} Acc: %0.2f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))                                                                                                    
scores_AUC = scores['test_roc_auc']                                                                     #Only works with binary classes, not multiclass                  
print(f"{classifier_name} AUC: %0.2f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))      
scores_sensitivity = scores['test_Sensitivity']                                                                     #Only works with binary classes, not multiclass                  
print(f"{classifier_name} Recall: %0.2f (+/- %0.2f)" % (scores_sensitivity.mean(), scores_sensitivity.std() * 2)) 
scores_precision = scores['test_precision']                                                                     #Only works with binary classes, not multiclass                  
print(f"{classifier_name} Precision: %0.2f (+/- %0.2f)" % (scores_precision.mean(), scores_precision.std() * 2))                          
print(f"{classifier_name} Sensitivity = ", Sensitivity, "%")
print(f"{classifier_name} Specificity = ", Specificity, "%")
print(f"{classifier_name} PPV = ", PPV, "%")  
print(f"{classifier_name} NPV = ", NPV, "%")

print("CV Runtime:", time.time()-start_ts)

**Or a more basic approach:**

In [None]:
#rf = RandomForestClassifier(n_estimators=1000, max_depth=40, min_samples_split=20,criterion='entropy', random_state=random_state)
rf = RandomForestClassifier(n_estimators=600, max_depth=30, min_samples_split=12,criterion='entropy', random_state=random_state)
rf.fit(x_train, y_train)
y_pred_train = rf.predict(x_train)
y_pred_test = rf.predict(x_test)
print("Training accuracy: ", accuracy_score(y_train, y_pred_train))
print("Testing accuracy: ", accuracy_score(y_test, y_pred_test))

In [None]:
def feature_importance(clf):
    importances = clf.feature_importances_
    i = np.argsort(importances)
    features = X.columns
    plt.title('Feature Importance')
    plt.barh(range(len(i)), importances[i], align='center')
    plt.yticks(range(len(i)), [features[x] for x in i])
    plt.xlabel('Scale')
    plt.show()

In [None]:
feature_importance(rf)

* we can drop VIP and all the letter stuff probably

## Submit to the competition:

In [None]:
submission.head()

In [None]:
test.drop(['First_Name','Last_Name','First_Name_FirstLetter','Last_Name_FirstLetter'], axis=1, inplace=True)

In [None]:
test.head()

In [None]:
#Now submit to the competition using the model:
submission['Transported'] = rf.predict(test)
submission['Transported'] = submission['Transported'].astype(bool)
submission.to_csv("submission.csv", index=False)
submission.head()