# Space Titanic
- Predict which passengers are transported to an alternate dimension

# File and Data Field Descriptions
- train.csv - Personal records for about two-thirds (~8700) of the    passengers, to be used as training data.
  - PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
  - HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
  - CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
  - Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
  - Destination - The planet the passenger will be debarking to.
  - Age - The age of the passenger.
  - VIP - Whether the passenger has paid for special VIP service during the voyage.
  - RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
  - Name - The first and last names of the passenger.
  - Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.
- test.csv - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.
sample_submission.csv - A submission file in the correct format.
  - PassengerId - Id for each passenger in the test set.
  - Transported - The target. For each passenger, predict either True or False.



![spaceship-titanic](https://i.imgur.com/m5j2A0p.png)

## Image made using stable diffusion in the [dreamstudio beta](https://beta.dreamstudio.ai/dream)

code to help grab the dataset if outside of kaggle, I'll comment this out so kaggle doesn't have any issues

In [16]:
# from fastkaggle import setup_comp
# path = setup_comp('spaceship-titanic', install='fastai "timm>=0.6.2.dev0"')

Grabbing the training and test data and replacing the Transported boolean column with 0s and 1s

In [10]:
import pandas as pd

df = pd.read_csv('../input/spaceship-titanic/train.csv')
tst_df = pd.read_csv('../input/spaceship-titanic/test.csv')
modes = df.mode().iloc[0]
df['Transported'] = df['Transported'].apply(lambda x: 1 if x == True else 0)

## Data preprocessing

- To smooth the outliers in RoomService, FoodCourt, ShoppingMall, and Spa; lets take the log of these columns
- Both the Cabin and Name column contain extra information that may be useful to our model if we split them into seperate columns, lets do that also 

In [76]:
import numpy as np

def add_features(df):
  df.fillna(modes, inplace=True)
  df['LogRoomService'] = np.log(df['RoomService']+1)
  df['LogFoodCourt'] = np.log(df['FoodCourt']+1)
  df['LogShoppingMall'] = np.log(df['ShoppingMall']+1)
  df['LogSpa'] = np.log(df['Spa']+1)
  df['LogVRDeck'] = np.log(df['VRDeck']+1)
  df[['Deck','CabinNumber','Side']] = df.Cabin.str.split('/', expand=True)
  df[['Group', 'PassengerNumber']] = df.PassengerId.str.split('_', expand=True)
  df[['FirstName', 'LastName']] = df.Name.str.split(' ', expand=True)

add_features(df)
add_features(tst_df)

Here we are just using pandas Categorical method to assign numerical codes to the columns

In [77]:
def proc_data(df):
  df.fillna(modes, inplace=True)
  df['HomePlanet'] = pd.Categorical(df.HomePlanet)
  df["CryoSleep"] = pd.Categorical(df.CryoSleep)
  df['Destination'] = pd.Categorical(df.Destination)
  df['Deck'] = pd.Categorical(df.Deck)
  df['CabinNumber'] = pd.Categorical(df.HomePlanet)
  df['Side'] = pd.Categorical(df.Side)
  df['Group'] = pd.Categorical(df.Group)
  df['PassengerNumber'] = pd.Categorical(df.PassengerNumber)
  df['FirstName'] = pd.Categorical(df.FirstName)
  df['LastName'] = pd.Categorical(df.LastName)
  

proc_data(df)
proc_data(tst_df)


In [24]:
df.CryoSleep.cat.codes.head()

In [78]:
cats=["HomePlanet", "CryoSleep", "Destination", "Deck", "CabinNumber", "Side", "Group", "PassengerNumber", "FirstName", "LastName"]
conts=['Age', 'LogRoomService', 'LogFoodCourt', 'LogShoppingMall', 'LogSpa', 'LogVRDeck']
dep="Transported"

## Binary splits
- It may be useful to look at a histogram of the transportation rate of a given column and the respective overall count for the column

In [25]:
import seaborn as sns
import matplotlib.pyplot as plt

fix,axs = plt.subplots(1, 2, figsize=(11, 5))
sns.barplot(data=df, y=dep, x="CryoSleep", ax=axs[0]).set(title="Transportation rate")
sns.countplot(data=df, x='CryoSleep', ax=axs[1]).set(title='Histogram')

## Now that we have our dataset cleaned up a bit and columns split into categories, continous, and dependent columns we can split our data randomly into training and validation sets

In [81]:
 from numpy import random
 from sklearn.model_selection import train_test_split
 
 random.seed(42)
 trn_df, val_df = train_test_split(df, test_size=0.25)
 trn_df[cats] = trn_df[cats].apply(lambda x: x.cat.codes)
 val_df[cats] = val_df[cats].apply(lambda x: x.cat.codes)

In [82]:
def xs_y(df):
  xs = df[cats+conts].copy()
  return xs, df[dep] if dep in df else None

trn_xs, trn_y = xs_y(trn_df)
val_xs, val_y = xs_y(val_df)

Lets look at predictions for just passengers who where in cryosleep

In [28]:
preds = val_xs.CryoSleep==1
val_y

We can look at the MAE using these cryosleep predictions

In [29]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(val_y, preds)

Not bad, looks like cryosleep is a good indicator of being transported

We can look at numeric columns with boxen and kde plots to see if there are anything that stands out

In [30]:
df_age = trn_df[trn_df.Age>0]
fig, axs = plt.subplots(1,2, figsize=(11, 5))
sns.boxenplot(data=df_age, x=dep, y="Age", ax=axs[0])
sns.kdeplot(data=df_age, x="Age", ax=axs[1]);

Not really conclusive quantiles for age and transportation

We can write a few functions to get scores for our columns

In [83]:
def _side_score(side, y):
  tot = side.sum()
  if tot<=1: return 0
  return y[side].std()*tot

In [84]:
def score(col, y, split):
  lhs = col<=split
  return (_side_score(lhs, y) + _side_score(~lhs, y))/len(y)

In [85]:
score(trn_xs["CryoSleep"], trn_y, 0.5)

In [86]:
score(trn_xs['Age'], trn_y, 20)

In [89]:
def iscore(nm, split):
  col = trn_xs[nm]
  return score(col, trn_y, split)

from ipywidgets import interact
interact(nm=conts, split=15.5)(iscore);

In [36]:
interact(nm=cats, split=2)(iscore);

## Instead of manually doing this lets write a function that can automatically find the best split point for all our columns

In [90]:
def min_col(df, nm):
  col, y = df[nm], df[dep]
  unq = col.dropna().unique()
  scores = np.array([score(col, y, o) for o in unq if not np.isnan(o)])
  idx = scores.argmin()
  return unq[idx], scores[idx]

min_col(trn_df, "Age")

In [88]:
cols = cats+conts
{o:min_col(trn_df, o) for o in cols}

# Decision Tree
- Lets take a look at a simple decision tree classifier

In [91]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz

m = DecisionTreeClassifier(max_leaf_nodes=4).fit(trn_xs, trn_y);

In [92]:
import graphviz
import re

def draw_tree(t, df, size=10, ratio=0.6, precision=2, **kwargs):
    s=export_graphviz(t, out_file=None, feature_names=df.columns, filled=True, rounded=True,
                      special_characters=True, rotate=False, precision=precision, **kwargs)
    return graphviz.Source(re.sub('Tree {', f'Tree {{ size={size}; ratio={ratio}', s))

In [93]:
draw_tree(m, trn_xs, size=10)

## We can see that CryoSleep is a important feature in predicting transportation

In [94]:
def gini(cond):
  act = df.loc[cond, dep]
  return 1 - act.mean()**2 - (1-act).mean()**2

In [95]:
gini(df.CryoSleep==0), gini(df.CryoSleep==1)

In [96]:
mean_absolute_error(val_y, m.predict(val_xs))

## Lets add more nodes to our tree and see if that lowers our MAE

In [97]:
m = DecisionTreeClassifier(min_samples_leaf=50)
m.fit(trn_xs, trn_y)
draw_tree(m, trn_xs, size=25)

In [98]:
mean_absolute_error(val_y, m.predict(val_xs))

Nice it does! Lets submit this model 

In [99]:
tst_df[cats] = tst_df[cats].apply(lambda x: x.cat.codes)
tst_xs,_ = xs_y(tst_df)

def subm(preds, suff):
  tst_df['Transported'] = list(map(lambda x: True if x == 1 else False, preds))
  sub_df = tst_df[['PassengerId', 'Transported']]
  sub_df.to_csv(f'sub-{suff}.csv', index=False)

subm(m.predict(tst_xs), 'tree')

In [100]:
#! kaggle competitions submit -c spaceship-titanic -f sub-tree.csv -m 'decision tree'

# Random Forest
- Lets use an ensemble of decision trees to build a predictive model


In [101]:
def get_tree(prop=0.75):
  n = len(trn_y)
  idxs = random.choice(n, int(n*prop))
  return DecisionTreeClassifier(min_samples_leaf=5).fit(trn_xs.iloc[idxs], trn_y.iloc[idxs])

In [102]:
trees = [get_tree() for t in range(100)]

In [103]:
all_probs = [t.predict(val_xs) for t in trees]
avg_probs = np.stack(all_probs).mean(0)

mean_absolute_error(val_y, avg_probs)

In [112]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(100, min_samples_leaf=10)
rf.fit(trn_xs, trn_y)
mean_absolute_error(val_y, rf.predict(val_xs))

In [105]:
subm(rf.predict(tst_xs), 'rf')

In [106]:
#! kaggle competitions submit -c spaceship-titanic -f sub-rf.csv -m 'random forest'

## Cool thing about random forests is the feature_importances_ method, that always us to see what our model thinks are good predictive features for our depend variable

In [107]:
pd.DataFrame(dict(cols=trn_xs.columns, imp=m.feature_importances_)).plot('cols', 'imp', 'barh')

In [108]:
from sklearn.model_selection import cross_val_score

## XG Boost

In [109]:
import xgboost as xgb
xg_reg = xgb.XGBClassifier(colsample_bytree = 0.7, learning_rate = 0.001,
                max_depth = 8, alpha = 1, gamma = 3, n_estimators = 400)

scores = cross_val_score(xg_reg, trn_xs, trn_y, cv=5)
scores.mean()

In [74]:
xg_reg.fit(trn_xs, trn_y)
subm(xg_reg.predict(tst_xs), 'xgb')