<a href="https://www.kaggle.com/cameron858/spaceship-titanic-various-models?scriptVersionId=88956106" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

Load training data into pandas dataframe

In [1]:
import pandas as pd
import numpy as np

train = pd.read_csv("/kaggle/input/spaceship-titanic/train.csv")
test = pd.read_csv("/kaggle/input/spaceship-titanic/test.csv")
submission = pd.read_csv("/kaggle/input/spaceship-titanic/sample_submission.csv")

print(f'{train.head()}\n{train.info()}')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB
  PassengerId HomePlanet CryoSleep  Cabin  Destination   Age    VIP  \
0     0001_01     Europa     False  B/0/P  TRAPPIST-1e  39.0  False   


In [2]:
# examine Nans

print(f'Training NaNs:\n{train.isnull().sum()}\n\nTesting NaNs:\n{test.isnull().sum()}')

Training NaNs:
PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

Testing NaNs:
PassengerId       0
HomePlanet       87
CryoSleep        93
Cabin           100
Destination      92
Age              91
VIP              93
RoomService      82
FoodCourt       106
ShoppingMall     98
Spa             101
VRDeck           80
Name             94
dtype: int64


There are numerous NaNs in the dataset.They will just be ignored for now

In [3]:
#train, test = train.dropna(), test.dropna()
#print(f'Training NaNs:\n{train.isnull().sum()}\n\nTesting NaNs:\n{test.isnull().sum()}')

# Feature creation #

Numerous features can be split into more features. The PassengerId has the format of 'XXXX_XX'. We can split the first 4 digits, and last 2 into seperate features. The cabin format is of "deck / number / side (port P or startboard S). We can seperate these into seperate columns. It is a valid assumption that families travel together, and stay in the same rooms. First and Last name features can be created from the original names feature. The 3 original features have been dropped afterwards.

In [4]:
# splitting PassengerId feature
train[['PassengerId_0', 'PassengerId_1']] = train['PassengerId'].str.split('_', 1, expand=True)
test[['PassengerId_0', 'PassengerId_1']] = test['PassengerId'].str.split('_', 1, expand=True)

# splitting Cabin feature
train[['Deck', 'Number', 'Side']] = train['Cabin'].str.split('/', 2, expand=True)
test[['Deck', 'Number', 'Side']] = test['Cabin'].str.split('/', 2, expand=True)

# splitting Name feature
train[['First name', 'Family name']] = train['Name'].str.split(' ', 1, expand=True)
test[['First name', 'Family name']] = test['Name'].str.split(' ', 1, expand=True)

# drop old features
train.drop(['PassengerId', 'Cabin', 'Name'], axis=1, inplace=True)
test.drop(['PassengerId', 'Cabin', 'Name'], axis=1, inplace=True)

print(f'Training:\n{train.head()}\nTesting:\n{test.head()}')

Training:
  HomePlanet CryoSleep  Destination   Age    VIP  RoomService  FoodCourt  \
0     Europa     False  TRAPPIST-1e  39.0  False          0.0        0.0   
1      Earth     False  TRAPPIST-1e  24.0  False        109.0        9.0   
2     Europa     False  TRAPPIST-1e  58.0   True         43.0     3576.0   
3     Europa     False  TRAPPIST-1e  33.0  False          0.0     1283.0   
4      Earth     False  TRAPPIST-1e  16.0  False        303.0       70.0   

   ShoppingMall     Spa  VRDeck  Transported PassengerId_0 PassengerId_1 Deck  \
0           0.0     0.0     0.0        False          0001            01    B   
1          25.0   549.0    44.0         True          0002            01    F   
2           0.0  6715.0    49.0        False          0003            01    A   
3         371.0  3329.0   193.0        False          0003            02    A   
4         151.0   565.0     2.0         True          0004            01    F   

  Number Side First name  Family name  
0     

Scale continuous features

In [5]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer

# scale continuous features
continuous_cols = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
min_max_scaler = MinMaxScaler()
train[continuous_cols] = min_max_scaler.fit_transform(train[continuous_cols])
test[continuous_cols] = min_max_scaler.fit_transform(test[continuous_cols])

# impute columns with missing values
impute_cols = ['HomePlanet', 'CryoSleep', 'Destination', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
imputer = SimpleImputer(strategy="most_frequent")
train[impute_cols] = imputer.fit_transform(train[impute_cols])
test[impute_cols] = imputer.fit_transform(test[impute_cols])

In [6]:
from sklearn.preprocessing import LabelEncoder

def encode_df_cols(df, columns):
    for col in columns:
        df[col] = LabelEncoder().fit_transform(df[col].astype('str'))   
    return df

cat_cols = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Deck', 'Number', 'Side', 'First name', 'Family name']
train = encode_df_cols(train, cat_cols)
test = encode_df_cols(test, cat_cols)

In [7]:
Y_train = train['Transported']
X_train = train.loc[:, train.columns != 'Transported']

# Models #

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

# prevents slow optimisation every run
run_dtc_grid = False

if run_dtc_grid:
    params_dtc = {'max_depth': range(5,16), 'max_leaf_nodes': list(range(15, 26)), 'min_samples_split': list(range(2,5))}
    clf_dtc = GridSearchCV(DecisionTreeClassifier(), params_dtc, n_jobs=4, verbose=1)
    clf_dtc.fit(X_train, Y_train)
    decision_tree = clf_dtc.best_estimator_
    print(f'The best score was {clf_dtc.best_score_}')

In [9]:
from sklearn.ensemble import AdaBoostClassifier

run_ada = False

if run_ada:
    clf_ada = AdaBoostClassifier(n_estimators=500)
    clf_ada.fit(X_train, Y_train)
    clf_ada.estimator_errors_
    clf_ada.score(X_train, Y_train)

In [10]:
from sklearn.ensemble import GradientBoostingClassifier

run_gbc = True

if run_gbc:
    params_gbc = {'n_estimators': [100, 200, 300, 400, 500], 'max_depth': range(1,6)}
    clf_gbc = GridSearchCV(GradientBoostingClassifier(), params_gbc, n_jobs=4, verbose=1)
    clf_gbc.fit(X_train, Y_train)
    grad_boost = clf_gbc.best_estimator_
    print(f'The best score was {clf_gbc.best_score_}')

Fitting 5 folds for each of 25 candidates, totalling 125 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:   39.4s
[Parallel(n_jobs=4)]: Done 125 out of 125 | elapsed:  4.1min finished


The best score was 0.7572783321234708


In [11]:
def create_submission(model):
    submission['Transported'] = model.predict(test)
    submission.to_csv("submission.csv", index=False)
    
create_submission(clf_gbc)