Spaceship Titanic
=================

Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars. While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system. Help save them and change history!

https://www.kaggle.com/competitions/spaceship-titanic/overview

**Step 2/3 create and train model**

In [1]:
import pandas as pd
import joblib

import warnings
warnings.filterwarnings('ignore')

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# For this to work, you need to "File / Save and export notebook as... / Executable Script" the notebook
import Spaceship_Titanic_data

(train) Number of rows = 8693 and Number of cols = 14
(test) Number of rows = 4277 and Number of cols = 13


Create and train a model
------------------------

Create a model with the processing pipeline and one classifier.

In [2]:
model = Pipeline(
    [
        ('preproc', Spaceship_Titanic_data.preproc),
        ('drop_target', Spaceship_Titanic_data.drop_target),
        ('cla', LogisticRegression())
    ]
).set_output(transform='pandas')

Create a `GridSearchCV` to try many variants of the model, with different strategies and parameters, and find the combination with the best score.

In [3]:
param_grid = [
    {
        'preproc__imputer__num_imputer__strategy': ['mean', 'median'],
        'preproc__scale_encode__minmax_scaler__feature_range': [(0, 1), (-1, 1)],
        'cla': (LogisticRegression(),),
        'cla__C': [0.5, 1.0, 5.0],
        'cla__max_iter': [1000],
        'cla__class_weight': [None, 'balanced']
    },
    {
        'preproc__imputer__num_imputer__strategy': ['mean', 'median'],
        'preproc__scale_encode__minmax_scaler__feature_range': [(0, 1), (-1, 1)],
        'cla': (KNeighborsClassifier(),),
        'cla__n_neighbors': [3, 5, 7],
        'cla__weights': ['uniform', 'distance']
    },
    {
        'preproc__imputer__num_imputer__strategy': ['mean', 'median'],
        'preproc__scale_encode__minmax_scaler__feature_range': [(0, 1), (-1, 1)],
        'cla': (MLPClassifier(),),
        'cla__hidden_layer_sizes': [(20,), (25,), (30,)],
        'cla__activation': ['logistic', 'relu'],
        'cla__max_iter': [1500]
    },
    {
        'preproc__imputer__num_imputer__strategy': ['mean', 'median'],
        'preproc__scale_encode__minmax_scaler__feature_range': [(0, 1), (-1, 1)],
        'cla': (DecisionTreeClassifier(),),
        'cla__criterion': ['gini', 'entropy'],
        'cla__max_depth': [5, 8, 10]
    },
    {
        'preproc__imputer__num_imputer__strategy': ['mean', 'median'],
        'preproc__scale_encode__minmax_scaler__feature_range': [(0, 1), (-1, 1)],
        'cla': (RandomForestClassifier(),),
        'cla__n_estimators': [50, 100, 150],
        'cla__max_depth': [5, 8, 10]
    }
]   

gs = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    scoring='roc_auc',
    error_score='raise',
    cv=5,
    verbose=1,  # Set to 10 to print traces and know the % progress (very verbose)
    n_jobs=-1   # -1 uses all CPU cores; you can give a number > 0 to use that number of cores
)

Fit all variants and display the scores.

In [4]:
gs.fit(Spaceship_Titanic_data.train_data,
       Spaceship_Titanic_data.train_data.Transported)

result = pd.DataFrame(gs.cv_results_).sort_values(by='rank_test_score').reset_index(drop=True)

result

Fitting 5 folds for each of 132 candidates, totalling 660 fits


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_cla,param_cla__C,param_cla__class_weight,param_cla__max_iter,param_preproc__imputer__num_imputer__strategy,param_preproc__scale_encode__minmax_scaler__feature_range,...,param_cla__n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,6.155337,0.635549,0.036098,0.001739,"MLPClassifier(hidden_layer_sizes=(30,), max_it...",,,1500,median,"(-1, 1)",...,,"{'cla': MLPClassifier(hidden_layer_sizes=(30,)...",0.872498,0.879525,0.875940,0.882748,0.892477,0.880638,0.006844,1
1,6.068546,0.690459,0.039708,0.002771,"MLPClassifier(hidden_layer_sizes=(30,), max_it...",,,1500,mean,"(-1, 1)",...,,"{'cla': MLPClassifier(hidden_layer_sizes=(30,)...",0.872903,0.878775,0.874396,0.882271,0.891393,0.879947,0.006606,2
2,5.521898,0.849068,0.043391,0.011489,"MLPClassifier(hidden_layer_sizes=(30,), max_it...",,,1500,median,"(-1, 1)",...,,"{'cla': MLPClassifier(hidden_layer_sizes=(30,)...",0.875247,0.876519,0.874645,0.882100,0.890049,0.879712,0.005803,3
3,5.695844,1.226276,0.037858,0.002093,"MLPClassifier(hidden_layer_sizes=(30,), max_it...",,,1500,median,"(-1, 1)",...,,"{'cla': MLPClassifier(hidden_layer_sizes=(30,)...",0.873245,0.877457,0.873632,0.882947,0.890992,0.879654,0.006657,4
4,5.354629,0.667446,0.037927,0.002483,"MLPClassifier(hidden_layer_sizes=(30,), max_it...",,,1500,mean,"(-1, 1)",...,,"{'cla': MLPClassifier(hidden_layer_sizes=(30,)...",0.872007,0.879345,0.873306,0.881809,0.888947,0.879083,0.006136,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
127,0.112178,0.002882,0.034323,0.000919,DecisionTreeClassifier(),,,,mean,"(0, 1)",...,,"{'cla': DecisionTreeClassifier(), 'cla__criter...",0.342201,0.702188,0.734466,0.844701,0.640140,0.652739,0.168859,128
128,0.137527,0.014452,0.046167,0.014268,DecisionTreeClassifier(),,,,median,"(-1, 1)",...,,"{'cla': DecisionTreeClassifier(), 'cla__criter...",0.382640,0.652619,0.730415,0.837804,0.607522,0.642200,0.151476,129
129,0.133290,0.015775,0.057924,0.023778,DecisionTreeClassifier(),,,,median,"(0, 1)",...,,"{'cla': DecisionTreeClassifier(), 'cla__criter...",0.382617,0.651713,0.731344,0.835478,0.607522,0.641734,0.150982,130
130,0.123360,0.003661,0.037455,0.002469,DecisionTreeClassifier(),,,,mean,"(-1, 1)",...,,"{'cla': DecisionTreeClassifier(), 'cla__criter...",0.349939,0.646234,0.722571,0.817448,0.603025,0.627843,0.156913,131


Choose the optimal model

In [5]:
optimal_model = gs.best_estimator_

optimal_model

Save the optimal model
----------------------

In [6]:
with open('model.jlb', 'wb') as file:
    joblib.dump(optimal_model, file)

Use the model to predict the test data
----------------------------------

In [7]:
processor = optimal_model.steps[0][1]
drop_target = optimal_model.steps[1][1]
classifier = optimal_model.steps[2][1]

# This call is necessary for the predict to work. For an unknown reason, if we don't do this, the model expects that
# the data has a 'Transported' column, and fails if not.
dummy = classifier.predict(drop_target.fit_transform(processor.fit_transform(Spaceship_Titanic_data.test_data)))

In [8]:
prediction = optimal_model.predict(Spaceship_Titanic_data.test_data)

Generate the output file as required by the Kaggle competition

In [9]:
output = pd.DataFrame({'PassengerId': Spaceship_Titanic_data.test_data['PassengerId'],
                       'Transported': prediction})
output.to_csv('submission.csv', index=False)

Output the result including passenger names

In [10]:
pd.DataFrame({'PassengerId': Spaceship_Titanic_data.test_data['PassengerId'],
              'Name': Spaceship_Titanic_data.test_data['Name'],
              'Transported': prediction})

Unnamed: 0,PassengerId,Name,Transported
0,0013_01,Nelly Carsoning,False
1,0018_01,Lerome Peckers,False
2,0019_01,Sabih Unhearfus,True
3,0021_01,Meratz Caltilter,True
4,0023_01,Brence Harperez,False
...,...,...,...
4272,9266_02,Jeron Peter,True
4273,9269_01,Matty Scheron,False
4274,9271_01,Jayrin Pore,True
4275,9273_01,Kitakan Conale,True
