<h1><center>Spaceship Titanic Prediction - Supervised Learning</center></h1>
<center>October 2024</center>
<center>Celine Ng</center>

# Table of Contents

1. Project Introduction   
    1. Notebook Preparation
    1. Dataset
1. Data Cleaning
    1. Duplicate rows
    1. Datatypes
    1. Unique values
    1. Missing values
1. EDA
    1. Distribution
    1. Distribution according to target label
    1. Missing Values
1. Data Formatting
1. Preprocessing
    1. Transformations
    1. Data Splitting
1. Models
    1. Basic model
    1. Baseline model
    1. Hyperparameter Tuning
    1. Best model on test data
    1. Model Interpretation 
1. Deploy the model
1. Improvements

# 1. Project Introduction

## 1.1. Notebook Preparation

In [1]:
%%capture
%pip install -r requirements.txt

In [3]:
from IPython.display import HTML
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from utils.eda import *
from utils.model import *

from sklearn.compose import make_column_selector as selector
from sklearn.model_selection import (train_test_split, StratifiedKFold, 
                                     cross_validate, GridSearchCV)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (OneHotEncoder, FunctionTransformer,
                                   StandardScaler)
from sklearn.compose import ColumnTransformer

from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.metrics import (accuracy_score, roc_auc_score, f1_score, 
                             precision_score, recall_score, make_scorer)
import graphviz
import shap
import pickle
from fastapi import FastAPI

## 1.2. Dataset

Objective: Brief overview of our dataset, including the features and target 
variable

The dataset was downloaded from Kaggle, [Spaceship Titanic](https://www.kaggle.com/competitions/spaceship-titanic/data?select=train.csv)
, on 18 October 2024. <br>
This dataset is part of an open Kaggle competition, 
where the task is to predict whether a passenger was transported to an 
alternate dimension during the Spaceship Titanic's collision with the 
spacetime anomaly. <br>
The data originally comes in 2 separate datasets, *train.csv* and *test.csv*
. Each dataset contains a set of personal records recovered from the ship's
 damaged computer system. There are 13 columns of personal records, and the 
 14th column is the target.

- train.csv - Personal records for about two-thirds (~8700) of the 
passengers, to be used as training data.
- test.csv - Personal records for the remaining one-third (~4300) of the 
passengers, to be used as test data. Does not include the target variable. 
My task is to predict the value of Transported for the passengers in this set.
<ol>
<li>PassengerId - A unique Id for each passenger. Each Id takes the form 
gggg_pp where gggg indicates a group the passenger is travelling with and pp
 is their number within the group. People in a group are often family 
 members, but not always.</li>
<li>HomePlanet - The planet the passenger departed from, typically their planet
 of permanent residence.</li>
<li>CryoSleep - Indicates whether the passenger elected to be put into 
suspended animation for the duration of the voyage. Passengers in cryosleep 
are confined to their cabins.</li>
<li>Cabin - The cabin number where the passenger is staying. Takes the form 
deck/num/side, where side can be either P for Port or S for Starboard.</li>
<li>Destination - The planet the passenger will be debarking to.</li>
<li>Age - The age of the passenger.</li>
<li>VIP - Whether the passenger has paid for special VIP service during the 
voyage.</li>
<li>RoomService - Amount the passenger has billed at this luxury amenity. </li>
<li>FoodCourt - Amount the passenger has billed at this luxury amenity. </li>
<li>ShoppingMall - Amount the passenger has billed at this luxury amenity. </li>
<li>Spa - Amount the passenger has billed at this luxury amenity. </li>
<li>VRDeck - Amount the passenger has billed at this luxury amenity. </li>
<li>Name - The first and last names of the passenger.</li>
<li>Transported - Whether the passenger was transported to another dimension. 
This is the target, the column I am trying to predict.</li>
</ol>

In [12]:
spaceship_train = pd.read_csv('data/train.csv')
display(spaceship_train.head())
spaceship_train_shape = spaceship_train.shape
print(f"Number of rows on train data: {spaceship_train_shape[0]}\nNumber of "
      f"columns on train data: {spaceship_train_shape[1]}")

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Number of rows on train data: 8693
Number of columns on train data: 14


# 2.  Data cleaning
Objective:
1. Closer look at the values that consist of our data
2. Look out for duplicates, and missing and/or unusual values