<a href="https://www.kaggle.com/code/averilkan/spaceship-titanic-dataset-classification?scriptVersionId=221806399" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 💡 About The Competition: Spaceship Titanic

Task:
The objective of this task is to predict whether a passenger was transported to an alternate dimension aboard the Spaceship Titanic. This binary classification problem requires the model to predict the 'Transported' status (True or False) for each passenger in the test dataset, based on a variety of features provided.

Dataset:
The dataset for this competition includes information about passengers aboard the fictional spaceship Titanic. Features include demographic data, cabin information, and details about their in-ship expenditures. The target variable is Transported.

Exploration:
Begin with exploratory data analysis (EDA) to understand the distribution and relationships of features. Utilize visualization techniques to identify patterns and anomalies that could influence the predictive model's performance.

Evaluation:
Submissions are evaluated on the accuracy of the predictions. A higher accuracy score indicates a more effective model at classifying passengers' transportation status.

Submission Files:
train.csv - The training dataset, including the target variable Transported.
test.csv - The test dataset, where predictions are to be made.
sample_submission.csv - A sample submission file in the correct format.
Evaluation Metric:
The competition uses accuracy as the metric for evaluating submissions. This metric assesses the proportion of correctly predicted outcomes to total predictions.

# 1. Importing libraries

## 💡 About The Packages:

- **NumPy** and **Pandas** for data manipulation and analysis.
- **Matplotlib** and **Seaborn** for data visualization.
- **Warnings** to manage warning messages.
- **lazypredict** for quick model comparison.
- **Scikit-learn**, **XGBoost**, **LightGBM**, **CatBoost**, and other machine learning libraries for model building and evaluation.

In [1]:
!pip install lazypredict

import io
import sys
import tensorflow as tf
import tensorflow_decision_forests as tfdf
import xgboost as xgb
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt
import category_encoders as ce

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score

from IPython.utils.io import capture_output

from hyperopt import hp, fmin, tpe, Trials

from skopt import gp_minimize
from skopt.space import Real, Integer
from skopt.utils import use_named_args

from contextlib import redirect_stdout, redirect_stderr
from IPython.utils.io import capture_output

Collecting lazypredict
  Downloading lazypredict-0.2.13-py2.py3-none-any.whl.metadata (12 kB)
Downloading lazypredict-0.2.13-py2.py3-none-any.whl (12 kB)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.13


2025-02-10 14:06:20.416438: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-02-10 14:06:20.416565: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-02-10 14:06:20.795063: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# 2. Reading the Datasets

We start by reading the training and test datasets using the Pandas library. The training dataset contains the target variable `Transported`, which indicates whether a passenger was transported to another dimension. We convert this column to an integer type for compatibility with machine learning algorithms.

We then concatenate the train and test datasets for combined preprocessing and feature engineering. This helps to ensure consistency in data processing and avoid data leakage.

In [2]:
train_data = pd.read_csv("/kaggle/input/spaceship-titanic/train.csv")
train_data['Transported'] = train_data['Transported'].astype(int)
train_data.head()

test_data = pd.read_csv("/kaggle/input/spaceship-titanic/test.csv")
test_data.head()

train_data['Source'] = 'train'
test_data['Source'] = 'test'

df = pd.concat([train_data, test_data], axis=0)

# 3. Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) involves summarizing and visualizing the main characteristics of the data. This step helps us understand the structure of the dataset, identify patterns, detect anomalies, and check assumptions.

We create a summary function to give an overview of the dataset, including:
- The number of unique values in each column.
- The number of missing values in each column.
- The data types of each column.

This overview helps us identify any issues that need to be addressed during data preprocessing.

In [3]:
def summary(df):
    print(f"Dataset has {df.shape[1]} features and {df.shape[0]} examples.")
    summary = pd.DataFrame(index=df.columns)
    summary["Unique"] = df.nunique().values
    summary["Missing"] = df.isnull().sum().values
    summary["Duplicated"] = df.duplicated().sum()
    summary["Types"] = df.dtypes
    return summary

summary(df)

Dataset has 15 features and 12970 examples.


Unnamed: 0,Unique,Missing,Duplicated,Types
PassengerId,12970,0,0,object
HomePlanet,3,288,0,object
CryoSleep,2,310,0,object
Cabin,9825,299,0,object
Destination,3,274,0,object
Age,80,270,0,float64
VIP,2,296,0,object
RoomService,1578,263,0,float64
FoodCourt,1953,289,0,float64
ShoppingMall,1367,306,0,float64


# 4. Feature Engineering

### 4.1 Extract Group Size from `PassengerId`

A unique Id for each passenger takes the form `gggg_pp` where `gggg` indicates a group the passenger is travelling with and `pp` is their number within the group. People in a group are often family members, but not always. Extracting the group size can help us understand the potential impact of group travel on the likelihood of being transported.

In [4]:
group = df['PassengerId'].apply(lambda x: x.split('_')[0]).value_counts().to_dict()
df['Group_size'] = df['PassengerId'].apply(lambda x: group[x.split('_')[0]])
df.set_index('PassengerId', inplace=True)

### 4.2 Handle Missing Values in `HomePlanet`

The `HomePlanet` column indicates the planet the passenger is from. To fill missing values, we use a random selection based on the distribution of existing values in the column. This helps maintain the original distribution of `HomePlanet` in the dataset.

In [5]:
v = df['HomePlanet'].value_counts().index
p = df['HomePlanet'].value_counts(normalize=True).values
df.loc[df['HomePlanet'].isna(), 'HomePlanet'] = np.random.choice(v, df['HomePlanet'].isna().sum(), p=p)

### 4.3 Handle Missing Values in CryoSleep

The `CryoSleep` column indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Given that being in CryoSleep is a notable status that would likely be recorded, missing values are assumed to indicate passengers were not in CryoSleep. Thus, we fill missing values with 0 and convert this column to an integer type.

In [6]:
df['CryoSleep'] = df['CryoSleep'].fillna(0).astype(int)

### 4.4 Extract Cabin Components and Handle Missing Values

The `Cabin` column provides information about the passenger's cabin, including the deck, number, and side. We extract these components into separate columns and handle missing values by randomly assigning values based on the distribution of the existing data.

In [7]:
tmp = df['Cabin'].apply(lambda x: x.split('/') if isinstance(x, str) else ['-1', '-1', '-1']).to_list()
tmp = np.array(tmp)

df['Cabin_deck'] = tmp[:, 0]
df['Cabin_num'] = tmp[:, 1].astype(int)
df['Cabin_side'] = pd.Series(tmp[:, 2]).map({'S': 0, 'P': 1})

df.drop(columns='Cabin', inplace=True)

df.loc[df['Cabin_deck'] == '-1', 'Cabin_deck'] = np.random.choice(['F', 'G'], sum(df['Cabin_deck'] == '-1'), p=[0.5, 0.5])
df.loc[df['Cabin_side'].isna(), 'Cabin_side'] = np.random.choice([0, 1], sum(df['Cabin_side'].isna()), p=[0.5, 0.5])

### 4.5 Handle Missing Values in `Destination`

The `Destination` column indicates the destination of the passenger. We fill missing values by randomly assigning values based on the distribution of the existing data. This maintains the original distribution of `Destination` in the dataset.

In [8]:
v = df['Destination'].value_counts().index
p = df['Destination'].value_counts(normalize=True).values
df.loc[df['Destination'].isna(), 'Destination'] = np.random.choice(v, df['Destination'].isna().sum(), p=p)

### 4.6 Handle Missing Values in `Age`

The `Age` column indicates the age of the passenger. To fill missing values, we use a random sample from a normal distribution with the mean and standard deviation of the existing ages. This helps maintain the original distribution of `Age` in the dataset.

In [9]:
mean_age = df["Age"].mean()
std_age = df["Age"].std()
is_null = df["Age"].isnull().sum()
rand_sample = np.random.uniform(mean_age - std_age, mean_age + std_age, size = is_null)
df.loc[df['Age'].isna(), 'Age'] = rand_sample

### 4.7 Handle Missing Values in `VIP`

The `VIP` column indicates whether the passenger paid for special VIP service during the voyage. We fill missing values with `False` (0), assuming that most passengers did not opt for VIP service. We then convert this column to an integer type.

In [10]:
df['VIP'] = df['VIP'].fillna(0).astype(int)

### 4.8 Handle Missing Values in Spending Columns

The columns `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, and `VRDeck` represent the amount of money spent by the passenger on various amenities. We fill missing values with the median value of each column. This helps maintain the original distribution of spending in the dataset.

In [11]:
cols = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

for col in cols:
    df[col] = df[col].fillna(df[col].median())

### 4.9 Create Total Spending Feature

We create a new feature `total_spending` by summing the values of the spending columns. This feature represents the total amount of money spent by the passenger on various amenities during the voyage. We then apply a logarithmic transformation to normalize the spending data, making it easier for the model to learn from.

In [12]:
df['total_spending'] = df['RoomService'] + df['FoodCourt'] + df['ShoppingMall'] + df['Spa'] + df['VRDeck']

for col in cols + ['total_spending']:
    df.loc[df[col]==0, col] = 0.367
    df[col] = np.log(df[col])

### 4.10 Drop the Name column as it does not contain valuable information

In [13]:
df.drop(columns='Name', inplace=True)

# 5. One-Hot Encoding for Categorical Features

One-hot encoding is a technique to convert categorical variables into numerical format. This step is crucial for machine learning algorithms, as they require numerical input.

We apply one-hot encoding to the categorical features `HomePlanet`, `Destination`, and `Cabin_deck`. This creates new binary columns for each category, allowing the model to understand and use the categorical data effectively.

In [14]:
categorical_features = ['HomePlanet', 'Destination', 'Cabin_deck']
df = pd.concat([df, pd.get_dummies(df[categorical_features], dtype=int)], axis=1)
df.drop(columns=categorical_features, inplace=True)

# 6. Splitting Data back to Train and Test Sets

After preprocessing and feature engineering, we split the combined dataset back into the original training and test sets. This ensures that we maintain the original structure of the data, with the training set used to train the model and the test set used to evaluate its performance.

We also separate the target variable `Transported` from the training data, as this is what we aim to predict.

In [15]:
train_df = df[df['Source'] == 'train'].drop(columns=['Source'])
test_df = df[df['Source'] == 'test'].drop(columns=['Source'])

X = train_df.drop(columns=['Transported'])
y = train_df['Transported']

train_df.head()
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8693 entries, 0001_01 to 9280_02
Data columns (total 27 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   CryoSleep                  8693 non-null   int64  
 1   Age                        8693 non-null   float64
 2   VIP                        8693 non-null   int64  
 3   RoomService                8693 non-null   float64
 4   FoodCourt                  8693 non-null   float64
 5   ShoppingMall               8693 non-null   float64
 6   Spa                        8693 non-null   float64
 7   VRDeck                     8693 non-null   float64
 8   Transported                8693 non-null   float64
 9   Group_size                 8693 non-null   int64  
 10  Cabin_num                  8693 non-null   int64  
 11  Cabin_side                 8693 non-null   float64
 12  total_spending             8693 non-null   float64
 13  HomePlanet_Earth           8693 non-null   i

# 7. Initial Model Comparison with LazyPredict

We will use the LazyPredict library to quickly compare the performance of different machine learning models. LazyPredict provides an easy-to-use interface to fit multiple models and get a quick overview of their performance.

In [16]:
import lazypredict
from lazypredict.Supervised import LazyClassifier

X = train_df.drop('Transported', axis=1)
y = train_df['Transported'].astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

clf = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)
models, predictions = clf.fit(X_train, X_test, y_train, y_test)
print(models)

100%|██████████| 29/29 [00:18<00:00,  1.53it/s]

[LightGBM] [Info] Number of positive: 3731, number of negative: 3658
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005584 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1963
[LightGBM] [Info] Number of data points in the train set: 7389, number of used features: 25
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.504940 -> initscore=0.019760
[LightGBM] [Info] Start training from score 0.019760
                               Accuracy  Balanced Accuracy  ROC AUC  F1 Score  \
Model                                                                           
LGBMClassifier                     0.80               0.80     0.80      0.80   
XGBClassifier                      0.79               0.79     0.79      0.79   
SVC                                0.78               0.78     0.78      0.78   
RandomForestClassifier             0.78   




# 8. Detailed Modeling with Pipelines and GridSearchCV

For a more detailed evaluation and fine-tuning, we will use several classifiers with pipelines and GridSearchCV. We will use k-fold cross-validation to ensure a robust estimation of model performance.

We will perform the following steps:

1. Set up the models and pipelines.
2. Define the hyperparameter grids for each model.
3. Use GridSearchCV with k-fold cross-validation to find the best parameters for each model.
4. Evaluate the performance of the best models on the test set.

In [17]:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from catboost import CatBoostClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.metrics import accuracy_score

pipelines = {
    'adaboost': make_pipeline(StandardScaler(), AdaBoostClassifier(random_state=123)),
    'xgboost': make_pipeline(StandardScaler(), XGBClassifier(random_state=123)),
    'catboost': make_pipeline(StandardScaler(), CatBoostClassifier(random_state=123)),
    'gradientboost': make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=123)),
    'lightgbm': make_pipeline(StandardScaler(), LGBMClassifier(random_state=123)),
    'randomforest': make_pipeline(StandardScaler(), RandomForestClassifier(random_state=123)),
    'logistic': make_pipeline(StandardScaler(), LogisticRegression(random_state=123)),
    'knn': make_pipeline(StandardScaler(), KNeighborsClassifier())
}

grid = {
    'adaboost': {
        'adaboostclassifier__n_estimators': [50, 100, 150],
        'adaboostclassifier__learning_rate': [0.01, 0.05, 0.1]
    },
    'xgboost': {
        'xgbclassifier__n_estimators': [50, 100, 150],
        'xgbclassifier__learning_rate': [0.01, 0.05],
        'xgbclassifier__max_depth': [3, 4],
        'xgbclassifier__gamma': [0.1, 0.2],
        'xgbclassifier__subsample': [0.6, 0.8]
    },
    'catboost': {
        'catboostclassifier__learning_rate': [0.01, 0.05, 0.1, 0.5],
        'catboostclassifier__depth': [2, 3, 4], 
        'catboostclassifier__iterations': [50, 100, 150]
    },
    'gradientboost': {
        'gradientboostingclassifier__n_estimators': [50, 100, 150],
        'gradientboostingclassifier__learning_rate': [0.01, 0.05],
        'gradientboostingclassifier__max_depth': [3, 4],
        'gradientboostingclassifier__subsample': [0.6, 0.8]
    },
    'lightgbm': {
        'lgbmclassifier__n_estimators': [50, 100, 150],
        'lgbmclassifier__learning_rate': [0.01, 0.05],
        'lgbmclassifier__max_depth': [3, 4],
        'lgbmclassifier__subsample': [0.6, 0.8]
    },
    'randomforest': {
        'randomforestclassifier__n_estimators': [50, 100, 150],
        'randomforestclassifier__max_depth': [3, 4, 5],
        'randomforestclassifier__min_samples_split': [2, 3, 4]
    },
    'logistic': {
        'logisticregression__C': [0.01, 0.1, 1, 10]
    },
    'knn': {
        'kneighborsclassifier__n_neighbors': [3, 5, 7],
        'kneighborsclassifier__weights': ['uniform', 'distance'],
        'kneighborsclassifier__metric': ['euclidean', 'manhattan']
    }
}

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

best_estimators = {}
kf = KFold(n_splits=5, shuffle=True, random_state=42)

metrics = []

f = io.StringIO()
e = io.StringIO()

with redirect_stdout(f), redirect_stderr(e):  # suppress outputs
    with capture_output():
        for name, pipeline in pipelines.items():
            grid_search = GridSearchCV(pipeline, grid[name], cv=kf, n_jobs=-1, verbose=0)
            grid_search.fit(X_train, y_train)
            best_estimators[name] = grid_search.best_estimator_
            yhat = grid_search.predict(X_test)
            accuracy = accuracy_score(y_test, yhat)
            precision = precision_score(y_test, yhat)
            recall = recall_score(y_test, yhat)
            metrics.append((name, accuracy, precision, recall))
            print(f'Metrics for {name}: accuracy- {accuracy:.3f}, precision- {precision:.3f}, recall- {recall:.3f}')
            print(f"Best parameters for {name}: {grid_search.best_params_}")



0:	learn: 0.6905940	total: 59.9ms	remaining: 2.94s
1:	learn: 0.6876475	total: 62.1ms	remaining: 1.49s
2:	learn: 0.6842125	total: 68.3ms	remaining: 1.07s
3:	learn: 0.6809990	total: 76.3ms	remaining: 877ms
4:	learn: 0.6783445	total: 78.6ms	remaining: 707ms
5:	learn: 0.6759075	total: 87.2ms	remaining: 639ms
6:	learn: 0.6727219	total: 94ms	remaining: 578ms
7:	learn: 0.6705179	total: 96.9ms	remaining: 509ms
8:	learn: 0.6676099	total: 104ms	remaining: 472ms
9:	learn: 0.6646919	total: 110ms	remaining: 440ms
10:	learn: 0.6624350	total: 117ms	remaining: 415ms
11:	learn: 0.6596734	total: 125ms	remaining: 396ms
12:	learn: 0.6569783	total: 127ms	remaining: 362ms
13:	learn: 0.6550299	total: 133ms	remaining: 343ms
14:	learn: 0.6529287	total: 140ms	remaining: 327ms
15:	learn: 0.6503793	total: 144ms	remaining: 305ms
16:	learn: 0.6483618	total: 150ms	remaining: 291ms
17:	learn: 0.6459235	total: 157ms	remaining: 279ms
18:	learn: 0.6434593	total: 164ms	remaining: 268ms
19:	learn: 0.6412568	total: 172ms	r

In [18]:
metrics.sort(key=lambda x: x[1], reverse=True)

print("\nSorted model metrics (by accuracy):")
for metric in metrics:
    print(f"Model: {metric[0]}, Accuracy: {metric[1]:.3f}, Precision: {metric[2]:.3f}, Recall: {metric[3]:.3f}")



Sorted model metrics (by accuracy):
Model: catboost, Accuracy: 0.797, Precision: 0.776, Recall: 0.830
Model: lightgbm, Accuracy: 0.795, Precision: 0.779, Recall: 0.819
Model: xgboost, Accuracy: 0.794, Precision: 0.768, Recall: 0.839
Model: gradientboost, Accuracy: 0.789, Precision: 0.764, Recall: 0.832
Model: adaboost, Accuracy: 0.766, Precision: 0.761, Recall: 0.771
Model: logistic, Accuracy: 0.765, Precision: 0.752, Recall: 0.787
Model: randomforest, Accuracy: 0.762, Precision: 0.779, Recall: 0.726
Model: knn, Accuracy: 0.738, Precision: 0.752, Recall: 0.705


# 9. Apply ensemble method on the test set

In [19]:
from sklearn.ensemble import VotingClassifier

top_models = [name for name, _, _, _ in metrics[:5]]

estimators = [(name, best_estimators[name]) for name in top_models]

ensemble_model = VotingClassifier(estimators=estimators, voting='soft', n_jobs=-1)

ensemble_model.fit(X, y)

test_df = test_df.drop("Transported",axis=1)
test_predictions = ensemble_model.predict(test_df)
test_predictions = test_predictions.astype(bool)

[LightGBM] [Info] Number of positive: 4378, number of negative: 4315
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001198 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1971
[LightGBM] [Info] Number of data points in the train set: 8693, number of used features: 25
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.503624 -> initscore=0.014495
[LightGBM] [Info] Start training from score 0.014495
0:	learn: 0.6534607	total: 6.81ms	remaining: 1.01s
1:	learn: 0.6208866	total: 12.8ms	remaining: 946ms
2:	learn: 0.5982503	total: 16.2ms	remaining: 793ms
3:	learn: 0.5752141	total: 20.7ms	remaining: 754ms
4:	learn: 0.5598762	total: 26.5ms	remaining: 768ms
5:	learn: 0.5461717	total: 30.6ms	remaining: 735ms
6:	learn: 0.5327222	total: 34.1ms	remaining: 698ms
7:	learn: 0.5227910	total: 36.9ms	remaining: 655ms
8:	learn: 0.5154181	total: 40.4ms	remai

In [20]:
submission = pd.DataFrame([test_data['PassengerId'], test_predictions]).T
submission.columns = ['PassengerID', 'Transported']

In [21]:
submission.head()

Unnamed: 0,PassengerID,Transported
0,0013_01,True
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,True


In [22]:
submission.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!
