## Spaceship Titanic 🚀
This notebook is an exercise in exploratory data analysis and machine learning. It uses the [Spaceship Titanic](https://www.kaggle.com/c/spaceship-titanic) data set, which contains information about the passengers's fate on the futuristic spaceship Titanic.

### Introduction
First, we will import the necessary libraries, load the data, and survey basic information about the data.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

train_set = pd.read_csv('train.csv')
test_set = pd.read_csv('test.csv')

train = train_set.copy()
test = test_set.copy()
train_set.head()

#### 1. Check the dataset details, missing values, datatypes and other information.
It will allow us to further choose the examination methods determine exploration strategy.



In [None]:
# check the 'Transported' column data distribution
sns.countplot(x='Transported', data=train_set)

In [None]:
train_set.dtypes ### check the data type of each column

We can see that the features are of type `object` and `int64` -  the first ones are categorical types, the latter is a numerical type.
The dataset description is available in the [Kaggle dataset description](https://www.kaggle.com/c/spaceship-titanic/data).
The columns are as follows:
- **PassengerId** - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.

- **HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence.

- **CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

- **Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.

- **Destination** - The planet the passenger will be debarking to.

- **Age** - The age of the passenger.

- **VIP** - Whether the passenger has paid for special VIP service during the voyage.

- **RoomService, FoodCourt, ShoppingMall, Spa, VRDeck** - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.

- **Name** - The first and last names of the passenger.

- **Transported** - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

In [None]:
train_set.isnull().sum() ### check the missing value of each column

In [None]:
train_set.isnull().sum()/train_set.shape[0]*100 ### check the percentage of missing value of each column

In [None]:
train_set.describe() ### check the statistical summary of each column (from the numerical columns)

#### 2. Check the data distribution, correlation and mutual information of the features.

In [None]:
### get the names of numerical and categorical columns
numerical_cols, categorical_cols = train_set.dtypes[train_set.dtypes != 'object'].index, train_set.dtypes[train_set.dtypes == 'object'].index

In [None]:
### get the correlation matrix of numerical columns
corr_matrix = train_set.corr()
corr_matrix.style.background_gradient() ### check the correlation matrix of numerical columns

Spa + VRDeck, FoodCourt + Shopping Mall, RoomService

In [None]:
sns.displot(data=train_set,x='HomePlanet', hue='Transported') ### check the distribution of home planet and transported

In [None]:
### check age distribution with division by transported column
sns.violinplot(x='Transported', y='Age', data=train_set)


In [None]:
### group people into bins and check the distribution of age with transported column
bins = [0, 15, 20, 30, 40, 50, 60, 100]
group_names = ['0-15', '16-20', '21-30', '31-40', '41-50', '51-60', '61-100']
train_set['AgeGroup'] = pd.cut(train_set['Age'], bins, labels=group_names)
sns.histplot(hue='Transported', x='AgeGroup', data=train_set)


In [None]:
sns.countplot(data=train_set, x='CryoSleep', hue='Transported') ### check the distribution of survived and transported

In [None]:
### test the cabin split hypothesis
hypo = train_set[['Cabin','Transported']].copy()
hypo[['Deck', 'Num', 'Side']] = hypo['Cabin'].str.split('/', expand=True)

In [None]:
sns.countplot(x='Deck', data=hypo, hue='Transported')

In [None]:
sns.countplot(x='Side', data=hypo, hue='Transported')

In [None]:
sns.countplot(data=train_set,x='VIP', hue='Transported')

In [None]:
ammenities = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'] ### room service seem to be distinguishable from other amenities, RS, FC
# we can sum 3 vars
# #FC Spa VR somewhat correlated

In [None]:
# create pipeline with column transformer and model
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from xgboost import XGBClassifier

from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures

from sklearn.compose import ColumnTransformer

from sklearn.base import BaseEstimator, TransformerMixin

X_train, X_test, y_train, y_test = train_test_split(train, train['Transported'], test_size=0.2, random_state=42)

class spaceship_titanic_transformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        df = X.copy()
        df[['Deck', 'Num', 'Side']] = df['Cabin'].str.split('/', expand=True)
        bins = [0, 15, 20, 30, 40, 50, 60, 100]
        group_names = ['0-15', '16-20', '21-30', '31-40', '41-50', '51-60', '61-100']
        df['AgeGroup'] = pd.cut(df['Age'], bins, labels=group_names)
        df['SpaVR'] = df['Spa']+df['VRDeck']
        df['FoodMall'] = df['FoodCourt']+df['ShoppingMall']
        df.drop(['PassengerId', 'Name', 'Cabin', 'Spa', 'VRDeck', 'FoodCourt', 'ShoppingMall'] , axis=1, inplace=True)
        return df
        


#numerical transformer
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

#categorical transformer
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

#preprocessing pipeline
column_tranformer = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, ['SpaVR', 'FoodMall', 'RoomService']),
        ('cat', categorical_transformer, ['Deck', 'Num', 'Side', 'AgeGroup', 'VIP', 'CryoSleep', 'HomePlanet'])
    ])


ml_pipeline = Pipeline(steps=[('preprocessor', spaceship_titanic_transformer()),
    ('column_transformer', column_tranformer),
    ('classifier', XGBClassifier())])

# search for hyperparameters by grid search
param_grid = {
    'n_estimators': [200, 500, 1000],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.1, 0.3, 0.5, 0.7, 0.9],
    'min_child_weight': [1, 3, 5],
    'gamma': [0, 0.1, 0.3, 0.5, 0.7, 0.9],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'reg_alpha': [0, 0.1, 0.3, 0.5, 0.7, 0.9],
}

grid_search = GridSearchCV(ml_pipeline, param_grid, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)


# check the best parameters
print(grid_search.best_params_)
# get the best estimator accuracy
print(grid_search.best_score_)


In [None]:
predictions = test['PassengerId'] + grid_search.predict(test)
predictions = pd.DataFrame(predictions, columns=['PassengerId'])
predictions.to_csv('submission.csv', index=False)

'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age', 'VIP',
       'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Transported', 'Age_Group', 'AgeGroup', 'Deck', 'Num', 'Side', 'SpaVR',
       'FoodMall'],