# 🌌 Spaceship My Starter Model

Hello a Simple Starter Model, **Stay Tune for More Updates...**

### File and Data Field Descriptions

**train.csv** - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
* PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
* HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
* CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
* Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
* Destination - The planet the passenger will be debarking to.
* Age - The age of the passenger.
* VIP - Whether the passenger has paid for special VIP service during the voyage.
* RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
* Name - The first and last names of the passenger.
* Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

**test.csv** - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.

**sample_submission.csv** - A submission file in the correct format.

* PassengerId - Id for each passenger in the test set.
* Transported - The target. For each passenger, predict either True or False.

# Loading Libraries...

In [None]:
%%time
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Seeting Notebook Parameters...

In [None]:
%%time
# I like to disable my Notebook Warnings.
import warnings
warnings.filterwarnings('ignore')

In [None]:
%%time
# Notebook Configuration...

# Amount of data we want to load into the Model...
DATA_ROWS = None
# Dataframe, the amount of rows and cols to visualize...
NROWS = 50
NCOLS = 15
# Main data location path...
BASE_PATH = '...'

In [None]:
%%time
# Configure notebook display settings to only use 2 decimal places, tables look nicer.
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.max_columns', NCOLS) 
pd.set_option('display.max_rows', NROWS)

# Loading Information from CSV...

In [None]:
%%time
trn_data = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
tst_data = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')

sub = pd.read_csv('/kaggle/input/spaceship-titanic/sample_submission.csv')

# Exploring the Information Available...

In [None]:
%%time
trn_data.info()

In [None]:
%%time
trn_data.head()

In [None]:
%%time
trn_data.describe()

In [None]:
%%time
def describe_categ(df):
    for col in df.columns:
        unique_samples = list(df[col].unique())
        unique_values = df[col].nunique()

        print(f' {col}: {unique_values} Unique Values,  Data Sample >> {unique_samples[:5]}')
    print(' ...')
    return None

In [None]:
%%time
describe_categ(trn_data)

In [None]:
%%time
describe_categ(tst_data)

In [None]:
%%time
trn_data.isnull().sum()

In [None]:
%%time
tst_data.head()

In [None]:
%%time
tst_data.isnull().sum()

In [None]:
%%time
sub.sample(10)

# Exploring the Target Variable...

In [None]:
%%time
def analyse_categ_target(df, target = 'Transported'):
    
    transported = df[df[target] == True].shape[0]
    not_transported = df[df[target] == False].shape[0]
    total = transported + not_transported
    
    print(f'Transported     : {transported / total:.2f} %')
    print(f'Not Transported : {not_transported / total:.2f} %')
    print(f'Total Passengers: {total}')
    print('...')

In [None]:
%%time
analyse_categ_target(trn_data)

In [None]:
%%time
trn_passenger_ids = set(trn_data['PassengerId'].unique())
tst_passenger_ids = set(tst_data['PassengerId'].unique())
intersection = trn_passenger_ids.intersection(tst_passenger_ids)
print('Overlapped Passengers:', len(intersection))

# Feature Engineering...

In [None]:
%%time
def fill_missing(df):
    '''
    Fill nan values or missing data with mean or most commond value...
    
    '''
    
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    numeric_tmp = df.select_dtypes(include = numerics)
    categ_tmp = df.select_dtypes(exclude = numerics)

    for col in numeric_tmp.columns:
        print(col)
        df[col] = df[col].fillna(value = df[col].mean())
        
    for col in categ_tmp.columns:
        print(col)
        df[col] = df[col].fillna(value = df[col].mode()[0])
        
    print('...')
    
    return df

In [None]:
%%time
trn_data =  fill_missing(trn_data)
tst_data =  fill_missing(tst_data)

In [None]:
%%time
def total_billed(df):
    '''
    Calculates total amount billed in the trip to the passenger... 
    Args:
    Returns:
    
    '''
    
    df['Total_Billed'] = df['RoomService'] + df['FoodCourt'] + df['ShoppingMall'] + df['Spa'] + df['VRDeck']
    return df

In [None]:
%%time
trn_data = total_billed(trn_data)
tst_data = total_billed(tst_data)

In [None]:
%%time
def name_ext(df):
    '''
    Split the Name of the passenger into First and Family...
    
    '''
    
    df['FirstName'] = df['Name'].str.split(' ', expand=True)[0]
    df['FamilyName'] = df['Name'].str.split(' ', expand=True)[1]
    df.drop(columns = ['Name'], inplace = True)
    return df

In [None]:
%%time
trn_data = name_ext(trn_data)
tst_data = name_ext(tst_data)

In [None]:
%%time
trn_relatives = trn_data.groupby('FamilyName')['PassengerId'].count().reset_index()
tst_relatives = tst_data.groupby('FamilyName')['PassengerId'].count().reset_index()

In [None]:
%%time
trn_relatives = trn_relatives.rename(columns = {'PassengerId': 'NumRelatives'})
tst_relatives = tst_relatives.rename(columns = {'PassengerId': 'NumRelatives'})

In [None]:
%%time
trn_data = trn_data.merge(trn_relatives, how = 'left', on = ['FamilyName'])
tst_data = tst_data.merge(tst_relatives, how = 'left', on = ['FamilyName'])

In [None]:
%%time
def cabin_separation(df):
    '''
    Split the Cabin name into Deck, Number and Side
    
    '''
    
    df['CabinDeck'] = df['Cabin'].str.split('/', expand=True)[0]
    df['CabinNum'] = df['Cabin'].str.split('/', expand=True)[1]
    df['CabinSide'] = df['Cabin'].str.split('/', expand=True)[2]
    df.drop(columns = ['Cabin'], inplace = True)
    return df

In [None]:
%%time
trn_data = cabin_separation(trn_data)
tst_data = cabin_separation(tst_data)

In [None]:
%%time
def route(df):
    '''
    Calculate a combination of origin and destinations, creates a new feature for training.
    Args:
    Returns:
    '''
    
    df['Route'] = df['HomePlanet'] + df['Destination']
    return df

In [None]:
%%time
trn_data = route(trn_data)
tst_data = route(tst_data)

In [None]:
def age_groups(df):
    '''
    
    '''
    df['IsKid'] = np.where(df['Age'] <= 10, 1, 0)
    df['IsAdult'] = np.where(df['Age'] > 10, 1, 0)
    df['IsOlder'] = np.where(df['Age'] >= 65, 1, 0)
    return df

In [None]:
%%time
trn_data = age_groups(trn_data)
tst_data = age_groups(tst_data)

In [None]:
def extract_group(df):
    '''
    '''
    df['TravelGroup'] =  df['PassengerId'].str.split('_', expand = True)[0]
    return df

In [None]:
%%time
trn_data = extract_group(trn_data)
tst_data = extract_group(tst_data)

In [None]:
%%time
trn_data.head()

# Pre-Processing for Training

In [None]:
%%time
# A list of the original variables from the dataset
numerical_features = ['Age', 
                      'RoomService', 
                      'FoodCourt', 
                      'ShoppingMall', 
                      'Spa', 
                      'VRDeck', 
                      'Total_Billed'
                     ]

categorical_features = ['FirstName',
                        'FamilyName',
                        'CabinNum',
                        'TravelGroup',]


categorical_features_onehot = ['HomePlanet',
                               'CryoSleep',
                               'CabinDeck',
                               'CabinSide',
                               'Destination',
                               'VIP',]

target_feature = 'Transported'

In [None]:
%%time
from sklearn.preprocessing import LabelEncoder 

def encode_categorical(train_df, test_df, categ_feat = categorical_features):
    '''
    
    '''
    encoder_dict = {}
    
    concat_data = pd.concat([trn_data[categ_feat], tst_data[categ_feat]])
    
    for col in concat_data.columns:
        print('Encoding: ', col, '...')
        encoder = LabelEncoder()
        encoder.fit(concat_data[col])
        encoder_dict[col] = encoder

        train_df[col + '_Enc'] = encoder.transform(train_df[col])
        test_df[col + '_Enc'] = encoder.transform(test_df[col])
    
    train_df = train_df.drop(columns = categ_feat, axis = 1)
    test_df = test_df.drop(columns = categ_feat, axis = 1)

    return train_df, test_df

In [None]:
%%time
trn_data, tst_data = encode_categorical(trn_data, tst_data, categorical_features)

In [None]:
def one_hot(df, one_hot_categ):
    for col in one_hot_categ:
        tmp = pd.get_dummies(df[col], prefix = col)
        df = pd.concat([df, tmp], axis = 1)
    df = df.drop(columns = one_hot_categ)
    return df

In [None]:
trn_data = one_hot(trn_data, categorical_features_onehot) 
tst_data = one_hot(tst_data, categorical_features_onehot) 

In [None]:
trn_data.info(verbose=True)

# Simple CV Sttrategy 80/20 Split

In [None]:
%%time
trn_data.columns

In [None]:
%%time
remove = ['PassengerId', 
          'Route', 
          'FirstName_Enc', 
          'CabinNum_Enc', 
          'Transported', 
          #'IsKid', 
          #'IsAdult', 
          #'IsOlder'
         ]
features = [feat for feat in trn_data.columns if feat not in remove]

In [None]:
%%time
features

In [None]:
%%time
from sklearn.model_selection import train_test_split
test_size_pct = 0.01
X_train, X_valid, y_train, y_valid = train_test_split(trn_data[features], trn_data[target_feature], test_size = test_size_pct, random_state = 42)

# Training a ML Classifier

In [None]:
%%time
X_train.shape

In [None]:
%%time
from xgboost  import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

In [None]:
%%time
param = {'learning_rate': 0.05,
         'n_estimators': 1024,
         'n_jobs': -1,
         'random_state': 42,
         'objective': 'binary:logistic',
        }

In [None]:
%%time
cls = XGBClassifier(**param)
cls.fit(X_train, y_train, eval_set = [(X_valid, y_valid)], eval_metric = ['logloss'], early_stopping_rounds = 128, verbose = False)

In [None]:
%%time
from sklearn.metrics import accuracy_score

val_preds = cls.predict(X_valid[features])
val_preds = val_preds.astype('bool')
accuracy = accuracy_score(val_preds, y_valid)

In [None]:
%%time
print(f'Mean accuracy score: {accuracy}')

In [None]:
# Mean accuracy score: 0.7586206896551724
# Mean accuracy score: 0.7586206896551724
# Mean accuracy score: 0.7471264367816092
# Mean accuracy score: 0.7816091954022989
# Mean accuracy score: 0.7827586206896552
# Mean accuracy score: 0.7908045977011494 (One Hot Encode...)
# Mean accuracy score: 0.7862068965517242
# Mean accuracy score: 0.7954022988505747
# Mean accuracy score: 0.7701149425287356 (Best Model)

In [None]:
%%time
import matplotlib.pyplot as plt

def feature_importance(clf):
    importances = clf.feature_importances_
    i = np.argsort(importances)
    features = X_train.columns
    plt.title('Feature Importance')
    plt.barh(range(len(i)), importances[i], align='center')
    plt.yticks(range(len(i)), [features[x] for x in i])
    plt.xlabel('Scale')
    plt.show()

In [None]:
%%time
plt.figure(figsize = (10,7))
feature_importance(cls)

In [None]:
%%time
preds = cls.predict(tst_data[features])

In [None]:
%%time
sub['Transported'] = preds
sub.to_csv('submission_simple_split_03272022.csv', index = False)

# Time for Optuna

In [None]:
%%time
import optuna

In [None]:
%%time
X_train, X_valid, y_train, y_valid = train_test_split(trn_data[features], trn_data[target_feature])

def objective(trial):
    n_estimators = trial.suggest_int("n_estimators", 8, 2048)
    max_depth = trial.suggest_int("max_depth", 2, 16)
    learning_rate = trial.suggest_float("learning_rate", 0.01, 0.2)
    subsample = trial.suggest_float("subsample", 0.5, 1)
    colsample_bytree = trial.suggest_float("colsample_bytree", 0.5, 1)
    reg_lambda = trial.suggest_float("reg_lambda", 1, 20)
    reg_alpha = trial.suggest_float("reg_alpha", 0, 20)
    gamma = trial.suggest_float("gamma", 0, 20)
    min_child_weight  = trial.suggest_int("min_child_weight", 0, 128)
    
    clf = XGBClassifier(n_estimators  = n_estimators,
                       learning_rate = learning_rate,
                       max_depth = max_depth,
                       subsample = subsample,
                       colsample_bytree = colsample_bytree,
                       reg_lambda = reg_lambda,
                       reg_alpha = reg_alpha,
                       gamma = gamma,
                       min_child_weight = min_child_weight,
                       random_state  = 69,
                       objective = 'binary:logistic',
                       tree_method = 'gpu_hist',
                      )
    
    clf.fit(X_train, y_train)
    
    valid_pred = clf.predict(X_valid)
    score = accuracy_score(y_valid, valid_pred)
    
    return score

In [None]:
%%time
#study = optuna.create_study(direction = "maximize")
#study.optimize(objective, n_trials = 100)

In [None]:
%%time
#parameters = study.best_params
#parameters

# Training a ML Classifier Using a N Fold CV Loop

In [None]:
%%time
import optuna
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [None]:
%%time
N_SPLITS = 20
folds = StratifiedKFold(n_splits = N_SPLITS, shuffle = True)

In [None]:
%%time
optuna_params = {'n_estimators': 474,
 'max_depth': 12,
 'learning_rate': 0.17092496820170439,
 'subsample': 0.8681931753955343,
 'colsample_bytree': 0.6753406152924646,
 'reg_lambda': 8.439432864212677,
 'reg_alpha': 1.6521594249189673,
 'gamma': 9.986385923158347,
 'min_child_weight': 11,
 'random_state': 69,
 'objective': 'binary:logistic',
 'tree_method':'gpu_hist',}

In [None]:
%%time

scores  = []
y_probs = []

for fold, (trn_id, val_id) in enumerate(folds.split(trn_data[features], trn_data[target_feature])):  
    X_train, y_train = trn_data[features].iloc[trn_id], trn_data[target_feature].iloc[trn_id]
    X_valid, y_valid = trn_data[features].iloc[val_id], trn_data[target_feature].iloc[val_id]
    
    #scaler = MinMaxScaler()
    #X_train[numerical_features] = scaler.fit_transform(X_train[numerical_features])
    #X_valid[numerical_features] = scaler.transform(X_valid[numerical_features])
        
    model = XGBClassifier(**optuna_params)
    model.fit(X_train, y_train, eval_set = [(X_valid, y_valid)], eval_metric = ['logloss'], early_stopping_rounds = 50, verbose = False)
    
    valid_pred = model.predict(X_valid)
    valid_score = accuracy_score(y_valid, valid_pred)
    
    print("Fold:", fold, "Accuracy:", valid_score)
    scores.append(valid_score)
    #tst_data[numerical_features] = scaler.transform(tst_data[numerical_features])
    y_probs.append(model.predict_proba(tst_data[features]))

In [None]:
%%time
print("Mean accuracy score:", np.array(scores).mean())

In [None]:
# Mean accuracy score: 0.8035192541977858
# Mean accuracy score: 0.8046655013507072
# Mean accuracy score: 0.8084726415594046
# Mean accuracy score: 0.8043222628317178
# Mean accuracy score: 0.8043269446979618
# Mean accuracy score: 0.8050191570881226
# Mean accuracy score: 0.7999563006515177
# Mean accuracy score: 0.8024831823719477
# Mean accuracy score: 0.7991493193495418
# Mean accuracy score: 0.8023674453096034 ... Best 

In [None]:
%%time
y_prob = sum(y_probs) / len(y_probs)
y_prob_results = np.argmax(y_prob, axis = 1)
y_prob_results = y_prob_results.astype('bool')

sub['Transported'] = y_prob_results
sub.to_csv('submission_twenty_fold_loop_03272022.csv', index = False)