# Titanic Survival Part 3: Training Classifiers for Accuracy

In Part 1 of this project I conduct Exploratory Data Analysis (EDA) of the Titanic training data using R. This exploration can be found [here.](http://rpubs.com/BigBangData/512981)

In Part 2 I continue the exploration using Python and building a couple of basic models. This is not intended as the goal of the competition, just an exploration of modeling in Python.

In Part 3 (this notebook) I create a pre-processing pipeline and train several models in Python using the scikit-learn module, and submit my predictions to the competition.


In [1]:
from datetime import datetime
import time

dt_object = datetime.fromtimestamp(time.time())
dt_object = str(dt_object).split('.')[0]

Date, Time = dt_object.split(' ')
print('Revised on: ' + Date)

Revised on: 2020-01-21


# Pre-Processing

In [2]:
# import modules
import pandas as pd
import numpy as np

# custom modules
import processing_pipeline as pp  
import modeling_functions as mf

# load training data
train_data = pd.read_csv("../input/train.csv")

# separate target from predictors in training set
survived_labels = train_data['Survived'].copy()
train_data_nolabel = train_data.drop('Survived', axis=1)

# get processed training data and labels
X = pp.process_train(train_data_nolabel)
y = survived_labels.to_numpy()

# Modeling

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import GridSearchCV

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import pickle

## Grid Search with Random Forests

I performed a grid search comparing `gini` and `entropy` information gain criteria, bootstrapping vs. not, and a sweep of the number of estimators from 50 to 5000 (by 250, so not very granular). We can expect an accuracy of about 82% from a model with the parameters below, so we'll test this assumption by training a model with these parameters and by comparing with the accuracy in the real test set.

In [4]:
rf_grid_search_1 = pickle.load(open('./RandomForest_GridSearch.sav', 'rb'))
rf_grid_search_1.best_params_ , 'Accuracy: ' + str(round(rf_grid_search_1.best_score_, 4)*100) + '%'

({'bootstrap': True, 'criterion': 'entropy', 'n_estimators': 300},
 'Accuracy: 81.93%')

## Training Best RF model plus Feature Selection

In [5]:
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# split training data into 20% test and 80% training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# train best RF classifier
forest_clf = RandomForestClassifier(bootstrap=True, 
                                    criterion='entropy', 
                                    n_estimators=300, 
                                    random_state=42)
forest_clf.fit(X_train, y_train)

# select subset with most important features (threshold defined with elbow plot)
sfm = SelectFromModel(forest_clf, threshold=0.0275)
sfm.fit(X_train, y_train)

# create subsets with most important features
X_train_imp = sfm.transform(X_train)
X_test_imp = sfm.transform(X_test)

# train a new model on this subset
forest_clf_imp =  RandomForestClassifier(bootstrap=True, 
                                        criterion='entropy', 
                                        n_estimators=300, 
                                        random_state=42)
forest_clf_imp.fit(X_train_imp, y_train)

# predict on the test set using the full dataset
y_pred = forest_clf.predict(X_test)
full_accuracy = str(round(accuracy_score(y_test, y_pred), 5)*100) + '%'

# predict on the test set using the important subset model
y_pred_imp = forest_clf_imp.predict(X_test_imp)
important_accuracy = str(round(accuracy_score(y_test, y_pred_imp), 5)*100) + '%'

print('Full data accuracy: ' + full_accuracy)
print('Important subset accuracy: ' + important_accuracy)

Full data accuracy: 82.682%
Important subset accuracy: 81.006%


It would seem as if the subset with the most important features slightly underperforms the full dataset, however, it's possible that the full dataset is overfitting, since we haven't performed any regularization. It will be interesting to compare the actual scores on the test set by submitting predictions using these two models.

In [9]:
# Load and process test dataset - avoid warning for the single NaN before re-imputing 'Age'
import warnings
warnings.filterwarnings('ignore')

test_data = pd.read_csv("../input/test.csv")
test_PassengerId = test_data['PassengerId'].copy()

X_test = pp.process_test(test_data) # this only differs by 1 line, see code

In [10]:
X_test_imp = sfm.transform(X_test)

# get predictions using our two RF classifiers
y_pred_full = forest_clf.predict(X_test)
y_pred_imp = forest_clf_imp.predict(X_test_imp)

In [11]:
# full data CSV
dict_full = { 'PassengerId': test_PassengerId, 'Survived': pd.Series(y_pred_full) } 
rf_full = pd.DataFrame(dict_full) 
rf_full.to_csv('./rf_full.csv', index=False)

# important subset CSV
dict_imp = { 'PassengerId': test_PassengerId, 'Survived': pd.Series(y_pred_imp) } 
rf_imp = pd.DataFrame(dict_imp) 
rf_imp.to_csv('./rf_imp.csv', index=False)

## Random Forest Submissions

My second submission, the full data random forest model which got 82.682% accuracy in validation, got 77.033% accuracy in the real test set, and 10,210th place. The third submission, the important subset, got the same exact score. This is just slightly better than the `gender submission` "model" which just predicts females survive, which is the example used in the competition for how one should format and submit predictions. 

## SGD submission

## XGBoost submission

### XGBoost

In [87]:
from xgboost import XGBClassifier

# split training data into 20% test and 80% training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# default model
xgb_def = XGBClassifier()
xgb_def.fit(X_train, y_train)
y_pred_def = xgb_def.predict(X_test)
accuracy_def = accuracy_score(y_test, y_pred_def)


xgb_best = XGBClassifier(objective='binary:logistic', 
                         colsample_bytree=0.8, 
                         learning_rate=0.15,
                         n_estimators=150,
                         random_state=42)

# best model
xgb_best.fit(X_train, y_train)
y_pred_best = xgb_best.predict(X_test)
accuracy_best = accuracy_score(y_test, y_pred_best)

xgb_third = XGBClassifier(objective='binary:logistic', 
                          colsample_bytree=0.8, 
                          learning_rate=0.1,
                          max_depth=3,
                          n_estimators=250,
                          reg_alpha=10,
                          random_state=42)
# a third model
xgb_third.fit(X_train, y_train)
y_pred_third = xgb_third.predict(X_test)
accuracy_third = accuracy_score(y_test, y_pred_third)

print("Accuracy with defaults: %.2f%%" % (accuracy_def * 100.0))
print("Accuracy w/ best model: %.2f%%" % (accuracy_best * 100.0))
print("Accuracy w/ third model: %.2f%%" % (accuracy_third * 100.0))

Accuracy with defaults: 83.24%
Accuracy w/ best model: 83.80%
Accuracy w/ third model: 82.12%


## Fitting Entire Train Set

## Submission

I just submitted my first prediction in Kaggle and achieved the stunning result of being the **14,155th** entry in the leaderboard. (The leaderboard is quite heavily overfitted for this baby competition.) The XGBoost classifier which got 84.41% accuracy during validation got 73.68% accuracy on the real test set, so it lost 10% accuracy when generalizing - it was probably overfitting.

In real life, I would have to stop now, you only get one chance, but since this is Kaggle, I will submit the SGD classifier and RF classifier as well to see whether simpler models generalized better.