# Titanic Survival Classification - First Submission (Part 5)

Now we have our basic ensemble model now its time to improve the performance and get a final model worthy of submission ready.

So firstly lets import everything we need including all of the previously built functions.

To give credit some of the ideas for data transformations taken from - 

https://www.kaggle.com/sinakhorami/titanic-best-working-classifier?scriptVersionId=566580/notebook <br>
https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python?scriptVersionId=2368078

In [1]:
#First importing some relevant packages
import numpy as np
import pandas as pd

#Import Tensorflow
import tensorflow as tf

#Import Keras
from keras import layers
from keras.layers import Input, Dense, Activation, BatchNormalization, Dropout
from keras.layers.advanced_activations import LeakyReLU, PReLU
from keras.models import Model
from keras import regularizers

#Import mathematical functions
from random import *
import math

#Get regular expression package
import re

#Import  Scikit learn framework
import sklearn as sk
from sklearn import svm
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier, 
                              GradientBoostingClassifier, ExtraTreesClassifier)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
#Import the functions built in previous parts
from Titanic_Import import *

full_set = pd.read_csv('D:/Datasets/Titanic/train.csv')

So lets go right back to the start and have a look at our raw data once more and revisit the transformations we built way back in part 1.

In [3]:
full_set.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


## Data Experimentation

For this notebook most of the experimentation was done in an ad-hoc fashion and thus will not be shown explicitly.  However they can be verified by modifying the functions below.

The architectures were derived by the random brute force method as described originally.

The experiments were using numeric fields left raw (with medians input in place of NaN) unless stated otherwise, I was experiemting using categorical numeric variables 

### Results Table

Test | Architecture | Train Accuracy | CV Accuracy
--- | --- | --- | ---
Original Benchmark | [22, 22, 11, 10, 7, 5, 3, 3, 3] | 0.8394437427014376 | 0.85 
Original Benchmark (2) | [13, 13, 11, 8, 8, 4] | 0.8419721877077587 | 0.86 
No OH, categorical deck, no norm | [6, 4, 3, 3, 3, 3] | 0.7572692797699409 | 0.83
Categorical deck, isalone OH, hascabin OH, no norm | [9, 8, 6, 6, 5, 4, 4] |  0.7484197225492311 | 0.83
Categorical deck, isalone OH, hascabin OH, normalized fare/age | [11, 7, 5] | 0.804045511482639 | 0.85
All tested variables plus name length (no norm) | [7, 6, 5, 5] | 0.7724399495064534 | 0.83
All tested variables plus name length (with norm) | [7, 6, 5, 5] | 0.8128950690047629 | 0.84
All tested variables plus name length (with norm) and title | [18, 10, 8, 4, 3, 3, 2] | 0.8343868523873813 | 0.89
Original Benchmark plus name length and title | [14, 12, 6, 6, 6] | 0.8381795198968629 | 0.89



# Key Learnings

## Experimentation with Data

* Initial intuitions about data normalization were accurate and do improve performance
* Using One-Hot encodings do improve performance over leaving categorical fields as numeric
* Adding Binary features such as if a person is alone or do they have a cabin does not improve performance over adding a cleansed field's full data
* Our Cross Validation set is very slightly biased in favour of survivors by comparison to our training data
* As discovered from experimentation in an earlier notebook L2 regularization with $\lambda = 0.01$ seems to be about the best value with a naive search.
* Adding name length and title may give slight benefits over not having them but random initialization has a bigger impact

## Online research

* Initial intuitions about cabin feature were correct - see https://www.kaggle.com/ccastleberry/titanic-cabin-features/notebook for another notebook which independantly had the same intuitions.
* While I was initially dissapointed with scores having seen the best kernels have 100% accuracy, with 80%+ accuracy I'm actually doing pretty well - https://www.kaggle.com/pliptor/how-am-i-doing-with-my-score
* The kaggle leaderboard is very skewed, the top submissions with 100% accuracy are all likely using the actual data from the event and not a ML algorithm.  
* Submissions are then very heavily skewed with 99-88% accuracy models all likely being some genetically programmed algorithm that is non-human readable (given publically available models with that accuracy).
* Most other kernels use some form of decision tree with ensemble models and achieve about $80 - 85%$% accuracy on out of sample data. Therefore this seems to be a good target at which to aim.

# Final Model

## Data Pre-processing

In [4]:
#Get Title function from Sina
def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""


#Creating our Training Set
def Cleanse_Training_Data_v2(df_in):
    #Put our dataframe into new object to avoid corrupting original dataframe
    test_set = df_in
    
    test_set['Age'] = test_set.groupby(['Pclass'])['Age'].apply(lambda x: x.fillna(x.median()))
    
    #Name Length from Anisotropic
    test_set['Name_length'] = test_set['Name'].apply(len)
    
    #Normalize numerical fields
    age_mean = test_set['Age'].mean()
    fare_mean = test_set['Fare'].mean()
    name_len_mean = test_set['Name_length'].mean()
    
    age_range = test_set['Age'].max() - test_set['Age'].min()
    fare_range = test_set['Fare'].max() - test_set['Fare'].min()
    name_len_range = test_set['Name_length'].max() - test_set['Name_length'].min()
    
    test_set['Norm_age'] = (test_set['Age'] - age_mean) / age_range
    test_set['Norm_fare'] = (test_set['Fare'] - fare_mean) / fare_range
    #test_set['Norm_name'] = (test_set['Name_length'] - name_len_mean) / name_len_range
    
    
    
    
    
    #Getting our Deck
    test_set['canc'] = test_set['Cabin'].str.replace(' ', '')
    test_set['Deckstr'] = test_set['canc'].str[0]
    test_set['Deckstr'] = test_set['Deckstr'].fillna(value = 'X')
    test_set['Deckstr'] = test_set['Deckstr'].map( {'A': 1, 'B': 2, 'C' : 3,'D' : 4, 'E' : 5,'F' : 6,'G' : 7, 'T' : 8 ,'X' : 0} ).astype(int)
    
    #Remap Gender and create number of family members present field
    test_set['Sex'] = test_set['Sex'].map( {'female': -1, 'male': 1} ).astype(int)
    test_set['Company'] = test_set['SibSp'] + test_set['Parch']
    
    #Alone classifier took from Sina
    test_set['IsAlone'] = 0
    test_set.loc[test_set['Company'] == 1, 'IsAlone'] = 1
    
    #Alone classifier took from Sina
    test_set['Has_Cabin'] = 1
    test_set.loc[test_set['Deckstr'] == 1, 'Has_Cabin'] = 0
    
    #Applying Title code from Sia
    test_set['Title'] = test_set['Name'].apply(get_title)
    test_set['Title'] = test_set['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    test_set['Title'] = test_set['Title'].replace('Mlle', 'Miss')
    test_set['Title'] = test_set['Title'].replace('Ms', 'Miss')
    test_set['Title'] = test_set['Title'].replace('Mme', 'Mrs')
    
    

    
    
    #Manually populate embarked with correct values (only 2 looked up correct value based on average fare)
    values = {'Embarked': 'C'}
    test_set = test_set.fillna(value=values)
    test_set['Embarked'] = test_set['Embarked'].map( {'S': 0, 'C': 1, 'Q' : 2} ).astype(int)
    
    
    
    emb_set = pd.get_dummies(test_set.Embarked, prefix='Emb', dummy_na = False)
    title_set = pd.get_dummies(test_set.Title, prefix='ti', dummy_na = True)
    deck_set = pd.get_dummies(test_set.Deckstr, prefix='de', dummy_na = False)
     


    
    oh_set = pd.concat([test_set,  
                        emb_set, 
                        title_set, 
                        deck_set], axis=1)
    
    #Create output fully numeric dataframe
    out_set = oh_set.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 
                             'Cabin', 'canc', 'Embarked', 'Title', 'IsAlone', 'Has_Cabin', 
                           'Deckstr', 'Age', 'Fare', 'Name_length'], axis=1)

    
    #Segmenting data
    X_Train_df = out_set.head(791)
    X_CV_df = out_set.tail(100)

    #Getting our Y vectors
    Y_Train = X_Train_df['Survived'].values
    Y_CV = X_CV_df['Survived'].values

    #Dropping columns we don't want to feed into our ML algorithm
    X_Train_df = X_Train_df.drop(['Survived'], axis=1)
    X_CV_df = X_CV_df.drop(['Survived'], axis=1)

    #Getting our X vectors
    X_Train = X_Train_df.values
    X_CV = X_CV_df.values
    
    return X_Train, X_CV, Y_Train, Y_CV, out_set

## Train Model

In [5]:
X_Train, X_CV, Y_Train, Y_CV, test = Cleanse_Training_Data_v2(full_set)

Using the brute-force method to find the best model for now. We have slightly modified our old algorithm to allow us to output the actual model used, so we don't have to deal with variance through random initializations.

In [6]:
def Find_Architecture_v2(X_Train_2, Y_Train_2, X_CV_2, Y_CV_2, max_layers = 10, num_iters = 32): 
    best_perf = 0.0
    #Iterate through n interations
    for i in range(num_iters):
        #Reset hyperparameters and initalize nn depth
        layers = []
        num_layers = randint(3, max_layers)
        prev_layer = X_Train_2.shape[1]
        
        for j in range(num_layers):
            #Randomly generate number of units per layer
            min_size = math.ceil(prev_layer / 2.0)
            lay_size = randint(min_size, prev_layer)
            layers.append(lay_size)
            prev_layer = lay_size
            
        #Build and test model
        test_model = NN_model((X_Train_2.shape[1], ), layers, None, regularizers.l2(0.01))
        test_model.compile(optimizer = "Adam", loss = "binary_crossentropy", metrics = ["accuracy"])
        test_model.fit(x = X_Train_2, y = Y_Train_2, epochs = 32, verbose = 0)
        train_pred = test_model.evaluate(x = X_Train_2, y = Y_Train_2)
        cv_pred = test_model.evaluate(x = X_CV_2, y = Y_CV_2)
        
        #Evaluate performance by weighted sum of accuracies
        perform = train_pred[1]*0.6 + cv_pred[1]
        
        if perform > best_perf :
            best_perf = perform
            best_arch = layers
            best_train = train_pred
            best_cv = cv_pred
            best_model = test_model
        
    return best_arch, best_train, best_cv, best_model

Warning the below step can take a while to run.  Modify the num_iters variable to reduce run times.

In [7]:
nn_architecture, train_perf, cv_perf, best_model = Find_Architecture_v2(X_Train, Y_Train, X_CV, Y_CV, 10, num_iters =  25)



In [8]:
print(nn_architecture)
print()
print ("Train Loss = " + str(train_perf[0]))
print ("Train Accuracy = " + str(train_perf[1]))
print()
print ("CV Loss = " + str(cv_perf[0]))
print ("CV Accuracy = " + str(cv_perf[1]))

[21, 14, 8, 5, 5]

Train Loss = 0.46168896597345016
Train Accuracy = 0.8419721877077587

CV Loss = 0.4022293281555176
CV Accuracy = 0.88


Seems good so far let's have a look at the confusion matrix.

However first we need to normalize the output predictions with a simple function.

In [9]:
train_pred = best_model.predict(x = X_Train)
cv_pred = best_model.predict(x = X_CV)

In [10]:
def normalize_predictions(y_hat):
    y_out = y_hat.reshape((y_hat.shape[0],))
    y_out = np.around(y_out)
    
    return y_out

In [11]:
train_hat = normalize_predictions(train_pred)
cv_hat = normalize_predictions(cv_pred)

In [12]:
acc1, score1, conf1 = Calc_Accuracy(Y_Train, train_hat)

print("Accuracy = ", acc1)
print("F1 Score = ", score1)
print("")
print("Confusion Matrix")
conf1[["Labels", "Actual True", "Actual False"]]

Accuracy =  84.19721871049305
F1 Score =  0.7920133111480866

Confusion Matrix


Unnamed: 0,Labels,Actual True,Actual False
0,Pred True,238.0,57.0
1,Pred False,68.0,428.0


In [13]:
acc2, score2, conf2 = Calc_Accuracy(Y_CV, cv_hat)

print("Accuracy = ", acc2)
print("F1 Score = ", score2)
print("")
print("Confusion Matrix")
conf2[["Labels", "Actual True", "Actual False"]]

Accuracy =  88.0
F1 Score =  0.8333333333333334

Confusion Matrix


Unnamed: 0,Labels,Actual True,Actual False
0,Pred True,30.0,6.0
1,Pred False,6.0,58.0


So our classifier is not massively skewed in output inaccuracies and we are thus about ready to generate out submission file.

## Cleanse Submission File Data

Now as we applied the transformations to our training data lets apply similar transforms to our submission data, however we need to keep in mind we manually populated the embarked column by looking at the NaN values and average fares of people who boarded at each of the 3 locations and choosing the embarking location based on the fare.

In [24]:
sub_set = pd.read_csv('D:/Datasets/Titanic/test.csv')

sub_set.head(10)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.225,,S
6,898,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
7,899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0,,S
8,900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
9,901,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.15,,S


In [15]:
sub_set.count()

PassengerId    418
Pclass         418
Name           418
Sex            418
Age            332
SibSp          418
Parch          418
Ticket         418
Fare           417
Cabin           91
Embarked       418
dtype: int64

Luckily we don't have any missing embarked values so we do not need to account for this.

We do however have missing ages and one missing fare.  The missing ages are fine as we have this in our original algorithm.

However lets have a look at our missing fare.

In [16]:
sub_set[sub_set['Fare'].isnull()]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
152,1044,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S


So the best method of accounting for this will likely be to insert the median fare for his class.

In [17]:
testy = sub_set

In [18]:
testy['Fare'] = testy.groupby(['Pclass'])['Fare'].apply(lambda x: x.fillna(x.median()))

In [19]:
testy[testy['PassengerId'] == 1044]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
152,1044,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,7.8958,,S


Seemed to work fine, so lets create our submission file, first by cleansing our data.

Also doing some cursory analysis to ensure our features will line up correctly (so our one hot encodings work correctly), we have all the same titles and embarked locations, however we have 1 extra cabin in our Training data.  Fortunately this extra cabin is T (marked as 8 in our original data).  

In order to easily account for this we will use the get_dummies function with an extra column which by definition will always be 0 using the dummy_na function which will appear in the same column index as our T index from our original data.

Other than that our columns will line up perfectly.

In [25]:
#Creating our Training Set
def Cleanse_Submission_Data_v2(df_in):
    #Put our dataframe into new object to avoid corrupting original dataframe
    test_set = df_in
    
    test_set['Age'] = test_set.groupby(['Pclass'])['Age'].apply(lambda x: x.fillna(x.median()))
    test_set['Fare'] = test_set.groupby(['Pclass'])['Fare'].apply(lambda x: x.fillna(x.median()))
    
    #Name Length from Anisotropic
    test_set['Name_length'] = test_set['Name'].apply(len)
    
    #Normalize numerical fields
    age_mean = test_set['Age'].mean()
    fare_mean = test_set['Fare'].mean()
    name_len_mean = test_set['Name_length'].mean()
    
    age_range = test_set['Age'].max() - test_set['Age'].min()
    fare_range = test_set['Fare'].max() - test_set['Fare'].min()
    name_len_range = test_set['Name_length'].max() - test_set['Name_length'].min()
    
    test_set['Norm_age'] = (test_set['Age'] - age_mean) / age_range
    test_set['Norm_fare'] = (test_set['Fare'] - fare_mean) / fare_range
    test_set['Norm_name'] = (test_set['Name_length'] - fare_mean) / fare_range
    
    
    
    
    
    #Getting our Deck
    test_set['canc'] = test_set['Cabin'].str.replace(' ', '')
    test_set['Deckstr'] = test_set['canc'].str[0]
    test_set['Deckstr'] = test_set['Deckstr'].fillna(value = 'X')
    test_set['Deckstr'] = test_set['Deckstr'].map( {'A': 1, 'B': 2, 'C' : 3,'D' : 4, 'E' : 5,'F' : 6,'G' : 7, 'T' : 8 ,'X' : 0} ).astype(int)
    
    #Remap Gender and create number of family members present field
    test_set['Sex'] = test_set['Sex'].map( {'female': -1, 'male': 1} ).astype(int)
    test_set['Company'] = test_set['SibSp'] + test_set['Parch']
    
    #Alone classifier took from Sina
    test_set['IsAlone'] = 0
    test_set.loc[test_set['Company'] == 1, 'IsAlone'] = 1
    
    #Alone classifier took from Sina
    test_set['Has_Cabin'] = 1
    test_set.loc[test_set['Deckstr'] == 1, 'Has_Cabin'] = 0
    
    #Applying Title code from Sia
    test_set['Title'] = test_set['Name'].apply(get_title)
    test_set['Title'] = test_set['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    test_set['Title'] = test_set['Title'].replace('Mlle', 'Miss')
    test_set['Title'] = test_set['Title'].replace('Ms', 'Miss')
    test_set['Title'] = test_set['Title'].replace('Mme', 'Mrs')
    
    

    
    
    #Manually populate embarked with correct values (only 2 looked up correct value based on average fare)
    values = {'Embarked': 'C'}
    test_set = test_set.fillna(value=values)
    test_set['Embarked'] = test_set['Embarked'].map( {'S': 0, 'C': 1, 'Q' : 2} ).astype(int)
    
    
    
    emb_set = pd.get_dummies(test_set.Embarked, prefix='Emb', dummy_na = False)
    title_set = pd.get_dummies(test_set.Title, prefix='ti', dummy_na = True)
    deck_set = pd.get_dummies(test_set.Deckstr, prefix='de', dummy_na = True)

    
    oh_set = pd.concat([test_set,  
                        emb_set, 
                        title_set, 
                        deck_set], axis=1)
    
    #Create output fully numeric dataframe
    out_set = oh_set.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 
                             'Cabin', 'canc', 'Embarked', 'Title', 'IsAlone', 'Has_Cabin', 
                           'Deckstr', 'Age', 'Fare', 'Name_length'], axis=1)

    X_Test = out_set.values
    return X_Test

In [26]:
X_Test = Cleanse_Submission_Data_v2(sub_set)

In [27]:
sub_pred = best_model.predict(X_Test)
test_hat = normalize_predictions(sub_pred)

In [29]:
sub_df = Create_output_frame(sub_set, test_hat)

In [30]:
sub_df.head(10)

Unnamed: 0,PassengerId,Survived
0,892,0.0
1,893,1.0
2,894,0.0
3,895,0.0
4,896,1.0
5,897,0.0
6,898,1.0
7,899,0.0
8,900,1.0
9,901,0.0


Now to output our data and submit.

In [31]:
sub_df.to_csv("Predictions.csv", index=False, float_format='%1d')

Well it worked but the actual submission was pretty terrible at 78% accuracy. 

So the random intialization actually played a pretty large part from running various versions of the above functions multiple times.  Overall however this was prone to overfitting to the cross validation data based on the random intialization more than anything.

 So more work is needed to climb a bit.