# Titanic Survival Classification - Residual Layers and LightGBM (Part 10)

The primary focus of this notebook will be to build and test the effectiveness of using residual blocks for skip-connections to train very deep dense networks, in the same fashion convolutional blocks are used  - potentially with the same sparsity constraints.

Once this has been tested the next step will be to have a look at the functionality and performance of the LightGBM package.



In [4]:
##### First importing some relevant packages
import numpy as np
import pandas as pd

#Stop pandas from truncating output view
pd.options.display.max_columns = None

#Import Tensorflow
import tensorflow as tf

#Import Keras
from keras import layers
from keras.layers import Input, Dense, Activation, BatchNormalization, Dropout, Reshape, Flatten, Add
from keras.layers.advanced_activations import LeakyReLU, PReLU
from keras.models import Sequential, Model
from keras import regularizers
from keras.optimizers import Adam, SGD

#Import mathematical functions
from random import *
import math
import matplotlib
import matplotlib.pyplot as plt

#Get regular expression package
import re

#Import  Scikit learn framework
import sklearn as sk
from sklearn import svm
from sklearn import linear_model
from sklearn.metrics import roc_auc_score, roc_curve

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [5]:
#Import the functions built in previous parts
from Titanic_Import import *

full_set = pd.read_csv('D:/Datasets/Titanic/train.csv')
sub_set = pd.read_csv('D:/Datasets/Titanic/test.csv')

In [6]:
append_set = full_set
append_set = append_set.append([sub_set], ignore_index =True )
clean_set = Cleanse_Data_v3(append_set)
X_Train, Y_Train, X_CV, Y_CV, X_Test = dataset_splitter(clean_set, cv_size = 200)

## Residual Dense Network

So to start testing something I have personally been curious about - how effective are residual layers in dense networks.  

We know that they are extremely effective in convolutional networks and allow them to become significantly deeper by avoiding the issues of gradient vanishing/explosion by allowing layers to easily avoid learning the identity function.  So do they have the same effect in regular networks?  Or does the nature of the convolutional operator mean that this is a solution unique to convolutional networks?

Given how dense networks don't typically have any residual layers it is reasonable to assume that there's an inherant flaw with this design philosophy, however I wish to satisfy my curiosity nonetheless.

So following the insights of standard convolutional approaches, the logic entails a network structure of residual blocks of the N of units per block with each block having L layers and skip connections from start to end.  We can also draw on the insights of DenseNet (https://arxiv.org/abs/1802.08797) to connect every layer in each residual block to every other layer.  Then we will need a non-linear activation and likely a single dense layer to shrink the size for the next residual block.

This single layer to shrink the size of the output to U units where U < N is intuitively the reason for residual blocks not being applied to traditional dense networks.  This is due to the orthogonal nature of each node in dense networks by comparison to the locality and spacial distances being a factor in convolutional networks.

Initial ideas for getting around this issue are having a single layer cutoff of the density.  Most other forms of dimensionality reduction (such as autoencoders or PCA) won't work due to difficulty/impossibility of training or irreversability of the operation.

Perhaps there is a form of dimensionality reduction I have not considered but to keep it simple I will simply use a single dense layer, although I expect the pooling layer has a crucial role in the viability of residual blocks.

I will experiment with 2 architectures - one with 1 skip connection between the first and last layer per block, and the other another with every layer of a residual block connected with every previous layer.

In [15]:
def ResBlock_v1(X, layers, units, act_reg = None, ker_reg = None):

    #Single Layer to reduce dimensionality
    X = Dense(units, activation='relu', activity_regularizer = act_reg, kernel_regularizer = ker_reg)(X)
    
    #Batch normalize input as it will be added at end
    X = BatchNormalization()(X)

    #Snapshot first layer to skip to end
    X_shortcut = X
    
    for i in range(layers):
        X = Dense(units, activation='relu', activity_regularizer = act_reg, kernel_regularizer = ker_reg)(X)
        
    #Batch norm last layer
    X = BatchNormalization()(X)
    
    #Add skip connection
    X = Add()([X, X_shortcut])
    #Apply non-linearity
    X = Activation('relu')(X)

    return X

In [16]:
def Build_Resnet(input_shape, layers, act_reg = None, ker_reg = None):
    X_input = Input(input_shape)
    
    num_layers = len(layers)
    
    for i in range(num_layers):
        if i == 0 :
            X = ResBlock_v1(X_input, res_layers[i][1], res_layers[i][0], act_reg = act_reg, ker_reg = ker_reg)
        else :
            X = ResBlock_v1(X, res_layers[i][1], res_layers[i][0], act_reg = act_reg, ker_reg = ker_reg)
   
    #Build output layer
    X = Dense(1, activation='sigmoid')(X)
    
    model = Model(inputs = X_input, outputs = X, name='Residual_model')
    
    return model

Now the building blocks are established lets try it out.

In [17]:
res_layers = {0: (25, 1), 1: (20, 3), 2: (10, 3), 3: (5, 5), 5: (3, 5)}

In [23]:
test_model = Build_Resnet((X_Train.shape[1], ), res_layers, regularizers.l2(0.01), None)
test_model.compile(optimizer = "Adam", loss = "binary_crossentropy", metrics = ["accuracy", K_F1_score])
test_model.fit(x = X_Train, y = Y_Train, epochs = 256, verbose = 1)

Epoch 1/256
Epoch 2/256
Epoch 3/256
Epoch 4/256
Epoch 5/256
Epoch 6/256
Epoch 7/256
Epoch 8/256
Epoch 9/256
Epoch 10/256
Epoch 11/256
Epoch 12/256
Epoch 13/256
Epoch 14/256
Epoch 15/256
Epoch 16/256
Epoch 17/256
Epoch 18/256
Epoch 19/256
Epoch 20/256
Epoch 21/256
Epoch 22/256
Epoch 23/256
Epoch 24/256
Epoch 25/256
Epoch 26/256
Epoch 27/256
Epoch 28/256
Epoch 29/256
Epoch 30/256
Epoch 31/256
Epoch 32/256
Epoch 33/256
Epoch 34/256
Epoch 35/256
Epoch 36/256
Epoch 37/256
Epoch 38/256
Epoch 39/256
Epoch 40/256
Epoch 41/256
Epoch 42/256
Epoch 43/256
Epoch 44/256
Epoch 45/256
Epoch 46/256
Epoch 47/256
Epoch 48/256
Epoch 49/256
Epoch 50/256
Epoch 51/256
Epoch 52/256
Epoch 53/256
Epoch 54/256
Epoch 55/256
Epoch 56/256
Epoch 57/256
Epoch 58/256
Epoch 59/256
Epoch 60/256
Epoch 61/256
Epoch 62/256
Epoch 63/256
Epoch 64/256
Epoch 65/256
Epoch 66/256
Epoch 67/256
Epoch 68/256
Epoch 69/256
Epoch 70/256
Epoch 71/256
Epoch 72/256
Epoch 73/256
Epoch 74/256
Epoch 75/256
Epoch 76/256
Epoch 77/256
Epoch 78

Epoch 138/256
Epoch 139/256
Epoch 140/256
Epoch 141/256
Epoch 142/256
Epoch 143/256
Epoch 144/256
Epoch 145/256
Epoch 146/256
Epoch 147/256
Epoch 148/256
Epoch 149/256
Epoch 150/256
Epoch 151/256
Epoch 152/256
Epoch 153/256
Epoch 154/256
Epoch 155/256
Epoch 156/256
Epoch 157/256
Epoch 158/256
Epoch 159/256
Epoch 160/256
Epoch 161/256
Epoch 162/256
Epoch 163/256
Epoch 164/256
Epoch 165/256
Epoch 166/256
Epoch 167/256
Epoch 168/256
Epoch 169/256
Epoch 170/256
Epoch 171/256
Epoch 172/256
Epoch 173/256
Epoch 174/256
Epoch 175/256
Epoch 176/256
Epoch 177/256
Epoch 178/256
Epoch 179/256
Epoch 180/256
Epoch 181/256
Epoch 182/256
Epoch 183/256
Epoch 184/256
Epoch 185/256
Epoch 186/256
Epoch 187/256
Epoch 188/256
Epoch 189/256
Epoch 190/256
Epoch 191/256
Epoch 192/256
Epoch 193/256
Epoch 194/256
Epoch 195/256
Epoch 196/256
Epoch 197/256
Epoch 198/256
Epoch 199/256
Epoch 200/256
Epoch 201/256
Epoch 202/256
Epoch 203/256
Epoch 204/256
Epoch 205/256


Epoch 206/256
Epoch 207/256
Epoch 208/256
Epoch 209/256
Epoch 210/256
Epoch 211/256
Epoch 212/256
Epoch 213/256
Epoch 214/256
Epoch 215/256
Epoch 216/256
Epoch 217/256
Epoch 218/256
Epoch 219/256
Epoch 220/256
Epoch 221/256
Epoch 222/256
Epoch 223/256
Epoch 224/256
Epoch 225/256
Epoch 226/256
Epoch 227/256
Epoch 228/256
Epoch 229/256
Epoch 230/256
Epoch 231/256
Epoch 232/256
Epoch 233/256
Epoch 234/256
Epoch 235/256
Epoch 236/256
Epoch 237/256
Epoch 238/256
Epoch 239/256
Epoch 240/256
Epoch 241/256
Epoch 242/256
Epoch 243/256
Epoch 244/256
Epoch 245/256
Epoch 246/256
Epoch 247/256
Epoch 248/256
Epoch 249/256
Epoch 250/256
Epoch 251/256
Epoch 252/256
Epoch 253/256
Epoch 254/256
Epoch 255/256
Epoch 256/256


<keras.callbacks.History at 0x219d45c0>

In [24]:
train_pred = test_model.predict(x = X_Train)
cv_pred = test_model.predict(x = X_CV)

train_hat = normalize_predictions(train_pred)
cv_hat = normalize_predictions(cv_pred)

show_acc(Y_Train, train_hat)
show_acc(Y_CV, cv_hat)

  prec = true_pos / (true_pos + false_pos)


Accuracy =  59.913169319826345
F1 Score =  nan

Confusion Matrix
       Labels  Actual True  Actual False
0   Pred True          0.0           0.0
1  Pred False        277.0         414.0
Accuracy =  67.5
F1 Score =  nan

Confusion Matrix
       Labels  Actual True  Actual False
0   Pred True          0.0           0.0
1  Pred False         65.0         135.0


So that didn't work.  I guess there's a good reason this type of architecture is not used. 

It's worth trying fully connected blocks, for completeness sake (as well as practice implementing this type of architecture, as while it may not be suited for this task, it works very well for convolutional networks).

In [33]:
def ResBlock_v2(X, layers, units, act_reg = None, ker_reg = None):
    #Define layer dictionary
    X_Layer = {}
    
    #Single Layer to reduce dimensionality
    X = Dense(units, activation='relu', activity_regularizer = act_reg, kernel_regularizer = ker_reg)(X)

    #Snapshot first layer to skip to end
    X_Layer[0] = X
    
    for i in range(layers):
        X = Dense(units, activation='relu', activity_regularizer = act_reg, kernel_regularizer = ker_reg)(X)
        X_Layer[i + 1] = X
        
        
        for j in range(i):
            X = Add()([X, X_Layer[j]])
                       
        X = Activation('relu')(X)

    return X

In [34]:
def Build_Resnet_v2(input_shape, layers, act_reg = None, ker_reg = None):
    X_input = Input(input_shape)
    
    num_layers = len(layers)
    
    for i in range(num_layers):
        if i == 0 :
            X = ResBlock_v2(X_input, res_layers[i][1], res_layers[i][0], act_reg = act_reg, ker_reg = ker_reg)
        else :
            X = ResBlock_v2(X, res_layers[i][1], res_layers[i][0], act_reg = act_reg, ker_reg = ker_reg)
   
    #Build output layer
    X = Dense(1, activation='sigmoid')(X)
    
    model = Model(inputs = X_input, outputs = X, name='Residual_model')
    
    return model

In [35]:
test_model = Build_Resnet_v2((X_Train.shape[1], ), res_layers, regularizers.l2(0.01), None)
test_model.compile(optimizer = "Adam", loss = "binary_crossentropy", metrics = ["accuracy", K_F1_score])
test_model.fit(x = X_Train, y = Y_Train, epochs = 256, verbose = 1)

Epoch 1/256
Epoch 2/256
Epoch 3/256
Epoch 4/256
Epoch 5/256
Epoch 6/256
Epoch 7/256
Epoch 8/256
Epoch 9/256
Epoch 10/256
Epoch 11/256
Epoch 12/256
Epoch 13/256
Epoch 14/256
Epoch 15/256
Epoch 16/256
Epoch 17/256
Epoch 18/256
Epoch 19/256
Epoch 20/256
Epoch 21/256
Epoch 22/256
Epoch 23/256
Epoch 24/256
Epoch 25/256
Epoch 26/256
Epoch 27/256
Epoch 28/256
Epoch 29/256
Epoch 30/256
Epoch 31/256
Epoch 32/256
Epoch 33/256
Epoch 34/256
Epoch 35/256
Epoch 36/256
Epoch 37/256
Epoch 38/256
Epoch 39/256
Epoch 40/256
Epoch 41/256
Epoch 42/256
Epoch 43/256
Epoch 44/256
Epoch 45/256
Epoch 46/256
Epoch 47/256
Epoch 48/256
Epoch 49/256
Epoch 50/256
Epoch 51/256
Epoch 52/256
Epoch 53/256
Epoch 54/256
Epoch 55/256
Epoch 56/256
Epoch 57/256
Epoch 58/256
Epoch 59/256
Epoch 60/256
Epoch 61/256
Epoch 62/256
Epoch 63/256
Epoch 64/256
Epoch 65/256
Epoch 66/256
Epoch 67/256
Epoch 68/256
Epoch 69/256
Epoch 70/256
Epoch 71/256
Epoch 72/256
Epoch 73/256
Epoch 74/256
Epoch 75/256
Epoch 76/256
Epoch 77/256
Epoch 78

Epoch 138/256
Epoch 139/256
Epoch 140/256
Epoch 141/256
Epoch 142/256
Epoch 143/256
Epoch 144/256
Epoch 145/256
Epoch 146/256
Epoch 147/256
Epoch 148/256
Epoch 149/256
Epoch 150/256
Epoch 151/256
Epoch 152/256
Epoch 153/256
Epoch 154/256
Epoch 155/256
Epoch 156/256
Epoch 157/256
Epoch 158/256
Epoch 159/256
Epoch 160/256
Epoch 161/256
Epoch 162/256
Epoch 163/256
Epoch 164/256
Epoch 165/256
Epoch 166/256
Epoch 167/256
Epoch 168/256
Epoch 169/256
Epoch 170/256
Epoch 171/256
Epoch 172/256
Epoch 173/256
Epoch 174/256
Epoch 175/256
Epoch 176/256
Epoch 177/256
Epoch 178/256
Epoch 179/256
Epoch 180/256
Epoch 181/256
Epoch 182/256
Epoch 183/256
Epoch 184/256
Epoch 185/256
Epoch 186/256
Epoch 187/256
Epoch 188/256
Epoch 189/256
Epoch 190/256
Epoch 191/256
Epoch 192/256
Epoch 193/256
Epoch 194/256
Epoch 195/256
Epoch 196/256
Epoch 197/256
Epoch 198/256
Epoch 199/256
Epoch 200/256
Epoch 201/256
Epoch 202/256
Epoch 203/256
Epoch 204/256
Epoch 205/256
Epoch 206/256
Epoch 207/256
Epoch 208/256
Epoch 

<keras.callbacks.History at 0x48adea90>

In [36]:
train_pred = test_model.predict(x = X_Train)
cv_pred = test_model.predict(x = X_CV)

train_hat = normalize_predictions(train_pred)
cv_hat = normalize_predictions(cv_pred)

show_acc(Y_Train, train_hat)
show_acc(Y_CV, cv_hat)

Accuracy =  59.913169319826345
F1 Score =  nan

Confusion Matrix
       Labels  Actual True  Actual False
0   Pred True          0.0           0.0
1  Pred False        277.0         414.0
Accuracy =  67.5
F1 Score =  nan

Confusion Matrix
       Labels  Actual True  Actual False
0   Pred True          0.0           0.0
1  Pred False         65.0         135.0


  prec = true_pos / (true_pos + false_pos)


And thus this experiment was an unfortunate failure, however maybe there is a way to make it work in order to enable extremely deep conventional fully connected networks.  Either way I'm happy I tried it to at least see it for myself.  Perhaps more insight can be derived by delving into the weights and observing how the weights evolve over training periods, however that's a different project for a different day.

No matter, the next thing to address in this notebook is a simple test to see how LightGBM works.

## LightGBM

So with gradient boosting packages such as XGBoost (among others) being as popular as they are its worth having a look at the syntax to implement the one I decided to try out first - LightGBM.  Then to evaluate its performance in the context with other notebooks in this series.

In [1]:
import lightgbm as lgb

So the package was imported correctly, now to play around with the parameters.

In [206]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'binary',
    'num_leaves': 30,
    'learning_rate': 0.03,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

#boosting types available: gbdt (standard), rf (random forest),
#                          dart (dropout additive trees), goss (one sided sampling)


#Apparently lightGBM needs a different training dataset schema
lgb_train = lgb.Dataset(X_Train, Y_Train)
lgb_eval = lgb.Dataset(X_CV, Y_CV, reference=lgb_train)


In [207]:
testgbm = lgb.train(params,
                lgb_train,
                num_boost_round=500,
                valid_sets=lgb_eval,
                early_stopping_rounds=10)

[1]	valid_0's binary_logloss: 0.67999
Training until validation scores don't improve for 10 rounds.
[2]	valid_0's binary_logloss: 0.66812
[3]	valid_0's binary_logloss: 0.656368
[4]	valid_0's binary_logloss: 0.644288
[5]	valid_0's binary_logloss: 0.63363
[6]	valid_0's binary_logloss: 0.622875
[7]	valid_0's binary_logloss: 0.612415
[8]	valid_0's binary_logloss: 0.602441
[9]	valid_0's binary_logloss: 0.593052
[10]	valid_0's binary_logloss: 0.584099
[11]	valid_0's binary_logloss: 0.575881
[12]	valid_0's binary_logloss: 0.568109
[13]	valid_0's binary_logloss: 0.560739
[14]	valid_0's binary_logloss: 0.553518
[15]	valid_0's binary_logloss: 0.546957
[16]	valid_0's binary_logloss: 0.538992
[17]	valid_0's binary_logloss: 0.53232
[18]	valid_0's binary_logloss: 0.524997
[19]	valid_0's binary_logloss: 0.518535
[20]	valid_0's binary_logloss: 0.512675
[21]	valid_0's binary_logloss: 0.507524
[22]	valid_0's binary_logloss: 0.502107
[23]	valid_0's binary_logloss: 0.497146
[24]	valid_0's binary_logloss: 

In [208]:
gbm_Train_pred = testgbm.predict(X_Train, num_iteration=testgbm.best_iteration)
gbm_CV_pred = testgbm.predict(X_CV, num_iteration=testgbm.best_iteration)


In [209]:
gbm_CV_pred.shape

(200,)

In [210]:
gbm_train_hat = normalize_predictions(gbm_Train_pred)
gbm_cv_hat = normalize_predictions(gbm_CV_pred)

show_acc(Y_Train, gbm_train_hat)
show_acc(Y_CV, gbm_cv_hat)

Accuracy =  86.97539797395079
F1 Score =  0.8192771084337349

Confusion Matrix
       Labels  Actual True  Actual False
0   Pred True        204.0          27.0
1  Pred False         63.0         397.0
Accuracy =  82.5
F1 Score =  0.7482014388489209

Confusion Matrix
       Labels  Actual True  Actual False
0   Pred True         52.0          12.0
1  Pred False         23.0         113.0


### Results

So the above example of results is indicative of most average runs.  

Overall most configurations had 87-89% training accuracy and 82-84% cross validation accuracy.  Interestingly random forest was one of the worst performing models in contrast to what I saw with the sci-kit learn implementation.  

#### Positives
* Trains INSANELY fast (at almost any number of iterations trained near instantly)
* Automatic cross validation/early stopping for easy regularization
* Pretty solid out of the box performance
* Easy and simple to use out of the box (at least at first)

#### Negatives
* Will need to figure out some way to enforce generalization and will take some reading/trial and error
* Definitely will need ensembling with other models
* Will need a bit of reading to understand exactly what's going on "under the hood" to tune it correctly
* Worse performance than some of the neural net models

This was a fairly short and simple notebook by comparison to some of the previous books, but it was a simple excersize of curiosity.  

The next thing I want to experiment with is Genetic Programming.  It seems like a very interesting programming paradigm and I expect to find use for it in the future, but in order to figure out how to use it the best way is to try it out.