## Dropout with Keras

This notebook shows how to add dropout layers to Keras models in order to include the dropout regularisation technique when training neural networks.

Dropout is a regularisation technique introduced in 2014 by Srivastava et al. The original publication can be found here in pdf format: http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf

The authors of this paper found that when dropout was included in the training process, the generalisation of the models were improved. They prove this by besting some benchmark scores on well known training sets including MNIST and CIFAR10.

The images below show the basic principle and have been taken directly from the published paper.

![Pic](Images/dropout.png)

### How Dropout Works

As the above image shows the term 'dropout' simply refers to the visible or hidden units within the neural network. A unit is temporarily removed from the network, including its connected weights. Units are chosen at random, with fixed P, independent of other units. The P of dropout can be chosen by the user for each layer. Srivastava et al state that visible units should have a P of retention close to 1 while 0.5 is optimal for hidden layers.

Training with dropout is in theory the same as training 2^n different networks. Dropout is applied to the neural net at the beginning of each training pass. This means each neural network gets trained very rarely or not at all. At test time the outgoing weight of a unit is multiplied by its rentention probability. This ensures the model used for testing matches the expected output during training. This is analogous to combining all 2^n models into one final test model.

An additional aspect noted by the authors, specific to the backprop algorithm was the combination of dropout with max-norm constraints. Max-norm constraints restrict wieght vectors from becoming too large by capping an upper bound on the magnitude of the incoming weight vector for each neuron. If a gradient descent step moves the weight vector so that its magnitude ||w||2 becomes greater than the constraint C, the weight vector is projected back onto a ball of radius C. This prevents the weight vectors growing out of control, as can happen when using a large learning rate.

The code below uses the MNIST dataset. The first model is an original build of a neural network optimized with stochastic gradient descent. The second model shows the implementation of dropout and maxnorm.

In [2]:
import pandas as pd
import numpy as np
import os
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers.core import Dense, Dropout
from keras.optimizers import SGD
from keras.wrappers.scikit_learn import KerasClassifier
from keras.constraints import maxnorm
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold, cross_val_score as cv

def feature_label_split(path, data='train'):
    df = pd.read_csv(os.path.join(path,'mnist_{}.csv'.format(data)), header=None)
    y = df.iloc[:,0]
    X = df.drop(df.columns[0], axis=1)
    return X, y

Using Theano backend.


In [3]:
cwd = os.getcwd()

X_train, y_train = feature_label_split(cwd, data='train')
print("Training Data: Rows, Columns")
print("Feature set {0} and labels {1}".format(X_train.shape, y_train.shape))

Training Data: Rows, Columns
Feature set (60000, 784) and labels (60000,)


In [4]:
X_test, y_test = feature_label_split(cwd, data='test')
print("Test Data: Rows, Columns")
print("Feature set {0} and labels {1}".format(X_test.shape, y_test.shape))

Test Data: Rows, Columns
Feature set (10000, 784) and labels (10000,)


In [5]:
X_train, X_test = np.asarray(X_train), np.asarray(X_test)
X_train.shape, X_test.shape

((60000, 784), (10000, 784))

In [6]:
y_train_ohe = np_utils.to_categorical(y_train)

for i in range(5):
    print("Original label: {0} --- One hot encoded: {1}".format(y_train[i], y_train_ohe[i]))

Original label: 5 --- One hot encoded: [ 0.  0.  0.  0.  0.  1.  0.  0.  0.  0.]
Original label: 0 --- One hot encoded: [ 1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
Original label: 4 --- One hot encoded: [ 0.  0.  0.  0.  1.  0.  0.  0.  0.  0.]
Original label: 1 --- One hot encoded: [ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
Original label: 9 --- One hot encoded: [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  1.]


### Basic MLP - BackProp

Trained with 5-fold cross-validation. Recommended cv split is typically 10 fold, restricted here as the code is slow to run using the notebook.

In [7]:
# Set a random seed so the model is replicable 
np.random.seed(0) 

def mnist_NN_model():
    model = Sequential()
    model.add(Dense(input_dim=X_train.shape[1], 
                output_dim=50, 
                init='uniform', 
                activation='tanh'))
    model.add(Dense(input_dim=50, 
                output_dim=50, 
                init='uniform', 
                activation='tanh'))
    model.add(Dense(input_dim=50, 
                output_dim=y_train_ohe.shape[1], 
                init='uniform', 
                activation='softmax'))
    sgd = SGD(lr=0.001, decay=1e-7, momentum=.9, nesterov=False)
    model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
    return model

kf = StratifiedKFold(y_train, n_folds=5, shuffle=True, random_state=0)

# Call instance of the model
model = KerasClassifier(build_fn=mnist_NN_model, nb_epoch = 50, batch_size=300, verbose=0)

results = cv(model, X_train, y_train_ohe, cv=kf)

In [8]:
print("Accuracy: {:.2f}%, Std: {:.2f}".format(results.mean()*100, results.std()*100))

Accuracy: 93.16%, Std: 0.04


## MLP - BackProp with Dropout and max-norm

Dropout can be applied to any layer, simply add a new layer from .add()

max-norm is applied when adding layers and is one of the kwargs of Dense()

In [9]:
# Set a random seed so the model is replicable 
np.random.seed(0) 

def mnist_NN_model():
    model = Sequential()
    model.add(Dropout(0.1, input_shape=(X_train.shape[1],)))
    model.add(Dense(output_dim=50, 
                init='uniform', 
                activation='tanh',
                W_constraint=maxnorm(3)))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim=50, 
                output_dim=50, 
                init='uniform', 
                activation='tanh',
                W_constraint=maxnorm(3)))
    model.add(Dense(input_dim=50, 
                output_dim=y_train_ohe.shape[1], 
                init='uniform', 
                activation='softmax'))
    sgd = SGD(lr=0.001, decay=1e-7, momentum=.9, nesterov=False)
    model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
    return model

kf = StratifiedKFold(y_train, n_folds=5, shuffle=True, random_state=0)

# Call instance of the model
model = KerasClassifier(build_fn=mnist_NN_model, nb_epoch = 50, batch_size=300, verbose=0)

results = cv(model, X_train, y_train_ohe, cv=kf)

In [10]:
print("Accuracy: {:.2f}%, Std: {:.2f}".format(results.mean()*100, results.std()*100))

Accuracy: 91.00%, Std: 0.38


In this example I actually managed to decrease the mean accuracy. This is most likely due to a poor choice for the max-norm constant. With proper HP optimisation I believe the scores would improve as proven in the original paper. The Std is a multiple of 9 x bigger than the non regularised model. This could be an indication that the 'basic' model is slightly overfitting the training data.