## Dropout
Dropout is a regularization technique for neural network models. Dropout is a technique where randomly selected neurons are ignored during training. They are dropped-out randomly. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.

Link to paper:
- http://jmlr.org/papers/v15/srivastava14a.html


As a neural network learns, neuron weights settle into their context within the network. Weights of neurons are tuned for specific features providing some specialization. Neighboring neurons come to rely on this specialization, which if taken too far can result in a fragile model too specialized to the training data. This reliance on context for a neuron during training is referred to as complex co-adaptations. You can imagine that if neurons are randomly dropped out of the network during training, that other neurons will have to step in and handle the representation required to make predictions for the missing neurons. This is believed to result in multiple independent internal representations being learned by the network.

The effect is that the network becomes less sensitive to the specific weights of neurons. This in turn results in a network that is capable of better generalization and is less likely to overfit the training data.

### Dropout Regularization in Keras
Dropout is easily implemented by randomly selecting nodes to be dropped-out with a given probability (e.g. 20%) each weight update cycle. Dropout is only used during the training of a model and is not used when evaluating the skill of the model. Next we will explore a few different ways of using Dropout in Keras.

Here use the Sonar dataset binary classification dataset. Evaluate the models using scikit-learn with 10-fold cross-validation.  There are 60 input values and a single output value and the input values are standardized before being used in the network. The baseline neural network model has two hidden layers, the first with 60 units and the second with 30. Stochastic gradient descent is used to train the model with a relatively low learning rate and momentum.

In [11]:
from pandas import read_csv
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from keras.constraints import maxnorm
from keras.optimizers import SGD
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# fix random seed for reproducibility
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

In [3]:
seed = 7
np.random.seed(seed)
 
url = ("https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data")
dataset = pd.read_csv(url, header=None, prefix = 'V')

#print summary info
dataset.head()

Unnamed: 0,V0,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V51,V52,V53,V54,V55,V56,V57,V58,V59,V60
0,0.02,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0027,0.0065,0.0159,0.0072,0.0167,0.018,0.0084,0.009,0.0032,R
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.014,0.0049,0.0052,0.0044,R
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598,0.6194,...,0.0232,0.0166,0.0095,0.018,0.0244,0.0316,0.0164,0.0095,0.0078,R
3,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0121,0.0036,0.015,0.0085,0.0073,0.005,0.0044,0.004,0.0117,R
4,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0031,0.0054,0.0105,0.011,0.0015,0.0072,0.0048,0.0107,0.0094,R


In [4]:
#seperate data into input and output variables
X = np.array(dataset.iloc[:, 0:60])
y = np.array(dataset.iloc[:, 60])

In [5]:
#encode class as integers
encoder = LabelEncoder()
encoder.fit(y)
encoded_y = encoder.transform(y)

In [6]:
#create baseline
def create_baseline():
    # create model
    model = Sequential()
    model.add(Dense(60, input_dim=60, kernel_initializer= 'normal' , activation='relu' ))
    model.add(Dense(30, kernel_initializer='normal' , activation='relu' ))
    model.add(Dense(1, kernel_initializer= 'normal' , activation= 'sigmoid' ))
    # Compile model
    
    #Stochastic gradient descent optimizer.
    #Includes support for momentum, learning rate decay, and Nesterov momentum.
    #lr: float >= 0. Learning rate.
    #momentum: float >= 0. Parameter that accelerates SGD in the relevant direction and dampens oscillations.
    #decay: float >= 0. Learning rate decay over each update.
    #nesterov: boolean. Whether to apply Nesterov momentum.
    
    sgd = SGD(lr=0.01, momentum=0.8, decay=0.0, nesterov=False)
    model.compile(loss='binary_crossentropy' , optimizer='sgd', metrics=['accuracy'])
    return model

In [7]:
#evalaute the model
np.random.seed(seed)
estimators = []
estimators.append(('standardize',StandardScaler()))
estimators.append(('mlp',KerasClassifier(build_fn=create_baseline,epochs=200,batch_size = 20,verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits = 10,shuffle=True,random_state=seed)
results = cross_val_score(pipeline,X,encoded_y,cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Baseline: 83.64% (7.29%)


### Using Dropout on Visible layer
Add a new Dropout layer between the input (or visible layer) and the first hidden layer. The dropout
rate is set to 20%, meaning one in five inputs will be randomly excluded from each update cycle.

Additionally, as recommended in the original paper on dropout (see above), a constraint is imposed on
the weights for each hidden layer, ensuring that the maximum norm of the weights does not
exceed a value of 3. This is done by setting the **kernel_constraint** argument on the Dense
class when constructing the layers. The learning rate was lifted by one order of magnitude and
the momentum was increased to 0.9. These increases in the learning rate were also recommended
in the original dropout paper.

Use the same model as above but with differences outlined immediately above

In [8]:
# dropout in the input layer with weight constraint
def create_model():
    # create model
    model = Sequential()
    model.add(Dropout(0.2, input_shape=(60,)))
    model.add(Dense(60, kernel_initializer='normal', activation='relu',kernel_constraint=maxnorm(3)))
    model.add(Dense(30, kernel_initializer='normal', activation='relu',kernel_constraint=maxnorm(3)))
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
    # Compile model
    sgd = SGD(lr=0.1, momentum=0.9, decay=0.0, nesterov=False)
    model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])
    return model

In [12]:
#evaluate the model
np.random.seed(seed)
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=create_model, epochs=300, batch_size=16,verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_y, cv=kfold)
print("Visible: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Visible: 84.59% (4.80%)


Better performance than baseline

### Using Dropout on Hidden layers

Dropout can be applied to hidden neurons in the body of the network model. Dropout is applied between the two hidden layers and between the last hidden layer and
the output layer. Again a dropout rate of 20% is used as is a weight constraint on those layers

In [13]:
# dropout in hidden layers with weight constraint
def create_model():
    # create model
    model = Sequential()
    model.add(Dense(60, input_dim=60, kernel_initializer='normal', activation='relu',kernel_constraint=maxnorm(3)))
    model.add(Dropout(0.2))
    model.add(Dense(30, kernel_initializer='normal', activation='relu',kernel_constraint=maxnorm(3)))
    model.add(Dropout(0.2))
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
    # Compile model
    sgd = SGD(lr=0.1, momentum=0.9, decay=0.0, nesterov=False)
    model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])
    return model

In [15]:
np.random.seed(seed)
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=create_model, epochs=300, batch_size=16,verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_y, cv=kfold)
print("Hidden: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Hidden: 81.64% (5.79%)


Worse performance than the baseline

Using dropout in the hidden layers did not lift performance. In fact, performance was worse than the baseline.
Maybe additional training epochs are required or that further tuning is required to
the learning rate ?

### Tips for using Dropout
Practical tips from the reference paper:
 - Generally use a small dropout value of 20%-50% of neurons with 20% providing a good
    starting point. A probability too low has minimal effect and a value too high results in
    under-learning by the network.
 - Use a larger network. You are likely to get better performance when dropout is used
    on a larger network, giving the model more of an opportunity to learn independent
    representations.
 - Use dropout on input (visible) as well as hidden layers. Application of dropout at each
    layer of the network has shown good results.
 - Use a large learning rate with decay and a large momentum. Increase your learning rate
    by a factor of 10 to 100 and use a high momentum value of 0.9 or 0.99.
 - Constrain the size of network weights. A large learning rate can result in very large
    network weights. Imposing a constraint on the size of network weights such as max-norm
    regularization with a size of 4 or 5 has been shown to improve results.