# Grid Search tuning of deep learning models

## Background
In a previous notebook on [Tuning CNN for Sentiment Analysis](https://github.com/Shumakriss/MachineLearningTutorials/blob/master/Keras%20Sentiment%20Analysis%20Part%202.ipynb) I tried my hand at manual hyperparameter tuning. After attempting to tune [hyperparameters](https://www.quora.com/What-are-hyperparameters-in-machine-learning) manually and seeking some outside suggestions, I found a few [automated techniques](http://machinelearningmastery.com/how-to-tune-algorithm-parameters-with-scikit-learn/) and that manual tuning is really [not the best way to spend your time](http://machinelearningmastery.com/machine-learning-model-running/). While I am proud of myself for considering ensembling, it seems that we might want to [try some other techniques](http://machinelearningmastery.com/how-to-improve-machine-learning-results/) first. In this model, we will follow the [machinelearningmastery.com](http://machinelearningmastery.com) example for [tuning deep learning models with Grid Search in Keras](http://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/). 

## Credit
If it was not obvious from the introduction, I have had a lot of help from [machinelearningmastery.com](http://machinelearningmastery.com) and I highly suggest following along their tutorials. My contributions are the iPython Notebook format which I hope you find useful and my trials and tribulations which I hope you might learn from (or avoid yourself).

## Approach
In this example, we use Sci-Kit Learn (a.k.a. sklearn) in Python to tune the hyperparameters of a deep learning model with Grid Search.

## Example 1:  Tuning Batch Size and Number of Epochs
To begin, we build a very simple neural network with the Keras Sequential() object and initialize the Pima Indians Diabetes dataset.

In [6]:
# Use scikit-learn to grid search the batch size and epochs
import numpy
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier

# Function to create model, required for KerasClassifier
def create_model():
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))

    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# load dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")

# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]

Next, we prepare our model for use with sklearn by wrapping it with a KerasClassifier() object.

In [7]:
# create model
model = KerasClassifier(build_fn=create_model, verbose=0)

Grid search works by providing a list of parameters and possible values. Now we not need to write nested loops for each condition and we get a nice report afterward.

In [8]:
# define the grid search parameters
batch_size = [10, 20, 40, 60, 80, 100]
epochs = [10, 50, 100]
param_grid = dict(batch_size=batch_size, nb_epoch=epochs)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, Y)

In [9]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.690104 using {'nb_epoch': 100, 'batch_size': 40}
0.348958 (0.024774) with: {'nb_epoch': 10, 'batch_size': 10}
0.572917 (0.134575) with: {'nb_epoch': 50, 'batch_size': 10}
0.662760 (0.033197) with: {'nb_epoch': 100, 'batch_size': 10}
0.597656 (0.022326) with: {'nb_epoch': 10, 'batch_size': 20}
0.567708 (0.161196) with: {'nb_epoch': 50, 'batch_size': 20}
0.645833 (0.030978) with: {'nb_epoch': 100, 'batch_size': 20}
0.566406 (0.008438) with: {'nb_epoch': 10, 'batch_size': 40}
0.627604 (0.030647) with: {'nb_epoch': 50, 'batch_size': 40}
0.690104 (0.012890) with: {'nb_epoch': 100, 'batch_size': 40}
0.497396 (0.123210) with: {'nb_epoch': 10, 'batch_size': 60}
0.546875 (0.142885) with: {'nb_epoch': 50, 'batch_size': 60}
0.648438 (0.019401) with: {'nb_epoch': 100, 'batch_size': 60}
0.627604 (0.019225) with: {'nb_epoch': 10, 'batch_size': 80}
0.665365 (0.017566) with: {'nb_epoch': 50, 'batch_size': 80}
0.647135 (0.014731) with: {'nb_epoch': 100, 'batch_size': 80}
0.608073 (0.053115) wit

## Example 2: Tuning the Training Optimizer Algorithm
In the second example, we change the optimizer algorithm which is something we usually have decided based on conceptual knowledge, not empirical results.

> This is an odd example, because often you will choose one approach a priori and instead focus on tuning its parameters on your problem (e.g. see the next example).

In [14]:
# Use scikit-learn to grid search the batch size and epochs
import numpy
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier

# Function to create model, required for KerasClassifier
def create_model(optimizer='adam'):
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    
    return model

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# load dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")

# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]

# create model
model = KerasClassifier(build_fn=create_model, nb_epoch=100, batch_size=10, verbose=0)

In [16]:
# define the grid search parameters
optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
param_grid = dict(optimizer=optimizer)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, Y)

In [17]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.704427 using {'optimizer': 'Adam'}
0.348958 (0.024774) with: {'optimizer': 'SGD'}
0.348958 (0.024774) with: {'optimizer': 'RMSprop'}
0.471354 (0.156586) with: {'optimizer': 'Adagrad'}
0.669271 (0.029635) with: {'optimizer': 'Adadelta'}
0.704427 (0.031466) with: {'optimizer': 'Adam'}
0.682292 (0.016367) with: {'optimizer': 'Adamax'}
0.703125 (0.003189) with: {'optimizer': 'Nadam'}


## Example 3: Tune Learning Rate and Momentum

In [20]:
# Use scikit-learn to grid search the learning rate and momentum
import numpy
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.optimizers import SGD

# Function to create model, required for KerasClassifier
def create_model(learn_rate=0.01, momentum=0):
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    
    # Compile model
    optimizer = SGD(lr=learn_rate, momentum=momentum)
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    
    return model

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# load dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")

# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]

# create model
model = KerasClassifier(build_fn=create_model, nb_epoch=100, batch_size=10, verbose=0)

In [21]:
# define the grid search parameters
learn_rate = [0.001, 0.01, 0.1, 0.2, 0.3]
momentum = [0.0, 0.2, 0.4, 0.6, 0.8, 0.9]
param_grid = dict(learn_rate=learn_rate, momentum=momentum)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, Y)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.680990 using {'learn_rate': 0.01, 'momentum': 0.0}
0.348958 (0.024774) with: {'learn_rate': 0.001, 'momentum': 0.0}
0.348958 (0.024774) with: {'learn_rate': 0.001, 'momentum': 0.2}
0.467448 (0.151098) with: {'learn_rate': 0.001, 'momentum': 0.4}
0.665365 (0.010253) with: {'learn_rate': 0.001, 'momentum': 0.6}
0.669271 (0.030647) with: {'learn_rate': 0.001, 'momentum': 0.8}
0.666667 (0.035564) with: {'learn_rate': 0.001, 'momentum': 0.9}
0.680990 (0.024360) with: {'learn_rate': 0.01, 'momentum': 0.0}
0.677083 (0.026557) with: {'learn_rate': 0.01, 'momentum': 0.2}
0.427083 (0.134575) with: {'learn_rate': 0.01, 'momentum': 0.4}
0.427083 (0.134575) with: {'learn_rate': 0.01, 'momentum': 0.6}
0.544271 (0.146518) with: {'learn_rate': 0.01, 'momentum': 0.8}
0.651042 (0.024774) with: {'learn_rate': 0.01, 'momentum': 0.9}
0.651042 (0.024774) with: {'learn_rate': 0.1, 'momentum': 0.0}
0.651042 (0.024774) with: {'learn_rate': 0.1, 'momentum': 0.2}
0.572917 (0.134575) with: {'learn_rate': 

## Example 4: Tuning Network Weight Initialization

This example provided useful insight for me. There are actually many strategies to initialize the weights of a neural network and no one strategy works best in all scenarios.

In [22]:
# Use scikit-learn to grid search the weight initialization
import numpy
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier

# Function to create model, required for KerasClassifier
def create_model(init_mode='uniform'):
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, init=init_mode, activation='relu'))
    model.add(Dense(1, init=init_mode, activation='sigmoid'))

    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# load dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")

# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]

# create model
model = KerasClassifier(build_fn=create_model, nb_epoch=100, batch_size=10, verbose=0)

In [23]:
# define the grid search parameters
init_mode = ['uniform', 'lecun_uniform', 'normal', 'zero', 'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform']
param_grid = dict(init_mode=init_mode)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, Y)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.720052 using {'init_mode': 'uniform'}
0.720052 (0.024360) with: {'init_mode': 'uniform'}
0.348958 (0.024774) with: {'init_mode': 'lecun_uniform'}
0.712240 (0.012075) with: {'init_mode': 'normal'}
0.651042 (0.024774) with: {'init_mode': 'zero'}
0.700521 (0.010253) with: {'init_mode': 'glorot_normal'}
0.674479 (0.011201) with: {'init_mode': 'glorot_uniform'}
0.661458 (0.028940) with: {'init_mode': 'he_normal'}
0.678385 (0.004872) with: {'init_mode': 'he_uniform'}


### Example 5: Tuning the Neuron Activation Function

While from my own experience attending conferences and online articles, I see relu as being a pretty popular all-around option for activation functions and the authors agree:
>Generally, the rectifier activation function is the most popular, but it used to be the sigmoid and the tanh functions and these functions may still be more suitable for different problems.

In [24]:
# Use scikit-learn to grid search the activation function
import numpy
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier

# Function to create model, required for KerasClassifier
def create_model(activation='relu'):
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, init='uniform', activation=activation))
    model.add(Dense(1, init='uniform', activation='sigmoid'))
    
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# load dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")

# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]

# create model
model = KerasClassifier(build_fn=create_model, nb_epoch=100, batch_size=10, verbose=0)

# define the grid search parameters
activation = ['softmax', 'softplus', 'softsign', 'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'linear']
param_grid = dict(activation=activation)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, Y)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

INFO (theano.gof.compilelock): Waiting for existing lock by unknown process (I am process '15731')
INFO (theano.gof.compilelock): To manually release the lock, delete /Users/Chris/.theano/compiledir_Darwin-16.4.0-x86_64-i386-64bit-i386-3.5.2-64/lock_dir
INFO (theano.gof.compilelock): Waiting for existing lock by process '15734' (I am process '15733')
INFO (theano.gof.compilelock): To manually release the lock, delete /Users/Chris/.theano/compiledir_Darwin-16.4.0-x86_64-i386-64bit-i386-3.5.2-64/lock_dir
INFO (theano.gof.compilelock): Waiting for existing lock by process '15728' (I am process '15729')
INFO (theano.gof.compilelock): To manually release the lock, delete /Users/Chris/.theano/compiledir_Darwin-16.4.0-x86_64-i386-64bit-i386-3.5.2-64/lock_dir
INFO (theano.gof.compilelock): Waiting for existing lock by process '15727' (I am process '15729')
INFO (theano.gof.compilelock): To manually release the lock, delete /Users/Chris/.theano/compiledir_Darwin-16.4.0-x86_64-i386-64bit-i386-3.

Best: 0.722656 using {'activation': 'softplus'}
0.649740 (0.009744) with: {'activation': 'softmax'}
0.722656 (0.033603) with: {'activation': 'softplus'}
0.688802 (0.001841) with: {'activation': 'softsign'}
0.720052 (0.018136) with: {'activation': 'relu'}
0.683594 (0.003189) with: {'activation': 'tanh'}
0.704427 (0.020752) with: {'activation': 'sigmoid'}
0.687500 (0.009568) with: {'activation': 'hard_sigmoid'}
0.697917 (0.019225) with: {'activation': 'linear'}


The authors found that a linear activation function was best. In my case it was softplus. These and relu are very close to each other. The authors also mention that the data should be pre-processed for these different activation functions first. This would almost certainly contribute to the surprising results.

### Example 6: Tuning Dropout Regularization

In [26]:
# Use scikit-learn to grid search the dropout rate
import numpy
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from keras.constraints import maxnorm

# Function to create model, required for KerasClassifier
def create_model(dropout_rate=0.0, weight_constraint=0):
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, init='uniform', activation='linear', W_constraint=maxnorm(weight_constraint)))
    model.add(Dropout(dropout_rate))
    model.add(Dense(1, init='uniform', activation='sigmoid'))
    
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# load dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")

# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]

# create model
model = KerasClassifier(build_fn=create_model, nb_epoch=100, batch_size=10, verbose=0)

# define the grid search parameters
weight_constraint = [1, 2, 3, 4, 5]
dropout_rate = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
param_grid = dict(dropout_rate=dropout_rate, weight_constraint=weight_constraint)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, Y)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

INFO (theano.gof.compilelock): Waiting for existing lock by unknown process (I am process '15946')
INFO (theano.gof.compilelock): To manually release the lock, delete /Users/Chris/.theano/compiledir_Darwin-16.4.0-x86_64-i386-64bit-i386-3.5.2-64/lock_dir


Best: 0.725260 using {'weight_constraint': 4, 'dropout_rate': 0.3}
0.696615 (0.031948) with: {'weight_constraint': 1, 'dropout_rate': 0.0}
0.696615 (0.031948) with: {'weight_constraint': 2, 'dropout_rate': 0.0}
0.691406 (0.026107) with: {'weight_constraint': 3, 'dropout_rate': 0.0}
0.710938 (0.011049) with: {'weight_constraint': 4, 'dropout_rate': 0.0}
0.708333 (0.009744) with: {'weight_constraint': 5, 'dropout_rate': 0.0}
0.709635 (0.010253) with: {'weight_constraint': 1, 'dropout_rate': 0.1}
0.709635 (0.007366) with: {'weight_constraint': 2, 'dropout_rate': 0.1}
0.708333 (0.006639) with: {'weight_constraint': 3, 'dropout_rate': 0.1}
0.703125 (0.006379) with: {'weight_constraint': 4, 'dropout_rate': 0.1}
0.708333 (0.009744) with: {'weight_constraint': 5, 'dropout_rate': 0.1}
0.710938 (0.009568) with: {'weight_constraint': 1, 'dropout_rate': 0.2}
0.710938 (0.009568) with: {'weight_constraint': 2, 'dropout_rate': 0.2}
0.720052 (0.021710) with: {'weight_constraint': 3, 'dropout_rate': 0.

### Example 7: Tuning the Number of Neurons in the Hidden Layer

In [27]:
# Use scikit-learn to grid search the number of neurons
import numpy
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from keras.constraints import maxnorm

# Function to create model, required for KerasClassifier
def create_model(neurons=1):
    # create model
    model = Sequential()
    model.add(Dense(neurons, input_dim=8, init='uniform', activation='linear', W_constraint=maxnorm(4)))
    model.add(Dropout(0.2))
    model.add(Dense(1, init='uniform', activation='sigmoid'))
    
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# load dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")

# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]

# create model
model = KerasClassifier(build_fn=create_model, nb_epoch=100, batch_size=10, verbose=0)

# define the grid search parameters
neurons = [1, 5, 10, 15, 20, 25, 30]
param_grid = dict(neurons=neurons)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X, Y)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

INFO (theano.gof.compilelock): Waiting for existing lock by unknown process (I am process '16060')
INFO (theano.gof.compilelock): To manually release the lock, delete /Users/Chris/.theano/compiledir_Darwin-16.4.0-x86_64-i386-64bit-i386-3.5.2-64/lock_dir
INFO (theano.gof.compilelock): Waiting for existing lock by process '16059' (I am process '16065')
INFO (theano.gof.compilelock): To manually release the lock, delete /Users/Chris/.theano/compiledir_Darwin-16.4.0-x86_64-i386-64bit-i386-3.5.2-64/lock_dir
INFO (theano.gof.compilelock): Waiting for existing lock by process '16059' (I am process '16063')
INFO (theano.gof.compilelock): To manually release the lock, delete /Users/Chris/.theano/compiledir_Darwin-16.4.0-x86_64-i386-64bit-i386-3.5.2-64/lock_dir


Best: 0.716146 using {'neurons': 20}
0.700521 (0.011201) with: {'neurons': 1}
0.714844 (0.011049) with: {'neurons': 5}
0.712240 (0.017566) with: {'neurons': 10}
0.695313 (0.014616) with: {'neurons': 15}
0.716146 (0.024150) with: {'neurons': 20}
0.714844 (0.033146) with: {'neurons': 25}
0.709635 (0.025976) with: {'neurons': 30}


## Conclusion

We used Grid Search to try many combinations of parameters. We also learned about some parameters would could change that we didn't know had more options.