## Keras with SciKit GridSearchCV

This notebook will show how to perform hyperparamter tuning for deep learning models built using Keras. The GridSearchCV function from SciKit learn can be used to optimize all tunable parameters. The following list are some tunable parameters for an MLP:

* Number Hidden Neurons per layer
* Activation functions
* Initialisations
* Optimizers
* Learning rate
* Decay
* Momentum
* Dropout

Keras is a library that enables deep learning practisioners to build multi layered deep learning models with minimal code. See the Keras Intro notebook for an introduction to the Keras library.

SciKit-Learn is the go-to machine learning library in Python, this notebook shows how to combine the functionality of SciKit-Learn with models built in Keras. This is especially useful for performing k-fold cross validation, building pipelines and optimising hyper parameters.

I will use the wines dataset from 1994 and predict the class of wine from its set of attributes. This dataset can be obtained from the UCI machine learning archive. This is a very small dataset which will allow us time to perform extensive grid search optimisation.

https://archive.ics.uci.edu/ml/datasets/Wine

In [1]:
import pandas as pd
import numpy as np

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
df = pd.read_csv(url)

df.head()

Unnamed: 0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
0,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
1,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
2,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
3,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735
4,1,14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450


In [2]:
cols = ['Class','Alcohol', 'Malic acid', 'Ash','Alcalinity of ash', 'Magnesium', 'Total phenols',\
        'Flavanoids','Nonflavanoid phenols','Proanthocyanins','Color intensity','Hue','OD280/OD315 of diluted wines',\
        'Proline']

df.columns = cols

df.head()

class_ind = np.sort(df.Class.value_counts().index)

if class_ind[0] == 1:
    class_convert = {1:0, 2:1, 3:2}
    df.Class = df.Class.map(lambda x: class_convert[x])

df.Class.value_counts()

1    71
0    58
2    48
Name: Class, dtype: int64

Create Feature data (X) and labels (y)

Split data into training and test sets using sklearns train_test_split function

Convert dataframes to numpy arrays and one hot encode class labels

In [None]:
from sklearn.cross_validation import train_test_split
from keras.utils import np_utils

X = df.drop(['Class'], axis=1)
y = df.Class

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=45)

X_train, X_test, y_train = np.asarray(X_train), np.asarray(X_test), np.asarray(y_train)

y_train_ohe = np_utils.to_categorical(y_train)

for i in range(5):
    print("Original label: {0} --- One hot encoded: {1}".format(y_train[i], y_train_ohe[i]))

Using Theano backend.


Original label: 1 --- One hot encoded: [ 0.  1.  0.]
Original label: 1 --- One hot encoded: [ 0.  1.  0.]
Original label: 2 --- One hot encoded: [ 0.  0.  1.]
Original label: 1 --- One hot encoded: [ 0.  1.  0.]
Original label: 1 --- One hot encoded: [ 0.  1.  0.]


The source code for understanding how Keras uses scikit-learn functionality can be found here:
https://github.com/fchollet/keras/blob/master/keras/wrappers/scikit_learn.py

The BaseWrapper class serves as a parent class, it is recommended to call a descendent class that will in turn inherit the parent class properties. This allows separation of KerasRegressor and KerasClassifier.

The KerasCLassifier which we will use here takes the argument 'build_fn'. Using this, we can define our own function, build a MLP within and pass the model to KerasClassifier.

Param grid takes a dictionary of parameters to pass to gridsearch which will then form all possible HP combinations before performing 3-fold CV on them. For additional GridSearchCV functionality see the documentation:
http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html

In [None]:
from keras.models import Sequential
from keras.layers.core import Dense
from keras.optimizers import SGD
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.grid_search import GridSearchCV

# Set a random seed so the model is replicable 
np.random.seed(0) 

def call_model(optimizer = 'sgd', init='uniform'):
    model = Sequential()
    model.add(Dense(input_dim=X_train.shape[1], 
                output_dim=50, 
                init=init, 
                activation='tanh'))
    model.add(Dense(input_dim=50, 
                output_dim=50, 
                init=init, 
                activation='tanh'))
    model.add(Dense(input_dim=50, 
                output_dim=y_train_ohe.shape[1], 
                init=init, 
                activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

# Call instance of the model
model = KerasClassifier(build_fn=call_model)

# grid search epochs, batch size and initialisations. ALL HP below can be extended. Keras documentation provides list of options
optimizers = ['rmsprop']
init = ['normal']
epochs = [25,50]
batches = [10, 20]

# For this case I have chosen only 4 HP combinations to attempt
param_grid = dict(optimizer=optimizers, nb_epoch=epochs, batch_size=batches, init=init, verbose=[1])

gscv = GridSearchCV(estimator=model, param_grid=param_grid)
model_fit = gscv.fit(X_train, y_train_ohe)

We can evaluate the grid search in much the same way that we would choose to evaluate a grid search using models built from sklearns own library.

Best result can be printed using best\_score\_ and best params can be found from best\_params\_

In [19]:
# summarize results
print("Best: {:.4f} using {}\n".format(model_fit.best_score_, model_fit.best_params_))
for params, mean_score, scores in model_fit.grid_scores_:
    print("Mean Score: {:.4f}  StDev: {:.4f} with: \n {}".format(scores.mean(), scores.std(), params))

Best: 0.9119 using {'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 10, 'init': 'normal', 'verbose': 1}

Mean Score: 0.7987  StDev: 0.0695 with: 
 {'optimizer': 'rmsprop', 'nb_epoch': 25, 'batch_size': 10, 'init': 'normal', 'verbose': 1}
Mean Score: 0.9119  StDev: 0.0471 with: 
 {'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 10, 'init': 'normal', 'verbose': 1}
Mean Score: 0.7358  StDev: 0.0308 with: 
 {'optimizer': 'rmsprop', 'nb_epoch': 25, 'batch_size': 20, 'init': 'normal', 'verbose': 1}
Mean Score: 0.8553  StDev: 0.0728 with: 
 {'optimizer': 'rmsprop', 'nb_epoch': 50, 'batch_size': 20, 'init': 'normal', 'verbose': 1}


The best score achieved here is 0.9119, this can definitely be improved by increasing the number of epochs.

Testing on the test set yields the score shown below. This is quite a poor score for the dataset, simple regression and clustering models score in excess of 95% but again the purpose of this notebook was to demonstrate the ability to use sklearn functionality with Keras.


In [27]:
pred = model_fit.predict(X_test)
accuracy = sum(pred == y_test) / len(y_test)
print("\n\nAccuracy: {:.2f}%".format(accuracy*100))


Accuracy: 88.89%


### HP Optimisation: Layered tuning

The following code repeats the procedure outlined above but this time fine tunes parameters within the models itself such as number of hidden neurons and activation function for each layer.

In [None]:
# Set a random seed so the model is replicable 
np.random.seed(0) 

def call_model(optimizer = 'rmsprop', init='normal', activation='tanh', output_dim=50):
    model = Sequential()
    model.add(Dense(input_dim=X_train.shape[1], 
                output_dim = output_dim[0], 
                init=init, 
                activation = activation[0]))
    model.add(Dense(input_dim= output_dim[0], 
                output_dim = output_dim[1], 
                init=init, 
                activation= activation[1]))
    model.add(Dense(input_dim = output_dim[1], 
                output_dim = output_dim[2], 
                init=init, 
                activation=activation[2]))
    model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

# Call instance of the model
model = KerasClassifier(build_fn=call_model)

# This time we are casting a grid search over activation functions and number of neurons per layer
epochs = [50]
batches = [15]
activations = [['tanh','tanh','softmax'],['sigmoid','sigmoid','softmax']]
outputs = [[50,50,3],[75,75,3]]

param_grid = dict(nb_epoch=epochs, batch_size=batches, activation=activations, output_dim = outputs, verbose=[1])

gscv = GridSearchCV(estimator=model, param_grid=param_grid)
model_fit = gscv.fit(X_train, y_train_ohe)

In [34]:
# summarize results
print("Best: {:.4f} using {}\n".format(model_fit.best_score_, model_fit.best_params_))
for params, mean_score, scores in model_fit.grid_scores_:
    print("Mean Score: {:.4f}  StDev: {:.4f} with: \n {}".format(scores.mean(), scores.std(), params))

Best: 0.8113 using {'output_dim': [50, 50, 3], 'activation': ['tanh', 'tanh', 'softmax'], 'batch_size': 15, 'nb_epoch': 50, 'verbose': 1}

Mean Score: 0.8113  StDev: 0.0924 with: 
 {'output_dim': [50, 50, 3], 'activation': ['tanh', 'tanh', 'softmax'], 'batch_size': 15, 'nb_epoch': 50, 'verbose': 1}
Mean Score: 0.7673  StDev: 0.0583 with: 
 {'output_dim': [75, 75, 3], 'activation': ['tanh', 'tanh', 'softmax'], 'batch_size': 15, 'nb_epoch': 50, 'verbose': 1}
Mean Score: 0.6792  StDev: 0.0408 with: 
 {'output_dim': [50, 50, 3], 'activation': ['sigmoid', 'sigmoid', 'softmax'], 'batch_size': 15, 'nb_epoch': 50, 'verbose': 1}
Mean Score: 0.6792  StDev: 0.0308 with: 
 {'output_dim': [75, 75, 3], 'activation': ['sigmoid', 'sigmoid', 'softmax'], 'batch_size': 15, 'nb_epoch': 50, 'verbose': 1}
