# HyperParameter Optimization

Hyper-parameter optimization is the task of finding an optimal or near optimal (locally) set of hyper-parameters (or free parameters that are set manually or externally outside of a learning algorithm's self-adjustment of its internal parameters). In other words, we are searching for a good configuration of the various "knobs" one must set to achieve good generalization/out-of-sample performance.

To learn more about Convolutional Neural networks refer to this <a href= 'https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/'>Intutive Explanation.</a> 
Here, we create a Convolutional Neural Network trained on MNIST Dataset using Keras.

In [None]:
from __future__ import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
import numpy as np

batch_size = 128
num_classes = 2
epochs = 12

# input image dimensions
img_rows, img_cols = 28, 28

# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Only look at 2s and 7s
train_picks = np.logical_or(y_train==2,y_train==7)
test_picks = np.logical_or(y_test==2,y_test==7)

x_train = x_train[train_picks]
x_test = x_test[test_picks]
y_train = np.array(y_train[train_picks]==7,dtype=int)
y_test = np.array(y_test[test_picks]==7,dtype=int)


if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# Convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

# Convolutional Neural Network
model = Sequential()
model.add(Conv2D(4, kernel_size=(3, 3),activation='relu',input_shape=input_shape))
model.add(Conv2D(8, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

These are the values we get without Hyperparameter optimization:

Test Loss# = 4.89%, 
Test Accuracy# = 98.34%
('#' -> Results might differ)


Here, there are various parameters that can be tweaked to improve the accuracy of the network. They are:
1. Epochs.
2. Batch Size.
3. Optimization Algorithm.
4. Network Weight Initialization.
5. Activation Functions.
6. Dropout Regularization.
7. Number of Neurons.

Keras models can be used in scikit-learn by wrapping them with the KerasClassifier or KerasRegressor class. In this example, we can use scikit-learn's GridSearchCV function to perform Hyper Parameter Optimization with KerasClassifier.


It can be used as follows:
```python
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV
def create_model():
	...
	return model

model = KerasClassifier(build_fn=create_model)
```


## 1. Epochs

In machine-learning parlance, an epoch is a complete pass through a given dataset. That is, by the end of one epoch, your neural network – be it a restricted Boltzmann machine, convolutional net or deep-belief network – will have been exposed to every record to example within the dataset once. Not to be confused with an iteration, which is simply one update of the neural net model’s parameters. Many iterations can occur before an epoch is over. Epoch and iteration are only synonymous if you update your parameters once for each pass through the whole dataset.

In the neural network we used a value of 12 for the **Epoch** parameter. We will experiment with the values of epochs to see if we can improve the accuracy. We will select *[10,20,30,40]* as the values to be fed to the neural network.


In [None]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense

# Model that will be used for GridSearch
def build_model(epochs):
    print("\nThe current number of epochs are {}\n".format(epochs))
    model = Sequential()
    model.add(Conv2D(4, kernel_size=(3, 3),activation='relu',input_shape=input_shape))
    model.add(Conv2D(8, (3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))
    model.add(Flatten())
    model.add(Dense(16, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(2, activation='softmax'))
    
    model.compile(loss=keras.losses.categorical_crossentropy,
                  optimizer='rmsprop',
                  metrics=['accuracy'])
    return model

# Assigning 'model' to KerasClassifier to be used with GridSearchCV
model = KerasClassifier(build_fn = build_model)
parameters = {'epochs': [10, 20, 30, 40]}
grid_search = GridSearchCV(estimator = model,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 2)
grid_search = grid_search.fit(x_train, y_train)
print("\nThe best parameter is {}".format(grid_search.best_params_))
print("\nThe best_accuracy is {}".format(grid_search.best_score_))

### Output

In [4]:
#The best parameter is {'epochs':40}
#The best_accuracy is 0.9937004008835801

We use ***Loss VS Epochs*** and ***Accuracy Vs Epochs*** Graphs to understand what is happening
<table><tr><td><img src="./Epochs/AccuracyEpoch.png" width = "1000"/></td><td><img src="./Epochs/LossEpoch.png" width ="1000"/></td></tr></table>

From the above graphs we can conclude two things:
1. The accuracy is increasing as number of Epochs increase.
2. The loss is decreasing as number of Epochs increase.

The Epochs in the neural network can be increased, but after certain number of epochs the accuracy won't increase. So while training a neural network if a **Flat line** is observed in either Accuracy or Loss, there is no meaning in increasing the number of Epochs as this won't increase the Accuracy by a great difference.


## 2. Batch Size

Batch size defines number of samples that going to be propagated through the network.
For instance, let's say you have 1000 training samples and you want to set up batch_size equal to 100. Algorithm takes first 100 samples (from 1st to 100th) from the training dataset and trains network. Next it takes second 100 samples (from 101st to 200th) and train network again. We can keep doing this procedure until we will propagate through the networks all samples. The selection of Batch Size is a tricky task because:

1. The higher the batch size, the more memory space you'll need but the accuracy can be pretty good
2. The smaller the batch the less accurate estimate of the gradient. In the figure below you can see that mini-batch (green color) gradient's direction fluctuates compare to the full batch (blue color).

<img src = "./BatchSize.png" height = "500" width = "700" align= "center" />

*Stochastic is just a mini-batch with batch_size equal to 1. Gradient changes its direction even more often than a mini-batch.*

In the neural network, we used batch_size as 128. We will be experimenting with *[32, 64, 128, 192, 256]*  as batch_size values to be fed to the neural network to see it's effect on Loss and Accuracy.


In [None]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense

# Model that will be used for GridSearch
def build_model(batch_size):
    print("\nThe current number of batch size is {}\n".format(batch_size))
    model = Sequential()
    model.add(Conv2D(4, kernel_size=(3, 3),activation='relu',input_shape=input_shape))
    model.add(Conv2D(8, (3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))
    model.add(Flatten())
    model.add(Dense(16, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(2, activation='softmax'))
    
    model.compile(loss=keras.losses.categorical_crossentropy,
                  optimizer='rmsprop',
                  metrics=['accuracy'])
    return model

# Assigning 'model' to KerasClassifier to be used with GridSearchCV
model = KerasClassifier(build_fn = build_model)
parameters = {'batch_size' : [32, 64, 128, 192, 256]}
grid_search = GridSearchCV(estimator = model,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 2)
grid_search = grid_search.fit(x_train, y_train)
print("\nThe best parameter is {}".format(grid_search.best_params_))
print("\nThe best_accuracy is {}".format(grid_search.best_score_))

### Output

In [2]:
#The best parameter is {'batch_size': 128, 'epochs': 40}
#The best_accuracy is 0.9947639695655731

<table><tr><td><img src="./Accuracy.png" width = "1000"/></td><td><img src="./Loss.png" width ="1000"/></td></tr></table>

Instead of finding the best parameter independently, both Epochs and Batch Size can be given as input to the GridSearchCV function. There is one thing to be noted here: 
The Best parameter for Epoch is 40 and Batch Size is 128. However, the best parameters can truly be found after giving both Epoch and Batch size at the same time as input to the GridSearchCV function.

In [None]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
def build_model(epochs,batch_size):
    print("\nThe current number of epochs are {}\nThe current number of batch size is {}\n".format(epochs,batch_size))
    model = Sequential()
    model.add(Conv2D(4, kernel_size=(3, 3),activation='relu',input_shape=input_shape))
    model.add(Conv2D(8, (3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))
    model.add(Flatten())
    model.add(Dense(16, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(2, activation='softmax'))
    
    model.compile(loss=keras.losses.categorical_crossentropy,
                  optimizer='rmsprop',
                  metrics=['accuracy'])
    return model
model = KerasClassifier(build_fn = build_model)
parameters = {'epochs': [10,20,30,40],
              'batch_size' : [32,64,128,192,256]}
grid_search = GridSearchCV(estimator = model,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 2)
grid_search = grid_search.fit(x_train, y_train)
print("\nThe best parameter is {}".format(grid_search.best_params_))
print("\nThe best_accuracy is {}".format(grid_search.best_score_))

### Output

In [1]:
#The best parameter is {'batch_size': 64, 'epochs': 40}
#The best_accuracy is 0.9952548474188007

From the result above we can see that we get an increase in accuracy, but this comes at a cost of heavy computation. GridSearchCV looks at all the parameters given to it via the parameter grid (*param_grid*) which leads to perform every combination in the parameters to find the best results. An alternative to this can be RandomizedSearchCV function. RandomizedSearchCV is computationally less expensive as it uses randomness to find the best attributes instead of trying each and every parameter in the parameter grid.

In [None]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import RandomizedSearchCV
from keras.models import Sequential
from keras.layers import Dense
def build_model(epochs,batch_size):
    print("\nThe current number of epochs are {}\nThe current number of batch size is {}\n".format(epochs,batch_size))
    model = Sequential()
    model.add(Conv2D(4, kernel_size=(3, 3),activation='relu',input_shape=input_shape))
    model.add(Conv2D(8, (3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))
    model.add(Flatten())
    model.add(Dense(16, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(2, activation='softmax'))
    
    model.compile(loss=keras.losses.categorical_crossentropy,
                  optimizer='rmsprop',
                  metrics=['accuracy'])
    return model
model = KerasClassifier(build_fn = build_model)
parameters = {'epochs': [10,20,30,40],
              'batch_size' : [32,64,128,192,256]}
random_search = RandomizedSearchCV(estimator = model,
                           param_distributions = parameters,
                           scoring = 'accuracy',
                           cv = 2)
random_search = random_search.fit(x_train, y_train)
print("\nThe best parameter is {}".format(random_search.best_params_))
print("\nThe best_accuracy is {}".format(random_search.best_score_))

### Output

In [3]:
#The best parameter is {'epochs': 20, 'batch_size': 128}
#The best_accuracy is 0.9936185879080423


Here, we are able to get an accuracy of **99.36%** which is just **0.1636%** less from the GridSearchCV but very faster.

Conclusions:
1. RandomizedSearchCV takes less time than GridSearchCV to find results which are pretty good and acceptable.

## 3. Optimization Algorithm

Keras offers many optimization algorithms that can be used to train the Neural Network. A list of all optimizers with their Arguments are listed <a href = "https://keras.io/optimizers/"> here.</a>

Here, we will be experimenting with SGD, RMSprop, Adagrad, Adadelta, Adam, Adamax, Nadam for improving the accuracy of the Neural Network. We will be using the best parameters from RandomizedSearchCV i.e batch_size = 128, epochs = 20.

In [None]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
def build_model(optimizer):
    print("\nThe current optimizer is {}\n".format(optimizer))
    model = Sequential()
    model.add(Conv2D(4, kernel_size=(3, 3),activation='relu',input_shape=input_shape))
    model.add(Conv2D(8, (3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))
    model.add(Flatten())
    model.add(Dense(16, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(2, activation='softmax'))
    
    model.compile(loss=keras.losses.categorical_crossentropy,
                  optimizer=optimizer,
                  metrics=['accuracy'])
    return model
model = KerasClassifier(build_fn = build_model, epochs = 20 , batch_size = 128)
parameters = {'optimizer': ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']}
grid_search = GridSearchCV(estimator = model,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 2)
grid_search = grid_search.fit(x_train, y_train)
print("\nThe best parameter is {}".format(grid_search.best_params_))
print("\nThe best_accuracy is {}".format(grid_search.best_score_))

### Output

In [5]:
#The best parameter is {'optimizer': 'Nadam'}
#The best_accuracy is 0.9936185879080423

There is no hard and fast rule to choose an optimization algorithm. So, it is better to try everything. But, I'd suggest to look out for the latest optimizers implemented and try them first.

## 4. Network Weight Initialization

Usually weights in a Neural Network are selected to be random. But now there is a <a href = "https://keras.io/initializers/#usage-of-initializers">list</a> of Weight initializors in Keras that can be used. These do have an effect on the performance of the Neural Network.

We'll be using uniform, lecun_uniform, normal, zero, glorot_normal, glorot_uniform, he_normal, he_uniform for finding the best accuracy 

In [None]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
def build_model(init_mode):
    print("\nThe current init_mode is {}\n".format(init_mode))
    model = Sequential()
    model.add(Conv2D(4, kernel_size=(3, 3),activation='relu',input_shape=input_shape))
    model.add(Conv2D(8, (3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))
    model.add(Flatten())
    model.add(Dense(16, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(2, activation='softmax'))
    
    model.compile(loss=keras.losses.categorical_crossentropy,
                  optimizer='Nadam',
                  metrics=['accuracy'])
    return model
model = KerasClassifier(build_fn = build_model, epochs = 20 , batch_size = 128)
parameters = {'init_mode': ['uniform', 'lecun_uniform', 'normal', 'zero', 'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform']}
grid_search = GridSearchCV(estimator = model,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 2)
grid_search = grid_search.fit(x_train, y_train)
print("\nThe best parameter is {}".format(grid_search.best_params_))
print("\nThe best_accuracy is {}".format(grid_search.best_score_))

### Output

In [6]:
#The best parameter is {'init_mode': 'zero'}
#The best_accuracy is 0.9949275955166489

## 5. Activation Functions

