# Optimizing Neural Network Hyperparameters

We will consider an Image Recognition problem with MNIST dataset (28 x 28 images). The MNIST dataset has a training set of 60,000 images and a test set of 10,000 images. The digits have been sized-normalized and centered in a fixed-size image.
It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.

In this Notebook you will follow two different approaches to tune the hyperparameters:
- "Trial and Error" approach;
- "Grid Search" Hyperparameter optimization with Scikit-Learn wrapper;

### Data Preparation

In [1]:
from keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data() # train_x, train_y, test_x, test_y

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255

In [3]:
from keras.utils import to_categorical

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

In [4]:
split_size = int(train_images.shape[0]*0.7)

train_images, val_images = train_images[:split_size], train_images[split_size:]
train_labels, val_labels = train_labels[:split_size], train_labels[split_size:]

### Building the Model

Build a neural network with 3 layers, input, hidden and output:
- Dense layer with 50 hidden units and an appropriate activation function;
- Dense layer with 10 output units and appropriate activation function;

First we define some useful parameters:

In [5]:
# define vars
input_num_units = 784
hidden_num_units = 50
output_num_units = 10

epochs = 5
batch_size = 128

** Import Keras packages that you think may need.**

In [6]:
# Import Keras modules
from keras.models import Sequential
from keras.layers import Dense

** Create the model as described above. **

In [7]:
# create model
model = Sequential([
  Dense(output_dim=hidden_num_units, input_dim=input_num_units, activation='relu'),
  Dense(output_dim=output_num_units, input_dim=hidden_num_units, activation='softmax'),
])

  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.


** Configure the model with an optimizer and an appropriate loss function. **

In [8]:
# Compile the model with necessary attributes
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

** Train the model (also with the validation set). **

In [9]:
# Train the model
trained_model = model.fit(train_images, train_labels, nb_epoch=epochs, batch_size=batch_size, validation_data=(val_images, val_labels))



Train on 42000 samples, validate on 18000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Model Evaluation

** Test the model: get some prediction and evaluate the model. **

In [10]:
pred = model.predict_classes(test_images)

In [11]:
test_loss, test_acc = model.evaluate(test_images, test_labels) 



In [12]:
print('test_acc:', test_acc)

test_acc: 0.9572


Let's try to improve it by tuning some Hyperparameters.

### Hyperparameters Optimization - Trial and Error

Some important parameters to look out while optimizing neural networks are:
- Type of architecture;
- Number of layers;
- Number of neurons per layer;
- Regularization parameters;
- Learning rate;
- Type of optimization/backpropagation technique;
- Dropout rate;
- Weight sharing;

Now repeat all the previous steps (train, test etc..) but tuning the following parameters:
1. Make the model "wide": Increase the number of neurons in the hidden layer; 
2. Make the model "deep": Increase the number of hidden layers neurons each;
3. Dropout to deal with Overfitting;
4. Increase Epochs to 50;
5. Both "wide" and "deep": more hidden layers, each with more than 50 neurons

After every step, analyse your results and draw some conclusions.

### 1. Make the model "wide": increase number of neurons in the hidden layer.

** Define the new variables. **

In [13]:
input_num_units = 784
hidden_num_units = 500
output_num_units = 10
epochs = 5
batch_size = 128

** Build the network. **

In [14]:
model_1 = Sequential([Dense(output_dim=hidden_num_units, input_dim=input_num_units, activation='relu'),Dense(output_dim=output_num_units, input_dim=hidden_num_units, activation='softmax'),
])

  """Entry point for launching an IPython kernel.
  """Entry point for launching an IPython kernel.


** Configure the network. **

In [15]:
model_1.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

** Train the network. **

In [16]:
trained_model_1 = model_1.fit(train_images, train_labels, nb_epoch=epochs, batch_size=batch_size, validation_data=(val_images, val_labels))



Train on 42000 samples, validate on 18000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Considerations: what can you notice from these results? Is your model performing better than before?

**Results**: the model should perform significantly better than before.

### 2. Make the model "deep": Increase the number of hidden layers.

** Define the new variables. **

In [17]:
input_num_units = 784
hidden1_num_units = 50
hidden2_num_units = 50
hidden3_num_units = 50
hidden4_num_units = 50
hidden5_num_units = 50
output_num_units = 10

epochs = 5
batch_size = 128

** Build the network. **

In [18]:
model_2 = Sequential([
 Dense(output_dim=hidden1_num_units, input_dim=input_num_units, activation='relu'),
 Dense(output_dim=hidden2_num_units, input_dim=hidden1_num_units, activation='relu'),
 Dense(output_dim=hidden3_num_units, input_dim=hidden2_num_units, activation='relu'),
 Dense(output_dim=hidden4_num_units, input_dim=hidden3_num_units, activation='relu'),
 Dense(output_dim=hidden5_num_units, input_dim=hidden4_num_units, activation='relu'),

Dense(output_dim=output_num_units, input_dim=hidden5_num_units, activation='softmax'),
 ])

  
  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.
  """
  
  


** Configure the network. **

In [19]:
model_2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

** Train the network. **

In [20]:
trained_model_2 = model_2.fit(train_images, train_labels, nb_epoch=epochs, batch_size=batch_size, validation_data=(val_images, val_labels))



Train on 42000 samples, validate on 18000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Considerations: what can you notice from these results? Is your model performing better than before?

**Results**: the model is performing slightly worst than in the previous point. This may be due to a little bit of overfitting. To deal with this, we will use the dropout technique.

### 3. Dropout

** Define the new variables, remember to define also the dropout_ratio. **

In [21]:
input_num_units = 784
hidden1_num_units = 50
hidden2_num_units = 50
hidden3_num_units = 50
hidden4_num_units = 50
hidden5_num_units = 50
output_num_units = 10

epochs = 5
batch_size = 128

dropout_ratio = 0.2

** Build the network. **

In [22]:
from keras.layers import Dropout

model_3 = Sequential([
 Dense(output_dim=hidden1_num_units, input_dim=input_num_units, activation='relu'),
 Dropout(dropout_ratio),
 Dense(output_dim=hidden2_num_units, input_dim=hidden1_num_units, activation='relu'),
 Dropout(dropout_ratio),
 Dense(output_dim=hidden3_num_units, input_dim=hidden2_num_units, activation='relu'),
 Dropout(dropout_ratio),
 Dense(output_dim=hidden4_num_units, input_dim=hidden3_num_units, activation='relu'),
 Dropout(dropout_ratio),
 Dense(output_dim=hidden5_num_units, input_dim=hidden4_num_units, activation='relu'),
 Dropout(dropout_ratio),

Dense(output_dim=output_num_units, input_dim=hidden5_num_units, activation='softmax'),
 ])

  after removing the cwd from sys.path.
  
  
  # Remove the CWD from sys.path while we load stuff.
  if sys.path[0] == '':
  from ipykernel import kernelapp as app


** Configure the network. **

In [23]:
model_3.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

** Train the network. **

In [24]:
trained_model_3 = model_3.fit(train_images, train_labels, nb_epoch=epochs, batch_size=batch_size, validation_data=(val_images, val_labels))



Train on 42000 samples, validate on 18000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Considerations: what can you notice from these results? Is your model improving?

** Results**: There seems to be some problems, the model is not performing well enough. One possible reason may be that we are not using enough epochs to train the model. Let's try to increase the number of training epochs.

### 4. Increase training Epochs to 50.

This will take a while.

** Define the new variables. **

In [25]:
input_num_units = 784
hidden1_num_units = 50
hidden2_num_units = 50
hidden3_num_units = 50
hidden4_num_units = 50
hidden5_num_units = 50
output_num_units = 10

epochs = 50
batch_size = 128

** Build the network. **

In [26]:
model_4 = Sequential([
 Dense(output_dim=hidden1_num_units, input_dim=input_num_units, activation='relu'),
 Dropout(0.2),
 Dense(output_dim=hidden2_num_units, input_dim=hidden1_num_units, activation='relu'),
 Dropout(0.2),
 Dense(output_dim=hidden3_num_units, input_dim=hidden2_num_units, activation='relu'),
 Dropout(0.2),
 Dense(output_dim=hidden4_num_units, input_dim=hidden3_num_units, activation='relu'),
 Dropout(0.2),
 Dense(output_dim=hidden5_num_units, input_dim=hidden4_num_units, activation='relu'),
 Dropout(0.2),

Dense(output_dim=output_num_units, input_dim=hidden5_num_units, activation='softmax'),
])

  
  after removing the cwd from sys.path.
  
  
  # Remove the CWD from sys.path while we load stuff.
  del sys.path[0]


** Configure the network. **

In [27]:
model_4.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

** Train the network. **

In [28]:
trained_model_4 = model_4.fit(train_images, train_labels, nb_epoch=epochs, batch_size=batch_size, validation_data=(val_images, val_labels))



Train on 42000 samples, validate on 18000 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


Considerations: what can you notice from these results? Has the accuracy increased?

**Results**: Now seems better, there is an increase in the accuracy.

### 5. Make the model "wide" and "deep": more hidden layers, each with more than 50 neurons.

** Define the new variables.**

In [29]:
input_num_units = 784
hidden1_num_units = 500
hidden2_num_units = 500
hidden3_num_units = 500
hidden4_num_units = 500
hidden5_num_units = 500
output_num_units = 10

epochs = 25
batch_size = 128

** Build the network. **

In [30]:
model_5 = Sequential([
 Dense(output_dim=hidden1_num_units, input_dim=input_num_units, activation='relu'),
 Dropout(0.2),
 Dense(output_dim=hidden2_num_units, input_dim=hidden1_num_units, activation='relu'),
 Dropout(0.2),
 Dense(output_dim=hidden3_num_units, input_dim=hidden2_num_units, activation='relu'),
 Dropout(0.2),
 Dense(output_dim=hidden4_num_units, input_dim=hidden3_num_units, activation='relu'),
 Dropout(0.2),
 Dense(output_dim=hidden5_num_units, input_dim=hidden4_num_units, activation='relu'),
 Dropout(0.2),

Dense(output_dim=output_num_units, input_dim=hidden5_num_units, activation='softmax'),
 ])

  
  after removing the cwd from sys.path.
  
  
  # Remove the CWD from sys.path while we load stuff.
  del sys.path[0]


** Configure the network. **

In [31]:
model_5.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

** Train the network. Use 25 epochs if 50 takes too long. **

In [32]:
trained_model_5 = model_5.fit(train_images, train_labels, nb_epoch=epochs, batch_size=batch_size, validation_data=(val_images, val_labels))



Train on 42000 samples, validate on 18000 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


Considerations: what do you think about your final model?
It seems that we obtained our final solution model. Let's evaluate it with some predictions.

**Results**: Finally we get our very good model!

** Make some predictions and Evaluate the network. **

In [33]:
pred_final = model_5.predict_classes(test_images)
print(pred_final)

[7 2 1 ... 4 5 6]


In [34]:
test_loss_final, test_acc_final = model_5.evaluate(test_images, test_labels) 



In [35]:
print('test_acc:', test_acc_final)

test_acc: 0.9807


### Hyperparameters Optimization - Grid Search

Instead of proceeding with a "trial and error" approach we can also use GridSearch to combine all the hyperparameters we want to tune, or some of them. What you have to do is to use Sequential() model in Keras as a part of the Scikit-Learn workflow via the wrappers. 
Check out how this workflow works.


Please note that without GPU is extremely time consuming to tune all the hyperparameters in one shoot, by using an appropriate number of epochs. For this reason, in this example the idea is for you to understand how you can use GridSerach with Keras Model, but you will probably not be able to obtain an excellent model.

For this reason, try to tune the number of neurons in the hidden layers (more than one hidden layer) with just 5-10 epochs.

** Import GridSearchCV and KerasClassifier.**

In [36]:
import numpy
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from keras.constraints import maxnorm

In [37]:
input_num_units = 784
output_num_units = 10

** Create a function called `create_model` in which you build your KerasClassifier with number of hidden units equal to a general variable (you can call this `neurons`). Inside the function you should then also compile the model.**

In [38]:
# Function to create model, required for KerasClassifier
def create_model(neurons=1):
  
    model = Sequential()  
    model.add(Dense(output_dim=neurons, input_dim=input_num_units, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(output_dim=neurons, input_dim=neurons, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(output_dim=neurons, input_dim=neurons, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(output_dim=neurons, input_dim=neurons, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(output_dim=neurons, input_dim=neurons, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(output_dim=output_num_units, input_dim=neurons, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [39]:
# Fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)


** Create a model wrapper.**

In [40]:
model = KerasClassifier(build_fn=create_model, epochs=5, batch_size=32, verbose=0)

** Create a dictionary of parameters grid for the number of neurons in the hidden layer.**

In [41]:
neurons = [50, 100, 250, 500]
param_grid = dict(neurons=neurons) 

** Grid Search: use `GridSearchCV`with the model you have obtained from the wrapper as estimator and the dictionary you have just created as param_grid.**

In [42]:
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=None)

** Fit the result from the Grid Search, call your result `grid_result`. **

In [43]:
grid_result = grid.fit(train_images, train_labels)

  # This is added back by InteractiveShellApp.init_path()
  del sys.path[0]
  from ipykernel import kernelapp as app
  # This is added back by InteractiveShellApp.init_path()
  del sys.path[0]
  from ipykernel import kernelapp as app
  # This is added back by InteractiveShellApp.init_path()
  del sys.path[0]
  from ipykernel import kernelapp as app
  # This is added back by InteractiveShellApp.init_path()
  del sys.path[0]
  from ipykernel import kernelapp as app


** Let's print some results. Fill in the #TO DOs with the best_score and best_params that you got after fitting. **

In [44]:
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) #TO DO, #TO DO
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
    

Best: 0.966071 using {'neurons': 250}
0.946095 (0.003069) with: {'neurons': 50}
0.961833 (0.001098) with: {'neurons': 100}
0.966071 (0.002458) with: {'neurons': 250}
0.965262 (0.000962) with: {'neurons': 500}
