#### Model Optimization

In this notebook I will seek to experiment with the different hyperparameters inherint in building neural networks. I will hold constant the architecture of the network itself by building a function to define identitical networks with set features. I will then fit these identical networks with a wide range of hyperparameters, including the optimization function itself, the learning rate, decay, and type of loss calculation.

In [2]:
%run __initremote__.py

Using TensorFlow backend.


x_train shape: (50000, 32, 32, 3)
50000 train samples
10000 test samples


In [3]:
early_stop = keras.callbacks.EarlyStopping(monitor='val_acc', 
                                           min_delta=0, 
                                           patience=5, 
                                           verbose=0, 
                                           mode='auto')

In [4]:
def standard_network():
    model = Sequential()
    model.add(Conv2D(filters=32, kernel_size=(3,3), padding='same', input_shape=x_train.shape[1:]))
    model.add(Activation('relu'))
    model.add(Conv2D(32, (3, 3)))
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))


    model.add(Conv2D(64, (3, 3), padding='same'))
    model.add(Activation('relu'))
    model.add(Conv2D(64, (3, 3)))
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))

    model.add(Flatten())
    model.add(Dense(512))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))
    model.add(Dense(10))
    model.add(Activation('softmax'))
    
    model.summary()
    return model

### Learning Rate

In [5]:
opt_1 = standard_network()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 32, 32, 32)        896       
_________________________________________________________________
activation_1 (Activation)    (None, 32, 32, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 30, 30, 32)        9248      
_________________________________________________________________
activation_2 (Activation)    (None, 30, 30, 32)        0         
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 15, 15, 32)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 15, 15, 32)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 15, 15, 64)        18496     
__________

In [9]:
opt = keras.optimizers.RMSprop(lr=0.0001, decay=1e-6)
opt_1.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

In [12]:
opt_1.fit(x_train, y_train,
              batch_size=32,
              epochs=100,
              validation_data=(x_test, y_test),
              shuffle=True,
              callbacks=[early_stop])

Train on 50000 samples, validate on 10000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100


<keras.callbacks.History at 0x7f8eb64d22e8>

In [13]:
opt_1.evaluate(x_test, y_test)



[0.69683952236175539, 0.7722]

This represents the baseline model recreated from the Keras example repository for CIFAR-10. What happens to the model's performance if the learning rate is increased? What about decreased? 

In [14]:
opt_2_lr_inc = standard_network()

opt = keras.optimizers.RMSprop(lr=0.001, decay=1e-6)
opt_2_lr_inc.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

opt_2_lr_inc.fit(x_train, y_train,
              batch_size=32,
              epochs=100,
              validation_data=(x_test, y_test),
              shuffle=True,
              callbacks=[early_stop])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_5 (Conv2D)            (None, 32, 32, 32)        896       
_________________________________________________________________
activation_7 (Activation)    (None, 32, 32, 32)        0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 30, 30, 32)        9248      
_________________________________________________________________
activation_8 (Activation)    (None, 30, 30, 32)        0         
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 15, 15, 32)        0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 15, 15, 32)        0         
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 15, 15, 64)        18496     
__________

<keras.callbacks.History at 0x7f8eb4e234a8>

In [15]:
opt_2_lr_inc.evaluate(x_test, y_test)



[1.5240188974380493, 0.51859999999999995]

The learning rate parameter modulates how large of steps the model takes on the backward propagation pass. In other words, how large of steps does it take during the gradient descent. When the learning rate increases, larger steps are made. This could potentially lead to 'overshooting' the minimum of the function. Indeed, as we see above, the model approaches 71% accuracy on epoch 7 (fairly quickly compared to previous models) but drops down from there. Instead of making very small steps towards the minimum, the model makes big, clunky steps and overshoots. 

In [16]:
opt_2_lr_dec = standard_network()

opt = keras.optimizers.RMSprop(lr=0.00001, decay=1e-6)
opt_2_lr_dec.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

opt_2_lr_dec.fit(x_train, y_train,
              batch_size=32,
              epochs=100,
              validation_data=(x_test, y_test),
              shuffle=True,
              callbacks=[early_stop])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_9 (Conv2D)            (None, 32, 32, 32)        896       
_________________________________________________________________
activation_13 (Activation)   (None, 32, 32, 32)        0         
_________________________________________________________________
conv2d_10 (Conv2D)           (None, 30, 30, 32)        9248      
_________________________________________________________________
activation_14 (Activation)   (None, 30, 30, 32)        0         
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 15, 15, 32)        0         
_________________________________________________________________
dropout_7 (Dropout)          (None, 15, 15, 32)        0         
_________________________________________________________________
conv2d_11 (Conv2D)           (None, 15, 15, 64)        18496     
__________

Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x7f8e8c5d7e10>

With a lower learning rate, even after 100 epochs the model continues to learn. However, despite many epochs, validation accuracy remains below the accuracy achieved after only 34 epochs with the standard learning rate.

Below I will run the model for another 100 epochs to see whether it continues to learn, and if so for how long. 

In [17]:
opt_2_lr_dec.fit(x_train, y_train,
              batch_size=32,
              epochs=100,
              validation_data=(x_test, y_test),
              shuffle=True,
              callbacks=[early_stop])

Train on 50000 samples, validate on 10000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
11552/50000 [=====>........................] - ETA: 16s - loss: 0.8680 - acc: 0.6971

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
 1184/50000 [..............................] - ETA: 20s - loss: 0.8542 - acc: 0.7010

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [21]:
opt_2_lr_dec.evaluate(x_test, y_test)



[0.79080277261734011, 0.73229999999999995]

Decreasing the learning rate, in this case, slows the rate of approach towards the minimum loss drastically. In the first iteration of this neural network, 34 epochs was sufficient to produce a validation accuracy of 77%. However, with a slower learning rate, even after 150 epochs the model is still learning. At each epoch, it makes incredibly small steps. In this case, the amount of compute time is not worth any percieved gain in the models ability to classify. Indeed, it still only scores 73% validation accuracy even after so many epochs. 

### Decay

Let's try a new rate of decay, and see how that effects the model. Instead of 1e-6, let's try 1e-7. 

In [6]:
opt_2_dec_inc = standard_network()

opt = keras.optimizers.RMSprop(lr=0.0001, decay=1e-7)
opt_2_dec_inc.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

history = opt_2_dec_inc.fit(x_train, y_train,
              batch_size=32,
              epochs=100,
              validation_data=(x_test, y_test),
              shuffle=True,
              callbacks=[early_stop])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_5 (Conv2D)            (None, 32, 32, 32)        896       
_________________________________________________________________
activation_7 (Activation)    (None, 32, 32, 32)        0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 30, 30, 32)        9248      
_________________________________________________________________
activation_8 (Activation)    (None, 30, 30, 32)        0         
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 15, 15, 32)        0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 15, 15, 32)        0         
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 15, 15, 64)        18496     
__________

Changing the decay rate appears to offer promising results. The validation accuracy reaches 78% on some epochs before the model's patience (5) is triggered at just above 77%. It also does so in 53 epochs, much better than the lower learning rate from the model previous. 

Let's try upping the decay rate and see how it effects things. 

In [7]:
opt_2_dec_dec = standard_network()

opt = keras.optimizers.RMSprop(lr=0.0001, decay=1e-5)
opt_2_dec_dec.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

history_dec_up = opt_2_dec_dec.fit(x_train, y_train,
              batch_size=32,
              epochs=100,
              validation_data=(x_test, y_test),
              shuffle=True,
              callbacks=[early_stop])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_9 (Conv2D)            (None, 32, 32, 32)        896       
_________________________________________________________________
activation_13 (Activation)   (None, 32, 32, 32)        0         
_________________________________________________________________
conv2d_10 (Conv2D)           (None, 30, 30, 32)        9248      
_________________________________________________________________
activation_14 (Activation)   (None, 30, 30, 32)        0         
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 15, 15, 32)        0         
_________________________________________________________________
dropout_7 (Dropout)          (None, 15, 15, 32)        0         
_________________________________________________________________
conv2d_11 (Conv2D)           (None, 15, 15, 64)        18496     
__________

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100


Increasing the decay to 1e-5 does not seem to help or hinder the model in any drastic way. What if we made a bigger change to the decay?

In [8]:
opt_2_dec_del = standard_network()

opt = keras.optimizers.RMSprop(lr=0.0001, decay=1e-2)
opt_2_dec_del.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

history_dec_del = opt_2_dec_del.fit(x_train, y_train,
              batch_size=32,
              epochs=100,
              validation_data=(x_test, y_test),
              shuffle=True,
              callbacks=[early_stop])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_13 (Conv2D)           (None, 32, 32, 32)        896       
_________________________________________________________________
activation_19 (Activation)   (None, 32, 32, 32)        0         
_________________________________________________________________
conv2d_14 (Conv2D)           (None, 30, 30, 32)        9248      
_________________________________________________________________
activation_20 (Activation)   (None, 30, 30, 32)        0         
_________________________________________________________________
max_pooling2d_7 (MaxPooling2 (None, 15, 15, 32)        0         
_________________________________________________________________
dropout_10 (Dropout)         (None, 15, 15, 32)        0         
_________________________________________________________________
conv2d_15 (Conv2D)           (None, 15, 15, 64)        18496     
__________

Drastically changing the decay towards 1 clearly negatively impacts the networks ability to learn. Even after one hundred epochs, the network barely breaks 40% accuracy. Let's leave decay for now and test different optimization functions themselves. 

### Optimizers

The simplist kind of optimizer is Stochastic gradient descent. SGD uses a basic stochastic iteration approach to descend along the gradient with the hope of approaching a minima in the loss function. Here I will implement it with naive properties out of the box.

In [9]:
opt_3_sgd = standard_network()

opt = keras.optimizers.SGD(lr=0.01, momentum=0.0, decay=0.0, nesterov=False)
opt_3_sgd.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

history_sgd = opt_3_sgd.fit(x_train, y_train,
              batch_size=32,
              epochs=100,
              validation_data=(x_test, y_test),
              shuffle=True,
              callbacks=[early_stop])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_17 (Conv2D)           (None, 32, 32, 32)        896       
_________________________________________________________________
activation_25 (Activation)   (None, 32, 32, 32)        0         
_________________________________________________________________
conv2d_18 (Conv2D)           (None, 30, 30, 32)        9248      
_________________________________________________________________
activation_26 (Activation)   (None, 30, 30, 32)        0         
_________________________________________________________________
max_pooling2d_9 (MaxPooling2 (None, 15, 15, 32)        0         
_________________________________________________________________
dropout_13 (Dropout)         (None, 15, 15, 32)        0         
_________________________________________________________________
conv2d_19 (Conv2D)           (None, 15, 15, 64)        18496     
__________

Standard SGD actually does really well at learning the model. We do notice some overfitting happening, but in 75 epochs the network reaches 80% validation accuracy. Let's check against the test.

In [10]:
opt_3_sgd.evaluate(x_test, y_test)



[0.62109199433326723, 0.8075]

Let's try out some different hyperparameters here.

In [11]:
opt_3_sgd_mo = standard_network()

opt = keras.optimizers.SGD(lr=0.01, momentum=0.1, decay=0.0, nesterov=False)
opt_3_sgd_mo.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

history_sgd_mo = opt_3_sgd_mo.fit(x_train, y_train,
              batch_size=32,
              epochs=100,
              validation_data=(x_test, y_test),
              shuffle=True,
              callbacks=[early_stop])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_21 (Conv2D)           (None, 32, 32, 32)        896       
_________________________________________________________________
activation_31 (Activation)   (None, 32, 32, 32)        0         
_________________________________________________________________
conv2d_22 (Conv2D)           (None, 30, 30, 32)        9248      
_________________________________________________________________
activation_32 (Activation)   (None, 30, 30, 32)        0         
_________________________________________________________________
max_pooling2d_11 (MaxPooling (None, 15, 15, 32)        0         
_________________________________________________________________
dropout_16 (Dropout)         (None, 15, 15, 32)        0         
_________________________________________________________________
conv2d_23 (Conv2D)           (None, 15, 15, 64)        18496     
__________

In [23]:
opt_3_sgd_mo.evaluate(x_test, y_test)



[0.59636540532112126, 0.79749999999999999]

In [12]:
opt_3_sgd_nes = standard_network()

opt = keras.optimizers.SGD(lr=0.01, momentum=0.5, decay=0.0, nesterov=True)
opt_3_sgd_nes.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

history_sgd_nes = opt_3_sgd_nes.fit(x_train, y_train,
              batch_size=32,
              epochs=100,
              validation_data=(x_test, y_test),
              shuffle=True,
              callbacks=[early_stop])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_25 (Conv2D)           (None, 32, 32, 32)        896       
_________________________________________________________________
activation_37 (Activation)   (None, 32, 32, 32)        0         
_________________________________________________________________
conv2d_26 (Conv2D)           (None, 30, 30, 32)        9248      
_________________________________________________________________
activation_38 (Activation)   (None, 30, 30, 32)        0         
_________________________________________________________________
max_pooling2d_13 (MaxPooling (None, 15, 15, 32)        0         
_________________________________________________________________
dropout_19 (Dropout)         (None, 15, 15, 32)        0         
_________________________________________________________________
conv2d_27 (Conv2D)           (None, 15, 15, 64)        18496     
__________

In [21]:
history_sgd_nes.history['val_acc'][-1]

0.80200000000000005

In [22]:
opt_3_sgd_nes.evaluate(x_test, y_test)



[0.62306540794372556, 0.80200000000000005]

Here we see that using the nesterov momentum and slightly upping momentum itself provides slightly better results in 50 epochs. 

Let's try Adam 

In [24]:
opt_3_adam = standard_network()

opt = keras.optimizers.adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
opt_3_adam.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

history_adam = opt_3_adam.fit(x_train, y_train,
              batch_size=32,
              epochs=100,
              validation_data=(x_test, y_test),
              shuffle=True,
              callbacks=[early_stop])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_29 (Conv2D)           (None, 32, 32, 32)        896       
_________________________________________________________________
activation_43 (Activation)   (None, 32, 32, 32)        0         
_________________________________________________________________
conv2d_30 (Conv2D)           (None, 30, 30, 32)        9248      
_________________________________________________________________
activation_44 (Activation)   (None, 30, 30, 32)        0         
_________________________________________________________________
max_pooling2d_15 (MaxPooling (None, 15, 15, 32)        0         
_________________________________________________________________
dropout_22 (Dropout)         (None, 15, 15, 32)        0         
_________________________________________________________________
conv2d_31 (Conv2D)           (None, 15, 15, 64)        18496     
__________

In [26]:
opt = keras.optimizers.adam(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
opt_3_adam.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

history_adam = opt_3_adam.fit(x_train, y_train,
              batch_size=32,
              epochs=100,
              validation_data=(x_test, y_test),
              shuffle=True,
              callbacks=[early_stop])

Train on 50000 samples, validate on 10000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100


In [27]:
opt = keras.optimizers.adam(lr=0.00001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
opt_3_adam.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

history_adam = opt_3_adam.fit(x_train, y_train,
              batch_size=32,
              epochs=100,
              validation_data=(x_test, y_test),
              shuffle=True,
              callbacks=[early_stop])

Train on 50000 samples, validate on 10000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100


The same model can be recompiled and given adjusted learning rates. By starting with a relatively faster learning rate, one can approach a good approximation of the minimum on the first series of passes. Next, adjusting the learning rate down can fine tune the model if small percentage gains are desired. 

In [28]:
opt_3_adam.evaluate(x_test, y_test)



[0.66349807536602023, 0.81499999999999995]

Here, we were able to produce a model that tests at 81.5% accuracy. To perserve computing time and power, let's save this model and move on to adding data augmentation in the next notebook. 

In [29]:
opt_3_adam.save('60_epoch_adam.h5')