# Explore Optimizers

In this notebook, we'll use the MNIST dataset to explore different optimizers and their parameters

In [1]:
# get the MNIST data functions
# matplotlib for plotting
from keras.datasets import mnist
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

## Let's start with the MNIST Dataset again!

In [2]:
## mnist.load_data() will automatically download the dataset if you don't have it
(MNIST_train_X, MNIST_train_y), (MNIST_test_X, MNIST_test_y) = mnist.load_data()

### A bit of preprocessing on the data before we can train (teach) our network

For today, just ignore this.  Consider it a "Necessary evil". It's really not evil, but it is necessary.

In [3]:
MNIST_train_X = MNIST_train_X.reshape((60000, 28 * 28))
MNIST_train_X = MNIST_train_X.astype('float32') / 255

MNIST_test_X = MNIST_test_X.reshape((10000, 28 * 28))
MNIST_test_X = MNIST_test_X.astype('float32') / 255

from keras.utils import to_categorical

MNIST_train_y = to_categorical(MNIST_train_y)
MNIST_test_y = to_categorical(MNIST_test_y)

In [4]:
from keras import models
from keras import layers

# Build the architecture of the network

Let's stick with the same one from the last excercise, this way we can see how it changes with different loss functions.

###  Optimizers with default settings
Now we'd like to compare some optimizers
Let's take a look at whats available in the box from Keras

https://keras.io/optimizers/

There are a bunch of optimizers here!

Just like loss functions, there are a lot of choices and options.  Unless you spend time reading the papers on how each one was made, you pretty much should just stick with default options.  But, we can always experiment and try some things out.

I suggest you try

* `SGD` - One of the earlier optimizers, no paper to support it.
* `RMSProp` - recommended in "deep learning python" book by manning.
* `adam` 


For all these optimizers, we'll follow the keras instructions and leave everything except for the learning rate at default.

In fact for now let's leave it all default just to compare the 3 optimizers

In [7]:
network = models.Sequential() #we'll stick to sequential for this course

network.add(layers.Dense(512, activation='relu', input_shape=(784,)))  # Dense is the same as fully connected.
network.add(layers.Dense(10, activation='softmax'))

network.compile(optimizer='RMSProp', #your value here
                loss='categorical_crossentropy',# change this parameter here.
                metrics=['accuracy'])

network.fit(MNIST_train_X, MNIST_train_y, epochs=5, batch_size=128)


test_loss, test_acc = network.evaluate(MNIST_test_X, MNIST_test_y)
print('test_acc:', test_acc)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
test_acc: 0.9775000214576721


rmsprop : 0.9775

SGD : 0.9135

Adam : 0.9750

# Did you notice many (or any) differences in your results with the different optimizers?

your answer here:

# Now let's adjust the learning rate.

Let's stick with the `rmsprop` optimizer.

We want to see how changing the learning rate (LR) affects things.
RMSprop defaults with an LR of `0.001`, so we can try a decimal in both directions, `0.01` and `0.0001`

### in order to tune the parameters of the optimizer we'll have to import it and instatiate it.

In [8]:
from keras import optimizers

In [14]:
network = models.Sequential() #we'll stick to sequential for this course

network.add(layers.Dense(512, activation='relu', input_shape=(784,)))  # Dense is the same as fully connected.
network.add(layers.Dense(10, activation='softmax'))

# set the LR when we instatiate the optimizerr
rmsprop = optimizers.RMSprop(lr = 0.01) #start with 0.001 (default) to establish a baseline

#pass our optimizer object below

network.compile(optimizer=rmsprop,
                loss='categorical_crossentropy',
                metrics=['accuracy'])

network.fit(MNIST_train_X, MNIST_train_y, epochs=5, batch_size=128)

test_loss, test_acc = network.evaluate(MNIST_test_X, MNIST_test_y)
print('test_acc:', test_acc)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
test_acc: 0.9740999937057495


## What difference do you notice when you varied the Learning Rate?

If you didn't notice much of a difference, try it again with more extreme amounts. Try `0.1` or `0.00001`

Your answer here:

0.01 : 0.9740

0.1 : 0.8858

0.00001 : 0.8978

0.0000001 : 0.0883

## Let's vary the Momentum.  

To vary the momentum we'll have to use an optimizer that support momentum.  Let's head over to the keras documentation and choose one.

I see that `SGD` or stochastic gradient descent uses momentum.  The value must be `>=0` so lets choose a few different values for it.

In [21]:
network = models.Sequential() #we'll stick to sequential for this course

network.add(layers.Dense(512, activation='relu', input_shape=(784,)))  # Dense is the same as fully connected.
network.add(layers.Dense(10, activation='softmax'))

# set the LR when we instatiate the optimizer
sgd = optimizers.SGD(momentum = 0.28) #always run the default first for a baseline value.
 
#pass our optimizer object below

network.compile(optimizer=sgd,
                loss='categorical_crossentropy',
                metrics=['accuracy'])

network.fit(MNIST_train_X, MNIST_train_y, epochs=5, batch_size=128)

test_loss, test_acc = network.evaluate(MNIST_test_X, MNIST_test_y)
print('test_acc:', test_acc)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
test_acc: 0.9212999939918518


### Momentum Values
0.0 : 0.9121

0.08 : 0.9161

0.1 : 0.9158

0.3 : 0.9204

0.8 : 0.9480

2.8 : 0.9212

## These experiment can be extended by combining the epochs and batch size.

Obviously they all interact with one another, so feel free to play with all of them now and try to get a feeling for what they do.

If the learning rate is too low -- the model won't learn, but would increasing the epochs help with that?  What could be possible reasons to do this?

How might you change your strategy based on how much data you have available to you?

your answers here: