# Explore Loss Functions, Batch Size and Epochs

In this notebook, we'll use the MNIST dataset to explore loss functions, the epoch, and batch parameters.

In [1]:
# get the MNIST data functions
# matplotlib for plotting
from keras.datasets import mnist
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

## Let's start with the MNIST Dataset

In [2]:
## mnist.load_data() will automatically download the dataset if you don't have it
(MNIST_train_X, MNIST_train_y), (MNIST_test_X, MNIST_test_y) = mnist.load_data()

I've renamed our previous variables and introduced some more notation.  We always call the images `X`.  `X` represents two things, the letter 'x' is commonly used in machine learning to refer to the 'features' which is how we describe the data in numbers.  In the case of pictures, our feature are just the pixel values. The `X` is upper-case because it's , because `X` represents a matrix.  Each column of X is a single feature, and each row is a sample.

In constrast, we always call our labels (answers) `y`.  It's just common notation to have `y` be the labels.  It's a lower-case `y` because it's a vector, a single list of labels.

It goes without saying, but I'll say it anyways -- always make sure your `X` and `y` indices line up! If you are doing preprocessing and shuffling anything, you have to make sure to keep them lined up the same.

### A bit of preprocessing on the data before we can train (teach) our network

For today, just ignore this.  Consider it a "Necessary evil". It's really not evil, but it is necessary.

In [3]:
MNIST_train_X = MNIST_train_X.reshape((60000, 28 * 28))
MNIST_train_X = MNIST_train_X.astype('float32') / 255

MNIST_test_X = MNIST_test_X.reshape((10000, 28 * 28))
MNIST_test_X = MNIST_test_X.astype('float32') / 255

from keras.utils import to_categorical

MNIST_train_y = to_categorical(MNIST_train_y)
MNIST_test_y = to_categorical(MNIST_test_y)

In [4]:
from keras import models
from keras import layers

# Build the architecture of the network

Let's stick with the same one from the last excercise, this way we can see how it changes with different loss functions.

### We compile the network with an optimizer, loss function and metric

Now we'd like to compare some loss functions.
Let's take a look at whats available in the box from Keras

https://keras.io/losses/

There are a bunch of losses here!

It's actually a very nuanced decision to decide which loss to use, but the important point for you -- is that most of the ones designed for classification will work pretty well on MNIST.
In fact even the ones not designed for classification will work well!

Go ahead and try a few different ones.

I suggest you try

* `mape` (meant for regression)
* `mse` (meant for regression)
* `hinge` (meant for classification)
* `categorical_hinge`  (meant for multi-class classification)





In [8]:
network = models.Sequential() #we'll stick to sequential for this course

network.add(layers.Dense(512, activation='relu', input_shape=(784,)))  # Dense is the same as fully connected.
network.add(layers.Dense(10, activation='softmax'))

### Try different losses below

network.compile(optimizer='rmsprop',
                loss='mse',             # change this parameter here.
                metrics=['accuracy'])

network.fit(MNIST_train_X, MNIST_train_y, epochs=5, batch_size=128)


test_loss, test_acc = network.evaluate(MNIST_test_X, MNIST_test_y)
print('test_acc:', test_acc)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
test_acc: 0.9775999784469604


hinge: 0.97259998

categorical_hinge: 0.97259998 

MSE: 0.97759997

MAPE: 0.966300

# Did you notice many (or any) differences in your results with the different loss functions?

your answer here: Very little difference between the different loss functions. The mse loss function performed the best.


# Now let's adjust the batch size and epochs

Let's stick with the loss function `categorical_crossentropy`
This is the most commonly used loss function for this type of problem.

We want to see how using different batch sizes and epochs affects the learning.

Let's try a larger batch size first.
Go ahead and run it with `1024` instead of `128`.

In [16]:
network = models.Sequential() #we'll stick to sequential for this course

network.add(layers.Dense(512, activation='relu', input_shape=(784,)))  # Dense is the same as fully connected.
network.add(layers.Dense(10, activation='softmax'))


network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

network.fit(MNIST_train_X, MNIST_train_y, epochs=5, batch_size=8)

test_loss, test_acc = network.evaluate(MNIST_test_X, MNIST_test_y)
print('test_acc:', test_acc)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
test_acc: 0.9760000109672546


# What difference do you notice when you varied the batch size.

You should try a range of numbers, see if performance gets better or worse when you change the size of the batch.

1024 : 0.9627 

2048 : 0.9549

4096 : 0.9298

8192 : 0.9082

64 : 0.9801

32 : 0.9799

16 : 0.9775

8  : 0.9760

# Next lets try the same experiment, but with the Epoch numbers

We've been doing `5` epochs, let's try `10` and less.  We'll go back to the 128 batch size, for control.

In [19]:
network = models.Sequential() #we'll stick to sequential for this course

network.add(layers.Dense(512, activation='relu', input_shape=(784,)))  # Dense is the same as fully connected.
network.add(layers.Dense(10, activation='softmax'))


network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

network.fit(MNIST_train_X, MNIST_train_y, epochs=20, batch_size=128)

test_loss, test_acc = network.evaluate(MNIST_test_X, MNIST_test_y)
print('test_acc:', test_acc)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
test_acc: 0.982699990272522


10 : 0.9790

20 : 0.9826

1 :  0.9581

# You should be able to get a feel for whats going on.

Try varying both epochs and batch sizes now.  You can even mix in a different loss function.
Which parameter seems to affect performance the most?

your answer here:

The smaller your batch size, the fewer epochs you need, that's because we do more updates per epoch.

The greater your batch size, the more epochs, because we do fewer updates per epoch.

## So, how much does the loss function, epochs and batch sizes matter?

## Try to sum up your intuitions here.

Your answer here:
