# Exploring MNIST And Deep Neural Networks

This Jupyter Notebook is a companion to [this article](), if you found this document some other way and feel you're missing some context, you might wish to read the article. The purpose of this document is to help you learn about deep neural networks and explore how changing the architecture of a neural network impacts the performance of the network.  

Before we can build any neural networks we need to import a few things from Keras, and prepare our data. The following code extracts the MNIST dataset, provided by Keras, and flattens the 28x28 pixel images into a vector with length 784. Additionally, it modifies the labels from a numeric value 0-9 to a one-hot encoded vector.

In [28]:
import keras
from keras.datasets import mnist
from keras.layers import Dense
from keras.models import Sequential

# Preparing the dataset
# Setup train and test splits
(x_train, y_train), (x_test, y_test) = mnist.load_data()

image_size = 784 # 28 x 28
x_train = x_train.reshape(x_train.shape[0], image_size) 
x_test = x_test.reshape(x_test.shape[0], image_size)

# Convert class vectors to binary class matrices
num_classes = 10
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

## First Network

Here is a first, simple network, to solve MNIST. It has a single hidden layer with 32 nodes.

In [29]:
model = Sequential()

# The input layer requires the special input_shape parameter which should match
# the shape of our training data.
model.add(Dense(units=32, activation='sigmoid', input_shape=(image_size,)))
model.add(Dense(units=num_classes, activation='softmax'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_346 (Dense)            (None, 32)                25120     
_________________________________________________________________
dense_347 (Dense)            (None, 10)                330       
Total params: 25,450
Trainable params: 25,450
Non-trainable params: 0
_________________________________________________________________


## Train & Evaluate The Network

This code trains and evaluates the model we defined above.

In [30]:
model.compile(optimizer="sgd", loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size=128, epochs=5, verbose=True, validation_split=.1)
loss, accuracy  = model.evaluate(x_test, y_test, verbose=False)

print(f'Test loss: {loss:.3}')
print(f'Test accuracy: {accuracy:.3}')

Train on 54000 samples, validate on 6000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test loss: 0.512
Test accuracy: 0.887


## Some Helpers

There are a couple of things we're going to do a lot in this notebook: build a model and evaluate that model. These two functions will save us a bit of boilerplate overall. These functions will also help us compare "apples to apples" -- since we can be sure when we call `create_dense` and `evaluate` our models and training regimen will use the same hyperparameters. 

Both use some of the variables declared above, and both therefore make neural networks explicitly intended for working with MNIST

`create_dense` accepts an array of layer sizes, and returns a Keras model of a fully connected neural network with the layer sizes specified. `create_dense([32, 64, 128])` will return a deeply connected neural net with three hidden layers, the first with 32 nodes, second with 64 nodes, and third with 128 nodes. `create_dense` uses the `image_size` variable declared above, which means it assumes the input data will be a vector with 784 units. All the hidden layers use the sigmoid activation function except the output layer, which uses softmax. 

`evaluate` prints a summary of the model, trains the model, and then prints the loss and accuracy. This function always runs 5 training epochs and uses a fixed batch-size of 128 inputs per batch. It also uses the MNIST data extracted from Keras that we processed above.

In [37]:
def create_dense(layer_sizes):
    model = Sequential()
    model.add(Dense(layer_sizes[0], activation='sigmoid', input_shape=(image_size,)))

    for s in layer_sizes[1:]:
        model.add(Dense(units = s, activation = 'sigmoid'))

    model.add(Dense(units=num_classes, activation='softmax'))

    return model

def evaluate(model):
    model.summary()
    model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])
    model.fit(x_train, y_train, batch_size=128, epochs=5, validation_split=.1, verbose=False)
    loss, accuracy  = model.evaluate(x_test, y_test, verbose=False)

    print()
    print(f'Test loss: {loss:.3}')
    print(f'Test accuracy: {accuracy:.3}')


## Comparing Longer Chains

The following code trains and evaluates models with different numbers of hidden layers. All the hidden layers have 32 nodes. The first model has 1 hidden layer, the second as 2 ... up to five layers. 

How did the longer models compare to the thinner models?

In [38]:
for layers in range(1, 5):
    model = create_dense([32] * layers)
    evaluate(model)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_378 (Dense)            (None, 32)                25120     
_________________________________________________________________
dense_379 (Dense)            (None, 10)                330       
Total params: 25,450
Trainable params: 25,450
Non-trainable params: 0
_________________________________________________________________

Test loss: 0.493
Test accuracy: 0.89
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_380 (Dense)            (None, 32)                25120     
_________________________________________________________________
dense_381 (Dense)            (None, 32)                1056      
_________________________________________________________________
dense_382 (Dense)            (None, 10)                330       
Total params: 26,506
Trainable par

## Comparing Number Of Nodes Per Layer

Another way to add complexity is to add more nodes to each hidden layer. The following code creates several single layer neural networks, with increasingly more nodes in that layer. 

In [39]:
for nodes in [32, 64, 128, 256, 512, 1024, 2048]:
    model = create_dense([nodes])
    evaluate(model)


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_392 (Dense)            (None, 32)                25120     
_________________________________________________________________
dense_393 (Dense)            (None, 10)                330       
Total params: 25,450
Trainable params: 25,450
Non-trainable params: 0
_________________________________________________________________

Test loss: 0.502
Test accuracy: 0.886
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_394 (Dense)            (None, 64)                50240     
_________________________________________________________________
dense_395 (Dense)            (None, 10)                650       
Total params: 50,890
Trainable params: 50,890
Non-trainable params: 0
_________________________________________________________________

Test loss: 0.373
Test accura

## More Nodes And More Layers

Now that we've looked at the number of nodes and the number of layers in an isolated context, lets look at what happens as we combine these two factors.

In [42]:
nodes_per_layer = 32
for layers in [1, 2, 3, 4, 5]:
    model = create_dense([nodes_per_layer] * layers)
    evaluate(model)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_428 (Dense)            (None, 32)                25120     
_________________________________________________________________
dense_429 (Dense)            (None, 10)                330       
Total params: 25,450
Trainable params: 25,450
Non-trainable params: 0
_________________________________________________________________

Test loss: 0.484
Test accuracy: 0.89
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_430 (Dense)            (None, 32)                25120     
_________________________________________________________________
dense_431 (Dense)            (None, 32)                1056      
_________________________________________________________________
dense_432 (Dense)            (None, 10)                330       
Total params: 26,506
Trainable par

In [43]:
nodes_per_layer = 128
for layers in [1, 2, 3, 4, 5]:
    model = create_dense([nodes_per_layer] * layers)
    evaluate(model)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_448 (Dense)            (None, 128)               100480    
_________________________________________________________________
dense_449 (Dense)            (None, 10)                1290      
Total params: 101,770
Trainable params: 101,770
Non-trainable params: 0
_________________________________________________________________

Test loss: 0.317
Test accuracy: 0.918
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_450 (Dense)            (None, 128)               100480    
_________________________________________________________________
dense_451 (Dense)            (None, 128)               16512     
_________________________________________________________________
dense_452 (Dense)            (None, 10)                1290      
Total params: 118,282
Trainable

In [44]:
nodes_per_layer = 512
for layers in [1, 2, 3, 4, 5]:
    model = create_dense([nodes_per_layer] * layers)
    evaluate(model)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_468 (Dense)            (None, 512)               401920    
_________________________________________________________________
dense_469 (Dense)            (None, 10)                5130      
Total params: 407,050
Trainable params: 407,050
Non-trainable params: 0
_________________________________________________________________

Test loss: 0.242
Test accuracy: 0.934
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_470 (Dense)            (None, 512)               401920    
_________________________________________________________________
dense_471 (Dense)            (None, 512)               262656    
_________________________________________________________________
dense_472 (Dense)            (None, 10)                5130      
Total params: 669,706
Trainable

## Apples To Apples Isn't Actually Fair

Unfortunately, it's not so easy to just compare more complex models with the same amount of training. Often in order to increase accuracy, we have to add complexity; but in order to support that complexity we have to increase the training time. 

The `new_evaluate` function below supports this -- we can repeatedly call `new_evaluate` with the same model to continue training the model from where we left off. 

In [48]:
def new_evaluate(model):
    model.fit(x_train, y_train, batch_size=128, epochs=5, validation_split=.1, verbose=False)
    loss, accuracy  = model.evaluate(x_test, y_test, verbose=False)
    print(f'Test loss: {loss:.3}')
    print(f'Test accuracy: {accuracy:.3}')
    print()

In [49]:
nodes_per_layer = 32
for layers in [3, 4, 5]:
    model = create_dense([nodes_per_layer] * layers)
    model.summary()
    model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])
    for i in range(5):
        print("Round ", i)
        new_evaluate(model)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_512 (Dense)            (None, 32)                25120     
_________________________________________________________________
dense_513 (Dense)            (None, 32)                1056      
_________________________________________________________________
dense_514 (Dense)            (None, 32)                1056      
_________________________________________________________________
dense_515 (Dense)            (None, 10)                330       
Total params: 27,562
Trainable params: 27,562
Non-trainable params: 0
_________________________________________________________________
Round  0
Test loss: 2.11
Test accuracy: 0.359

Round  1
Test loss: 1.77
Test accuracy: 0.613

Round  2
Test loss: 1.27
Test accuracy: 0.744

Round  3
Test loss: 0.867
Test accuracy: 0.802

Round  4
Test loss: 0.689
Test accuracy: 0.836

_________________________________

In [50]:
nodes_per_layer = 128
for layers in [3, 4, 5]:
    model = create_dense([nodes_per_layer] * layers)
    model.summary()
    model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])
    for i in range(5):
        print("Round ", i)
        new_evaluate(model)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_527 (Dense)            (None, 128)               100480    
_________________________________________________________________
dense_528 (Dense)            (None, 128)               16512     
_________________________________________________________________
dense_529 (Dense)            (None, 128)               16512     
_________________________________________________________________
dense_530 (Dense)            (None, 10)                1290      
Total params: 134,794
Trainable params: 134,794
Non-trainable params: 0
_________________________________________________________________
Round  0
Test loss: 1.72
Test accuracy: 0.595

Round  1
Test loss: 0.922
Test accuracy: 0.83

Round  2
Test loss: 0.516
Test accuracy: 0.887

Round  3
Test loss: 0.377
Test accuracy: 0.908

Round  4
Test loss: 0.311
Test accuracy: 0.919

______________________________

In [51]:
nodes_per_layer = 256
for layers in [3, 4, 5]:
    model = create_dense([nodes_per_layer] * layers)
    model.summary()
    model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])
    for i in range(5):
        print("Round ", i)
        new_evaluate(model)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_542 (Dense)            (None, 256)               200960    
_________________________________________________________________
dense_543 (Dense)            (None, 256)               65792     
_________________________________________________________________
dense_544 (Dense)            (None, 256)               65792     
_________________________________________________________________
dense_545 (Dense)            (None, 10)                2570      
Total params: 335,114
Trainable params: 335,114
Non-trainable params: 0
_________________________________________________________________
Round  0
Test loss: 1.25
Test accuracy: 0.739

Round  1
Test loss: 0.555
Test accuracy: 0.876

Round  2
Test loss: 0.374
Test accuracy: 0.905

Round  3
Test loss: 0.304
Test accuracy: 0.919

Round  4
Test loss: 0.267
Test accuracy: 0.928

_____________________________

## Forcing The Thing To Learn!

If you're really committed to a model, you can always try training it a *lot longer* and find out what happens. Below I reduce the batch size and dramatically increase the number of training epochs, which results in many more updates to the weights. Eventually ... it learns!

In [52]:
model = create_dense([32] * 5)
model.summary()
model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=16, epochs=30, validation_split=.1, verbose=True)
loss, accuracy  = model.evaluate(x_test, y_test, verbose=False)
print(f'Test loss: {loss:.3}')
print(f'Test accuracy: {accuracy:.3}')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_557 (Dense)            (None, 32)                25120     
_________________________________________________________________
dense_558 (Dense)            (None, 32)                1056      
_________________________________________________________________
dense_559 (Dense)            (None, 32)                1056      
_________________________________________________________________
dense_560 (Dense)            (None, 32)                1056      
_________________________________________________________________
dense_561 (Dense)            (None, 32)                1056      
_________________________________________________________________
dense_562 (Dense)            (None, 10)                330       
Total params: 29,674
Trainable params: 29,674
Non-trainable params: 0
_________________________________________________________________
Train 