<h1><center>MNIST classification using Keras<center></h1>

# Student : Frank Enrique Facundo Raime

# Importing Keras

In [1]:
# Importing the Keras main module forcing tensorflow 1.x backend
import tensorflow as tf
import keras
print("Using tensorflow version " + str(tf.__version__))
print("Using keras version " + str(keras.__version__))

Using tensorflow version 2.1.0
Using keras version 2.3.1


Using TensorFlow backend.


## Loading and preparing the MNIST dataset

Load the MNIST dataset via keras.datasets. Again, turn train and test labels into one-hot encoding, and reshape and normalize data as in the first exercise. 

In [2]:
#@title
# The MNSIT dataset is ready to be imported from Keras into RAM
# Warning: you cannot do that for larger databases (e.g., ImageNet)
from keras.datasets import mnist
# START CODE HERE

(x_train, y_train), (x_test, y_test) = mnist.load_data()

# END CODE HERE


In [3]:
from keras.utils.np_utils import to_categorical
# START CODE HERE
y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)
# END CODE HERE

In [4]:
# Reshape to proper images with 1 color channel according to backend scheme
# img_rows, img_cols = train_images.shape[1], train_images.shape[2]
# train_images = train_images.reshape(...)
# START CODE HERE

x_train = x_train.reshape((60000,784))
x_test = x_test.reshape((10000,784))
                         
# END CODE HERE

# Cast pixels from uint8 to float32
#train_images = train_images.astype('float32')
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')

# Now let us normalize the images so that they have zero mean and standard deviation
# Hint: are real testing data statistics known at training time ?
# START CODE HERE
x_train /= 255.0
x_test /= 255.0
# END CODE HERE

## Defining the neural network architecture (i.e., the network model)

Look at this [cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Keras_Cheat_Sheet_Python.pdf) for some basic information on how to use numpy.

First, try to replicate the classifier of the first exercise. Secondly, create a fully connected network.
For the fully connected layer, you can for example use this architecture: 
$$ (784) \rightarrow (300) \rightarrow (10) $$
For this first implementation of the network, use only sigmoid activations in the hidden layer. Remember to use the right output activation function ! 

#### Classifier of first exercise

In [5]:
# The Sequential module is a container for more complex NN elements and
# defines a loop-less NN architecture
from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten

# START CODE HERE
input_shape = 784
output_shape = 10

model = Sequential()
model.add(Dense(output_shape, activation="sigmoid", input_shape = (input_shape,)))
model.add(Dense(output_shape, activation='sigmoid'))

model.summary()

# END CODE HERE

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 10)                7850      
_________________________________________________________________
dense_2 (Dense)              (None, 10)                110       
Total params: 7,960
Trainable params: 7,960
Non-trainable params: 0
_________________________________________________________________


#### Classifier fully connected with the architecture
$$ (784) \rightarrow (300) \rightarrow (10) $$

In [6]:
# The Sequential module is a container for more complex NN elements and
# defines a loop-less NN architecture
from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten

# START CODE HERE
input_shape = 784
output_shape = 10

model = Sequential()
model.add(Dense(input_shape, activation="sigmoid", input_shape = (input_shape,)))
model.add(Dense(300, activation='sigmoid'))
model.add(Dense(output_shape, activation='sigmoid'))

model.summary()

# END CODE HERE

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_3 (Dense)              (None, 784)               615440    
_________________________________________________________________
dense_4 (Dense)              (None, 300)               235500    
_________________________________________________________________
dense_5 (Dense)              (None, 10)                3010      
Total params: 853,950
Trainable params: 853,950
Non-trainable params: 0
_________________________________________________________________


Instantiate a SGD optimizer with a tentative learning rate of $\\eta = 10^{-2}$ and, using the appropriate loss function (which is called, in keras, ```'categorical_crossentropy'```) and compile the model.

In [7]:
# The optimizers module provides a number of optimization algorithms for updating
# a netwok parameters accoridng to the computed error gradints
from keras import optimizers

# START CODE HERE
sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer=sgd,
              loss='categorical_crossentropy',
              metrics=['accuracy'])
# END CODE HERE
# We can now have a look at the defined model topology
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_3 (Dense)              (None, 784)               615440    
_________________________________________________________________
dense_4 (Dense)              (None, 300)               235500    
_________________________________________________________________
dense_5 (Dense)              (None, 10)                3010      
Total params: 853,950
Trainable params: 853,950
Non-trainable params: 0
_________________________________________________________________


## Training the network

Train the model for 10 epochs using the ```.fit()``` method, validating the model at each epoch and keeping track of the training history for later plotting. Make sure you enable ```.fit()``` verbose mode in order to visualize the training.

In order to accelerate training, use the ```batch_size``` option of ```.fit()```, which will process a batch of examples at the same time, and make one update for all of them, averaged over the gradients for each training example of the batch. You can begin with a small size, and experiment with a larger size later.

In [8]:
# This is where the actual training-testing happens
# Number of epochs we want to train
epochs = 10

# START CODE HERE
history = model.fit(x_train,y_train,
          batch_size=32,
          epochs=epochs,
          verbose=1,
          validation_data = (x_test,y_test))
# END CODE HERE

Train on 60000 samples, validate on 10000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Visualizing the network performance

Visualize the training history using the ```pyplot``` package:
- In one graph, plot the train and vaidation loss functions,
- In another graph, the train and validation accuracy.
By comparing the training the testing curves, what can we conclude about the quality of the training ?

In [10]:
import matplotlib.pyplot as plt
# We now want to plot the train and validation loss functions and accuracy curves
print(history.history.keys())

# summarize history for loss
# START CODE HERE
plt.figure(figsize=(15,8))
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# END CODE HERE


# summarize history for accuracy
# START CODE HERE
plt.figure(figsize=(15,8))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# END CODE HERE


dict_keys(['val_loss', 'val_accuracy', 'loss', 'accuracy'])


KeyError: 'acc'

<Figure size 1080x576 with 0 Axes>

### We finish with a validation accuracy of 0.9468, this is so well model to predict new values.

## Experiments

Note down the performance of the larger network in terms of training and validation accuracy as a reference (save the loss/accuracy graphs of the network).

Then, experiment as follow and compare performance with the reference scenario:

*  Experiment increasing the size of the batch and compare the performance with reference.
*  Experiment replacing the sigmoid activations with Relus and note what happens.
*  Experiment with a larger architecture, for example: 
$$ (784) \rightarrow (300) \rightarrow (128) \rightarrow (84) \rightarrow (10) $$

### Experiment increasing the size of the batch to 128

In [None]:
# 1

epochs = 10

input_shape = 784
output_shape = 10

model = Sequential()
model.add(Dense(input_shape, activation="sigmoid", input_shape = (input_shape,)))
model.add(Dense(300, activation='sigmoid'))
model.add(Dense(output_shape, activation='sigmoid'))

model.summary()

sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer=sgd,
              loss='categorical_crossentropy',
              metrics=['accuracy'])

history = model.fit(x_train,y_train,
          batch_size=128,
          epochs=epochs,
          verbose=1,
          validation_data = (x_test,y_test))

print(history.history.keys())

# summarize history for loss
plt.figure(figsize=(15,8))
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

# summarize history for accuracy
plt.figure(figsize=(15,8))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

<b>This model have a validation accuracy decreased in around 0.02</b>

### Experiment replacing the sigmoid activations with Relus

In [None]:
# 2

epochs = 10

input_shape = 784
output_shape = 10

model = Sequential()
model.add(Dense(input_shape, activation="relu", input_shape = (input_shape,)))
model.add(Dense(300, activation='relu'))
model.add(Dense(output_shape, activation='relu'))

model.summary()

sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer=sgd,
              loss='categorical_crossentropy',
              metrics=['accuracy'])

history = model.fit(x_train,y_train,
          batch_size=32,
          epochs=epochs,
          verbose=1,
          validation_data = (x_test,y_test))

print(history.history.keys())

# summarize history for loss
plt.figure(figsize=(15,8))
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

# summarize history for accuracy
plt.figure(figsize=(15,8))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

<b>We obtained so bad results when we switch all functions activations to relus.</b>

### Experiment with a larger architecture
$$ (784) \rightarrow (300) \rightarrow (128) \rightarrow (84) \rightarrow (10) $$

In [None]:
# 3

epochs = 10

input_shape = 784
output_shape = 10

model = Sequential()
model.add(Dense(input_shape, activation="sigmoid", input_shape = (input_shape,)))
model.add(Dense(300, activation='sigmoid'))
model.add(Dense(128, activation='sigmoid'))
model.add(Dense(84, activation='sigmoid'))
model.add(Dense(output_shape, activation='softmax'))

model.summary()

sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
adadelta = optimizers.Adadelta(lr=1.0, rho=0.95)
model.compile(optimizer=adadelta,
              loss='categorical_crossentropy',
              metrics=['accuracy'])

history = model.fit(x_train,y_train,
          batch_size=32,
          epochs=epochs,
          verbose=1,
          validation_data = (x_test,y_test))

print(history.history.keys())

# summarize history for loss
plt.figure(figsize=(15,8))
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

# summarize history for accuracy
plt.figure(figsize=(15,8))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

<b>This is our best result, the model obtained a validation accuracy of 0.9707 which is so well.</b>