# Deep Neural Networks for MNIST Classification

### Import relevant libraries 

In [1]:
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

### Data

In [2]:
mnist_dataset,mnist_info = tfds.load(name='mnist',with_info=True,as_supervised=True)

<li>We manually spilt into validation and train dataset</li>
<li>We can either count the # train samples, or use the mnist_info</li>
<li>Normally we'd like to scale our data in some way to make the result more numerically stable ( inputs b/w 0 and 1 )</li>
<li>MNIST dataset has images value between 0 and 255 . So if we divide by 255 then we will get values between 0 and 1</li>

In [3]:
mnist_train , mnist_test = mnist_dataset['train'],mnist_dataset['test']

num_validation_samples = 0.1*mnist_info.splits['train'].num_examples
num_validation_samples = tf.cast(num_validation_samples,tf.int64)

num_test_samples = mnist_info.splits['train'].num_examples
num_test_samples = tf.cast(num_test_samples , tf.int64)

def scale( image,label ):
    image = tf.cast(image,tf.float32)
    image /= 255.
    return image,label

scaled_train_and_validation_data = mnist_train.map(scale) 

test_data = mnist_test.map(scale)

<li>Shuffling = keeping the same information but in a different order</li>
<li>Buffer size used when dealing with enormous datasets. we can't shuffle all at once</li> 

In [4]:
BUFFER_SIZE = 10000

shuffled_train_and_validation_data = scaled_train_and_validation_data.shuffle(BUFFER_SIZE)

In [5]:
validation_data = shuffled_train_and_validation_data.take(num_validation_samples)

train_data = shuffled_train_and_validation_data.skip(num_validation_samples)

<li>Batch size = 1 = SGD</li>
<li>Batch size = # samples = (single batch)GD</li>
<li>1 < Batch size <#samples = mini-batch GD</li>
<li>When batching we find the average loss and average accuracy</li>
<li>The model expects the validation dataset in batch form too</li>

In [6]:
BATCH_SIZE = 100

train_data = train_data.batch(BATCH_SIZE)
validation_data = validation_data.batch(num_validation_samples)
test_data = test_data.batch(mnist_info.splits['test'].num_examples)

<li>The MNIST data is iterable and in 2-tuple format ( as_supervised = True ) </li>

In [7]:
validation_inputs , validation_targets = next(iter(validation_data))

## Model

### Outline the model 

<li>The underlying assumption is that all hidden layers are of the same size</li>

In [8]:
input_size = 784
output_size = 10
hidden_layer_size = 150

model = tf.keras.Sequential([
                            tf.keras.layers.Flatten(input_shape=(28,28,1)),
                            tf.keras.layers.Dense(hidden_layer_size,activation='relu'),
                            tf.keras.layers.Dense(hidden_layer_size,activation='relu'),
                            tf.keras.layers.Dense(hidden_layer_size,activation='relu'),
                            tf.keras.layers.Dense(output_size,activation='softmax')
                            ])

### Choose the optimizer and the loss funciton

In [9]:
# we define the optimizer we'd like to use, 
# the loss function, 
# and the metrics we are interested in obtaining at each iteration
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

### Training

In [10]:
NUM_EPOCHS = 100
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)
model.fit(train_data,epochs=NUM_EPOCHS,validation_data=(validation_inputs,validation_targets),verbose=2,callbacks=[early_stopping])

Epoch 1/100
540/540 - 7s - loss: 0.2875 - accuracy: 0.9164 - val_loss: 0.1409 - val_accuracy: 0.9578
Epoch 2/100
540/540 - 5s - loss: 0.1137 - accuracy: 0.9653 - val_loss: 0.0889 - val_accuracy: 0.9738
Epoch 3/100
540/540 - 5s - loss: 0.0765 - accuracy: 0.9765 - val_loss: 0.0762 - val_accuracy: 0.9765
Epoch 4/100
540/540 - 5s - loss: 0.0571 - accuracy: 0.9825 - val_loss: 0.0714 - val_accuracy: 0.9778
Epoch 5/100
540/540 - 5s - loss: 0.0472 - accuracy: 0.9846 - val_loss: 0.0579 - val_accuracy: 0.9832
Epoch 6/100
540/540 - 5s - loss: 0.0369 - accuracy: 0.9883 - val_loss: 0.0619 - val_accuracy: 0.9802
Epoch 7/100
540/540 - 5s - loss: 0.0297 - accuracy: 0.9903 - val_loss: 0.0355 - val_accuracy: 0.9883
Epoch 8/100
540/540 - 5s - loss: 0.0286 - accuracy: 0.9911 - val_loss: 0.0262 - val_accuracy: 0.9918
Epoch 9/100
540/540 - 5s - loss: 0.0218 - accuracy: 0.9929 - val_loss: 0.0294 - val_accuracy: 0.9895
Epoch 10/100
540/540 - 5s - loss: 0.0209 - accuracy: 0.9929 - val_loss: 0.0305 - val_accura

<tensorflow.python.keras.callbacks.History at 0x1b8620fee88>

Loss didn't change too much. Even after the first epoch 540 different weights and bias updates. The accuracy shows what % of the cases the output were equal to the targets. We usually keep an eye on the validation loss ( or set early stopping mechnisms ) to determine whether the model is overfitting )
VALIDATION ACCURACY = TRUE ACCURACY OF THE MODEL

##### After changing the hidden layer size to 100 :
Drastically increased the accuracy of model

##### Width : 200
Find the variable: "hidden_layer_size" and change it to 200.

The validation accuracy is significantly higher (as the algorithm with 50 hidden units was too simple of a model).

Naturally, it takes the algorithm much longer to train (unless early stopping is triggered too soon).( 5 to 6s compared to 4s for 100 )

A hidden layer size of 500 (and not only) works even better.

##### Depth : add another hidden layer :
We can see that the accuracy of the model does not necessarily improve. This is an important lesson for us. Fiddling with a single hyperparameter may not be enough. Sometimes, a deeper net needs to also be wider in order to have higher accuracy. Maybe you need more epochs?

**ADDITIONAL TASK: Try this new model, but with a wider one (200-500 hidden units). Basically, combine this and the previous exercises**

In any case, it takes longer for the algorithm to train.

##### Width and depth : 5 hidden layer and 1000 size : 
The result (as you can see below) is that our model's training was going very well, until it overfit. It did so by quite a lot.

It took my personal computer around 5-6 minutes to train the model. 

##### Fiddle with activation functions :
ReLu to the first hidden layer and tanh to the second one. The tanh activation is given by the string 'tanh'.
Analogically to the previous lecture, we can change the activation functions. This time though, we will use different activators for the different layers.

The result should not be significantly different. However, with different width and depth, that may change.

##### Batch size :
A bigger batch size results in slower training. That's what we expected from the theory. We are taking advantage of batching because of the amazing speed increase.

Notice that the validation accuracy starts from a low number and with 5 epochs actually **finishes** at a lower number. That's because there are **fewer** updates in a single epoch.

*Try a batch size of 30,000 or 50,000. That's very close to single batch GD for this problem. What do you think about the speed?You will need to change the max epochs to 100 (for instance), as 5 epochs won't be enough to train the model.

##### Batch size = 1 :
A batch size of 1 results in the SGD. It takes the algorithm very little time to process a single batch (as it is one data point), but there are thousands of batches (54000 to be precise), thus the algorithm is actually slow. Remember that this depends on the number of cores that you train on. If you are using a CPU with 4 or 8 cores, you can only train 4 or 8 batches at once. The middle ground (mini-batching such as 100 samples per batch) is optimal.

Notice that the validation accuracy starts from a high number. That's because there are **lots** updates in a single epoch. Once the training is over, the accuracy is lower than all other batch sizes (SGD was an approximation).

##### Adjust learning rate :
We create the custom optimizer with:

    custom_optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001)

Then we change the respective argument in model.compile to reflect this: 

    model.compile(optimizer=custom_optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    

Since the learning rate is lower than normal, we may need to adjust the max_epochs (to, say, 50). 

The result is basically the same, but we reach it much slower.

##### Adjust learning rate :  0.02 :

While Adam adapts to the problem, if the orders of magnitude are too different, it may not have time to adjust accordingly. We start overfitting before we can reach a neat solution.

Therefore, for this problem, even 0.02 is a **HIGH** starting learning rate. What if you try a learning rate of = 1?

It's a good practice to try 0.001, 0.0001, and 0.00001. If it makes no difference, pick whatever, otherwise it makes sense to fiddle with the learning rate.

## Testing the model

In [11]:
test_loss , test_accuracy = model.evaluate(test_data)



In [12]:
print('Test loss : {0:.2f}. Test accuracy : {1:.2f}%'.format(test_loss,test_accuracy*100.))

Test loss : 0.09. Test accuracy : 97.84%
