# Class - Training NNs

#### Motivation. 
This is designed to familiarize you with using ML frameworks for deep learning. In particular, there are the following learning goals:

* Practice writing code that implements neural network training with tensorflow
* Practice replicating results from previously published work.
* Evaluate multiple optimization methds for training.
* Explore hand-tuning for hyperparameter optimization.
* It builds on the previous assignment but requires a higher level of mastery of deep learning theory and programming/engineering skills. In particular, you will experience training a much deeper network on a large-scale dataset. You will encounter practical issues that help you consolidate learning.

In the previous assignment, you tackled the image classification problem in MNIST, here you will explore model architecture for Densely Connected Neural Network to improve the image classification performance.


### Fashion MNIST

MNIST has been over-explored, state-of-the-art on MNIST doesn’t make much sense with over 99% already achieved. Fashion-MNIST is a dataset of Zalando's serving as a 'drop-in' replacement for MNIST for benchmarking machine learning algorithms. The dataset is the same size: consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.

Your trained network will take as input a feature vector of dimension 784 (corresponding to the pixel values of 28×28 images), each an integer from 0–1. The class labels are in the following table:

| Label Value |	Meaning |
| - | - |
|0 |	T-shirt/top |
|1 |	Trouser |
|2 |	Pullover |
|3 |	Dress |
|4 |	Coat |
|5 |	Sandal |
|6 |	Shirt |
|7 |	Sneaker |
|8 |	Bag |
|9 |	Ankle boot |

#### Benchmarks of Fashion MNIST for various algorithms
* Human accuracy is 0.835

* Xiao, Han, Kashif Rasul, and Roland Vollgraf. "Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms." arXiv preprint arXiv:1708.07747 (2017).
[https://arxiv.org/pdf/1708.07747.pdf](https://arxiv.org/pdf/1708.07747.pdf)
* Benchmarks
[http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/#](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/#)

## Load libraries

In [None]:
import numpy as np
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("talk")
plt.rcParams["figure.figsize"] = [9.708,6]
import warnings
warnings.filterwarnings('ignore')
import tensorflow as tf
tf.random.set_seed(0)
np.random.seed(0)
# !pip install tensorflow
from tensorflow.keras.datasets import fashion_mnist

# Load the data


In [None]:
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']


# Prepare the data
We will fix the input shape (to how TF prefers), rescale the data between 0 and 1, and encode the classes.

In [None]:
num_classes = 10
input_shape = (X_train.shape[1],X_train.shape[2])
#normalize the data between 0-1
X_train = X_train.astype('float32') / 255
X_test  = X_test.astype( 'float32') / 255
#Reshape To Match The Keras's Expectations
X_train = X_train.reshape(X_train.shape[0], 1, input_shape[0], input_shape[1])
X_test  = X_test.reshape( X_test.shape[0],  1, input_shape[0], input_shape[1])
#one hot encoding
Y_train = tf.keras.utils.to_categorical(y_train, num_classes)
Y_test  = tf.keras.utils.to_categorical(y_test,  num_classes)
#==============
print(X_train.shape[0], 'train samples')
print(X_test.shape[0],  'test samples')

## View the data

In [None]:
# preview the images first
fig=plt.figure()
ncols,nrows = len(class_names), 4
fig, axs = plt.subplots(nrows=nrows, ncols=ncols,figsize=(12,5))
print(axs.shape)
for i in range(ncols):
    inn = y_train==i
    Xi  = X_train[inn]
    for j in range(nrows):
        imgi = Xi[j].reshape((input_shape[0],input_shape[1]))
        axs[j,i].imshow(imgi,interpolation='nearest',cmap=plt.cm.gray)
        axs[j,i].axis('off')
plt.tight_layout()
fig.subplots_adjust(hspace=.1)
plt.show()

## Simple network
Build and compile a basic model.

Layers for our Network.

* **Input layer** - size 784 
    * flatten the input image (28x28).
* **Hidden layer** - size 128
    * Dense (fully connected) network from input layer to these 128 neuron hidden layer.
* **Dropout** - 0.2
    * randomly sets 20% input units to 0 with a frequency of rate at each step during training time, which helps prevent overfitting. 
* **Output layer** - size 10
    * Dense layer (fully connected back to the 128 neuron hidden layer). The 10 is the number of classes.  Given an input image, our network should **light** up the corresponding neuron of our target.
* **Softmax activation** - convert our output into a probability for each class.


# Part 1a: 1-layer network

In [None]:
#keep these constant
epochs    = 30
batch_size= 256

Define our model

In [None]:
d1         = 100                                  # setting my number of neurons
tf.random.set_seed(0)                             # set our initial seed
model1 = tf.keras.models.Sequential([              # model type
  tf.keras.layers.Flatten(input_shape=X_train[1].shape),  # input layer
  tf.keras.layers.Dense(d1, activation='relu'),  # hidden layer
  tf.keras.layers.Dropout(0.2),                   # Dropout helps reduce overfitting 
  tf.keras.layers.Dense(10),                      # output to each class, could just stop here
  tf.keras.layers.Softmax()                       # convert to probability
])
#define our optimizer
sgd = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9, nesterov=False, name='SGD')
#
model1.compile(optimizer=sgd,
              loss='categorical_crossentropy', #need to define our loss function
              metrics=['accuracy'])

In [None]:
tstart = tf.timestamp()
history1 = model1.fit(X_train, Y_train, 
                    epochs=epochs,
                    batch_size=batch_size,
                    validation_split = 0.2) # Store Data for evaluation in history
total_time = tf.timestamp() - tstart
print("total time %3.3f seconds"%total_time)

In [None]:
#we will use this a lot, so lets make a function
def printAccuracy(history,results_test):
    print("train loss %.5f \t train acc: %.5f"%(history.history['loss'][-1],history.history['accuracy'][-1]))
    print("valid loss %.5f \t valid acc: %.5f"%(history.history['val_loss'][-1],history.history['val_accuracy'][-1]))
    print("test loss  %.5f \t test acc:  %.5f"%(results_test[0],results_test[1]))
#we will do this a lot, so lets make a function for this
def plot_result(history,results_test):
    # Get training and validation histories
    training_acc = history.history['accuracy']
    val_acc      = history.history['val_accuracy']
    # Create count of the number of epochs
    epoch_count = range(1, len(training_acc) + 1)
    # Visualize loss history
    plt.plot(epoch_count, training_acc, 'b-o',label='Training')
    plt.plot(epoch_count, val_acc, 'r--',label='Validation')
    plt.plot(epoch_count, results_test[1]*np.ones(len(epoch_count)),'k--',label='Test')
    plt.legend()
    plt.title("Training and validation accuracy")
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
 

## Model accuracy and loss

In [None]:
#===    
results_test = model1.evaluate(X_test, Y_test, batch_size=128,verbose=0)    
printAccuracy(history1,results_test)

## Plot training and test accuracy per epoch

In [None]:
plot_result(history1,results_test)   
plt.title("MNIST 1-hidden layer, d=%d, in %3.2f s"%(d1,total_time)) #overwrite the title
plt.show()

## Benchmark, test accuracy 87.1%
That our simple 1-hidden layer network for Fashion MNIST, with an accuracy of 87.1%, not bad for a first try.  We can see that we are not stopping too soon, since the Validation loss has leveled off.

# Part 1b: 2-layer network: Is a deeper neural network more accurate?
Here we’ll combine all our steps into the following code block:

In [None]:
d1         = 100    # setting my number of neurons
d2         = 100
#=====
tf.random.set_seed(0)                             # set our initial seed
model2 = tf.keras.models.Sequential([              # model type
  tf.keras.layers.Flatten(input_shape=X_train[1].shape),  # input layer
  tf.keras.layers.Dense(d1, activation='relu'),  # hidden layer
  tf.keras.layers.Dropout(0.2),                   # Dropout helps reduce overfitting 
  tf.keras.layers.Dense(d2, activation='relu'),  # hidden layer    
  tf.keras.layers.Dropout(0.2),                   # Dropout helps reduce overfitting 
  tf.keras.layers.Dense(10),                      # output to each class, could just stop here
  tf.keras.layers.Softmax()                       # convert to probability
])
#define our optimizer
sgd = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9, nesterov=False, name='SGD')
#
model2.compile(optimizer=sgd,
              loss='categorical_crossentropy', #need to define our loss function
              metrics=['accuracy'])
#====
tstart   = tf.timestamp()
history2 = model2.fit(X_train, Y_train, 
                    epochs=epochs,
                    batch_size=batch_size,
                    validation_split = 0.2) # Store Data for evaluation in history
total_time = tf.timestamp() - tstart
print("total time %3.3f seconds"%total_time)

## Model accuracy and loss

In [None]:
#===    
results_test = model2.evaluate(X_test, Y_test, batch_size=128,verbose=0)    
printAccuracy(history2,results_test)

## plot training and test accuracy per epoch

In [None]:
plot_result(history2,results_test)   
plt.title("MNIST 2-hidden layer, d=(%d,%d) in %3.2f s"%(d1,d2,total_time)) #overwrite the title
plt.show()

## Benchmark, test accuracy 87.7%
Our first 2-hidden layer network for Fashion MNIST did not gain much of an improvement over our 1 hidden layer network.  


# Part 1c - 3 hidden layers - go big or go home

Let’s build a 5-layer network (3-hidden layers), keeping the same activation functions, shapes and settings, so the only difference is the depth of the network. Here we’ll combine all our steps into the following code block:

In [None]:
d1         = 100    # setting my number of neurons
d2         = 100
d3         = 100
#=====
tf.random.set_seed(0)                             # set our initial seed
model3 = tf.keras.models.Sequential([              # model type
  tf.keras.layers.Flatten(input_shape=X_train[1].shape),  # input layer
  tf.keras.layers.Dense(d1, activation='relu'),  # hidden layer
  tf.keras.layers.Dropout(0.2),                   # Dropout helps reduce overfitting 
  tf.keras.layers.Dense(d2, activation='relu'),  # hidden layer    
  tf.keras.layers.Dropout(0.2),                   # Dropout helps reduce overfitting 
  tf.keras.layers.Dense(d3, activation='relu'),  # hidden layer    
  tf.keras.layers.Dropout(0.2),                   # Dropout helps reduce overfitting 
  tf.keras.layers.Dense(10),                      # output to each class, could just stop here
  tf.keras.layers.Softmax()                       # convert to probability
])
#define our optimizer
sgd = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9, nesterov=False, name='SGD')
#
model3.compile(optimizer=sgd,
              loss='categorical_crossentropy', #need to define our loss function
              metrics=['accuracy'])
#====
tstart   = tf.timestamp()
history3 = model3.fit(X_train, Y_train, 
                    epochs=epochs,
                    batch_size=batch_size,
                    validation_split = 0.2) # Store Data for evaluation in history
total_time = tf.timestamp() - tstart
print("total time %3.3f seconds"%total_time)

## Model accuracy and loss

In [None]:
#===    
results_test = model3.evaluate(X_test, Y_test, batch_size=128,verbose=0)    
printAccuracy(history3,results_test)

## plot training and test accuracy per epoch

In [None]:
plot_result(history3,results_test)   
plt.title("MNIST 3-hidden layer, d=(%d,%d,%d) in %3.2f s"%(d1,d2,d3,total_time)) #overwrite the title
plt.show()

For all 3 models, the general trend we notice as the epochs increase with the training data set is that the accuracy approaches 1 (loss is decreasing down to 0), representing the ‘perfect score’. This is a sign of over-fitting, which is the motivation behind validating the model. This is a well know problem in machine learning, called Overfitting — when the model adapts too well to a specific dataset and thus does not generalize well on new information. There are many ways to correct for this, called Regularization.


## Compare the results

In [None]:
val_acc1 = history1.history['val_accuracy']
val_acc2 = history2.history['val_accuracy']
val_acc3 = history3.history['val_accuracy']
# Create count of the number of epochs
epoch_count = range(1, len(val_acc1) + 1)
# Visualize loss history
plt.plot(epoch_count, val_acc1, label='1-hidden layer')
plt.plot(epoch_count, val_acc2, label='2-hidden layer')
plt.plot(epoch_count, val_acc3, label='3-hidden layer')
plt.legend()
plt.title("Fashion MNIST")
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.show();

#### Since the 2-hidden layer marginally performs better for the 30 epochs, so we'll stick with that for the rest of our investigation.
#### Note: the 3-hidden layer network seems to still be increasing after 30 epochs (its still learning).  Come back to this for hw.

# Part 2 - what effect does batch size have on accuracy?

Lets try to see how mini-batch size changes our results.

In this part, we will explore how changing the minibatch size hyperparameter $B$. Suppose that we were to decrease this hyperparameter to 
* $B=100$
* $B=200$
* $B=400$
* $B=800$
* $B=1600$
* $B=3200$


In [None]:
epochs        = 30
hiddenNeurons = 100
Bsizes        = [100,200,400,800,1600,3200]

Since we run almost the same model, and only change the batch_size, we can put everything into a function, and just run it and just change one model parameter.  This technique is really helpful and can save you some trouble on your hw.

We can output just what we want too, the validation accuracy.  

In [None]:
def runNNModel_BatchSize(B):
    tf.random.set_seed(0)                             # set our initial seed
    modelB = tf.keras.models.Sequential([             # model type
      tf.keras.layers.Flatten(input_shape=X_train[1].shape),  # input layer
      tf.keras.layers.Dense(hiddenNeurons, activation='relu'),   # hidden layer
      tf.keras.layers.Dropout(0.2),                   # Dropout helps reduce overfitting 
      tf.keras.layers.Dense(10),                      # output to each class, could just stop here
      tf.keras.layers.Softmax()                       # convert to probability
    ])
    sgd = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9, nesterov=False, name='SGD')
    modelB.compile(optimizer=sgd,
                  loss='categorical_crossentropy',    #need to define our loss function
                  metrics=['accuracy'])
    history = modelB.fit(X_train, Y_train, verbose=0,
                        epochs=epochs,
                        batch_size=B,
                        validation_split = 0.2) 
    return history.history['val_accuracy']            #all we want is the validation history

Now lets run the model for each of the batch sizes we care about.

In [None]:
valAccuracy = []   
for B in Bsizes:
    print("Starting batchsize %d"%B,end=' ');
    tstart = tf.timestamp()
    valAcc_B = runNNModel_BatchSize(B)
    valAccuracy.append(valAcc_B)
    total_time = tf.timestamp() - tstart
    print("total time %3.3f seconds"%total_time)

## Plot our results

In [None]:
fig = plt.figure()
for v,B in zip(valAccuracy,Bsizes):   
    epoch_count = range(1, len(v) + 1)               # Create count of the number of epochs
    plt.plot(epoch_count, v,label='B=%d'%B)
#===========    
plt.legend()
plt.title("Fashion MNIST")
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.show();