# Assignment 4: Benchmarking Fashion-MNIST with Deep Neural Nets

### CS 4501 Machine Learning - Department of Computer Science - University of Virginia
"The original MNIST dataset contains a lot of handwritten digits. Members of the AI/ML/Data Science community love this dataset and use it as a benchmark to validate their algorithms. In fact, MNIST is often the first dataset researchers try. "If it doesn't work on MNIST, it won't work at all", they said. "Well, if it does work on MNIST, it may still fail on others." - **Zalando Research, Github Repo.**"

Fashion-MNIST is a dataset from the Zalando's article. Each example is a 28x28 grayscale image, associated with a label from 10 classes. They intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms.

![Here's an example how the data looks (each class takes three-rows):](https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png)

In this assignment, you will attempt to benchmark the Fashion-MNIST using Neural Networks. You must use it to train some neural networks on TensorFlow and predict the final output of 10 classes. For deliverables, you must write code in Python and submit this Jupyter Notebook file (.ipynb) to earn a total of 100 pts. You will gain points depending on how you perform in the following sections.


In [2]:
# You might want to use the following packages
import numpy as np
import os
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) #reduce annoying warning messages
from functools import partial

# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)


---
## 1. PRE-PROCESSING THE DATA (10 pts)

You can load the Fashion MNIST directly from Tensorflow. **Partition of the dataset** so that you will have 50,000 examples for training, 10,000 examples for validation, and 10,000 examples for testing. Also, make sure that you platten out each of examples so that it contains only a 1-D feature vector.

Write some code to output the dimensionalities of each partition (train, validation, and test sets).



In [240]:
# Your code goes here for this section.
#fmnist = tf.keras.datasets.fashion_mnist.load_data();
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data();
X_train, X_val, y_train, y_val = x_train[:50000], x_train[50000:], y_train[:50000], y_train[50000:]

In [241]:
#turning the examples into a 1-D feature vector
#X_train = X_train.flatten().reshape(-1,28*28)
#X_val = X_val.flatten().reshape(10000,28*28)
#X_test = x_test.flatten().reshape(10000,28*28)

X_train = X_train.astype(np.float32).reshape(-1,28*28)
X_val = X_val.astype(np.float32).reshape(-1,28*28)
X_test = x_test.astype(np.float32).reshape(-1,28*28)



In [242]:
y_test = y_test.astype(np.int32)
y_train = y_train.astype(np.int32)
y_val = y_val.astype(np.int32)

In [243]:
print("X validation set is dimensions:", X_val.shape)
print("X test set is dimensions:", X_test.shape)
print("X train set is dimensions:", X_train.shape)

X validation set is dimensions: (10000, 784)
X test set is dimensions: (10000, 784)
X train set is dimensions: (50000, 784)


- - -
## 2. CONSTRUCTION PHASE (30 pts)

In this section, define at least three neural networks with different structures. Make sure that the input layer has the right number of inputs. The best structure often is found through a process of trial and error experimentation:
- You may start with a fully connected network structure with two hidden layers.
- You may try a few settings of the number of nodes in each layer.
- You may try a few activation functions to see if they affect the performance.

**Important Implementation Note:** For the purpose of learning Tensorflow, you must use low-level TensorFlow API to construct the network. Usage of high-level tools (ie. Keras) is not permited. 

In [246]:
# Your code goes here
reset_graph()

# Set some configuration here
n_inputs = 28*28  # Fashion-MNIST
n_outputs = 10

# Construct placeholder for the input layer
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

In [247]:
#defining activation functions
#its technically leaky relu
def relu(z):
    alpha = 0.01
    return tf.maximum(alpha*z, z)

In [248]:
#base case with leaky relu
with tf.name_scope("dnn1"):
    n_inputs = 28 * 28  # MNIST
    n_hidden1 = 300
    n_hidden2 = 100
    n_outputs = 10
    
    
  #implementation of the first net here
    hidden1 = tf.layers.dense(X, n_hidden1, activation=relu, name="hidden1")
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=relu, name="hidden2")
    logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

In [249]:
#an additional hidden layer
with tf.name_scope("dnn2"):
  #implementation of the second net here
    n_inputs = 28 * 28  # MNIST
    n_hidden1 = 500
    n_hidden2 = 200
    n_hidden3 = 100
    n_outputs = 10
    
  #implementation of the first net here
    d2hidden1 = tf.layers.dense(X, n_hidden1, activation=relu, name="d2hidden1")
    d2hidden2 = tf.layers.dense(d2hidden1, n_hidden2, activation=relu, name="d2hidden2")
    d2hidden3 = tf.layers.dense(d2hidden2, n_hidden3, activation=relu, name="d2hidden3")
    blogits = tf.layers.dense(d2hidden3, n_outputs, name="boutputs")

In [250]:
#large funnel with relu
with tf.name_scope("dnn3"):
  #implementation of the third net here
    n_inputs = 28 * 28  # MNIST
    n_hidden1 = 1000
    n_hidden2 = 100
    n_outputs = 10
    
  #implementation of the first net here
    d3hidden1 = tf.layers.dense(X, n_hidden1, activation=relu, name="d3hidden1")
    d3hidden2 = tf.layers.dense(d3hidden1, n_hidden2, activation=relu, name="d3hidden2")
    clogits = tf.layers.dense(d3hidden2, n_outputs, name="coutputs")

In [251]:
with tf.name_scope("loss"):
#implementation of the loss function net here
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")
    
    bxentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=blogits)
    cxentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=clogits)
    
    bloss = tf.reduce_mean(bxentropy, name="bloss")
    closs = tf.reduce_mean(cxentropy, name="closs")

In [252]:
#using gradient strategy since I have used relu and selu
with tf.name_scope("train"):
    learning_rate = 0.0011  #at learning rate of 0.01, I was getting NAN as results, need learning rate to be really low
    blearning_rate = .0011
    clearning_rate = .0011
    
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)
    
    boptimizer = tf.train.GradientDescentOptimizer(blearning_rate)
    coptimizer = tf.train.GradientDescentOptimizer(clearning_rate)
    
    btraining_op = boptimizer.minimize(bloss)
    ctraining_op = coptimizer.minimize(closs)
    
    extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

In [253]:
with tf.name_scope("eval"):
  #implementation of the evaluation procedure here
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    
    bcorrect = tf.nn.in_top_k(blogits, y, 1)
    baccuracy = tf.reduce_mean(tf.cast(bcorrect, tf.float32))
    
    ccorrect = tf.nn.in_top_k(clogits, y, 1)
    caccuracy = tf.reduce_mean(tf.cast(ccorrect, tf.float32))

- - -
## 3. EXECUTION PHASE (30 pts)

After you construct the three models of neural networks, you can compute the performance measure as the class accuracy. You will need to define the number of epochs and size of the training batch. You also might need to reset the graph each time your try a different model. To save time and avoid retraining, you should save the trained model and load it from disk to evaluate a test set. Pick the best model and answer the following:
- Which model yields the best performance measure for your dataset? Provide a reason why it yields the best performance.
- Why did you pick this many hidden layers?
- Provide some justifiable reasons for selecting the number of neurons per hidden layers. 
- Which activation functions did you use?

In the next session you will get a chance to finetune it further .



In [254]:
# Your code goes here
init = tf.global_variables_initializer()
saver = tf.train.Saver()

#used the epochs and batch sizes from notes
n_epochs = 40
batch_size = 500

# shuffle_batch() shuffle the examples in a batch before training
def shuffle_batch(X, y, batch_size):
    rnd_idx = np.random.permutation(len(X))
    n_batches = len(X) // batch_size
    for batch_idx in np.array_split(rnd_idx, n_batches):
        X_batch, y_batch = X[batch_idx], y[batch_idx]
        yield X_batch, y_batch


In [255]:
with tf.Session() as sess:
  init.run()
  for epoch in range(n_epochs):
    # implementation of the training ops here
    # implementation of the validation accuracy here
    for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
        sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        sess.run(btraining_op, feed_dict={X: X_batch, y: y_batch})
        sess.run(ctraining_op, feed_dict={X: X_batch, y: y_batch})
    if epoch % 5 == 0:
        
        acc_batch = sess.run(accuracy,feed_dict={X: X_batch, y: y_batch})
        acc_valid = sess.run(accuracy,feed_dict={X: X_val, y: y_val})
        
        
        bacc_batch = sess.run(baccuracy,feed_dict={X: X_batch, y: y_batch})
        bacc_valid = sess.run(baccuracy,feed_dict={X: X_val, y: y_val})
        
        cacc_batch = sess.run(caccuracy,feed_dict={X: X_batch, y: y_batch})
        cacc_valid = sess.run(caccuracy,feed_dict={X: X_val, y: y_val})
        
        print(epoch,"model 1: ", "Batch accuracy:", acc_batch, "Validation accuracy:", acc_valid)
        print(epoch,"model 2: ", "Batch accuracy:", bacc_batch, "Validation accuracy:", bacc_valid)
        print(epoch,"model 3: ", "Batch accuracy:", cacc_batch, "Validation accuracy:", cacc_valid)

    
  save_path = saver.save(sess, "./my_dnn_model.ckpt")

0 model 1:  Batch accuracy: 0.666 Validation accuracy: 0.6354
0 model 2:  Batch accuracy: 0.75 Validation accuracy: 0.6979
0 model 3:  Batch accuracy: 0.768 Validation accuracy: 0.7326
5 model 1:  Batch accuracy: 0.796 Validation accuracy: 0.7605
5 model 2:  Batch accuracy: 0.798 Validation accuracy: 0.7801
5 model 3:  Batch accuracy: 0.832 Validation accuracy: 0.8014
10 model 1:  Batch accuracy: 0.824 Validation accuracy: 0.7888
10 model 2:  Batch accuracy: 0.838 Validation accuracy: 0.7998
10 model 3:  Batch accuracy: 0.832 Validation accuracy: 0.811
15 model 1:  Batch accuracy: 0.83 Validation accuracy: 0.8048
15 model 2:  Batch accuracy: 0.838 Validation accuracy: 0.8118
15 model 3:  Batch accuracy: 0.826 Validation accuracy: 0.8223
20 model 1:  Batch accuracy: 0.828 Validation accuracy: 0.8142
20 model 2:  Batch accuracy: 0.85 Validation accuracy: 0.8224
20 model 3:  Batch accuracy: 0.874 Validation accuracy: 0.8335
25 model 1:  Batch accuracy: 0.824 Validation accuracy: 0.8206
25

In [256]:
with tf.Session() as sess:
    saver.restore(sess, "./my_dnn_model.ckpt")
    # implementation of the test set evaluation here
    atest = sess.run(accuracy,feed_dict={X: X_test, y: y_test})
    btest = sess.run(baccuracy,feed_dict={X: X_test, y: y_test})
    ctest = sess.run(caccuracy,feed_dict={X: X_test, y: y_test})

In [257]:
# print out the final accuracy here
print("Model 1 test accuracy: ", atest)
print("Model 2 test accuracy: ", btest)
print("Model 3 test accuracy: ", ctest)

Model 1 test accuracy:  0.8198
Model 2 test accuracy:  0.8305
Model 3 test accuracy:  0.833


I did end up using the same activation function for all three (with the same alpha). The main difference between the three models is in the size and number of the hidden layers.The first model had 2 hidden layers of 500 and 100, while the second model had 3 layers starting at 500 and end the 3rd hidden layer being 100. The final model took model 1 and created a funnel, with the first hidden layer being 1000 and the second hidden layer being 100 (500 to 100 vs. 1000 to 100). The increased number of layers did not lead to better performance over the other 2 models. Funneling led ot a higher test accuracy score.


- - -
## 4. FINETUNING THE NETWORK (25 pts)

The best performance on the Fashion MNIST of a non-neural-net classifier is the Support Vector Classifier {"C":10,"kernel":"poly"} with 0.897 accuracy. In this section, you will see how close you can get to that accuracy, or (better yet) beat it! You will be able to see the performance of other ML methods below:
http://fashion-mnist.s3-website.eu-central-1.amazonaws.com

Use the best model from the previous section and see if you can improve it further. To improve the performance of your model, You must make some modifications based upon the practical guidelines discuss in class. Here are a few decisions about the recommended network configurations you have to make:
1. Initialization: Use He Initialization for your model
2. Activation: Add ELU as the activation function throughout your hidden layers
3. Normalization: Incorporate the batch normalization at every layer
4. Regularization: Configure the dropout policy at 50% rate
5. Optimization: Change Gradient Descent into Adam Optimization
6. Your choice: make any other changes in 1-5 you deem necessary

Keep in mind that the execution phase is essentially the same, so you can just run it from the above. See how much you gain in classification accuracy. Provide some justifications for the gain in performance. 






In [258]:
reset_graph()

# Set some configuration here
n_inputs = 28*28  # Fashion-MNIST
n_outputs = 10

# Construct placeholder for the input layer
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

In [259]:
with tf.name_scope("dnnBenchmark"):
  # implementation of the new benchmarking DNN here
    n_hidden1 = 1000
    n_hidden2 = 500
    n_hidden3 = 100
    n_outputs = 10
    
  #implementation of the first net here

    he_init = tf.variance_scaling_initializer()
    training = tf.placeholder_with_default(False, shape=(), name='training')
    
    ohidden1 = tf.layers.dense(X, n_hidden1, activation=relu, kernel_initializer=he_init, name="ohidden1")
    b1 = tf.layers.batch_normalization(ohidden1, training=training, momentum=0.9)
    ob1 = tf.nn.elu(b1)
    
    ohidden2 = tf.layers.dense(ob1, n_hidden2, activation=relu, name="ohidden2")
    b2 = tf.layers.batch_normalization(ohidden2, training=training, momentum=0.9)
    ob2 = tf.nn.elu(b2)
    
    ohidden3 = tf.layers.dense(ob2, n_hidden3, activation=relu, name="ohidden3")
    b3 = tf.layers.batch_normalization(ohidden3, training=training, momentum=0.9)
    ob3 = tf.nn.elu(b3)
    
    ologits = tf.layers.dense(ob3, n_outputs, name="ooutputs")

    
with tf.name_scope("oloss"):
    oxentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=ologits)
    oloss = tf.reduce_mean(oxentropy, name="oloss")

with tf.name_scope("otrain"):
    learning_rate = 0.0005
    
    ooptimizer = tf.train.GradientDescentOptimizer(learning_rate)
    otraining_op = optimizer.minimize(oloss)
    
with tf.name_scope("oeval"):
  #implementation of the evaluation procedure here
    ocorrect = tf.nn.in_top_k(ologits, y, 1)
    oaccuracy = tf.reduce_mean(tf.cast(ocorrect, tf.float32))

In [260]:
#since I implemented the execution phase above using different lines, Il remake it here
init = tf.global_variables_initializer()
saver = tf.train.Saver()

#used the epochs and batch sizes from notes
n_epochs = 40
batch_size = 500



with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
    # implementation of the training ops here
    # implementation of the validation accuracy here
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(otraining_op, feed_dict={X: X_batch, y: y_batch})
        
        if epoch % 5 == 0:
            accbatch = sess.run(oaccuracy,feed_dict={X: X_batch, y: y_batch})
            accvalid = sess.run(oaccuracy,feed_dict={X: X_val, y: y_val})
      
            print(epoch,"optimal model: ", "Batch accuracy:", accbatch, "Validation accuracy:", accvalid)

    save_path = saver.save(sess, "./optimized model.ckpt")

0 optimal model:  Batch accuracy: 0.784 Validation accuracy: 0.7269
5 optimal model:  Batch accuracy: 0.842 Validation accuracy: 0.8073
10 optimal model:  Batch accuracy: 0.886 Validation accuracy: 0.8219
15 optimal model:  Batch accuracy: 0.874 Validation accuracy: 0.8299
20 optimal model:  Batch accuracy: 0.894 Validation accuracy: 0.8369
25 optimal model:  Batch accuracy: 0.89 Validation accuracy: 0.8411
30 optimal model:  Batch accuracy: 0.902 Validation accuracy: 0.8422
35 optimal model:  Batch accuracy: 0.906 Validation accuracy: 0.845


In [261]:
with tf.Session() as sess:
    saver.restore(sess, "./optimized model.ckpt")
    atest = sess.run(oaccuracy,feed_dict={X: X_test, y: y_test})
    print("optimized model accuracy: ", atest)

optimized model accuracy:  0.843


- - -
## 5. OUTLOOK (5 pts)

Plan for the outlook of your system: This may lead to the direction of your future project:
- Did your neural network outperform other "traditional ML technique? Why/why not?
- Does your model work well? If not, which model should be further investigated?
- Do you satisfy with your system? What do you think needed to improve?



I am not fully satisfied with my system. My primary issue is 1 things. First the learning rate I had to use was much smaller than the learning rate used in the notes. Using a learning rate any significant size larger for me ended with NAN results, which worries me. I thus could not mess with learning rate to improve my model. The main changes I have done were to implement batche normalisations at every stage, and to implement he at the beginning. I was unable to implement drop out stages which I believe may have been able to perform better than taditional ML techniques. Overall though, I believe my model functions decently.



- - - 
### NEED HELP?

In case you get stuck in any step in the process, you may find some useful information from:

 * Consult my lectures and/or the textbook
 * Talk to the TA, they are available and there to help you during OH
 * Come talk to me or email me <nn4pj@virginia.edu> with subject starting "CS4501 Assignment 4:...".
 * More on the Fashion-MNIST to be found here: https://hanxiao.github.io/2018/09/28/Fashion-MNIST-Year-In-Review/

Best of luck and have fun!