# Prepare Environment

In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from IPython.display import display

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams["axes.grid"] = False
%matplotlib inline

In [None]:
# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras
print(tf.__version__)

**Note**: most of the code in the notebook is a simplified version of the tutorial example from Tensorflow ([here](https://www.tensorflow.org/tutorials/keras/classification))

# Import the Fashion MNIST dataset

This guide uses the [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset which contains 70,000 grayscale images in 10 categories. The images show individual articles of clothing at low resolution (28 x 28 pixels), as seen here:

<table>
  <tr><td>
    <img src="https://tensorflow.org/images/fashion-mnist-sprite.png"
         alt="Fashion MNIST sprite"  width="600">
  </td></tr>
  <tr><td align="center">
    <b>Figure 1.</b> <a href="https://github.com/zalandoresearch/fashion-mnist">Fashion-MNIST samples</a> (by Zalando, MIT License).<br/>&nbsp;
  </td></tr>
</table>

Here, 60,000 images are used to train the network and 10,000 images to evaluate how accurately the network learned to classify images. You can access the Fashion MNIST directly from TensorFlow. Import and load the Fashion MNIST data directly from TensorFlow:

In [None]:
from keras.datasets import fashion_mnist

# Download Fashion MNIST dataset using `datasets` module in `tf.keras`
# Note: the data have already been split into training and test sets
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()

print(f'Training set: {X_train.shape}, {y_train.shape}')
print(f'Test set: {X_test.shape}, {y_test.shape}')

In [None]:
from sklearn.model_selection import train_test_split
 
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train, y_train, 
    random_state=42,
    test_size=10000)

print(f'Training set: {X_train.shape}, {y_train.shape}')
print(f'Validation set: {X_valid.shape}, {y_valid.shape}')
print(f'Test set: {X_test.shape}, {y_test.shape}')

The images are 28x28 NumPy arrays, with pixel values ranging from 0 to 255. The *labels* are an array of integers, ranging from 0 to 9. These correspond to the *class* of clothing the image represents:

<table>
  <tr>
    <th>Label</th>
    <th>Class</th>
  </tr>
  <tr>
    <td>0</td>
    <td>T-shirt/top</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Trouser</td>
  </tr>
    <tr>
    <td>2</td>
    <td>Pullover</td>
  </tr>
    <tr>
    <td>3</td>
    <td>Dress</td>
  </tr>
    <tr>
    <td>4</td>
    <td>Coat</td>
  </tr>
    <tr>
    <td>5</td>
    <td>Sandal</td>
  </tr>
    <tr>
    <td>6</td>
    <td>Shirt</td>
  </tr>
    <tr>
    <td>7</td>
    <td>Sneaker</td>
  </tr>
    <tr>
    <td>8</td>
    <td>Bag</td>
  </tr>
    <tr>
    <td>9</td>
    <td>Ankle boot</td>
  </tr>
</table>

Each image is mapped to a single label. Since the *class names* are not included with the dataset, store them here to use later when plotting the images:

In [None]:
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

Let's look at an example of the fashion MNIST.

In [None]:
plt.figure()
plt.imshow(X_train[0])
plt.colorbar()
plt.grid(False)
plt.xlabel(class_names[y_train[0]])
plt.show()

# Data Preprocessing

It is a common pratice to **normalize the range of independent variables or features of data**. This is mainly because many classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized **so that each feature contributes approximately proportionately to the final distance**.

There are many other feature scaling techniques, which can be found in [here](https://en.wikipedia.org/wiki/Feature_scaling).

In this example, we'll only scale the inputs to be in the range [0-1] rather than [0-255].

In [None]:
# Scale the Fashion MNIST data to be in the range [0-1]
# Note: The maximum value of color value is 255
X_train = X_train / 255.0
X_valid = X_valid / 255.0
X_test = X_test / 255.0

To verify that the data is in the correct format and that you're ready to build and train the network, let's display the first 25 images from the *training*, *validation* and *test* sets and display the class name below each image.

In [None]:
def plot_data(images, labels):
    plt.figure(figsize=(10,10))
    for i in range(25):
        plt.subplot(5,5,i+1)
        plt.xticks([])
        plt.yticks([])
        plt.grid(False)
        plt.imshow(images[i], cmap=plt.cm.binary)
        plt.xlabel(class_names[labels[i]])
    plt.show()
    plt.close('all')

print("Training set")
plot_data(X_train, y_train)

print("Validation set")
plot_data(X_valid, y_valid)

print("Test set")
plot_data(X_test, y_test)

# Define a Model

We are going to define a neural network, or what is typically referred to as a deep learning model. Here, we will do a simple 3-layer fully-connected network.

<!--<img src="./img/fc_mnist.png" alt="Fully-connected Network" style="width:500px;"/>-->
<img src="https://www.dropbox.com/s/6a05qtkgmlih6s4/fc_mnist.png?raw=1" alt="Fully-connected Network" style="width:500px;"/>

In [None]:
from keras.models import Sequential
from keras.layers import *

num_classes = 10

model = keras.Sequential([
    # Layer 1 - Flatten the input from an image (28 * 28) to a vector (784)
    keras.layers.Flatten(input_shape=(28, 28)),
    # Layer 2 - Dense layer (i.e., fully-connected)
    keras.layers.Dense(128, activation='relu'),
    # Layer 3 - Dense layer (i.e., fully-connected)
    keras.layers.Dense(128, activation='relu'),
    # Layer 4 - Dense layer (i.e., fully-connected)
    # Note: the number of neurons in the last layer must be equal to the number
    #       of output classes (which is 10 in this example).
    keras.layers.Dense(num_classes, activation='softmax')
])

model.summary()

The first layer in this network, `tf.keras.layers.Flatten`, transforms the format of the images from a two-dimensional array (of 28 by 28 pixels) to a one-dimensional array (of 28 * 28 = 784 pixels). Think of this layer as unstacking rows of pixels in the image and lining them up. This layer has no parameters to learn; it only reformats the data.

After the pixels are flattened, the network consists of a sequence of two `tf.keras.layers.Dense` layers. These are densely connected, or fully connected, neural layers. The first and second `Dense` layers have 128 nodes (or neurons). The third (or last) layer returns a logits array with length of 10. Each node contains a score that indicates the current image belongs to one of the 10 classes.

# Train a Model

In this section, we will first define several parameters that will be used during the training.

*   `epochs`: the number of training epochs (one epoch means the model has seen the entire training samples one times).
*   `batch_size`: the number of examples per one training step.
*   `learning_rate`: a hyperparameter that defines the adjustment in the weights of our network with respect to the loss gradient.


In [None]:
epochs = 10
batch_size = 256
learning_rate = 0.01

## Loss Function

Before we train a model, we need to specify the **loss function**, `loss`, that will be used to quantify the error between the predicted and the target classes. As we would like to train our model to differentiate among 10 fashion classes in the dataset, a loss function that we can use is *cross-entropy*. Cross-entropy is a measure of how different your predicted distribution is from the target distribution (see [Wikipedia](https://en.wikipedia.org/wiki/Cross_entropy) for more details). 

In this exercise, we will use the cross-entropy.

Note: TF-Keras also provides many other loss functions for other problems as well. You can read more [here](https://www.tensorflow.org/api_docs/python/tf/keras/losses).

In [None]:
# Cross-entropy loss
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)

## Optimizer

Another component that we need to specify before the training is the **optimizer**, `optimizer`. The optimizers that are commonly used to train deep learning models are Stochastic Gradient Descent (SGD), Adam, RMSProp, Adadelta, etc. The list of optimizers provided by TF-Keras can be found [here](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers).

Here we will use SGD.

In [None]:
# Stochastic gradient descent (SGD)
optimizer = keras.optimizers.SGD(lr=learning_rate)

## Compile the Model

Next, we configures the model for training by calling.

In [None]:
model.compile(
    loss=loss,
    optimizer=optimizer,
    metrics=['accuracy'])

## Train a model

We are now ready to train our model. Let's start feeding the data to train the model and it will learn to classify digits.

You can read more on the arguments for the `fit` function [here](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential#fit).

In [None]:
hist = model.fit(
    X_train, y_train,
    batch_size=batch_size,
    epochs=epochs,
    validation_data=(X_valid, y_valid),
    # validation_split=0.1,
    verbose=1)

In [None]:
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(hist.history['loss'], label='train')
ax.plot(hist.history['val_loss'], label='valid')
ax.set_ylabel('Loss')
ax.set_xlabel('Epochs')
plt.legend()
plt.show()

fig, ax = plt.subplots(figsize=(8,6))
ax.plot(hist.history['accuracy'], label='train')
ax.plot(hist.history['val_accuracy'], label='valid')
ax.set_ylabel('Accuracy')
ax.set_xlabel('Epochs')
plt.legend()
plt.show()

plt.close('all')

Let's see the model prediction in details. Here we will apply the trained model on the validation set.

In [None]:
# Predict the labels of these images
y_hat_valid_probs = model.predict(X_valid)

print(y_hat_valid_probs.shape)
print(y_hat_valid_probs)

It can be seen that the outputs from the `predict` function are the probability distribution of each class. Typically, we will select the class with the highest probabiliy as the predicted class for each input image.

In [None]:
# Convert the label back to the original format
y_hat_valid = np.argmax(y_hat_valid_probs, axis=-1)

print(y_hat_valid.shape)
print(y_hat_valid)

To make it more human-friendly, we will visualize the input image and its corresponding prediction to see how our model performs.

In [None]:
def plot_image(i, probs, true_label, img):
    probs, true_label, img = probs, true_label[i], img[i]
    plt.grid(False)
    plt.xticks([])
    plt.yticks([])
    plt.imshow(img, cmap=plt.cm.binary)
    predicted_label = np.argmax(probs)
    if predicted_label == true_label:
        color = 'blue'
    else:
        color = 'red'
    plt.xlabel(
        '{} {:2.0f}% ({})'.format(
            class_names[predicted_label],
            100*np.max(probs),
            class_names[true_label]),
        color=color)
    
def plot_prob_dist(i, probs, true_label):
    probs, true_label = probs, true_label[i]
    plt.grid(False)
    plt.xticks(range(10))
    plt.yticks([])
    thisplot = plt.bar(range(10), probs, color="#777777")
    plt.ylim([0, 1])
    predicted_label = np.argmax(probs, axis=-1)
    thisplot[predicted_label].set_color('red')
    thisplot[true_label].set_color('blue')

def plot_output(probs, images, labels):
    num_rows = 5
    num_cols = 3
    num_images = num_rows*num_cols
    plt.figure(figsize=(2*2*num_cols, 2*num_rows))
    for i in range(num_images):
        plt.subplot(num_rows, 2*num_cols, 2*i+1)
        plot_image(i, probs[i], labels, images)
        plt.subplot(num_rows, 2*num_cols, 2*i+2)
        plot_prob_dist(i, probs[i], labels)
    plt.tight_layout()
    plt.show()

<table>
  <tr>
    <th>Label</th>
    <th>Class</th>
  </tr>
  <tr>
    <td>0</td>
    <td>T-shirt/top</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Trouser</td>
  </tr>
    <tr>
    <td>2</td>
    <td>Pullover</td>
  </tr>
    <tr>
    <td>3</td>
    <td>Dress</td>
  </tr>
    <tr>
    <td>4</td>
    <td>Coat</td>
  </tr>
    <tr>
    <td>5</td>
    <td>Sandal</td>
  </tr>
    <tr>
    <td>6</td>
    <td>Shirt</td>
  </tr>
    <tr>
    <td>7</td>
    <td>Sneaker</td>
  </tr>
    <tr>
    <td>8</td>
    <td>Bag</td>
  </tr>
    <tr>
    <td>9</td>
    <td>Ankle boot</td>
  </tr>
</table>

In [None]:
# Color correct predictions in blue and incorrect predictions in red.
plot_output(y_hat_valid_probs, X_valid, y_valid)

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print('Validation Set')
print(confusion_matrix(y_true=y_valid, y_pred=y_hat_valid))
print(f'Accuracy: {accuracy_score(y_true=y_valid, y_pred=y_hat_valid):.2f}')
print(f'Macro F1-score: {f1_score(y_true=y_valid, y_pred=y_hat_valid, average="macro"):.2f}')

# Evaluate Performance on Test Set

Once you have finished the model training, you then evaluate the classification performance on the test set (i.e., the unseen dataset).

In [None]:
# Predict the labels of these images
y_hat_test_probs = model.predict(X_test)

print(y_hat_test_probs.shape)
print(y_hat_test_probs)

In [None]:
# Convert the label back to the original format
y_hat_test = np.argmax(y_hat_test_probs, axis=-1)

In [None]:
# Output
print('Test Set')
print(confusion_matrix(y_true=y_test, y_pred=y_hat_test))
print(f'Accuracy: {accuracy_score(y_true=y_test, y_pred=y_hat_test):.2f}')
print(f'Macro F1-score: {f1_score(y_true=y_test, y_pred=y_hat_test, average="macro"):.2f}')

# Error Analysis

It's always a good idea to inspect the output and make sure everything looks fine. Here we'll look at some examples our model gets right, and some examples it gets wrong on the test sets.

First, we determine which samples are correct or incorrect on the test set.

In [None]:
correct_indices = np.where(y_hat_test == y_test)[0]
incorrect_indices = np.where(y_hat_test != y_test)[0]

Then we plot the images with their corresponding classes. In the incorrect case, we also plot the ground truth classes for comparison.

In [None]:
# Correct
idx = np.random.choice(np.arange(len(correct_indices)), 15)
print('Correct')
plot_output(
    y_hat_test_probs[correct_indices[idx]],
    X_test[correct_indices[idx]],
    y_test[correct_indices[idx]])

In [None]:
# Incorrect
idx = np.random.choice(np.arange(len(incorrect_indices)), 15)
print('Incorrect')
plot_output(
    y_hat_test_probs[incorrect_indices[idx]],
    X_test[incorrect_indices[idx]],
    y_test[incorrect_indices[idx]])

# Play around

Now it is your turn! Let's try to change the model architecture and the optimizer to see the effects.

For example,
* Change the number of fully-connected layers
    * e.g., 2, 3, 4 layers
* Change the number of hidden units
    * e.g., 10, 128, 256, 512
* Change the optimizers (i.e., `optimizer`)
    * e.g., [keras.optimizers.RMSprop](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/RMSprop), [keras.optimizers.Adadelta](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adadelta), [keras.optimizers.Adam](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam)
* Change the learning rate of the optimizer (i.e., `learning_rate`)
    * e.g., 10000, 0.00001, 0.001
* Change the number of training epochs (i.e., `epochs`)
    * e.g., 1, 10, 20