# Preface

In this notebook, we introduce basics of building neural network models using the `keras` API over the `tensorflow` library. This significantly simplifies model building and prototyping.

Install `tensorflow` (which now contains the `keras` API under `tf.keras`) by issuing

```
$pip install tensorflow
```

We will assume that you are using tensorflow v2.0.0 or later. You can check this by
```python
import tensorflow as tf
print(tf.__version__)
```

In [None]:
pip install tensorflow -i https://pypi.tuna.tsinghua.edu.cn/simple

In [None]:
import tensorflow as tf
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
sns.set(font_scale=1.5)

# MNIST Dataset

The MNIST dataset (http://yann.lecun.com/exdb/mnist/) is one of the simplest classification models. 

It consists of a collection of hand-written digits from 0 to 9. The goal is to build a classifier for this 10-class multi-class classification problem.

The dataset can be conveniently imported through the `tf.keras.datasets` API.

In [None]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# Preprocessing

Let's first check the data shape and format.

In [None]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

In [None]:
sns.heatmap(x_train[0])

As can be seen, each input sample is a 28x28 picture, with each pixel a grayscale value from 0-255. 

We will perform the following preprocessing:
  * Flatten the input into a data matrix [num_samples, 28, 28] $\rightarrow$ [num_samples, 784]
  * Rescale the pixel values from [0,255] to [0,1]

In [None]:
x_train = x_train.reshape(-1, 784) / 255.0
x_test = x_test.reshape(-1, 784) / 255.0

Next, let us look at the format of the labels

In [None]:
print(y_train)

The labels are the actual numerical values. But for this classification problem we should not use the ordinal data as is. Instead, we convert them into a one-hot encoding, e.g.
$$
    5 \rightarrow [0, 0, 0, 0, 0, 1, 0, 0, 0, 0] \\
    3 \rightarrow [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
$$

This can be done calling `tf.keras.utils.to_categorical` on the inputs.

In [None]:
y_train = tf.keras.utils.to_categorical(y_train)
y_test = tf.keras.utils.to_categorical(y_test)

In [None]:
print(y_train[0])

In [None]:
y_train.shape

# Building a simple neural network

Here we will build and train a simple neural network for this classification problem.

We will use the `Sequential` model API. A *model* in keras represents a high level abstraction of a neural network. It consists of a collection of *layers*, and training/evaluation and other common tasks are all handled at the model level.

The *sequential* model is a special type of models where it is a linear stack of layers, and will suffice for our current task. In future demos we will explore other types of models supported by `Keras`. For more information, you may check here: https://keras.io/models/about-keras-models/



In [None]:
from tensorflow.keras import Sequential

In [None]:
model = Sequential()

We want to build a simple 1-hidden-layer (shallow) neural network in the form of 
$$
    h = g(Wx + c) \\
    y = Vh + b
$$
We will use the ReLU activation
$$
    g(z) = \max(0, z)
$$
The number of hidden units is given by the first dimension of W (or the dimension of c) and the number of output units is 10, since we are considering a 10-class classification problem.

Both of these layers can be implemented by the `Dense` layer type from `tf.keras.layers`. We will first use a hidden dimension of 128.

In [None]:
from tensorflow.keras.layers import Dense

In [None]:
model.add(Dense(units=128, activation='relu'))
model.add(Dense(units=10, activation='softmax'))

# Compiling and Training

Now, let us specify losses and optimizers using the `compile` method and train the neural network using the `fit` method.

Here, we will use the *cross-entropy* loss and the *SGD* optimizer, which stands for stochastic gradient descent. We have only introduced the usual gradient descent in class, but we will discuss its extension to the stochastic case to handle large datasets in a later lecture. By setting `batch_size` to be equal to the size of the training set, SGD is equivalent to GD.

In the `compile` method, we can also specify additional quantities to monitor during training, in addition to the loss. Recall that we are using the cross-entropy loss as a surrogate of the accuracy (0-1 loss), so let us monitor the accuracy also.

In [None]:
from tensorflow.keras.optimizers import SGD

In [None]:
model.compile(loss='categorical_crossentropy', optimizer=SGD(learning_rate=0.25), metrics=['accuracy'])

We are now ready to train the model.

We are going to use the inefficient GD, which requires batch_size to be set to the total number of samples. To make training faster, we are going to only use the first 5000 data points for faster training.

Here `epochs` refers to the number of sweeps through our training set. Since we are doing GD, each iteration is one sweep, and hence one epoch. 

In [None]:
x_train = x_train[:5000]
y_train = y_train[:5000]

In [None]:
history = model.fit(x=x_train,
                    y=y_train,
                    epochs=200,
                    batch_size=x_train.shape[0],
                    validation_data=(x_test, y_test))

Let us examine the training curves to see how we are doing.

In [None]:
import pandas as pd

In [None]:
history = pd.DataFrame(history.history)

In [None]:
history.plot(y=['loss', 'val_loss'], title='Loss')
history.plot(y=['accuracy', 'val_accuracy'], title='Accuracy')

# Evaluation of our Model

Since we are faced with a classification problem, there is more than just accuracy we care about.

## Classification Report

This is a handy function from the `sklearn` library. It outputs the 1 vs all precision, recall, f1-score and support for each class. This is most useful when you have unbalanced datasets (not the case here). 

In [None]:
from sklearn.metrics import classification_report

In [None]:
y_test_predict = model.predict(x_test)

In [None]:
print(classification_report(y_true=y_test.argmax(1), y_pred=y_test_predict.argmax(1)))

## Confusion Matrix

We can also look at the so-called *confusion matrix*, which is a matrix whose $i,j$ entry represents the number of samples  belong to class $i$ that was classified as class $j$.

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
cmatrix = confusion_matrix(y_true=y_test.argmax(1), y_pred=y_test_predict.argmax(1))

In [None]:
print(cmatrix)

Let us use `sns.heatmap` to visualize the confusion matrix

In [None]:
sns.heatmap(cmatrix, annot=True, fmt="d")

We can also get rid of the diagonal values (representing the correctly classified samples) for a clearer view of the main confusion

In [None]:
np.fill_diagonal(cmatrix, 0)
sns.heatmap(cmatrix, annot=True, fmt="d")