# Training an MNIST Neural Net using Numpy (Only)

First let's install everything we need for this course.

In [None]:
!pip install -r requirements.txt

The goal of this notebook is to show you how to build and train a very simple feedforward neural network to classify MNIST purely in numpy. In practice, today we use packages such as `tensorflow` or `Pytorch`, but is useful nevertheless to demystify what is happening under the hood.

In [None]:
from io import BytesIO

import itertools
import gzip
import pickle
import requests
import time

import requests
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

## 1) Data Loading

In [None]:
def read_mnist(dataset="train", flatten=True):
    MNIST_REMOTE = (
        "https://s3-eu-west-1.amazonaws.com/faculty-client-teaching-materials"
        "/neural-networks/mnist.pkl.gz"
    )
    response = requests.get(MNIST_REMOTE)
    mnist = BytesIO(response.content)

    with gzip.open(mnist, "rb") as f:
        train_set, valid_set, test_set = pickle.load(f, encoding="bytes")

    if "train" in dataset.lower():
        images, labels = train_set
    elif "valid" in dataset.lower():
        images, labels = valid_set
    elif "test" in dataset.lower():
        images, labels = test_set
    else:
        raise ValueError(
            "dataset must be 'train', 'valid' or 'test'. "
            "Got '{}'".format(dataset)
        )
    if not flatten:
        images = images.reshape(-1, 28, 28)

    return images, labels

In [None]:
def show(image, label=None):
    fig, ax = plt.subplots()
    plot = ax.imshow(image.reshape(28, 28), cmap=plt.cm.gray)
    plot.set_interpolation("nearest")
    ax.xaxis.set_ticks_position("top")
    ax.yaxis.set_ticks_position("left")
    # Ensure label 0 is not passed as False
    if label or label == 0.0:
        ax.set_xlabel("Label: {}".format(label), size=14)
    plt.show()

In [None]:
X_train, Y_train = read_mnist("train")

In [None]:
show(X_train[0], Y_train[0])

## 2) Define the Neural Network

**Ex:** Complete the softmax and neural net prediction functions below. Recall for this simple example we're assuming a neural net that simply softmaxes its inputs i.e.
y = softmax(W.x + b) where $ \textrm{softmax}(z_i)= \frac{exp(z_i)}{\sum_j exp(z_j)} $

In [None]:
def softmax(z):
    exp_z = np.exp(z)
    return exp_z / exp_z.sum()

In [None]:
def neural_network_prediction(x, W, b):
    return softmax(np.dot(x, W) + b)

## 3) Define a Loss Function for the Network to Minimise

**Ex:** Implement the cross entropy loss $H_p(q) = -\sum_x p(x) \log q(x)$

In [None]:
def cross_entropy_loss(p, q):
    return -(p * np.log(q)).sum()

## 4) One-Hot Encode Labels to Compare With Predictions
The neural network takes a `28x28` pixel image (flattened to a vector of length 784) and outputs a prediction for digit shown in the image. The output is a vector of length ten, which forms a probability distribution over the possible classes `0-9` for the input image. Obviously the predicited class is just the class assigned the largest probability. 

But our loss function wants to compare distributions, and so expects vectors of equal length. Currently the true labels in the dataset are digits, `0`, or `1`, or ... `9`, not vectors.

**Ex:** Let's encode labels as a vector, such that `3` -> `[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]`.

In [None]:
def one_hot_encode(label):
    ohe = np.zeros(10, dtype=int)
    ohe[label] = 1
    return ohe

## 5) Gradients of Loss wrt Parameters
During gradient descent, we 'learn' by updating the parameters W, b so as to minimise the loss function. This requires calculating the gradient of the loss function w.r.t. the parameters.

I've done this bit for you (it can be done analytically in this case):

In [None]:
def dloss_dw(y_true, x, y_pred):
    """Analytic gradient of cross-entropy with reference to matrix W."""
    return np.outer(x, -y_true * (1 - y_pred))


def dloss_db(y_true, x, y_pred):
    """Analytic gradient of cross-entropy with reference to vector b"""
    return -y_true * (1 - y_pred)

## 6) Network Training Loop

To learn the neural network parameters, we have to
1. Loop through the data
2. Make a prediction for each data point 
3. Caluclate the losses and gradients for a batch of data points - note the gradient for a batch should be the average of the gradients of the points
4. Update the network parameters given these gradients 
5. Repeat until converged (plotting metrics)

**Ex:** Complete the above steps below:

In [None]:
def learn(W, b, max_iters=100000, learning_rate=0.1, batch_size=100):
    """Update W, b to reduce cross-entropy on using gradient descent."""
    W = W.copy()  # don't overwrite original parameters
    b = b.copy()
    grad_w = np.zeros(W.shape)  # Initialise gradients
    grad_b = np.zeros(b.shape)

    # keep track of some metrics
    total_loss = 0
    losses = []
    train_errors = []
    val_errors = []
    start = time.time()
    print_freq = max_iters / 10  # print progress 10 times during learning

    # learn by cycling repeatedly through data for max_iters iterations
    X_train, y_train = read_mnist("train")
    X_val, y_val = read_mnist("valid")

    for i, (x, label) in enumerate(itertools.cycle(zip(X_train, y_train))):
        # predictions
        y_pred = neural_network_prediction(x, W, b)
        y_true = one_hot_encode(label)

        # gradients and loss
        grad_w += dloss_dw(y_true, x, y_pred) / batch_size
        grad_b += dloss_db(y_true, x, y_pred) / batch_size
        total_loss += cross_entropy_loss(y_true, y_pred)

        if i % batch_size == 0:
            W -= learning_rate * grad_w
            b -= learning_rate * grad_b
            grad_w = np.zeros(W.shape)
            grad_b = np.zeros(b.shape)

        # Calculate error metrics
        if i % print_freq == 0:
            train_error, val_error = get_metrics(
                i, total_loss, W, b, X_train, y_train, X_val, y_val, print_freq
            )

            # accumulate the metrics
            train_errors.append(train_error)
            val_errors.append(val_error)
            losses.append(total_loss)
            total_loss = 0

        if i > max_iters:
            return W, b, losses, train_errors, val_errors

## 7) Metrics

In [None]:
def get_metrics(
    iteration, total_loss, W, b, X_train, y_train, X_val, y_val, print_freq
):
    total_loss /= print_freq
    train_error = error_rate_mnist(W, b, X_train, y_train)
    val_error = error_rate_mnist(W, b, X_val, y_val)
    print(
        (
            "Iteration {iteration} | Loss: {loss:.4f} | "
            "Train error: {train:.4f} | Validation error: {val:.4f} | ".format(
                iteration=iteration,
                loss=total_loss,
                train=train_error,
                val=val_error,
            )
        )
    )
    return train_error, val_error

In [None]:
def error_rate_mnist(W, b, X, y):
    y_pred = np.argmax(neural_network_prediction(X, W, b), axis=1)
    return np.mean(y_pred != y)

In [None]:
def plot_learning(losses, train_errors, val_errors):
    """Plot the train/val error rate and the loss"""
    fig, ax = plt.subplots(ncols=2, figsize=(14, 5), squeeze=True)

    ax[0].plot(
        range(len(train_errors)), train_errors, "-o", label="Training error"
    )
    ax[0].plot(
        range(len(val_errors)), val_errors, "-o", label="Validation error"
    )
    ax[0].set_ylim(0, 1)
    ax[0].legend(loc="upper right")
    ax[0].set_title("Error rate")

    ax[1].plot(range(len(losses)), losses, "-o")
    ax[1].set_title("Losses")
    plt.show()

## 8) Train Your Network

In [None]:
W1 = np.random.random((784, 10)) - 0.5
b1 = np.random.random((1, 10)) - 0.5

In [None]:
W, b, losses, train_errors, val_errors = learn(W1, b1)

In [None]:
plot_learning(losses=losses, train_errors=train_errors, val_errors=val_errors)