# Training a Neural Network
We will use Keras with the goal of classifying images of digits from the MNIST dataset - similarly as we have done earlier. The focus in this practical will be on implementing various concepts that we have seen in the lecture. If needed, you can find the documentation for Keras [here](https://keras.io/applications/).

Let's import all the functions and libraries that we will use:

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from keras.utils import to_categorical
import random
import matplotlib.pylab as plt

## Initial processing
We will reuse the code from earlier to do the initial processing.

We load the MNIST dataset first.

In [None]:
with np.load('data/mnist.npz', allow_pickle=True) as f:
    images_train, labels_train = f['x_train'], f['y_train']
    images_test, labels_test = f['x_test'], f['y_test']

Next we reshape the dataset into vector form and convert it to a type suitable for Keras (`float32`). We have 60,000 training samples, 10,000 test samples and the samples (instances) are 28x28 arrays (together giving 784 components).

In [None]:
images_train = images_train.reshape(60000, 784) 
images_test = images_test.reshape(10000, 784)

images_train = images_train.astype('float32') 
images_test = images_test.astype('float32')

images_train /= 255  # normalising on (0,1) 
images_test /= 255  # normalising on (0,1)

The labels are stored as integer values from 0 to 9. We need to tell Keras that these form the output categories via the function `to_categorical` from `np_utils`. 

In [None]:
nb_classes = 10
labels_train = to_categorical(labels_train, nb_classes)
labels_test = to_categorical(labels_test, nb_classes)

We also need to normalize our data so that they have mean 0 and standard deviation of 1. This is in addition to the initial processing we have done.

It is often very useful to normalize the data as this will allow us to obtain stronger performance from our models.

The normalization is typically done using the following formula:
$$\hat{x}_i=\frac{x_i-\text{mean}(X^{train})}{\text{std}(X^{train})},$$

where $X^{train}$ represents the whole training dataset and $x_i$ is the current pixel.

We typically calculate the statistics using the whole training set, before taking a part of the training data for validation. The test set is normalized using the train set mean and standard deviation.

You will find it useful to use `numpy` for calculating the statistics and transforming the data.

In [None]:
# add your code here
# calculate the mean and std



In [None]:
# add your code here
# normalize the data - both train and test



The mean and std values should be around 0.1307 and 0.3081 respectively.

It is important to use a validation set that is different from the testing set - otherwise we could over-fit to the test set and report too optimistic generalization abilities.

Create your own validation set by taking the final 1000 examples from the training set. Call the images ``images_val`` and labels ``labels_val``.

In [None]:
# add your code here



Also redefine the training set so that it includes only the first 5000 examples (no overlap with the validation set). Keep the names ``images_train`` and ``labels_train``.

In [None]:
# add your code here



You have seen how to implement and fit a model already, so we will focus on implementing several key concepts from scratch, together with learning how to use some of the more advanced techniques in Keras (e.g. learning rate annealing). This will help you understand the details of how these concepts work and will also allow you to potentially come up with your own versions!

## Cross-entropy loss
In this part we will implement multi-class cross-entropy loss from scratch.

As a reminder, the formula for multi-class cross-entropy is
$$\text{cross-entropy}=-\frac{1}{N}\sum_{i=1}^N \sum_{j=1}^C y_{i,j}\log(\hat{y}_{i,j}),$$
where $N$ is the number of examples in a mini-batch, $C$ is the number of classes, $y_{i,j}$ are the true labels and $\hat{y}_{i,j}$ are the predicted probabilities.

Name the function `my_cross_entropy_loss`. The function will expect two arguments: `y_true` labels that are already one-hot encoded and `y_pred` that has the predicted probabilities of the different classes.

While implementing your own cross-entropy loss, you may find the following functions useful:
* `tf.math.reduce_mean`
* `tf.math.reduce_sum`
* `tf.math.log`

Remember to select the correct axis when summing over the different classes - `axis=1` corresponds to the columns. Also note that we need to use the TensorFlow backend because Keras is only a high-level library that needs to use TensorFlow (or alternative if selected) for the lower-level operations.

In [None]:
# add your code here



Now test your implementation by running the cell below:

In [None]:
logits = tf.Variable(np.array([[1, 2, 3],[4, 5, 6]]), dtype = tf.float32)
y_pred = tf.nn.softmax(logits, axis = 1)
y_true = tf.Variable(np.array([[0, 1, 0],[1, 0, 0]]), dtype = tf.float32)

true_cross_entropy = tf.keras.losses.CategoricalCrossentropy()

true_ce = tf.keras.losses.CategoricalCrossentropy()(y_true, y_pred).numpy()
own_ce = my_cross_entropy_loss(y_true, y_pred).numpy()

if true_ce == own_ce:
    print('The implementation looks good')
else:
    print('There is some issue in the implementation')
    print('True CE: ' + str(true_ce))
    print('Your CE: ' + str(own_ce))

## ReLU activation function
Next we implement our own ReLU activation function. Recall that ReLU is defined using the following formula:
$$\text{ReLU}(x)=\max(0,x).$$

You will find it helpful to use a relevant function from `tf.math` library. Since the tensors are expected to be of `float` type, you will need to compare the tensor value with a `float` rather than integer. Name the function `my_relu_activation`.

In [None]:
# add your code here



You can test your implementation by running the cell below:

In [None]:
tensor = tf.Variable(np.array([[7, -2, 0],[-4, 5, 6]]), dtype = tf.float32)
true_relu_value = np.array([[7, 0, 0],[0, 5, 6]], dtype=np.float32)
own_relu_value = my_relu_activation(tensor).numpy()

if np.array_equal(own_relu_value, true_relu_value):
    print('The implementation looks good')
else:
    print('There is some issue in the implementation')
    print('True ReLU: ' + str(true_relu_value))
    print('Your ReLU: ' + str(own_relu_value))

## Glorot weight initialization scheme
Another key part of training a neural network is selecting a suitable initialization scheme. It is important as for example simply initializing the weights to all zeros would prevent the model from learning anything.

We will implement Glorot weight initialization scheme, specifically its uniform version: 
$$w_i\sim U\left(-\sqrt{6/(n_{in}+n_{out})}, \sqrt{6/(n_{in}+n_{out})}\right),$$

where $n_{in}$ is the number of units in the previous layer and $n_{out}$ is the number of units in the next layer.

Give your function a name `my_glorot_weight_init`. The function should accept two arguments: `shape` (this will give you $n_{in}$ and $n_{out}$ ) and `dtype` that will be used when sampling the uniform values. Set the default value of `dtype` to `float32`. You may find it useful to use `tf.random.uniform` for sampling from the uniform distribution.

In [None]:
# add your code here



You can check if your weights have the right shape using the cell below:

In [None]:
shape = [2, 3]

weights = my_glorot_weight_init(shape)

if len(weights.shape) == 2 and weights.shape[0] == 2 and weights.shape[1] == 3:
    print('Your weights have the right shape')
else:
    print('Your weights have incorrect shape: ' + str(weights.shape) + ' rather than (2, 3)')

## L2 regularization
L2 regularization (also known as weight decay in the case of SGD optimizer) is a common method for preventing over-fitting and improving the generalization of the model. The value of L2 regularization is calculated as:
$$L_2 = \alpha \sum_i w_i^2,$$

where $\alpha$ is the strength of the regularization and we sum over all model weights $w_i$.

Implementing own regularization in Keras is slightly more complex if we want to make it general enough so that it can be used with various regularization strengths. However, it will be a useful learning exercise.

Define a class called `MyWeightDecay` that inherits from `keras.regularizers.Regularizer`. Define `__init__` method that has an argument for strength (with default value `0.001`) and stores it as a parameter of the object. Also define `__call__` method that accepts the weights and returns the regularization, scaled by the chosen regularization strength. The returned value should be a scalar. Similarly as before, you may find methods from `tf.math` helpful.



In [None]:
# add your code here



Now test your implementation using the code below:

In [None]:
tensor = tf.Variable(np.array([[7, -2, 0],[-4, 5, 6]]), dtype = tf.float32)
weight_decay_obj = MyWeightDecay(0.5)
own_reg = weight_decay_obj(tensor).numpy()

if np.array_equal(own_reg, 65.0):
    print('The implementation looks good')
else:
    print('There is some issue in the implementation')
    print('True value: ' + str(65.0))
    print('Your value: ' + str(own_reg))

## Model training

We have implemented several key concepts from scratch and now we are ready to implement and train a model! Previously we used a Sequential model from Keras, however this does not allow us to use some of the more advanced features that may be useful. Consequently, we will define our own class.

Give the class name `MultiLayerPerceptron` and make it inherit from `keras.Model` - this will give the class many useful additional functions.

Define a method `__init__` and use three arguments: `activation`, `weight_init`, `regularizer`. Set the default values to `relu`, `glorot_uniform` and `None` respectively. As part of the `__init__` method define three fully-connected layers (`keras.layers.Dense`):
* The first one with 500 units, `input_shape=(784,)` and with passed argument values for `activation`, `kernel_initializer`, `kernel_regularizer`.
* The second layer should have 300 units and again use the passed argument values.
* The final layer should have 10 output units, use `softmax` for activation and the passed argument values for `kernel_initializer` and `kernel_regularizer`.

Further define a method `call` that takes an input and passes it through the three layers that we have defined.

If needed, you can find additional help [here](https://keras.io/guides/making_new_layers_and_models_via_subclassing/).

Define the model:

In [None]:
# add your code here



Now we can create an instance of our `MultiLayerPerceptron` model - we can call it `model`. We will use our own methods for `activation`, `weight_init` and `regularizer` (selecting strength of 0.001).

In [None]:
# add your code here



We will also need to create the optimizer. We will use Nesterov SGD optimizer available from `keras.optimizers.SGD` with learning rate 0.01 and momentum 0.9.

Define the optimizer:

In [None]:
# add your code here



When working with neural networks it is often very useful to use model check-pointing so that we can train the model for a fixed number of epochs and use the model that achieved the best validation accuracy over the training process. This is a form of early stopping and allows us to use the model from its best stage of training.

We can set-up model check-pointing using the following code:

In [None]:
# Do not modify the checkpoint filepath
# it is set so our model can retrieve checkpoints later
checkpoint_filepath = './.checkpoint.weights.h5'

In [None]:
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    checkpoint_filepath, 
    monitor='val_accuracy', 
    verbose=0, 
    save_best_only=True,
    save_weights_only=True, 
    mode='auto', 
    save_freq='epoch')

Now we are ready to train the model. Use function `compile` to prepare the model, using our own cross-entropy loss and the optimizer we have just defined.

When fitting the model, use a batch size of 100, train for 100 epochs, set `verbose=2` to see more details, use `callbacks=[model_checkpoint_callback]` and use the validation data that we have created. If you wanted to store the training statistics, you can assign the result of `model.fit` to a variable and then look at the field `.history` that will give you the statistics in the form of a dictionary.

Afterwards load the best weights using `model.load_weights` from the path we have defined. Confirm that if you call `model.evaluate` on the validation data, you obtain the best validation accuracy that you have seen during the training. Further, evaluate the stored model also on the test data to see how it generalizes.

In [None]:
# add your code here



We have used SGD with the default learning rate of 0.01. But is it actually a good value? In this part we check what happens if we try values [0.1, 0.01, 0.001] separately.

Iterate over the different learning rates, in each step initializing a new model and optimizer and training it in the same way we have just done. For each learning rate print the validation accuracy with the best model obtained during training (remember to reinitialize also the model check-pointing). Based on this we will be able to select which learning rate is the best from the given set.

In [None]:
# add your code here



It is quite likely that you have just found out that the learning rate of 0.01 is actually a really good one and works better than the other ones we have considered. We can consider using learning rate annealing to further improve the model performance. Using a larger learning rate is helpful at the beginning, followed by a smaller learning rate to get closer to the optimum.

We will try using `PiecewiseConstantDecay` learning rate scheduler from `keras.optimizers.schedules` (more information [here](https://keras.io/api/optimizers/learning_rate_schedules/piecewise_constant_decay/)). We need to specify the boundaries and the learning rate values to use. The boundary is expected in steps rather than epochs, so we will need to do a small calculation:
* There are 50,000 examples in the training set and our batch size is 50. Hence each epoch has 5000 / 50 = 100 steps.

Let's use learning rate of 0.01 for the first 4 epochs (400 steps), followed by learning rate of 0.005 for the remained of training. When implementing the approach, use the same settings as we have done previously. The learning rate schedule can be conveniently passed directly to the `learning_rate` parameter. Have a look at both the best validation accuracy as well as the test accuracy.

In [None]:
# add your code here



Hopefully learning rate annealing has helped for you too, but if not, it can simply be because of the noise associated with training neural networks.

In [None]:
# review model performance on random examples:
test_image = random.choice(images_test)
plt.imshow(test_image.reshape(28, 28), cmap="gray")
prediction = model.predict(test_image.reshape(1, 784)).argmax()
print(f"\nIs this a number {prediction}?")

In [None]:
# report final loss and accuracy on the test set
print('Test set results:')

loss, accuracy = model.evaluate(images_test, labels_test)
print(f"Test loss: {loss}")
print(f"Test accuracy: {accuracy}")
