## Morning practical 2 day 3

Welcome to the second practical of today. Here, you'll start on the process of actually implementing backpropagation. First, preparation of some functions we will need.



In [1]:
#run this cell to set things up
import ipywidgets as widgets, numpy as np, pandas as pd
from numpy.random import default_rng
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
import math
import seaborn as sns
from IPython.display import display, Markdown
from scipy.optimize import fmin_bfgs

In [2]:
#important functions
def mySigmoid(data):
    output = 1/(1+ np.exp(-data))
    return output

## Loading in the MNIST data

We're going to be working with the MNIST data again, though now in a slighly different format of 28\*28 = 784 pixel images, where each pixel has a brightness from 0 to 255. We change this brightness to a value between 0 and 1.

In [3]:
import tensorflow as tf
from tensorflow import keras

#load-in taken from here: https://keras.io/examples/vision/mnist_convnet/
# Model / data parameters
num_classes = 10
input_shape = (28, 28, 1)

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Scale images to the [0, 1] range
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255
# Make sure images have shape (28, 28, 1)
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)
print("x_train shape:", x_train.shape)
print(x_train.shape[0], "train samples")
print(x_test.shape[0], "test samples")


# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)


x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples


## Subsampling the training data

At this point, the data is in the format of 60.000 training samples, each consisting of a 28\*28 matrix of pixel intensities, with 1 channel (i.e. there's just 1 colour here, grey. In a colour image you would have 3 channels: red, green and blue intensities).

For a dense neural network that we'll be working with, we'll just convert these 28\*28 matrices into vectors with 784 entries. We'll also subsample the data: this amount of data is fine for optimised routines, but for our own implementation 20.000 training samples is _plenty_. 

In [6]:
from numpy.random import default_rng
rng = default_rng(42)

print(np.unique(y_train, axis = 0, return_counts= True))
#the training set is not quite balanced, but let's keep the proportions it has nonetheless.

uniqueLabels = np.unique(y_train, axis = 0, return_counts= True)

#if to make sure you don't subsample again if you rerun this cell
if np.sum(uniqueLabels[1]) >=20010:
    listChoiceIndices = []
    for index, label in enumerate(uniqueLabels[0]):
        print("Subsetting for label: " + str(label))
        nToPick = np.ceil(1/3*uniqueLabels[1][index]).astype(int)
        indicesToPick = np.where(np.all(y_train == label, axis = 1))
        choiceIndices = rng.choice(indicesToPick[0], size = nToPick, replace = False)
        listChoiceIndices.append(choiceIndices)
    choiceIndices = [val for sublist in listChoiceIndices for val in sublist]

    x_train, y_train = x_train[choiceIndices], y_train[choiceIndices]
    #convert matrices into vectors

x_train_vectors = np.vstack([np.ravel(elem) for elem in x_train])
x_test_vectors  = np.vstack([np.ravel(elem) for elem in x_test])

#output for afternoon practical.
# np.savetxt("XTrainMNIST.csv", x_train_vectors, delimiter = ",")
# np.savetxt("XTestMNIST.csv", x_test_vectors, delimiter = ",")
# np.savetxt("yTrainMNIST.csv", y_train, delimiter=",")
# np.savetxt("yTestMNIST.csv", y_test, delimiter=",")
np.savez_compressed("dataMNISTNeuralNetwork", XTrain=x_train_vectors, XTest=x_test_vectors,
                   yTrain = y_train, yTest = y_test)

(array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32), array([1983, 1951, 2089, 1973, 1807, 1948, 2044, 1986, 2248, 1975],
      dtype=int64))


(20004, 784)

## Implementing the cost function

We'll be working with the same network architecture as yesterday (although slightly more inputs): a Hidden Layer (HL) with 25 nodes, and 10 output neurons (predicting each of the 10 possible digits).
![NN](NeuralNetwork.png)


First things first: we'll need to calculate the cost at the end of the network. We'll use the categorical cross-entropy, whose formula is: <br>
![CategoricalCrossEntropyFormula](CategoricalCrossEntropy.png) <br> <br>

Let's make a function that, ultimately, will return the gradients for every weight in our 2-layer neural network. We might want to use `fmin_bfgs` or another minimisation function on the weights (rather than bog-standard gradient descent). For this reason, we will make the function take in a list of all the thetas of the network. That's what these methods want: a single list/array of gradients. We internally turn this array back into the matrices $\Theta_1$ and $\Theta_2$. No stress: that's done for you. Your job: reimplement forward propagation in the function below, then calculate the cost with categorical cross-entropy. So:

* Implement forward propagation in the function below. This is the same as yesterday.
* After forward propagation, calculate the mean cost (1/m * total cost) for all the forward-propagated samples. There's no looping involved!

Hints:
* Don't forget to use `mySigmoid()`.
* First experiment outside the function body before putting things into the function!
* With the seed we've set for our rng, you should get a mean cost of 6.84 for these random thetas.

In [97]:
rng = default_rng(42)
thetaOneMatrix = rng.uniform(-0.12, 0.12, size = (25, 785))
thetaTwoMatrix = rng.uniform(-0.12, 0.12, size = (10, 26))

nnThetas       = np.append(np.ravel(thetaOneMatrix), np.ravel(thetaTwoMatrix))
inputLayerSize = 784
hiddenLayerSize = 25
classLabels = 10
X = x_train_vectors
y = y_train

def nnCostFunction(nnThetas, X, y, lambda_ = 0, inputLayerSize = 784, hiddenLayerSize = 25, classLabels = 10):
    
    m = len(X)
    
    #reshaping the list of parameters to matrices
    hiddenLayerParamNr    = (hiddenLayerSize * (inputLayerSize+1))
    thetaOneMatrix        = np.reshape(nnThetas[0:hiddenLayerParamNr],
                                       newshape = (hiddenLayerSize, inputLayerSize+1))
    outputLayerParamStart = hiddenLayerParamNr 
    thetaTwoMatrix        = np.reshape(nnThetas[outputLayerParamStart:],
                                       newshape = (classLabels, hiddenLayerSize+1))
    
    #Your turn! Implement forward propagation below, then calculate the cost for the forward_propagated samples
    #Feel free to remove the def nnCostfunction: -part and just play with the steps in this cell or a new cell below
        #and to then later assemble it back into a function
    
    return None #change this to return the cost!

#answer

def nnCostFunction(nnThetas, X, y, lambda_ = 0, inputLayerSize = 784, hiddenLayerSize = 25, classLabels = 10):
   
    m = len(X)
    
    #reshaping the list of parameters to matrices
    hiddenLayerParamNr    = (hiddenLayerSize * (inputLayerSize+1))
    thetaOneMatrix        = np.reshape(nnThetas[0:hiddenLayerParamNr],
                                       newshape = (hiddenLayerSize, inputLayerSize+1))
    outputLayerParamStart = hiddenLayerParamNr 
    thetaTwoMatrix        = np.reshape(nnThetas[outputLayerParamStart:],
                                       newshape = (classLabels, hiddenLayerSize+1))
    
    #calculating the forward pass
    inputs        = np.c_[np.ones(shape = (len(X), 1)), X]
    weightedSumHL = inputs @ thetaOneMatrix.T
    activationHL  = mySigmoid(weightedSumHL)
    
    inputsOL      = np.c_[np.ones(shape = (len(activationHL), 1)), activationHL]
    weightedSumOL = inputsOL @ thetaTwoMatrix.T
    activationsOL = mySigmoid(weightedSumOL)
    
    #calculating the cost (skip the explanation to the 1 line of code if you want!)
    #remember: log 1 = 0.
    #log 0 = -Inf
    #So: y is, for example [1,0,0,0,0,0,0,0,0,0]
    #activations for this sample are, for instance: [0.8, 0.133, 0,0,0,0,0,0,0,0]
    #np.log(activations) = [-0.223, -2.02, -Inf, -Inf, -Inf, -Inf, -Inf, -Inf, -Inf, -Inf]
    #[1,0,0,0,0,0,0,0,0,0] * [-0.223, -2.02, -Inf, -Inf, -Inf, -Inf, -Inf, -Inf, -Inf, -Inf] = [-0.223, 0, 0, 0, etc.]
    # - [-0.223, 0, 0, 0, etc.] = [0.223, 0, 0, 0, etc.] --> so 0.223 cost due to not correctly predicting the 1 but instead 0.8

    #second part:
    #1-y = [0,1,1,1,1,1,1,1,1,1]
    #1-activationsOL = [0.2, 0.867, 1, 1, 1, 1, 1, 1, 1, 1]
    #np.log(1-activationsOL) = [-1.61, -0.143, 0,0,0,0,0,0,0,0]
    #1-y * np.log(1-activationsOL) = [0, -0.143, 0,0,0,0,0,0,0,0]
    
    #finally:
    # [0.223, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] - [0, -0.143, 0,0,0,0,0,0,0,0] = [0.223, 0.143, 0,0,0,0,0,0,0,0]
    # Hence, for this one sample, cost would be 0.366
    
    #Because everything here is an array, this is automatically done for every row (every sample) and the
    #np.sum then sums all numbers up to get to the total cost, which is then averaged by multiplying with 1/m
    J = 1/m * np.sum((- (y * np.log(activationsOL)) - ((1-y) * np.log(1-activationsOL))))
    
    return J


nnCostFunction(nnThetas, X, y)

6.8477887682042775

## Adding regularisation

You probably already reckoned from the inclusion of a lambda_ argument that there was going to be some regularisation. Indeed there is. We don't want to overfit our network to the training data, and it's still a good strategy to do so by adding a penalty for too large or too small weights.

The regularised cost function looks as follows:
![RegCostFunction](RegularisedCostFunctionNN.PNG) <br>

It might seem a bit daunting, but all it's saying is 'loop over every parameter for every unit in the HL and the output layer, and add the square of that number to a running sum, then scale with $\lambda$ and divide by the number of samples'. It's no different from before.

Up to you to:
* Copy your function from above in the cell below and edit it so that it can return a regularised cost.
* Once done, test it with a $\lambda$ of 100. The resultant cost should be ~7.09.

Hint:
* We don't regularise the bias!
* `np.square()` is probably your friend.
* Remember `lambda_/2*m != lambda_/(2*m)`

In [112]:
#answer

def nnCostFunction(nnThetas, X, y, lambda_ = 0, inputLayerSize = 784, hiddenLayerSize = 25, classLabels = 10):
   
    m = len(X)
    
    #reshaping the list of parameters to matrices
    hiddenLayerParamNr    = (hiddenLayerSize * (inputLayerSize+1))
    thetaOneMatrix        = np.reshape(nnThetas[0:hiddenLayerParamNr],
                                       newshape = (hiddenLayerSize, inputLayerSize+1))
    outputLayerParamStart = hiddenLayerParamNr 
    thetaTwoMatrix        = np.reshape(nnThetas[outputLayerParamStart:],
                                       newshape = (classLabels, hiddenLayerSize+1))
    
    #calculating the forward pass
    inputs        = np.c_[np.ones(shape = (len(X), 1)), X]
    weightedSumHL = inputs @ thetaOneMatrix.T
    activationHL  = mySigmoid(weightedSumHL)
    
    inputsOL      = np.c_[np.ones(shape = (len(activationHL), 1)), activationHL]
    weightedSumOL = inputsOL @ thetaTwoMatrix.T
    activationsOL = mySigmoid(weightedSumOL)
    
    #cost
    J = 1/m * np.sum((- (y * np.log(activationsOL)) - ((1-y) * np.log(1-activationsOL))))
    
    #regularised cost
    #remember: units in the rows, their parameters in the columns
    #Hence, [:,1:] removes the columns with the bias term.
    regThetaOne = np.sum(np.square(thetaOneMatrix[:,1:]))
    regThetaTwo = np.sum(np.square(thetaTwoMatrix[:,1:]))
    regCost     = J + (lambda_/(2*m)) * (regThetaOne + regThetaTwo)
    
    return regCost

nnCostFunction(nnThetas, X, y, lambda_ = 100)

7.085673822927744

## Numerical gradient checking

Before we get to implementing backpropagation, there is one thing we need to do to be sure that we implement it correctly: make a function that can check this for us. In this case, we do that by numerically approximating the gradient. We do this by taking our current $\theta$ value and adding some small value $e$, and also taking our $\theta$ value and substracting this small value $e$. Then, we can calculate: <br> <br>
$$f(\theta) \approx \frac{J(\theta + e) - J(\theta - e)}{2e} $$ 

This corresponds to the following image: ![numericalApproximationGradient](NumericalGradientComputation.PNG) <br>

If this explanation leaves you unsure, Welch labs has got you covered in [this great video](https://www.youtube.com/watch?v=pHMzNW8Agq4).
Do note that the computational cost is **exorbitant**: you have to calculate the cost for all your training samples twice per parameter in the network, and our tiny network has some ~19.500 of them. 

Up to you to:
* Define a function `numericalGradientApproximation()` that takes in nnThetas, X, y, and lambda_ as arguments, and runs `nnCostFunction()` for the current theta +$e$ and -$e$, and divides them by 2$e$. Set $e$ to $1\cdot10^{-4}$ 
* **NOTE**: remember that you want to do this for each $\theta$ separately: we are calculating the partial derivative w.r.t. each parameter in the network. In other words: you should change _only one_ theta in each calculation (one time subtracting $e$ from it, one time adding $e$ to it) of the cost function and keep the rest the same. 

In [None]:
#answer
def numericalGradientApproximation(nnThetas, X, y, lambda_, e = 1e-4):
    nnThetasMinusE = nnThetas-e
    nnThetasPlusE  = nnThetas+e
    listGradients = []
    for index, value in enumerate(nnThetas):
        if index % 500 == 0:
            print("Parameter " + str(index) + "out of " + str(len(nnThetas)))
        minusEThetas = nnThetas; minusEThetas[index] = nnThetasMinusE[index]
        minusECost   = nnCostFunction(minusEThetas, X, y, lambda_)
        plusEThetas  = nnThetas; plusEThetas[index] = nnThetasPlusE[index]
        plusECost    = nnCostFunction(plusEThetas, X, y, lambda_)
        numericalGradApproxThisTheta = (plusECost - minusECost)/(2*e)
        listGradients.append(numericalGradApproxThisTheta)
    return(listGradient)




## Implementing backpropagation part 1: sigmoid gradient

As a first step to actually do backpropagation, your task is to implement the sigmoid gradient function in the code cell below: $sigmoid(x)*(1-sigmoid(x))$. In keeping with the age-old traditions of our people, call it `mySigmoidGradient`.

In [None]:
#answer
def mySigmoidGradient(x):
    outcome = mySigmoid(x) * (1-mySigmoid(x))
    return outcome

## Implementing backpropagation part 2: the real deal

The equations you need to implement are in this figure:
![NNBackProp](NeuralNetworkBackprop.PNG)

Concretely:
* You can calculate the error for the output layer $\delta^{(3)}$: that's the activations minus the labels.
* You can use these to calculate $\delta^{(2)}$: $(\Theta^{(2)})^T \cdot \delta^{(3)} \odot sigmoidGradient(z^{(2)}) $. Here, .* or $\odot$ is elementwise multiplication, which is just * for numpy.
* The remove $\delta_0^{(2)}$ just means that you shouldn't calculate a gradient for the +1 neuron.
* There's no error for the input layer, but you do need to change the weights of the Hidden Layer!

This might seem a little bit simple, but it performs all we need to do.
$\delta^{(3)}$ is a (10, 1) vector containing at each position how wrong a given neuron was for a prediction (or averaged over predictions of many samples). $(\Theta^{(2)})^T$ is a (10, 26)$^T$ = (26, 10) matrix. So 26 rows: 25 units, 10 columns containing the weights of each unit to the 10 units in the output layer, and 1 row with the 10 biases, 1 for each output layer unit. When we multiply this with the error in the output layer (10, 1), we get for each output unit's weights and the biases how their _activations_ should be changed to decrease the error. We change their activations by changing the inputs to the activation function (_weights and biases_), so we should take the derivative of the activation function for these values to find the gradients to step down to change the weights and biases so as to reduce the error in the output layer. Just to be clear: $z^{(2)}$ is the weighted sum that layer 2 generates.

Up to you to:
* Make a copy of 