## Afternoon practical day 3

You've just learned about convolutional neural networks. There will be a sneak peek into them at the end. For now, however, we finally implement backpropagation.



In [36]:
#run this cell to set things up
import ipywidgets as widgets, numpy as np, pandas as pd
from numpy.random import default_rng
%matplotlib inline
import matplotlib.pyplot as plt
import math
import seaborn as sns
from IPython.display import display, Markdown, Math
from scipy.optimize import fmin_bfgs, fmin_cg, fmin
import sklearn

In [30]:
#important functions
def mySigmoid(data):
    data= np.array(data)
    output = 1/(1+ np.exp(-data))
    return output

def mySigmoidGradient(x):
    outcome = mySigmoid(x) * (1-mySigmoid(x))
    return outcome

def nnCostFunction(nnThetas, X, y, lambda_ = 0, inputLayerSize = 784, hiddenLayerSize = 25, classLabels = 10):
   
    m = len(X)
    
    #reshaping the list of parameters to matrices
    hiddenLayerParamNr    = (hiddenLayerSize * (inputLayerSize+1))
    thetaOneMatrix        = np.reshape(nnThetas[0:hiddenLayerParamNr],
                                       newshape = (hiddenLayerSize, inputLayerSize+1))
    outputLayerParamStart = hiddenLayerParamNr 
    thetaTwoMatrix        = np.reshape(nnThetas[outputLayerParamStart:],
                                       newshape = (classLabels, hiddenLayerSize+1))
    
    #calculating the forward pass
    inputs        = np.c_[np.ones(shape = (len(X), 1)), X]
    weightedSumHL = inputs @ thetaOneMatrix.T
    activationsHL  = mySigmoid(weightedSumHL)
    
    inputsOL      = np.c_[np.ones(shape = (len(activationsHL), 1)), activationsHL]
    weightedSumOL = inputsOL @ thetaTwoMatrix.T
    activationsOL = mySigmoid(weightedSumOL)
    
    #cost
    J = 1/m * np.sum((- (y * np.log(activationsOL)) - ((1-y) * np.log(1-activationsOL))))
    
    #regularised cost
    #remember: units in the rows, their parameters in the columns
    #Hence, [:,1:] removes the columns with the bias term.
    regThetaOne = np.sum(np.square(thetaOneMatrix[:,1:]))
    regThetaTwo = np.sum(np.square(thetaTwoMatrix[:,1:]))
    regCost     = J + (lambda_/(2*m)) * (regThetaOne + regThetaTwo)
    
    return regCost

def numericalGradientApproximation(nnThetas, X, y, lambda_, e = 1e-4):
    nnThetasMinusE = nnThetas-e
    nnThetasPlusE  = nnThetas+e
    listGradients = []
    for index, value in enumerate(nnThetas):
        if index % 500 == 0:
            print("Parameter " + str(index) + "out of " + str(len(nnThetas)))
        minusEThetas = nnThetas; minusEThetas[index] = nnThetasMinusE[index]
        minusECost   = nnCostFunction(minusEThetas, X, y, lambda_)
        plusEThetas  = nnThetas; plusEThetas[index] = nnThetasPlusE[index]
        plusECost    = nnCostFunction(plusEThetas, X, y, lambda_)
        numericalGradApproxThisTheta = (plusECost - minusECost)/(2*e)
        listGradients.append(numericalGradApproxThisTheta)
    return(listGradient)

## Implementing backpropagation part 2: the real deal

Part of the equations you need to implement are in this figure, namely the calculation of the error per unit. Note that this figure has these calculations for **1 training example**. In reality we want to implement them with linear algebra such that we calculate for all training examples at the same time:
![NNBackProp](NeuralNetworkBackprop.PNG)

Concretely:
* You can calculate the error for the output layer $\delta^{(3)}$: that's the activations for this training sample minus its label vector. **WAAROM NIET * sigma'(z^(3))?**
* You can use these to calculate $\delta^{(2)}$: $(\Theta^{(2)})^T \cdot \delta^{(3)} \odot sigmoidGradient(z^{(2)}) $. Here, .* or $\odot$ is elementwise multiplication, which is just * for numpy. This calculates the part of the error that is due to the Hidden Layer weighing its inputs wrong. 
* The remove $\delta_0^{(2)}$ just means that you shouldn't calculate a gradient for the +1 neuron: it's not connected to the previous layer so there's nothing to propagate back!
* There's no error for the input layer, so after this you're done propagating the error, but you do need to change the weights of the Hidden Layer!

This might seem a little bit simple, but it performs all we need to do.
$\delta^{(3)}$ is a (10, 1) vector containing at each position how wrong a given neuron was for a prediction (or averaged over predictions of many samples). $(\Theta^{(2)})^T$ is a (10, 26)$^T$ = (26, 10) matrix. So 26 rows: 25 units, 10 columns containing the weights of each unit to the 10 units in the output layer, and 1 row with the 10 biases, 1 for each output layer unit. When we multiply this with the error in the output layer (10, 1), we get for each output unit's weights and the biases how their _activations_ should be changed to decrease the error. We change their activations by changing the inputs to the activation function (_weights and biases_), so we should take the derivative of the activation function for these values to find the gradients to step down to change the weights and biases so as to reduce the error in the output layer. Just to be clear: $z^{(2)}$ is the weighted sum that layer 2 generates using matrix $\Theta^{(1)}$.

Up to you to:
* Make a copy of `nnCostFunction`. Call it `nnGradientFunction`. It keeps the same arguments and does exactly the same to begin with, but it will return the gradients calculated by backpropagation rather than stop at calculating the cost function. First set it to return `None`. 
* Calculate $\delta^{(3)}$ using the cost.
* Calculate $\delta^{(2)}$ using $\delta^{(3)}$. For _one_ training example that would be the formula above. But remember, you have a matrix of shape (20,004 x 10). With the first row corresponding to the errors on the first training example,  etc. 

In [61]:
from numpy.random import default_rng
rng = default_rng(42)
thetaOneMatrix = rng.uniform(-0.12, 0.12, size = (25, 785))
thetaTwoMatrix = rng.uniform(-0.12, 0.12, size = (10, 26))

savedData = np.load("dataMNISTNeuralNetwork.npz")
X_train, X_test, y_train, y_test = savedData["XTrain"], savedData["XTest"], savedData["yTrain"], savedData["yTest"]


nnThetas       = np.append(np.ravel(thetaOneMatrix), np.ravel(thetaTwoMatrix))
inputLayerSize = 784
hiddenLayerSize = 25
classLabels = 10
X = X_train
y = y_train


# answer

def nnGradientFunction(nnThetas, X, y, lambda_ = 0, inputLayerSize = 784, hiddenLayerSize = 25, classLabels = 10):
   
    m = len(X)
    
    #reshaping the list of parameters to matrices
    hiddenLayerParamNr    = (hiddenLayerSize * (inputLayerSize+1))
    thetaOneMatrix        = np.reshape(nnThetas[0:hiddenLayerParamNr],
                                       newshape = (hiddenLayerSize, inputLayerSize+1))
    outputLayerParamStart = hiddenLayerParamNr 
    thetaTwoMatrix        = np.reshape(nnThetas[outputLayerParamStart:],
                                       newshape = (classLabels, hiddenLayerSize+1))
    
    #calculating the forward pass
    inputs        = np.c_[np.ones(shape = (len(X), 1)), X]
    weightedSumHL = inputs @ thetaOneMatrix.T
    activationsHL  = mySigmoid(weightedSumHL)
    
    inputsOL      = np.c_[np.ones(shape = (len(activationsHL), 1)), activationsHL]
    weightedSumOL = inputsOL @ thetaTwoMatrix.T
    activationsOL = mySigmoid(weightedSumOL)
    
    print("Activations output layer shape: " + str(activationsOL.shape))
    print("Activations for one training example: " + str(activationsOL[0,:]))
    
    #cost
    J = 1/m * np.sum((- (y * np.log(activationsOL)) - ((1-y) * np.log(1-activationsOL))))
    
    #regularised cost
    #remember: units in the rows, their parameters in the columns
    #Hence, [:,1:] removes the columns with the bias term.
    regThetaOne = np.sum(np.square(thetaOneMatrix[:,1:]))
    regThetaTwo = np.sum(np.square(thetaTwoMatrix[:,1:]))
    regCost     = J + (lambda_/(2*m)) * (regThetaOne + regThetaTwo)
    
    
    #calculate error layer 3
    smallDeltaThree = activationsOL - y
    print("delta^(3): " + str(smallDeltaThree))
    print("delta^(3) shape: " + str(smallDeltaThree.shape) + "\n")
    
    #calculate the weighted sums that the HL generates that go into the activation function and then get sent to 
    #the output layer
    weightedSumsLayerTwo = zTwo = weightedSumHL #= np.c_[np.ones(shape = (len(X), 1)), X] @ thetaOneMatrix.T
    print("z^(2): " + str(zTwo))
    print("z^(2) shape: " + str(zTwo.shape) + "\n")
    
    sigmoidGradientOfZTwo = mySigmoidGradient(zTwo)
    print("Sigmoid gradient: " + str(sigmoidGradientOfZTwo))
    print("Sigmoid gradient shape: " + str(sigmoidGradientOfZTwo.shape))
    smallDeltaTwo   = smallDeltaThree @ thetaTwoMatrix * np.c_[np.ones(shape = (len(X), 1)),
                                                               mySigmoidGradient(zTwo)]
    smallDeltaTwo   = smallDeltaTwo[:, 1:]
    print("delta^(2): " + str(smallDeltaTwo))
    print("size of delta^(2): " + str(smallDeltaTwo.shape)) 
    
    
    bigDeltaThree     = smallDeltaThree.T @ np.c_[np.ones(shape = len(activationsHL)),
                                                 activationsHL]
    print("Gradients output layer: " + str(bigDeltaThree))
    print("Gradients output layer shape: " + str(bigDeltaThree))
    
    bigDeltaTwo       = smallDeltaTwo.T   @ np.c_[np.ones(shape = len(X)),
                                                 X]
    print("Gradients hidden layer: " + str(bigDeltaTwo))
    print("Gradients hidden layer shape: " + str(bigDeltaTwo))
    
    #average values, we've now summed them over all training examples
    bigDeltaTwo   = bigDeltaTwo   * 1/m
    bigDeltaThree = bigDeltaThree * 1/m

    finalGradients = np.append(np.ravel(bigDeltaTwo), np.ravel(bigDeltaThree))
    
    return finalGradients



In [62]:
b = nnGradientFunction(nnThetas, X, y)
print(b.shape)
print(nnThetas.shape)

Activations output layer shape: (20004, 10)
Activations for one training example: [0.50679905 0.47026724 0.48045209 0.45695882 0.50439269 0.42537011
 0.45660324 0.55061342 0.58567024 0.48573213]
delta^(3): [[ 0.50679905  0.47026724  0.48045209 ...  0.55061342  0.58567024
  -0.51426787]
 [ 0.51597866  0.47184661  0.47270976 ...  0.57208504  0.58398438
  -0.52566946]
 [ 0.50011956  0.45834984  0.47915344 ...  0.55698424  0.59916884
  -0.51727237]
 ...
 [-0.46315771  0.47886494  0.45896251 ...  0.56575545  0.59230702
   0.48211795]
 [-0.47937306  0.46951334  0.45806689 ...  0.54667469  0.58760281
   0.47131704]
 [-0.48248302  0.47086103  0.45044764 ...  0.56747375  0.60631971
   0.48869242]]
delta^(3) shape: (20004, 10)

z^(2): [[ 1.47211529  0.72742786 -0.61089524 ...  0.31187328 -0.9862839
  -0.24400619]
 [ 1.97432473  1.22491872 -1.0079633  ...  0.44606513 -0.53983498
   0.30598892]
 [ 1.79004393  0.16434479 -0.54288865 ...  1.06426415 -0.48233109
  -0.21282825]
 ...
 [ 0.38756623  1.6

Gradients hidden layer: [[-221.88941345    0.            0.         ...    0.
     0.            0.        ]
 [ 130.18579792    0.            0.         ...    0.
     0.            0.        ]
 [  33.42637502    0.            0.         ...    0.
     0.            0.        ]
 ...
 [-487.67420661    0.            0.         ...    0.
     0.            0.        ]
 [ 140.56213249    0.            0.         ...    0.
     0.            0.        ]
 [  38.66465763    0.            0.         ...    0.
     0.            0.        ]]
Gradients hidden layer shape: [[-221.88941345    0.            0.         ...    0.
     0.            0.        ]
 [ 130.18579792    0.            0.         ...    0.
     0.            0.        ]
 [  33.42637502    0.            0.         ...    0.
     0.            0.        ]
 ...
 [-487.67420661    0.            0.         ...    0.
     0.            0.        ]
 [ 140.56213249    0.            0.         ...    0.
     0.            0.        ]


['XTrain', 'XTest', 'yTrain', 'yTest']