# Trondvq - week 43


In [248]:
import autograd.numpy as np  # We need to use this numpy wrapper to make automatic differentiation work later
from autograd import grad, elementwise_grad
from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score


# Defining some activation functions
def ReLU(z):
    return np.where(z > 0, z, 0)


# Derivative of the ReLU function
def ReLU_der(z):
    return np.where(z > 0, 1, 0)


def sigmoid(z):
    return 1 / (1 + np.exp(-z))


def mse(predict, target):
    return np.mean((predict - target) ** 2)

# Exercise 2 - Gradient with one layer using autograd

For the first few exercises, we will not use batched inputs. Only a single input vector is passed through the layer at a time.

In this exercise you will compute the gradient of a single layer. You only need to change the code in the cells right below an exercise, the rest works out of the box. Feel free to make changes and see how stuff works though!


**b)** Complete the feed_forward_one_layer function. It should use the sigmoid activation function. Also define the weigth and bias with the correct shapes.


In [249]:
def feed_forward_one_layer(W, b, x):
    z =  W @ x + b
    a = sigmoid(z)
    return a


def cost_one_layer(W, b, x, target):
    predict = feed_forward_one_layer(W, b, x)
    return mse(predict, target)


x = np.random.rand(2)
target = np.random.rand(3)

W = np.random.randn(3, 2) # target ->  output 3, x -> input 2
b = np.random.randn(3)

**c)** Compute the gradient of the cost function wrt. the weigth and bias by running the cell below. You will not need to change anything, just make sure it runs by defining things correctly in the cell above. This code uses the autograd package which uses backprogagation to compute the gradient!


In [250]:
autograd_one_layer = grad(cost_one_layer, [0, 1])
W_g, b_g = autograd_one_layer(W, b, x, target)
print(W_g, b_g)

[[-0.00955562 -0.006513  ]
 [ 0.01321374  0.00900633]
 [-0.03521398 -0.02400143]] [-0.01473166  0.02037129 -0.05428851]


# Exercise 3 - Gradient with one layer writing backpropagation by hand

Before you use the gradient you found using autograd, you will have to find the gradient "manually", to better understand how the backpropagation computation works. To do backpropagation "manually", you will need to write out expressions for many derivatives along the computation.


**a)** Which intermediary results can be reused between the two expressions?

The intermediary results we can use are:
$$
   \frac{da}{dz}
$$
as it comes from the derivative of the activation function 
$$
   \frac{dC}{da}
$$
as it comes from the derivative of the cost function

**b)** What is the derivative of the cost wrt. the final activation? You can use the autograd calculation to make sure you get the correct result. Remember that we compute the mean in mse.


In [251]:
z = W @ x + b
a = sigmoid(z)

predict = a

def mse_der(predict, target):
    return 2 * (predict - target) / len(target) 


print(mse_der(predict, target))

cost_autograd = grad(mse, 0)
print(cost_autograd(predict, target))

[-0.05974069  0.23076101 -0.22877004]
[-0.05974069  0.23076101 -0.22877004]


**c)** What is the expression for the derivative of the sigmoid activation function? You can use the autograd calculation to make sure you get the correct result.


In [252]:
def sigmoid_der(z):
    sigmoid = 1 / (1 + np.exp(-z))
    return sigmoid * (1 - sigmoid)


print(sigmoid_der(z))

sigmoid_autograd = elementwise_grad(sigmoid, 0)
print(sigmoid_autograd(z))

[0.24659339 0.08827872 0.23730602]
[0.24659339 0.08827872 0.23730602]


**d)** Using the two derivatives you just computed, compute this intermetidary gradient you will use later:

$$
\frac{dC}{dz} = \frac{dC}{da}\frac{da}{dz}
$$


In [253]:
dC_da = mse_der(a, target)
dC_dz = dC_da * sigmoid_der(z)

**e)** What is the derivative of the intermediary z wrt. the weight and bias? What should the shapes be? The one for the weights is a little tricky, it can be easier to play around in the next exercise first. You can also try computing it with autograd to get a hint.

z = W @ x + b

The derivative of intermediary z wrt. the weight:
As there is a matrix multiplication between the weight and x, the derivative will be x transposed.
$$
   \frac{dz}{dW} = x^T
$$

The shape will be the same as the weight matrix.

The derivative of intermediary z wrt. the bias:
There is a linear relationship between the bias and z, so the derivative is 1.
$$
   \frac{dz}{db} = 1
$$

The shape will be the same as the bias vector



**f)** Now combine the expressions you have worked with so far to compute the gradients! Note that you always need to do a feed forward pass while saving the zs and as before you do backpropagation, as they are used in the derivative expressions


In [254]:
dC_da = mse_der(a, target)
dC_dz = dC_da * sigmoid_der(z)
dC_dW = np.outer(dC_dz, x)
dC_db = dC_dz

print(dC_dW, dC_db)

[[-0.00955562 -0.006513  ]
 [ 0.01321374  0.00900633]
 [-0.03521398 -0.02400143]] [-0.01473166  0.02037129 -0.05428851]


You should get the same results as with autograd.


In [255]:
W_g, b_g = autograd_one_layer(W, b, x, target)
print(W_g, b_g)

[[-0.00955562 -0.006513  ]
 [ 0.01321374  0.00900633]
 [-0.03521398 -0.02400143]] [-0.01473166  0.02037129 -0.05428851]


# Exercise 4 - Gradient with two layers writing backpropagation by hand


Now that you have implemented backpropagation for one layer, you have found most of the expressions you will need for more layers. Let's move up to two layers.


In [256]:
x = np.random.rand(2)
target = np.random.rand(4)

W1 = np.random.rand(3, 2)
b1 = np.random.rand(3)

W2 = np.random.rand(4, 3)
b2 = np.random.rand(4)

layers = [(W1, b1), (W2, b2)]

In [257]:
z1 = W1 @ x + b1
a1 = sigmoid(z1)
z2 = W2 @ a1 + b2
a2 = sigmoid(z2)

We begin by computing the gradients of the last layer, as the gradients must be propagated backwards from the end.

**a)** Compute the gradients of the last layer, just like you did the single layer in the previous exercise.


In [258]:
dC_da2 = mse_der(a2, target)
dC_dz2 = dC_da2 * sigmoid_der(z2)
dC_dW2 = np.outer(dC_dz2, a1)
dC_db2 = dC_dz2 

To find the derivative of the cost wrt. the activation of the first layer, we need a new expression, the one furthest to the right in the following.

$$
\frac{dC}{da_1} = \frac{dC}{dz_2}\frac{dz_2}{da_1}
$$

**b)** What is the derivative of the second layer intermetiate wrt. the first layer activation? (First recall how you compute $z_2$)

$$
\frac{dz_2}{da_1}
$$


In [259]:
#z2 =  W2 @ a1 + b2 -> derivative with respect to a1

dz2_da1 = W2.T

**c)** Use this expression, together with expressions which are equivelent to ones for the last layer to compute all the derivatives of the first layer.

$$
\frac{dC}{dW_1} = \frac{dC}{da_1}\frac{da_1}{dz_1}\frac{dz_1}{dW_1}
$$

$$
\frac{dC}{db_1} = \frac{dC}{da_1}\frac{da_1}{dz_1}\frac{dz_1}{db_1}
$$


In [260]:
dC_da1 = dz2_da1 @ dC_dz2
dC_dz1 = dC_da1 * sigmoid_der(z1)
dC_dW1 = np.outer(dC_dz1, x)
dC_db1 = dC_dz1

In [261]:
print(dC_dW1, dC_db1)
print(dC_dW2, dC_db2)

[[0.00282524 0.0023705 ]
 [0.00317703 0.00266566]
 [0.001463   0.00122752]] [0.0031767  0.00357224 0.00164499]
[[ 0.01594189  0.01542827  0.01389762]
 [ 0.00509916  0.00493487  0.00444528]
 [-0.00514188 -0.00497622 -0.00448252]
 [ 0.03856889  0.03732625  0.03362309]] [ 0.01945072  0.00622149 -0.00627361  0.04705794]


**d)** Make sure you got the same gradient as the following code which uses autograd to do backpropagation.


In [262]:
def feed_forward_two_layers(layers, x):
    W1, b1 = layers[0]
    z1 = W1 @ x + b1
    a1 = sigmoid(z1)

    W2, b2 = layers[1]
    z2 = W2 @ a1 + b2
    a2 = sigmoid(z2)

    return a2

In [263]:
def cost_two_layers(layers, x, target):
    predict = feed_forward_two_layers(layers, x)
    return mse(predict, target)


grad_two_layers = grad(cost_two_layers, 0)
grad_two_layers(layers, x, target)

[(array([[0.00282524, 0.0023705 ],
         [0.00317703, 0.00266566],
         [0.001463  , 0.00122752]]),
  array([0.0031767 , 0.00357224, 0.00164499])),
 (array([[ 0.01594189,  0.01542827,  0.01389762],
         [ 0.00509916,  0.00493487,  0.00444528],
         [-0.00514188, -0.00497622, -0.00448252],
         [ 0.03856889,  0.03732625,  0.03362309]]),
  array([ 0.01945072,  0.00622149, -0.00627361,  0.04705794]))]

**e)** How would you use the gradient from this layer to compute the gradient of an even earlier layer? Would the expressions be any different?

This is the concept of backpropagation. Here we would use this layer's gradient to compute the gradient of the previous. This is done through the chain rule.  We use the gradient of this layer’s pre-activation output (z) with respect to the previous layer’s activation (a) and then apply the activation function’s derivative ($\frac{da}{dz}$). 

The expressions would remain the same as we follow the same steps for each layer. 

# Exercise 5 - Gradient with any number of layers writing backpropagation by hand


In [264]:
def create_layers(network_input_size, layer_output_sizes):
    layers = []

    i_size = network_input_size
    for layer_output_size in layer_output_sizes:
        W = np.random.randn(layer_output_size, i_size)
        b = np.random.randn(layer_output_size)
        layers.append((W, b))

        i_size = layer_output_size
    return layers


def feed_forward(input, layers, activation_funcs):
    a = input
    for (W, b), activation_func in zip(layers, activation_funcs):
        z = W @ a + b
        a = activation_func(z)
    return a


def cost(layers, input, activation_funcs, target):
    predict = feed_forward(input, layers, activation_funcs)
    return mse(predict, target)

You might have already have noticed a very important detail in backpropagation: You need the values from the forward pass to compute all the gradients! The feed forward method above is great for efficiency and for using autograd, as it only cares about computing the final output, but now we need to also save the results along the way.

Here is a function which does that for you.


In [265]:
def feed_forward_saver(input, layers, activation_funcs):
    layer_inputs = []
    zs = []
    a = input
    for (W, b), activation_func in zip(layers, activation_funcs):
        layer_inputs.append(a)
        z = W @ a + b
        a = activation_func(z)

        zs.append(z)

    return layer_inputs, zs, a

**a)** Now, complete the backpropagation function so that it returns the gradient of the cost function wrt. all the weigths and biases. Use the autograd calculation below to make sure you get the correct answer.


In [266]:
def backpropagation(
    input, layers, activation_funcs, target, activation_ders, cost_der=mse_der
):
    layer_inputs, zs, predict = feed_forward_saver(input, layers, activation_funcs)

    layer_grads = [() for layer in layers]

    # We loop over the layers, from the last to the first
    for i in reversed(range(len(layers))):
        layer_input, z, activation_der = layer_inputs[i], zs[i], activation_ders[i]

        if i == len(layers) - 1:
            # For last layer we use cost derivative as dC_da(L) can be computed directly
            dC_da = cost_der(predict, target)
        else:
            # For other layers we build on previous z derivative, as dC_da(i) = dC_dz(i+1) * dz(i+1)_da(i)
            (W, b) = layers[i + 1]
            # dC/dz from the next layer -> layer_grads[i + 1][1]
            dC_da = W.T @ (layer_grads[i + 1][1]) # as  dz(i+1)_da(i)  -> W.T

        dC_dz = dC_da * activation_der(z)
        dC_dW = np.outer(dC_dz, layer_input)
        dC_db =  dC_dz

        layer_grads[i] = (dC_dW, dC_db)

    return layer_grads

In [267]:
network_input_size = 2
layer_output_sizes = [3, 4]
activation_funcs = [sigmoid, ReLU]
activation_ders = [sigmoid_der, ReLU_der]

layers = create_layers(network_input_size, layer_output_sizes)

x = np.random.rand(network_input_size)
target = np.random.rand(4)

In [268]:
layer_grads = backpropagation(x, layers, activation_funcs, target, activation_ders)
print(layer_grads)

[(array([[0., 0.],
       [0., 0.],
       [0., 0.]]), array([0., 0., 0.])), (array([[-0., -0., -0.],
       [-0., -0., -0.],
       [-0., -0., -0.],
       [-0., -0., -0.]]), array([-0., -0., -0., -0.]))]


In [269]:
cost_grad = grad(cost, 0)
cost_grad(layers, x, [sigmoid, ReLU], target)

[(array([[0., 0.],
         [0., 0.],
         [0., 0.]]),
  array([0., 0., 0.])),
 (array([[0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.]]),
  array([0., 0., 0., 0.]))]

# Exercise 6 - Batched inputs

Make new versions of all the functions in exercise 5 which now take batched inputs instead. See last weeks exercise 5 for details on how to batch inputs to neural networks. You will also need to update the backpropogation function.


In [270]:
#From last week
def create_layers_batch(network_input_size, layer_output_sizes):
    layers = []
    i_size = network_input_size
    for layer_output_size in layer_output_sizes:
        W = np.random.randn(i_size,layer_output_size)
        b = np.random.randn(layer_output_size)
        layers.append((W, b))
        i_size = layer_output_size
    return layers
#From last week
def feed_forward_batch(inputs, layers, activation_funcs):
    a = inputs
    for (W, b), activation_func in zip(layers, activation_funcs):
        z = a @ W + b
        a = activation_func(z)
    return a
#From last week
def cost_batch(layers, input, activation_funcs, target):
    predict = feed_forward_batch(input, layers, activation_funcs)
    return mse(predict, target)

def feed_forward_saver_batch(input_batch, layers, activation_funcs):
    layer_inputs = []
    zs = []
    a = input_batch  
    for (W, b), activation_func in zip(layers, activation_funcs):
        layer_inputs.append(a)  
        z = a @ W + b #Transposed
        a = activation_func(z) 
        zs.append(z)  
    return layer_inputs, zs, a


def backpropagation_batch(
        input, layers, activation_funcs, target, activation_ders, cost_der=mse_der
):
    layer_inputs, zs, predict = feed_forward_saver_batch(input, layers, activation_funcs)
    layer_grads = [() for layer in layers]

    batch_size = input.shape[0]

    # We loop over the layers, from the last to the first
    for i in reversed(range(len(layers))):
        layer_input, z, activation_der = layer_inputs[i], zs[i], activation_ders[i]

        if i == len(layers) - 1:
            # For last layer we use cost derivative as dC_da(L) can be computed directly
            dC_da = cost_der(predict, target)
        else:
            # For other layers we build on previous z derivative, as dC_da(i) = dC_dz(i+1) * dz(i+1)_da(i)
            W = layers[i + 1][0]
            # dC/dz from the next layer -> layer_grads[i + 1][1]
            dC_da =(layer_grads[i + 1][1]) @  W.T # Transposed

        dC_dz = dC_da * activation_der(z)
        dC_dW = (layer_input.T @ dC_dz) / batch_size #batch
        dC_db = np.mean(dC_dz, axis=0) #
        
        layer_grads[i] = (dC_dW, dC_db)

    return layer_grads

In [271]:
network_input_size = 3
layer_output_sizes = [5, 2]
batch_size = 4

layers = create_layers_batch(network_input_size, layer_output_sizes)

inputs = np.random.randn(batch_size, network_input_size)
target = np.random.randn(batch_size, layer_output_sizes[-1])

activation_funcs = [sigmoid, ReLU]
activation_ders = [sigmoid_der, ReLU_der]

backpropagation_batch(inputs, layers, activation_funcs, target, activation_ders, cost_der=mse_der)


[(array([[ 0.03991761, -0.00855397, -0.00472669,  0.01363542,  0.00322598],
         [ 0.05642971, -0.02012056, -0.00664997,  0.02053988,  0.00640443],
         [-0.04220574,  0.0041889 ,  0.00470921, -0.01415179, -0.00284884]]),
  array([-0.12651016,  0.04127896,  0.01404353, -0.04688054, -0.01515062])),
 (array([[-0.12679767,  0.        ],
         [-0.02749372,  0.        ],
         [-0.07297573,  0.        ],
         [-0.12663614,  0.        ],
         [-0.01746809,  0.        ]]),
  array([-0.19811082,  0.        ]))]

# Exercise 7 - Training


**a)** Complete exercise 6 and 7 from last week, but use your own backpropogation implementation to compute the gradient.

**b)** Use stochastic gradient descent with momentum when you train your network.


In [272]:
import numpy as np
from sklearn import datasets
from sklearn.metrics import accuracy_score


In [273]:
#a)
#From week 42
def cross_entropy_der(predict, target):
    return predict - target  # Cross-entropy derivative

def accuracy(predictions, targets):
    one_hot_predictions = np.zeros(predictions.shape)

    for i, prediction in enumerate(predictions):
        one_hot_predictions[i, np.argmax(prediction)] = 1
    return accuracy_score(one_hot_predictions, targets)

# from last week, but with backpropagation_batch
def train_network_backprop(inputs, layers, activation_funcs, activation_ders, targets, learning_rate=0.001, epochs=100):
    for i in range(epochs):
        layers_grad = backpropagation_batch(inputs, layers, activation_funcs, targets, activation_ders, cost_der=cross_entropy_der)
        for (W, b), (W_g, b_g) in zip(layers, layers_grad):
            W -= learning_rate * W_g
            b -= learning_rate * b_g



In [274]:
# From last week
iris = datasets.load_iris()
inputs = iris.data
targets = np.zeros((len(iris.data), 3))
for i, t in enumerate(iris.target):
    targets[i, t] = 1


network_input_size = 4
layer_output_sizes = [8, 3]
activation_funcs = [sigmoid, ReLU]
activation_ders = [sigmoid_der, ReLU_der]
layers = create_layers_batch(network_input_size, layer_output_sizes)

train_network_backprop(inputs, layers, activation_funcs, activation_ders, targets, epochs=100)
predictions = feed_forward_batch(inputs, layers, activation_funcs)
print("Accuracy with custom backpropagation:", accuracy(predictions, targets))

Accuracy with custom backpropagation: 0.3333333333333333


In [275]:
# b)
# Smiliar to SGD from week 41
def SGD_momentum_with_backprop(inputs, layers, activation_funcs, activation_ders, targets, batch_size=5, eta=0.01, momentum=0.8, epochs=100):
    M = batch_size
    m = inputs.shape[0]  // batch_size  # Number of mini-batches
    changes = [(np.zeros_like(W), np.zeros_like(b)) for W, b in layers]
    
    for epoch in range(epochs):
        for batch in range(m):
            random_indices = np.random.choice(inputs.shape[0], batch_size, replace=False)
            batch_inputs = inputs[random_indices]
            batch_targets = targets[random_indices]

            layer_grads = backpropagation_batch(batch_inputs, layers, activation_funcs, batch_targets, activation_ders, cost_der=cross_entropy_der)

            for i, ((W, b), (W_g, b_g), (W_change, b_change)) in enumerate(zip(layers, layer_grads, changes)):
                W_change_new = eta * W_g + momentum * W_change
                b_change_new = eta * b_g + momentum * b_change
                W -= W_change_new
                b -= b_change_new
                changes[i] = (W_change_new, b_change_new)



In [276]:
# From last week
iris = datasets.load_iris()
inputs = iris.data
targets = np.zeros((len(iris.data), 3))
for i, t in enumerate(iris.target):
    targets[i, t] = 1


network_input_size = 4
layer_output_sizes = [8, 3]
activation_funcs = [sigmoid, ReLU]
activation_ders = [sigmoid_der, ReLU_der]
layers = create_layers_batch(network_input_size, layer_output_sizes)

SGD_momentum_with_backprop(inputs, layers, activation_funcs, activation_ders, targets, batch_size=5, eta=0.01, momentum=0.8, epochs=100)
predictions = feed_forward_batch(inputs, layers, activation_funcs)
print("Accuracy(SGD + momentum):", accuracy(predictions, targets))

Accuracy(SGD + momentum): 0.3333333333333333


# Exercise 8 (Optional) - Object orientation

Passing in the layers, activations functions, activation derivatives and cost derivatives into the functions each time leads to code which is easy to understand in isoloation, but messier when used in a larger context with data splitting, data scaling, gradient methods and so forth. Creating an object which stores these values can lead to code which is much easier to use.

**a)** Write a neural network class. You are free to implement it how you see fit, though we strongly recommend to not save any input or output values as class attributes, nor let the neural network class handle gradient methods internally. Gradient methods should be handled outside, by performing general operations on the layer_grads list using functions or classes separate to the neural network.

We provide here a skeleton structure which should get you started.


In [277]:
import numpy as np

class NeuralNetwork:
    def __init__(
            self,
            network_input_size,
            layer_output_sizes,
            activation_funcs,
            activation_ders,
            cost_fun,
            cost_der,
            learning_rate=0.001,
    ):
        # Initialize the network layers
        self.layers = self.create_layers(network_input_size, layer_output_sizes)
        self.activation_funcs = activation_funcs
        self.activation_ders = activation_ders
        self.cost_fun = cost_fun 
        self.cost_der = cost_der  
        self.learning_rate = learning_rate

    def create_layers(self, network_input_size, layer_output_sizes):
        layers = []
        i_size = network_input_size
        for layer_output_size in layer_output_sizes:
            W = np.random.randn(i_size, layer_output_size)
            b = np.random.randn(layer_output_size)
            layers.append((W, b))
            i_size = layer_output_size
        return layers

    def predict(self, inputs):
        a = inputs
        for (W, b), activation_func in zip(self.layers, self.activation_funcs):
            z = a @ W + b
            a = activation_func(z)
        return a

    def cost(self, inputs, targets):
        predictions = self.predict(inputs)
        return self.cost_fun(predictions, targets)

    def _feed_forward_saver(self, inputs):
        layer_inputs = []
        zs = []
        a = inputs
        for (W, b), activation_func in zip(self.layers, self.activation_funcs):
            layer_inputs.append(a)
            z = a @ W + b
            a = activation_func(z)
            zs.append(z)
        return layer_inputs, zs, a

    def compute_gradient(self, inputs, targets):
        layer_inputs, zs, predictions = self._feed_forward_saver(inputs)
        batch_size = inputs.shape[0]
        layer_grads = [() for _ in self.layers]

        for i in reversed(range(len(self.layers))):
            layer_input, z, activation_der = layer_inputs[i], zs[i], self.activation_ders[i]

            if i == len(self.layers) - 1:
                dC_da = self.cost_der(predictions, targets)
            else:
                W = self.layers[i + 1][0]
                dC_da = (layer_grads[i + 1][1]) @ W.T

            dC_dz = dC_da * activation_der(z)
            dC_dW = (layer_input.T @ dC_dz) / batch_size
            dC_db = np.mean(dC_dz, axis=0)

            layer_grads[i] = (dC_dW, dC_db)

        return layer_grads

    def update_weights(self, layer_grads):
        for (W, b), (W_g, b_g) in zip(self.layers, layer_grads):
            W -= self.learning_rate * W_g
            b -= self.learning_rate * b_g


    def autograd_compliant_predict(self, layers, inputs):
        a = inputs
        for (W, b), activation_func in zip(layers, self.activation_funcs):
            z = a @ W + b
            a = activation_func(z)
        return a

    def autograd_gradient(self, inputs, targets):
        predictions = self.autograd_compliant_predict(self.layers, inputs)
        return self.cost_der(predictions, targets)