# Exercises week 42

**October 13-17, 2025**

Date: **Deadline is Friday October 17 at midnight**


# Overarching aims of the exercises this week

The aim of the exercises this week is to train the neural network you implemented last week.

To train neural networks, we use gradient descent, since there is no analytical expression for the optimal parameters. This means you will need to compute the gradient of the cost function wrt. the network parameters. And then you will need to implement some gradient method.

You will begin by computing gradients for a network with one layer, then two layers, then any number of layers. Keeping track of the shapes and doing things step by step will be very important this week.

We recommend that you do the exercises this week by editing and running this notebook file, as it includes some checks along the way that you have implemented the neural network correctly, and running small parts of the code at a time will be important for understanding the methods. If you have trouble running a notebook, you can run this notebook in google colab instead(https://colab.research.google.com/drive/1FfvbN0XlhV-lATRPyGRTtTBnJr3zNuHL#offline=true&sandboxMode=true), though we recommend that you set up VSCode and your python environment to run code like this locally.

First, some setup code that you will need.


In [1]:
import autograd.numpy as np  # We need to use this numpy wrapper to make automatic differentiation work later
from autograd import grad, elementwise_grad
from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score


# Defining some activation functions
def ReLU(z):
    return np.where(z > 0, z, 0)


# Derivative of the ReLU function
def ReLU_der(z):
    return np.where(z > 0, 1, 0)


def sigmoid(z):
    return 1 / (1 + np.exp(-z))


def mse(predict, target):
    return np.mean((predict - target) ** 2)

# Exercise 1 - Understand the feed forward pass

**a)** Complete last weeks' exercises if you haven't already (recommended).


[W42 exercise 7: a - e](https://github.com/NatalliaDanilchanka/FYS-STK3155_MachineLearning_deliveries/blob/main/Week_41.ipynb)

# Exercise 2 - Gradient with one layer using autograd

For the first few exercises, we will not use batched inputs. Only a single input vector is passed through the layer at a time.

In this exercise you will compute the gradient of a single layer. You only need to change the code in the cells right below an exercise, the rest works out of the box. Feel free to make changes and see how stuff works though!


**a)** If the weights and bias of a layer has shapes (10, 4) and (10), what will the shapes of the gradients of the cost function wrt. these weights and this bias be?


As per my understanding the shape of cost function will be vectors of length 10. 

**b)** Complete the feed_forward_one_layer function. It should use the sigmoid activation function. Also define the weigth and bias with the correct shapes.


In [2]:
def feed_forward_one_layer(W, b, x):
    z = W @ x  + b
    a = sigmoid(z)
    return a


def cost_one_layer(W, b, x, target):
    predict = feed_forward_one_layer(W, b, x)
    return mse(predict, target)

num_inp = 2
num_out = 3

x = np.random.rand(num_inp)
target = np.random.rand(num_out)

W = np.random.randn(num_out, num_inp)
b = np.random.randn(num_out)

**c)** Compute the gradient of the cost function wrt. the weigth and bias by running the cell below. You will not need to change anything, just make sure it runs by defining things correctly in the cell above. This code uses the autograd package which uses backprogagation to compute the gradient!


In [3]:
autograd_one_layer = grad(cost_one_layer, [0, 1])
W_g, b_g = autograd_one_layer(W, b, x, target)
print(W_g, b_g)

[[-0.01095302 -0.00354917]
 [-0.0056232  -0.00182212]
 [ 0.01219003  0.00395   ]] [-0.01349215 -0.00692678  0.01501593]


# Exercise 3 - Gradient with one layer writing backpropagation by hand

Before you use the gradient you found using autograd, you will have to find the gradient "manually", to better understand how the backpropagation computation works. To do backpropagation "manually", you will need to write out expressions for many derivatives along the computation.


We want to find the gradient of the cost function wrt. the weight and bias. This is quite hard to do directly, so we instead use the chain rule to combine multiple derivatives which are easier to compute.

$$
\frac{dC}{dW} = \frac{dC}{da}\frac{da}{dz}\frac{dz}{dW}
$$

$$
\frac{dC}{db} = \frac{dC}{da}\frac{da}{dz}\frac{dz}{db}
$$


**a)** Which intermediary results can be reused between the two expressions?


we can see that these two expression have $\frac{dC}{da}, \frac{da}{dz}$ in comon and we can re-use these expressions

**b)** What is the derivative of the cost wrt. the final activation? You can use the autograd calculation to make sure you get the correct result. Remember that we compute the mean in mse.



\begin{equation}
\frac{dC}{d{a}} = \frac{2}{target...size}({a} - y)
\end{equation}



In [4]:
z = W @ x + b
a = sigmoid(z)

predict = a

def mse_der(predict, target):
    return (2 * (predict - target)) / target.size

print(mse_der(predict, target))

def mse(predict, target):
    return np.mean((predict - target) ** 2)
    
cost_autograd = grad(mse, 0)
print(cost_autograd(predict, target))

[-0.06804342 -0.03263867  0.16127725]
[-0.06804342 -0.03263867  0.16127725]


**c)** What is the expression for the derivative of the sigmoid activation function? You can use the autograd calculation to make sure you get the correct result.


The sigmoid function is defined as: $ a =\sigma(z) = \frac{1}{1 + e^{-z}}$

$u = 1 + e^{-z}$

a = $\frac{1}{u}$

$\frac{du}{dz} = -e^{-z}$

$\frac{da}{du} = - \frac{1}{u^2}$

$\frac{da}{dz} = \frac{da}{du} * \frac{du}{dz} = - \frac{1}{u^2} * (-e^{-z}) = \frac{e^{-z}}{(1 + e^{-z})^2}$

it can be transformed into $\frac{da}{dz} = \sigma(z)*(1-\sigma(z))$

In [5]:
def sigmoid_der(z):
    sigm = sigmoid(z)
    return sigm * (1 - sigm)
print(sigmoid_der(z))

sigmoid_autograd = elementwise_grad(sigmoid, 0)
print(sigmoid_autograd(z))

[0.19828733 0.21222604 0.09310629]
[0.19828733 0.21222604 0.09310629]


**d)** Using the two derivatives you just computed, compute this intermetidary gradient you will use later:

$$
\frac{dC}{dz} = \frac{dC}{da}\frac{da}{dz}
$$


**e)** What is the derivative of the intermediary z wrt. the weight and bias? What should the shapes be? The one for the weights is a little tricky, it can be easier to play around in the next exercise first. You can also try computing it with autograd to get a hint.


**f)** Now combine the expressions you have worked with so far to compute the gradients! Note that you always need to do a feed forward pass while saving the zs and as before you do backpropagation, as they are used in the derivative expressions


In [6]:
dC_da = (predict - target)/3*2
dC_dz = dC_da * (a * (1 - a))
dC_dW =  np.outer(dC_dz, x)
dC_db = dC_dz

print(dC_dW, dC_db)

[[-0.01095302 -0.00354917]
 [-0.0056232  -0.00182212]
 [ 0.01219003  0.00395   ]] [-0.01349215 -0.00692678  0.01501593]


You should get the same results as with autograd.


In [7]:
W_g, b_g = autograd_one_layer(W, b, x, target)
print(W_g, b_g)

[[-0.01095302 -0.00354917]
 [-0.0056232  -0.00182212]
 [ 0.01219003  0.00395   ]] [-0.01349215 -0.00692678  0.01501593]


# Exercise 4 - Gradient with two layers writing backpropagation by hand


Now that you have implemented backpropagation for one layer, you have found most of the expressions you will need for more layers. Let's move up to two layers.


In [8]:
x = np.random.rand(2)
target = np.random.rand(4)

W1 = np.random.rand(3, 2)
b1 = np.random.rand(3)

W2 = np.random.rand(4, 3)
b2 = np.random.rand(4)

layers = [(W1, b1), (W2, b2)]

z1 = W1 @ x + b1
a1 = sigmoid(z1)


z2 = W2 @ a1 + b2
a2 = sigmoid(z2)
print(a1)


[0.69214664 0.73632221 0.58237706]


We begin by computing the gradients of the last layer, as the gradients must be propagated backwards from the end.

**a)** Compute the gradients of the last layer, just like you did the single layer in the previous exercise.


In [9]:
#example from lecture does not work ???
#dC_da2 = (a2 - target)*2/4
#dC_dz2 = np.matmul(a2.T, dC_da2)
#dC_dW2 = np.matmul(a1.T, dC_dz2)
#dC_db2 = dC_dz2
#print("Gradient dC_dW2:\n", dC_dW2)
#print("Gradient dC_db2:\n", dC_db2)



In [10]:
dC_da2 = (a2 - target)*2/4
dC_dz2 = dC_da2 * sigmoid_der(z2)
dC_dW2 = np.outer(dC_dz2, a1)
dC_db2 = dC_dz2
print("Gradient dC_dW2:\n", dC_dW2)
print("Gradient dC_db2:\n", dC_db2)


Gradient dC_dW2:
 [[0.0289412  0.03078834 0.02435133]
 [0.02127736 0.02263537 0.01790292]
 [0.00896275 0.00953479 0.00754132]
 [0.00713198 0.00758717 0.0060009 ]]
Gradient dC_db2:
 [0.04181368 0.03074112 0.01294921 0.01030415]


To find the derivative of the cost wrt. the activation of the first layer, we need a new expression, the one furthest to the right in the following.

$$
\frac{dC}{da_1} = \frac{dC}{dz_2}\frac{dz_2}{da_1}
$$

**b)** What is the derivative of the second layer intermetiate wrt. the first layer activation? (First recall how you compute $z_2$)

$$
\frac{dz_2}{da_1} = \frac{W2*a1 + b2}{da_1} = W2
$$


**c)** Use this expression, together with expressions which are equivelent to ones for the last layer to compute all the derivatives of the first layer.

$$
\frac{dC}{dW_1} = \frac{dC}{da_1}\frac{da_1}{dz_1}\frac{dz_1}{dW_1}
$$

$$
\frac{dC}{db_1} = \frac{dC}{da_1}\frac{da_1}{dz_1}\frac{dz_1}{db_1}
$$

$$
\frac{\partial C}{\partial w_2} = \frac{\partial C}{\partial a_2} \frac{\partial a_2}{\partial z_2} \frac{\partial z_2}{\partial a_1} \frac{\partial a_1}{\partial z_1} \frac{\partial z_1}{\partial b_1} = (a_2 - y) \sigma_2' \sigma_1' = \delta_1
$$

In [11]:
# Backpropagation to layer 1
dC_da1 = dC_dz2 @ W2
dC_dz1 = dC_da1 * sigmoid_der(z1)  
dC_dW1 = np.outer(dC_dz1, x)                      
dC_db1 = dC_dz1                                  

# Print gradients
print("Gradient dC_dW1:\n", dC_dW1)
print("Gradient dC_db1:\n", dC_db1)
print()
print("Gradient dC_dW2:\n", dC_dW2)
print("Gradient dC_db2:\n", dC_db2)

Gradient dC_dW1:
 [[0.00428513 0.00064139]
 [0.00593419 0.00088821]
 [0.00559663 0.00083769]]
Gradient dC_db1:
 [0.00857461 0.0118744  0.01119893]

Gradient dC_dW2:
 [[0.0289412  0.03078834 0.02435133]
 [0.02127736 0.02263537 0.01790292]
 [0.00896275 0.00953479 0.00754132]
 [0.00713198 0.00758717 0.0060009 ]]
Gradient dC_db2:
 [0.04181368 0.03074112 0.01294921 0.01030415]


**d)** Make sure you got the same gradient as the following code which uses autograd to do backpropagation.


In [12]:
def feed_forward_two_layers(layers, x):
    W1, b1 = layers[0]
    z1 = W1 @ x + b1
    a1 = sigmoid(z1)

    W2, b2 = layers[1]
    z2 = W2 @ a1 + b2
    a2 = sigmoid(z2)

    return a2

def cost_two_layers(layers, x, target):
    predict = feed_forward_two_layers(layers, x)
    return mse(predict, target)


grad_two_layers = grad(cost_two_layers, 0)
grad_two_layers(layers, x, target)

[(array([[0.00428513, 0.00064139],
         [0.00593419, 0.00088821],
         [0.00559663, 0.00083769]]),
  array([0.00857461, 0.0118744 , 0.01119893])),
 (array([[0.0289412 , 0.03078834, 0.02435133],
         [0.02127736, 0.02263537, 0.01790292],
         [0.00896275, 0.00953479, 0.00754132],
         [0.00713198, 0.00758717, 0.0060009 ]]),
  array([0.04181368, 0.03074112, 0.01294921, 0.01030415]))]

**e)** How would you use the gradient from this layer to compute the gradient of an even earlier layer? Would the expressions be any different?


# Exercise 5 - Gradient with any number of layers writing backpropagation by hand


Well done on getting this far! Now it's time to compute the gradient with any number of layers.

First, some code from the general neural network code from last week. Note that we are still sending in one input vector at a time. We will change it to use batched inputs later.


In [13]:
def create_layers(network_input_size, layer_output_sizes):
    layers = []

    i_size = network_input_size
    for layer_output_size in layer_output_sizes:
        W = np.random.randn(layer_output_size, i_size)
        b = np.random.randn(layer_output_size)
        layers.append((W, b))

        i_size = layer_output_size
    return layers


def feed_forward(input, layers, activation_funcs):
    a = input
    for (W, b), activation_func in zip(layers, activation_funcs):
        z = W @ a + b
        a = activation_func(z)
    return a


def cost(layers, input, activation_funcs, target):
    predict = feed_forward(input, layers, activation_funcs)
    return mse(predict, target)

You might have already have noticed a very important detail in backpropagation: You need the values from the forward pass to compute all the gradients! The feed forward method above is great for efficiency and for using autograd, as it only cares about computing the final output, but now we need to also save the results along the way.

Here is a function which does that for you.


In [14]:
def feed_forward_saver(input, layers, activation_funcs):
    layer_inputs = []
    zs = []
    a = input
    for (W, b), activation_func in zip(layers, activation_funcs):
        layer_inputs.append(a)
        z = W @ a + b
        a = activation_func(z)

        zs.append(z)

    return layer_inputs, zs, a

**a)** Now, complete the backpropagation function so that it returns the gradient of the cost function wrt. all the weigths and biases. Use the autograd calculation below to make sure you get the correct answer.


In [15]:
def backpropagation(
    input, layers, activation_funcs, target, activation_ders, cost_der=mse_der
):
    layer_inputs, zs, predict = feed_forward_saver(input, layers, activation_funcs)

    layer_grads = [() for layer in layers]

    # We loop over the layers, from the last to the first
    for i in reversed(range(len(layers))):
        layer_input, z, activation_der = layer_inputs[i], zs[i], activation_ders[i]

        if i == len(layers) - 1:
            # For last layer we use cost derivative as dC_da(L) can be computed directly
            dC_da = (predict - target)*2/len(target)
        else:
            # For other layers we build on previous z derivative, as dC_da(i) = dC_dz(i+1) * dz(i+1)_da(i)
            (W, b) = layers[i + 1]
            dC_da = dC_dz @ W

        dC_dz = dC_da * activation_der(z)
        dC_dW = np.outer(dC_dz, layer_input) 
        dC_db = dC_dz 

        layer_grads[i] = (dC_dW, dC_db)

    return layer_grads                     


In [16]:
network_input_size = 2
layer_output_sizes = [3, 4]
activation_funcs = [sigmoid, ReLU]
activation_ders = [sigmoid_der, ReLU_der]

layers = create_layers(network_input_size, layer_output_sizes)

x = np.random.rand(network_input_size)
target = np.random.rand(4)

layer_grads = backpropagation(x, layers, activation_funcs, target, activation_ders)
print(layer_grads)

[(array([[ 2.37899513e-03,  1.74013921e-03],
       [ 8.51362332e-05,  6.22737288e-05],
       [-8.48703559e-03, -6.20792503e-03]]), array([ 0.00348343,  0.00012466, -0.01242708])), (array([[-0.        , -0.        , -0.        ],
       [-0.        , -0.        , -0.        ],
       [-0.00166204, -0.0020844 , -0.00137003],
       [-0.04396832, -0.0551418 , -0.03624349]]), array([-0.        , -0.        , -0.00219165, -0.05797888]))]


In [17]:
cost_grad = grad(cost, 0)
cost_grad(layers, x, [sigmoid, ReLU], target)

[(array([[ 2.37899513e-03,  1.74013921e-03],
         [ 8.51362332e-05,  6.22737288e-05],
         [-8.48703559e-03, -6.20792503e-03]]),
  array([ 0.00348343,  0.00012466, -0.01242708])),
 (array([[ 0.        ,  0.        ,  0.        ],
         [ 0.        ,  0.        ,  0.        ],
         [-0.00166204, -0.0020844 , -0.00137003],
         [-0.04396832, -0.0551418 , -0.03624349]]),
  array([ 0.        ,  0.        , -0.00219165, -0.05797888]))]

# Exercise 6 - Batched inputs

Make new versions of all the functions in exercise 5 which now take batched inputs instead. See last weeks exercise 5 for details on how to batch inputs to neural networks. You will also need to update the backpropogation function.


In [67]:
def create_layers_batch(network_input_size, layer_output_sizes):
    layers = []
    i_size = network_input_size  
    for layer_output_size in layer_output_sizes:
        W = np.random.randn(layer_output_size, i_size)
        W = W.T
        b = np.random.randn(layer_output_size)
        layers.append((W, b))
        i_size = layer_output_size
    return layers


def feed_forward_batch(inputs, layers, activation_funcs):
    a = inputs
    for (W, b), activation_func in zip(layers, activation_funcs):
        z = a @ W + b
        a = activation_func(z)
    return a

def cost(layers, input, activation_funcs, target):
    predict = feed_forward_batch(input, layers, activation_funcs)
    return mse(predict, target)


def feed_forward_saver_batch(inputs, layers, activation_funcs):
    layer_inputs = []
    zs = []
    a = inputs
    for (W, b), activation_func in zip(layers, activation_funcs):
        layer_inputs.append(a)        
        z = a @ W + b
        a = activation_func(z)
        zs.append(z)
    return layer_inputs, zs, a

def backpropagation_batch(
    inputs, layers, activation_funcs, target, activation_ders, cost_der=mse_der
):
    layer_inputs, zs, predict = feed_forward_saver_batch(inputs, layers, activation_funcs)

    layer_grads = [() for layer in layers]

    # We loop over the layers, from the last to the first
    for i in reversed(range(len(layers))):
        layer_input, z, activation_der = layer_inputs[i], zs[i], activation_ders[i]

        if i == len(layers) - 1:
            # For last layer we use cost derivative as dC_da(L) can be computed directly
            dC_da = (predict - target)*2/target.shape[1] # (n, tf)
        else:
            # For other layers we build on previous z derivative, as dC_da(i) = dC_dz(i+1) * dz(i+1)_da(i)
            (W, b) = layers[i + 1]
            dC_da = dC_dz @ W.T # 

        dC_dz = dC_da * activation_der(z) # (n, tf)
        dC_dW = np.matmul(layer_input.T, dC_dz)/target.shape[0] # remember to divide by n
        dC_db = np.mean(dC_dz, axis = 0) # (n, tf)

        layer_grads[i] = (dC_dW, dC_db)

    return layer_grads 

In [68]:
len(target)

300

In [69]:
target.shape

(300, 4)

In [70]:
network_input_size = 2
layer_output_sizes = [3,3, 4]
activation_funcs = [sigmoid, ReLU, sigmoid]
activation_ders = [sigmoid_der, ReLU_der,sigmoid_der]

layers = create_layers_batch(network_input_size, layer_output_sizes)


x = np.random.rand(300,network_input_size)
target = np.random.rand(300, 4)

layer_grads = backpropagation_batch(x, layers, activation_funcs, target, activation_ders)
print(layer_grads)

[(array([[ 0.13474092, -0.17723495, -0.53509682],
       [ 0.10449262, -0.14411266, -0.51225619]]), array([ 0.00082328, -0.00111484, -0.00335893])), (array([[-2.84740243,  0.        ,  0.        ],
       [-0.86188525,  0.        ,  0.        ],
       [-1.23656009,  0.        ,  0.        ]]), array([-0.01072927,  0.        ,  0.        ])), (array([[-0.22430808,  0.4278522 , -0.54270834,  0.56281856],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ]]), array([-0.01551825,  0.01293471, -0.02460425,  0.02478435]))]


In [66]:
print(layers)


[(array([[ 0.59086711, -0.89000115, -0.01802279],
       [ 0.05028379,  0.85446639,  2.5474856 ]]), array([ 0.31253161, -1.8500788 , -0.82682587])), (array([[-1.90100266,  0.10132323,  0.24290091],
       [ 2.33774058,  2.42701912,  0.05449987],
       [ 1.30045569,  0.69898225,  0.80258499]]), array([-0.68625726,  1.78438919, -2.26010976])), (array([[ 0.26521573, -0.91793359,  0.01019282, -1.43885652],
       [-0.83083534, -0.23615136, -1.06871627,  0.85877987],
       [ 1.10804506, -0.22076153,  0.8238782 , -0.62356998]]), array([ 0.14847387, -0.91839742, -0.19410968, -1.48609056]))]


In [65]:
cost_grad = grad(cost, 0)
cost_grad(layers, x, [sigmoid, ReLU, sigmoid], target)

[(array([[0.00053754, 0.00618994, 0.00363736],
         [0.00052671, 0.00718346, 0.00323179]]),
  array([0.00112064, 0.01383242, 0.00740124])),
 (array([[0.        , 0.03190521, 0.        ],
         [0.        , 0.00668765, 0.        ],
         [0.        , 0.02840964, 0.        ]]),
  array([0.        , 0.04900224, 0.        ])),
 (array([[ 0.        ,  0.        ,  0.        ,  0.        ],
         [-0.05216996, -0.05642572, -0.02787697,  0.04692527],
         [ 0.        ,  0.        ,  0.        ,  0.        ]]),
  array([-0.02029738, -0.02187817, -0.01088383,  0.01786276]))]

# Exercise 7 - Training


**a)** Complete exercise 6 and 7 from last week, but use your own backpropogation implementation to compute the gradient.
- IMPORTANT: Do not implement the derivative terms for softmax and cross-entropy separately, it will be very hard!
- Instead, use the fact that the derivatives multiplied together simplify to **prediction - target** (see [source1](https://medium.com/data-science/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1), [source2](https://shivammehta25.github.io/posts/deriving-categorical-cross-entropy-and-softmax/))

**b)** Use stochastic gradient descent with momentum when you train your network.


# Exercise 8 (Optional) - Object orientation

Passing in the layers, activations functions, activation derivatives and cost derivatives into the functions each time leads to code which is easy to understand in isoloation, but messier when used in a larger context with data splitting, data scaling, gradient methods and so forth. Creating an object which stores these values can lead to code which is much easier to use.

**a)** Write a neural network class. You are free to implement it how you see fit, though we strongly recommend to not save any input or output values as class attributes, nor let the neural network class handle gradient methods internally. Gradient methods should be handled outside, by performing general operations on the layer_grads list using functions or classes separate to the neural network.

We provide here a skeleton structure which should get you started.


In [21]:
class NeuralNetwork:
    def __init__(
        self,
        network_input_size,
        layer_output_sizes,
        activation_funcs,
        activation_ders,
        cost_fun,
        cost_der,
    ):
        pass

    def predict(self, inputs):
        # Simple feed forward pass
        pass

    def cost(self, inputs, targets):
        pass

    def _feed_forward_saver(self, inputs):
        pass

    def compute_gradient(self, inputs, targets):
        pass

    def update_weights(self, layer_grads):
        pass

    # These last two methods are not needed in the project, but they can be nice to have! The first one has a layers parameter so that you can use autograd on it
    def autograd_compliant_predict(self, layers, inputs):
        pass

    def autograd_gradient(self, inputs, targets):
        pass