# Exercises week 42

**October 13-17, 2025**

Date: **Deadline is Friday October 17 at midnight**


# Overarching aims of the exercises this week

The aim of the exercises this week is to train the neural network you implemented last week.

To train neural networks, we use gradient descent, since there is no analytical expression for the optimal parameters. This means you will need to compute the gradient of the cost function wrt. the network parameters. And then you will need to implement some gradient method.

You will begin by computing gradients for a network with one layer, then two layers, then any number of layers. Keeping track of the shapes and doing things step by step will be very important this week.

We recommend that you do the exercises this week by editing and running this notebook file, as it includes some checks along the way that you have implemented the neural network correctly, and running small parts of the code at a time will be important for understanding the methods. If you have trouble running a notebook, you can run this notebook in google colab instead(https://colab.research.google.com/drive/1FfvbN0XlhV-lATRPyGRTtTBnJr3zNuHL#offline=true&sandboxMode=true), though we recommend that you set up VSCode and your python environment to run code like this locally.

First, some setup code that you will need.


In [1]:
import autograd.numpy as np  # We need to use this numpy wrapper to make automatic differentiation work later
from autograd import grad, elementwise_grad
from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score


# Defining some activation functions
def ReLU(z):
    return np.where(z > 0, z, 0)


# Derivative of the ReLU function
def ReLU_der(z):
    return np.where(z > 0, 1, 0)


def sigmoid(z):
    return 1 / (1 + np.exp(-z))


def mse(predict, target):
    return np.mean((predict - target) ** 2)

# Exercise 1 - Understand the feed forward pass

**a)** Complete last weeks' exercises if you haven't already (recommended).

ok

# Exercise 2 - Gradient with one layer using autograd

For the first few exercises, we will not use batched inputs. Only a single input vector is passed through the layer at a time.

In this exercise you will compute the gradient of a single layer. You only need to change the code in the cells right below an exercise, the rest works out of the box. Feel free to make changes and see how stuff works though!


**a)** If the weights and bias of a layer has shapes (10, 4) and (10), what will the shapes of the gradients of the cost function wrt. these weights and this bias be?

They will be equal one to each parameter that is going to be updated, (10,4) and (10)


**b)** Complete the feed_forward_one_layer function. It should use the sigmoid activation function. Also define the weigth and bias with the correct shapes.


In [2]:
def feed_forward_one_layer(W, b, x):
    z = W @ x + b
    a = sigmoid(z)
    return a


def cost_one_layer(W, b, x, target):
    predict = feed_forward_one_layer(W, b, x)
    return mse(predict, target)


x = (np.random.rand(2))
target = (np.random.rand(3))

W = np.random.rand(3, 2)
b = np.random.rand(3)

**c)** Compute the gradient of the cost function wrt. the weigth and bias by running the cell below. You will not need to change anything, just make sure it runs by defining things correctly in the cell above. This code uses the autograd package which uses backprogagation to compute the gradient!


In [3]:
autograd_one_layer = grad(cost_one_layer, [0, 1])
W_g, b_g = autograd_one_layer(W, b, x, target)
print(W_g, b_g)

[[0.01231316 0.03416783]
 [0.01519907 0.04217594]
 [0.0246748  0.0684702 ]] [0.04459187 0.05504313 0.08935933]


# Exercise 3 - Gradient with one layer writing backpropagation by hand

Before you use the gradient you found using autograd, you will have to find the gradient "manually", to better understand how the backpropagation computation works. To do backpropagation "manually", you will need to write out expressions for many derivatives along the computation.


We want to find the gradient of the cost function wrt. the weight and bias. This is quite hard to do directly, so we instead use the chain rule to combine multiple derivatives which are easier to compute.

$$
\frac{dC}{dW} = \frac{dC}{da}\frac{da}{dz}\frac{dz}{dW}
$$

$$
\frac{dC}{db} = \frac{dC}{da}\frac{da}{dz}\frac{dz}{db}
$$


**a)** Which intermediary results can be reused between the two expressions?

This calculation can be reused between the steps $\frac{dC}{da} \frac{da}{dz}$, as it will be computed once and then be used for the backpropagation from this node. 

**b)** What is the derivative of the cost wrt. the final activation? You can use the autograd calculation to make sure you get the correct result. Remember that we compute the mean in mse.


In [4]:
z = W @ x + b
a = sigmoid(z)

predict = a


def mse_der(predict, target):
    return 2 * (predict - target) / target.size


print(mse_der(predict, target))

cost_autograd = grad(mse, 0)
print(cost_autograd(predict, target))

[0.22918949 0.32682512 0.45124147]
[0.22918949 0.32682512 0.45124147]


**c)** What is the expression for the derivative of the sigmoid activation function? You can use the autograd calculation to make sure you get the correct result.


In [5]:
def sigmoid_der(z):
    return sigmoid(z) * (1 - sigmoid(z))


print(sigmoid_der(z))

sigmoid_autograd = elementwise_grad(sigmoid, 0)
print(sigmoid_autograd(z))

[0.19456334 0.16841769 0.19802996]
[0.19456334 0.16841769 0.19802996]


**d)** Using the two derivatives you just computed, compute this intermetidary gradient you will use later:

$$
\frac{dC}{dz} = \frac{dC}{da}\frac{da}{dz}
$$


In [6]:
dC_da = mse_der(predict, target)
dC_dz = dC_da * sigmoid_der(z)
dC_dz.shape

(3,)

**e)** What is the derivative of the intermediary z wrt. the weight and bias? What should the shapes be? The one for the weights is a little tricky, it can be easier to play around in the next exercise first. You can also try computing it with autograd to get a hint.

The shape of the itermidiary z is (3,1), while the Weights for the bias will be (3,2), where 3 is the output and the 2 is the input from x.

**f)** Now combine the expressions you have worked with so far to compute the gradients! Note that you always need to do a feed forward pass while saving the zs and as before you do backpropagation, as they are used in the derivative expressions


In [7]:
dC_da = mse_der(predict, target)
dC_dz = (dC_da * sigmoid_der(z))
dC_dW =  np.outer(dC_dz, x)
dC_db = dC_dz

print(dC_dW, dC_db)

[[0.01231316 0.03416783]
 [0.01519907 0.04217594]
 [0.0246748  0.0684702 ]] [0.04459187 0.05504313 0.08935933]


You should get the same results as with autograd.


In [8]:
W_g, b_g = autograd_one_layer(W, b, x, target)
print(W_g, b_g)

[[0.01231316 0.03416783]
 [0.01519907 0.04217594]
 [0.0246748  0.0684702 ]] [0.04459187 0.05504313 0.08935933]


# Exercise 4 - Gradient with two layers writing backpropagation by hand


Now that you have implemented backpropagation for one layer, you have found most of the expressions you will need for more layers. Let's move up to two layers.


In [9]:
x = np.random.rand(2)
target = np.random.rand(4)

W1 = np.random.rand(3, 2)
b1 = np.random.rand(3)

W2 = np.random.rand(4, 3)
b2 = np.random.rand(4)

layers = [(W1, b1), (W2, b2)]

In [10]:
z1 = W1 @ x + b1
a1 = sigmoid(z1)
z2 = W2 @ a1 + b2
a2 = sigmoid(z2)

We begin by computing the gradients of the last layer, as the gradients must be propagated backwards from the end.

**a)** Compute the gradients of the last layer, just like you did the single layer in the previous exercise.


In [11]:
dC_da2 = mse_der(a2, target)
dC_dz2 = (dC_da2 * sigmoid_der(z2))
dC_dW2 = np.outer(dC_dz2, a1)
dC_db2 = dC_dz2

To find the derivative of the cost wrt. the activation of the first layer, we need a new expression, the one furthest to the right in the following.

$$
\frac{dC}{da_1} = \frac{dC}{dz_2}\frac{dz_2}{da_1}
$$

**b)** What is the derivative of the second layer intermetiate wrt. the first layer activation? (First recall how you compute $z_2$)

$$
\frac{dz_2}{da_1}
$$

As $z_2= w_{1}^{(2)}a_1^{(1)} +w_{2}^{(2)}a_2^{(1)}+b^{(2)} \implies w_1^{(2)}$


**c)** Use this expression, together with expressions which are equivelent to ones for the last layer to compute all the derivatives of the first layer.

$$
\frac{dC}{dW_1} = \frac{dC}{da_1}\frac{da_1}{dz_1}\frac{dz_1}{dW_1}
$$

$$
\frac{dC}{db_1} = \frac{dC}{da_1}\frac{da_1}{dz_1}\frac{dz_1}{db_1}
$$


In [12]:
dC_da1 = W2.T @ dC_dz2
dC_dz1 = dC_da1 * sigmoid_der(z1)
dC_dW1 = np.outer(dC_dz1, x)
dC_db1 = dC_dz1

In [13]:
print(dC_dW1, dC_db1)
print(dC_dW2, dC_db2)

[[0.01306212 0.00206715]
 [0.00538999 0.00085299]
 [0.00194636 0.00030802]] [0.01821309 0.00751551 0.0027139 ]
[[0.00322421 0.00324285 0.00345423]
 [0.01516952 0.01525719 0.0162517 ]
 [0.00027894 0.00028055 0.00029884]
 [0.03864459 0.03886792 0.04140146]] [0.00521425 0.02453241 0.00045111 0.06249667]


**d)** Make sure you got the same gradient as the following code which uses autograd to do backpropagation.


In [14]:
def feed_forward_two_layers(layers, x):
    W1, b1 = layers[0]
    z1 = W1 @ x + b1
    a1 = sigmoid(z1)

    W2, b2 = layers[1]
    z2 = W2 @ a1 + b2
    a2 = sigmoid(z2)

    return a2

In [15]:
def cost_two_layers(layers, x, target):
    predict = feed_forward_two_layers(layers, x)
    return mse(predict, target)


grad_two_layers = grad(cost_two_layers, 0)
grad_two_layers(layers, x, target)

[(array([[0.01306212, 0.00206715],
         [0.00538999, 0.00085299],
         [0.00194636, 0.00030802]]),
  array([0.01821309, 0.00751551, 0.0027139 ])),
 (array([[0.00322421, 0.00324285, 0.00345423],
         [0.01516952, 0.01525719, 0.0162517 ],
         [0.00027894, 0.00028055, 0.00029884],
         [0.03864459, 0.03886792, 0.04140146]]),
  array([0.00521425, 0.02453241, 0.00045111, 0.06249667]))]

**e)** How would you use the gradient from this layer to compute the gradient of an even earlier layer? Would the expressions be any different?


# Exercise 5 - Gradient with any number of layers writing backpropagation by hand


Well done on getting this far! Now it's time to compute the gradient with any number of layers.

First, some code from the general neural network code from last week. Note that we are still sending in one input vector at a time. We will change it to use batched inputs later.


In [16]:
def create_layers(network_input_size, layer_output_sizes):
    layers = []

    i_size = network_input_size
    for layer_output_size in layer_output_sizes:
        W = np.random.randn(layer_output_size, i_size)
        b = np.random.randn(layer_output_size)
        layers.append((W, b))

        i_size = layer_output_size
    return layers


def feed_forward(input, layers, activation_funcs):
    a = input
    for (W, b), activation_func in zip(layers, activation_funcs):
        z = W @ a + b
        a = activation_func(z)
    return a


def cost(layers, input, activation_funcs, target):
    predict = feed_forward(input, layers, activation_funcs)
    return mse(predict, target)

You might have already have noticed a very important detail in backpropagation: You need the values from the forward pass to compute all the gradients! The feed forward method above is great for efficiency and for using autograd, as it only cares about computing the final output, but now we need to also save the results along the way.

Here is a function which does that for you.


In [17]:
def feed_forward_saver(input, layers, activation_funcs):
    layer_inputs = []
    zs = []
    a = input
    for (W, b), activation_func in zip(layers, activation_funcs):
        layer_inputs.append(a)
        z = W @ a + b
        a = activation_func(z)

        zs.append(z)

    return layer_inputs, zs, a

**a)** Now, complete the backpropagation function so that it returns the gradient of the cost function wrt. all the weigths and biases. Use the autograd calculation below to make sure you get the correct answer.


In [18]:
def backpropagation(
    input, layers, activation_funcs, target, activation_ders, cost_der=mse_der
):
    layer_inputs, zs, predict = feed_forward_saver(input, layers, activation_funcs)

    layer_grads = [() for layer in layers]

    # We loop over the layers, from the last to the first
    for i in reversed(range(len(layers))):
        layer_input, z, activation_der = layer_inputs[i], zs[i], activation_ders[i]

        if i == len(layers) - 1:
            # For last layer we use cost derivative as dC_da(L) can be computed directly
            dC_da = cost_der(predict, target)
        else:
            # For other layers we build on previous z derivative, as dC_da(i) = dC_dz(i+1) * dz(i+1)_da(i)
            (W, b) = layers[i + 1]
            dC_da = W.T @ dC_dz

        dC_dz = dC_da * activation_der(z)
        dC_dW = np.outer(dC_dz,layer_input)
        dC_db = dC_dz

        layer_grads[i] = (dC_dW, dC_db)

    return layer_grads

In [19]:
network_input_size = 2
layer_output_sizes = [3, 4]
activation_funcs = [sigmoid, ReLU]
activation_ders = [sigmoid_der, ReLU_der]

layers = create_layers(network_input_size, layer_output_sizes)

x = np.random.rand(network_input_size)
target = np.random.rand(4)

In [20]:
layer_grads = backpropagation(x, layers, activation_funcs, target, activation_ders)
print(layer_grads)

[(array([[-0.00188195, -0.01645937],
       [ 0.00106259,  0.00929333],
       [ 0.01052162,  0.09202112]]), array([-0.02330049,  0.01315598,  0.13026852])), (array([[-0.19408739, -0.06323691, -0.062431  ],
       [-0.        , -0.        , -0.        ],
       [ 0.24112433,  0.07856233,  0.07756111],
       [ 0.57255527,  0.18654806,  0.18417063]]), array([-0.21581614, -0.        ,  0.26811903,  0.63665479]))]


In [21]:
cost_grad = grad(cost, 0)
cost_grad(layers, x, [sigmoid, ReLU], target)

[(array([[-0.00188195, -0.01645937],
         [ 0.00106259,  0.00929333],
         [ 0.01052162,  0.09202112]]),
  array([-0.02330049,  0.01315598,  0.13026852])),
 (array([[-0.19408739, -0.06323691, -0.062431  ],
         [ 0.        ,  0.        ,  0.        ],
         [ 0.24112433,  0.07856233,  0.07756111],
         [ 0.57255527,  0.18654806,  0.18417063]]),
  array([-0.21581614,  0.        ,  0.26811903,  0.63665479]))]

# Exercise 6 - Batched inputs

Make new versions of all the functions in exercise 5 which now take batched inputs instead. See last weeks exercise 5 for details on how to batch inputs to neural networks. You will also need to update the backpropogation function.


In [22]:
# Flipping the W shape
def create_layers_batch(network_input_size, layer_output_sizes):
    layers = []

    i_size = network_input_size
    for layer_output_size in layer_output_sizes:
        W = np.random.rand(i_size, layer_output_size)
        b = np.random.rand(1,layer_output_size)
        layers.append((W, b))

        i_size = layer_output_size
    return layers

def cost(layers, input, activation_funcs, target):
    predict = feed_forward_saver(input, layers, activation_funcs)
    return mse(predict, target)

In [23]:
# Flipping the W shape
def feed_forward_saver_batch(input, layers, activation_funcs):
    layer_inputs = []
    zs = []
    a = input
    for (W, b), activation_func in zip(layers, activation_funcs):
        layer_inputs.append(a)
        z = a @ W + b
        a = activation_func(z)

        zs.append(z)

    return layer_inputs, zs, a

In [24]:
def backpropagation_batch(
    input, layers, activation_funcs, target, activation_ders, cost_der=mse_der
):
    layer_inputs, zs, predict = feed_forward_saver_batch(input, layers, activation_funcs)

    layer_grads = [() for layer in layers]

    # We loop over the layers, from the last to the first
    for i in reversed(range(len(layers))):
        layer_input, z, activation_der = layer_inputs[i], zs[i], activation_ders[i]

        if i == len(layers) - 1:
            # For last layer we use cost derivative as dC_da(L) can be computed directly
            dC_da = cost_der(predict, target)
        else:
            # For other layers we build on previous z derivative, as dC_da(i) = dC_dz(i+1) * dz(i+1)_da(i)
            (W, b) = layers[i + 1]
            dC_da = dC_dz @ W.T

        dC_dz = dC_da * activation_der(z)
        dC_dW = layer_input.T @ dC_dz 
        dC_db = dC_dz

        layer_grads[i] = (dC_dW, dC_db)

    return layer_grads

In [25]:
network_input_size = 2
layer_output_sizes = [3, 4]
activation_funcs = [sigmoid, ReLU]
activation_ders = [sigmoid_der, ReLU_der]

layers = create_layers_batch(network_input_size, layer_output_sizes)

x = np.random.rand(10,network_input_size)
target = np.random.rand(10,4)

In [26]:
backpropagation_batch(x, layers, activation_funcs, target, activation_ders)

[(array([[0.08831393, 0.040351  , 0.10594327],
         [0.10544734, 0.04756942, 0.12469669]]),
  array([[0.02248005, 0.01047734, 0.03119896],
         [0.01837321, 0.00855545, 0.02163037],
         [0.01784806, 0.00846614, 0.02174907],
         [0.02375857, 0.01029839, 0.02688988],
         [0.02059961, 0.00989108, 0.02478739],
         [0.02602705, 0.0122301 , 0.03011172],
         [0.02514837, 0.01151729, 0.02517586],
         [0.01858566, 0.00757579, 0.01928286],
         [0.01288686, 0.00552679, 0.01684211],
         [0.01919558, 0.00923371, 0.02003074]])),
 (array([[0.49907906, 0.31969654, 0.31054551, 0.36890793],
         [0.55831546, 0.359078  , 0.34906102, 0.41557137],
         [0.51281034, 0.33151543, 0.3230219 , 0.38541462]]),
  array([[0.08358027, 0.05663615, 0.06576174, 0.03958198],
         [0.07593704, 0.03712242, 0.03729279, 0.0337683 ],
         [0.06141587, 0.04527677, 0.03477851, 0.05190472],
         [0.07986921, 0.04966859, 0.05150476, 0.05729417],
         [0.0974

# Exercise 7 - Training


**a)** Complete exercise 6 and 7 from last week, but use your own backpropogation implementation to compute the gradient.
- IMPORTANT: Do not implement the derivative terms for softmax and cross-entropy separately, it will be very hard!
- Instead, use the fact that the derivatives multiplied together simplify to **prediction - target** (see [source1](https://medium.com/data-science/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1), [source2](https://shivammehta25.github.io/posts/deriving-categorical-cross-entropy-and-softmax/))

**b)** Use stochastic gradient descent with momentum when you train your network.


In [27]:
def softmax(z):
    """Compute softmax values for each set of scores in the rows of the matrix z.
    Used with batched input data."""
    e_z = np.exp(z - np.max(z, axis=0))
    return e_z / np.sum(e_z, axis=1)[:, np.newaxis]

def der_softmax(predict, target):
    """Compute the derivative of the softmax function for each set of scores in the rows of the matrix z.
    Used with batched input data."""
    return predict - target

def cross_entropy(predict, target):
    return np.sum(-target * np.log(predict))


def cost_batch(layers, input, activation_funcs, target):
    _, _, predict = feed_forward_saver_batch(input, layers, activation_funcs)
    return cross_entropy(predict, target)

In [40]:
def backpropagation_batch(
    input, layers, activation_funcs, target, activation_ders, cost_der=mse_der
):
    layer_inputs, zs, predict = feed_forward_saver_batch(input, layers, activation_funcs)

    layer_grads = [() for layer in layers]

    # We loop over the layers, from the last to the first
    for i in reversed(range(len(layers))):
        layer_input, z, activation_der = layer_inputs[i], zs[i], activation_ders[i]

        if i == len(layers) - 1:
            # For last layer we use cost derivative as dC_da(L) can be computed directly
            dC_dz = cost_der(predict, target)
        else:
            # For other layers we build on previous z derivative, as dC_da(i) = dC_dz(i+1) * dz(i+1)_da(i)
            (W, b) = layers[i + 1]
            dC_da = dC_dz @ W.T
            dC_dz = dC_da * activation_der(z)

        dC_dW = layer_input.T @ dC_dz 
        dC_db = np.sum(dC_dz, axis=0, keepdims=True)

        layer_grads[i] = (dC_dW, dC_db)

    return layer_grads

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

iterations = 10000
batch_size = 100
learning_rate = 0.001
momentum = 0.9

iris = datasets.load_iris()
scaler = StandardScaler()
input = iris.data

input_scaled = scaler.fit_transform(input)
targets = np.zeros((len(iris.data), 3))
for i, t in enumerate(iris.target):
    targets[i, t] = 1
network_input_size = 4
layer_output_sizes = [5,3]
activation_funcs = [sigmoid, softmax]
layers = create_layers_batch(network_input_size, layer_output_sizes)


X_train, X_test, y_train, y_test = train_test_split(input_scaled, targets, test_size=0.2)

velocities = []
for W, b in layers:
    v_W = np.zeros_like(W)
    v_b = np.zeros_like(b)
    velocities.append((v_W, v_b))

best_model_score = float('inf')
best_model = None

for i in range(iterations):
    for j in range(0, len(X_train), batch_size):
        x_batch = X_train[j : j + batch_size]
        y_batch = y_train[j : j + batch_size]

        layer_grads = backpropagation_batch(
            x_batch, layers, activation_funcs, y_batch, [sigmoid_der, der_softmax], cost_der=der_softmax
        )

        
        # Update weights and biases using gradient descent
        for k in range(len(layers)):
            W, b = layers[k]
            dC_dW, dC_db = layer_grads[k]
            v_W, v_b = velocities[k]

            v_W = momentum * v_W + learning_rate * dC_dW
            v_b = momentum * v_b + learning_rate * np.mean(dC_db, axis=0, keepdims=True)

            W -= v_W
            b -= v_b
            layers[k] = (W, b)
            velocities[k] = (v_W, v_b)
    if(cost_batch(layers, X_test, activation_funcs, y_test) < best_model_score):
        best_model_score = cost_batch(layers, X_test, activation_funcs, y_test)
        best_model = layers
    if i % 1000 == 0:
        train_cost = cost_batch(layers, X_train, activation_funcs, y_train)
        test_cost = cost_batch(layers, X_test, activation_funcs, y_test)
        print(f"Iteration {i}, Train Cost: {train_cost}, Test Cost: {test_cost}")

(array([[-1.08495369, -0.35037467,  1.72719805,  1.33091219,  4.15005009],
       [ 0.65048982,  1.72600037,  0.67629567, -1.67112747, -1.3427484 ],
       [-1.33564038, -0.83272631,  1.7835273 ,  1.93031626,  4.67815805],
       [-1.34297368, -0.60851439,  2.10500195,  1.85161267,  4.81697091]]), array([[ 0.57840535, -0.82932269, -2.84624445, -0.55062073, -1.30598841]]))


In [37]:
def accuracy(predictions, targets):
    one_hot_predictions = np.zeros(predictions.shape)

    for i, prediction in enumerate(predictions):
        one_hot_predictions[i, np.argmax(prediction)] = 1
    return accuracy_score(one_hot_predictions, targets)

_ , _ ,predictions = feed_forward_saver_batch(X_test, best_model, activation_funcs)
print("Test Accuracy:", accuracy(predictions, y_test))

Test Accuracy: 1.0


# Exercise 8 (Optional) - Object orientation

Passing in the layers, activations functions, activation derivatives and cost derivatives into the functions each time leads to code which is easy to understand in isoloation, but messier when used in a larger context with data splitting, data scaling, gradient methods and so forth. Creating an object which stores these values can lead to code which is much easier to use.

**a)** Write a neural network class. You are free to implement it how you see fit, though we strongly recommend to not save any input or output values as class attributes, nor let the neural network class handle gradient methods internally. Gradient methods should be handled outside, by performing general operations on the layer_grads list using functions or classes separate to the neural network.

We provide here a skeleton structure which should get you started.


In [31]:
class NeuralNetwork:
    def __init__(
        self,
        network_input_size,
        layer_output_sizes,
        activation_funcs,
        activation_ders,
        cost_fun,
        cost_der,
    ):
        pass

    def predict(self, inputs):
        # Simple feed forward pass
        pass

    def cost(self, inputs, targets):
        pass

    def _feed_forward_saver(self, inputs):
        pass

    def compute_gradient(self, inputs, targets):
        pass

    def update_weights(self, layer_grads):
        pass

    # These last two methods are not needed in the project, but they can be nice to have! The first one has a layers parameter so that you can use autograd on it
    def autograd_compliant_predict(self, layers, inputs):
        pass

    def autograd_gradient(self, inputs, targets):
        pass