# Neural networks and backpropagation

In this notebook we will exercise the feedforward, fully-connected neural network, aka multilayer perceptron (MLP). With a single hidden layer it looks like this:

![](figures/Colored_neural_network.svg)

Every neuron takes a list of inputs $x_i$, applies a weighted sum and feeds the result into an activation function $f$ such that the output $a_i$ is given by

$$a_i = f\left(\sum_j w_{ij} x_j\right) \quad \text{or} \quad \vec a = f(W\vec x)$$

Note how the argument of the activation function is essentially a **matrix multiplication** of the weight matrix with the input vector!

For an arbitrary number of layers the output vector of layer $l$ $\vec a^(l)$ is then calculated from the output vector $\vec a^{(l-1)}$ of the previous layer:

$$\vec a^{(l)} = f(W^{(l)}\vec a^{(l-1)})$$

Popular choices for the activation function include ([source][2]):

![1]

We will implement a simple 1-hidden-layer neural network first using available libraries and then manually.

[1]: figures/activation_functions.png "overview of commonly used activation functions"
[2]: https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning

We use again our toy example for the 2-moons 2D dataset:

In [None]:
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
import numpy as np

In [None]:
X, y = make_moons(noise=0.25, n_samples=100, random_state=0)

In [None]:
plt.scatter(*X[y==0].T)
plt.scatter(*X[y==1].T)

# With sklearn

A simple MLP classifier is available in scikit-learn:

In [None]:
from sklearn.neural_network import MLPClassifier

In [None]:
model = MLPClassifier().fit(X, y)

In [None]:
plt.plot(model.loss_curve_)

In [None]:
def visualize_classifier(predict, xmin, xmax, ymin, ymax, **kwargs):
    xx, yy = np.meshgrid(
        np.linspace(xmin, xmax, 100),
        np.linspace(ymin, ymax, 100),
    )
    X = np.stack([xx, yy], axis=-1).reshape(-1, 2)
    zz = predict(X).reshape(xx.shape)
    plt.contourf(xx, yy, zz, levels=100, **kwargs)

In [None]:
visualize_classifier(lambda x: model.predict_proba(x)[:, 1], -2, 2.5, -2, 2.5, cmap="RdBu")
plt.scatter(*X[y==0].T, color="red")
plt.scatter(*X[y==1].T, color="blue")

Let's tune the parameters a bit to make a relatively simple model that we can then reproduce more and more manually:

In [None]:
model = MLPClassifier(
    hidden_layer_sizes=(32,), solver="sgd", batch_size=len(X), learning_rate_init=0.2, max_iter=1000
).fit(X, y)

In [None]:
plt.plot(model.loss_curve_)

In [None]:
visualize_classifier(lambda x: model.predict_proba(x)[:, 1], -2, 2.5, -2, 2.5, cmap="RdBu")
plt.scatter(*X[y==0].T, color="red")
plt.scatter(*X[y==1].T, color="blue")

# With pytorch

[PyTorch](https://pytorch.org) is one of the most popular machine learning libraries to date (2024). It is mainly developed by Meta AI. Have a look at the [tutorials](https://pytorch.org/tutorials) to learn more.

Other popular choices are [TensorFlow](https://www.tensorflow.org)/[Keras](https://keras.io) and [Jax](https://jax.readthedocs.io)

The main object in torch are so called *tensors* which have a very similar API to numpy arrays. PyTorch builds computation graphs dynamically which allows for a lot of flexibility and easy debugging.

In [None]:
import torch
from torch import nn

To convert a numpy array to a torch tensor, we can just call the `torch.tensor` constructor on it.

For some operations PyTorch is rather strict not to mix data types of different precision. Since most NN parameters are initialized as 32 bit floating point numbers we will also convert our data to this type:

In [None]:
X, y = torch.tensor(X, dtype=torch.float32), torch.tensor(y, dtype=torch.float32)

We create our 1-hidden-layer MLP using the `nn.Sequential` constructor:

In [None]:
neurons = 32
model = nn.Sequential(nn.Linear(2, neurons), nn.ReLU(), nn.Linear(neurons, 1), nn.Sigmoid())
model

The `torch.optim` package contains implementations of various optimization algorithms. Here we will just use the plain stochastic gradient descent algorithm:

In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=1.0)

We told the optimizer about the parameters of the model (which are also just torch tensors) and when we later call `.step()` the optimizer will apply it's update rule using the gradients that have been attached to the parameters.

<div class="alert alert-block alert-success">
    <h2>Exercise 1</h2><br> How many parameters has this model?
</div>

Let's fit the model:

In [None]:
history = []
for i in range(1000):
    optimizer.zero_grad()
    y_pred = model(X).squeeze(1)
    loss = nn.functional.binary_cross_entropy(y_pred, y)
    loss.backward()
    history.append(loss.item())
    optimizer.step()

* `optimizer.zero_grad()` resets all gradients to 0. By default torch accumulates gradients if a future backpropagation step is executed. **Don't forget this**
* the forward pass is calculated by `model(X)` - the `.squeeze(1)` changes the shape of the output from (N, 1) to (N,)
* `loss` is our objective we want to minimize, in this case the binary cross entropy (or negative log likelihood)
* `loss.backward()` will run the backward pass and attach the gradient of the loss w.r.t `.grad` attribute of all parameters that have `requires_grad=True` set
* `optimizer.step()` will perform the actual gradient update

In [None]:
plt.plot(history)

In [None]:
with torch.no_grad():
    visualize_classifier(
        lambda x: model(torch.tensor(x, dtype=torch.float32)).squeeze().numpy(),
        -2, 2.5, -2, 2.5,
        cmap="RdBu"
    )
plt.scatter(*X[y==0].T, color="red")
plt.scatter(*X[y==1].T, color="blue")

# With torch (manual optimizer)

Next, let's leave out the optimizer and perform the gradient update manually, still using pytorch to get the gradient.

To get the gradient of a loss w.r.t. some parameters we have to set `requires_grad=True` for the corresponding tensors:

In [None]:
x = torch.linspace(0, 2*np.pi, 100, requires_grad=True)

Now, all tensors that are created as operations of this will have a `grad_fn` attribute:

In [None]:
f.grad_fn

To convert such tensors (implicitly or explicitly) to numpy arrays we need to "detach" them from the computation graph:

In [None]:
plt.plot(x.detach(), f.detach())

## The VJP
What does `grad_fn` do? It will calculate the so called *vector-Jacobian-product* (VJP)

What is that? To calculate the gradient of a loss function we use the chain rule, in the single variable case

$$\frac{\partial f(g(x))}{\partial x} = \frac{\partial f}{\partial g}\frac{\partial g}{\partial x}$$

and in the [multivariable case](https://en.wikipedia.org/wiki/Chain_rule#General_rule:_Vector-valued_functions_with_multiple_inputs) ($x\in \mathbb{R}^n, g: \mathbb{R}^n\rightarrow\mathbb{R}^m, f: \mathbb{R}^m\rightarrow\mathbb{R}^k$)

$$\mathbf{J}_f(g(x)) = \mathbf{J}_f(g)\mathbf{J}_g(x),\quad \text{in components} \quad \frac{\partial f_i}{\partial x_j}=\frac{\partial f_i}{\partial g_k}\frac{\partial g_k}{\partial x_j}$$

For **the gradient of a scalar** (loss):

$$\frac{\partial f}{\partial x_j}=\underbrace{\frac{\partial f}{\partial g_k}}_{\mathrm{vector}}\underbrace{\frac{\partial g_k}{\partial x_j}}_{\mathrm{Jacobian}}$$

So we need to matrix multiply the (**incoming**) gradient (row) **vector** with the **Jacobian** in each step, the vector-Jacobian-product **VJP**.

The cool thing: usually we **don't need to compute the full Jacobian to get the VJP!**

E.g. here, our tensor `f` applies a sin function componentwise to an input `x`. It's pretty clear we don't need to compute the Jacobian (it's a diagonal matrix).

So, the VJP-way of calculating the derivative componentwise would be to take an incoming gradient vector that is all ones:

In [None]:
v = torch.ones_like(f)
v

and calculate the VJP by just multiplying this with the componentwise derivative, here the cosine:

In [None]:
vjp = v * torch.cos(x)
vjp

This is also what `grad_fn` will do

In [None]:
f.grad_fn(v)

`.backward` will then go through the whole chain of computations backwards and calculate the VJP in each step.

Here we only have one step. Since our tensor is not a scalar we need to feed in the incoming gradient as well:

In [None]:
f.backward(v)

This doesn't return anything, but rather attach the resulting gradient to all tensors with `requires_grad=True` (in torch called "leaf" tensors)

In [None]:
x.grad

In [None]:
plt.plot(x.detach(), f.detach())
plt.plot(x.detach(), x.grad)

With that, we will now fit our neural network again, implementing the gradient step manually:

In [None]:
model = nn.Sequential(nn.Linear(2, neurons), nn.ReLU(), nn.Linear(neurons, 1), nn.Sigmoid())

To update a tensor in-place, adding a value we can use the `add_` method. Our gradient update will just add the negative gradient, scaled by a learning rate `lr` to each of parameters:

In [None]:
def step(lr=1):
    with torch.no_grad():
        for par in model.parameters():
            par.add_(-lr * par.grad)

The `torch.no_grad()` context manager ensures that all operations in this block won't be attached to the computation graph.

The training loop becomes:

In [None]:
history = []
for i in range(1000):
    model.zero_grad()
    y_pred = model(X).squeeze(1)
    loss = nn.functional.binary_cross_entropy(y_pred, y)
    loss.backward()
    step()
    history.append(loss.detach().item())

In [None]:
plt.plot(history)

In [None]:
with torch.no_grad():
    visualize_classifier(lambda x: model(torch.tensor(x, dtype=torch.float32)).squeeze(1), -2, 2.5, -2, 2.5, cmap="RdBu")
plt.scatter(*X[y==0].T, color="red")
plt.scatter(*X[y==1].T, color="blue")

# With torch (manual backpropagation)

Now we want to also calculate the backward pass completely manually, meaning taking the VJP in each step ourself.

This part is inspired by [part 4](https://www.youtube.com/watch?v=q8SA3rM6ckI) of [Andrej Karpathy's *Neural Networks: Zero to Hero* tutorials](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ)

First, when we define the parameters of the network we will also need to decide for starting values.

Typically they are initialized to small random values around 0. While there are [schemes that optimize this](https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.kaiming_uniform_), here we will just take a standard normal distribution, scaled by 0.1 for all the parameters:

In [None]:
w1 = (torch.randn(2, neurons) * 0.1).requires_grad_()

Let's start with a very simple neural network with one hidden layer, no bias and use the mean squared error loss function.

The first intermediate output is the matrix multiplication of the input vectors with the weight matrix of the hidden layer:

In [None]:
z1 = X @ w1; z1.retain_grad()
z1.shape

This was now a matrix-matrix multiplication since we did this for our whole input data at once! When training neural networks we will almost always work with batches of data, so this is very common.

To crosscheck our manual gradient calculations later we will set `retain_grad` for all intermediate outputs, such that torch will also attach gradients to these in the backward pass.

Next, we apply the relu activation function:

In [None]:
a1 = torch.relu(z1); a1.retain_grad()

Then we continue to the second (final) layer:

In [None]:
w2 = (torch.randn(neurons, 1) * 0.1).requires_grad_()

In [None]:
z2 = a1 @ w2; z2.retain_grad()
z2.shape

Since this is the final output with one value per data point we will squeeze the last dimension:

In [None]:
z2 = z2.squeeze(1); z2.retain_grad()
z2.shape

We don't apply an activation function, but instead use the linear output to calculate the mean squared error:

In [None]:
loss = torch.mean((y - z2) ** 2)
loss

For crosscheck we run the backward pass with torch:

In [None]:
loss.backward()

Now we manually go backwards. We will create variables with `d` in front of them where e.g. `dz2` means gradient of loss wrt all components of `z2`

The gradient of the loss $L = \frac{1}{N} \sum_i (y_i - \hat{y}_i)^2$ is given by $\frac{\partial L}{\partial \hat{y}_i} = -\frac{2}{N}(y_i - \hat{y}_i)$

In [None]:
dz2 = - 2 / len(z2) * (y - z2)

crosscheck with what torch got:

In [None]:
(z2.grad == dz2).all()

Now we need the VJP for a matrix multiplication

<div class="alert alert-block alert-success">
    <h2>Exercise 2</h2>
    What is the VJP for a matrix multiplication?<br>
    <b>Hint:</b> It's also going to be a matrix multiplication. You can try to guess this from the shapes of the involved tensors.
    It's also instructive to derive it once on a sheet of paper.
</div>

The matrix multiplication was `z2 = a1 @ w2`

Since we squeezed the last dimension of `z2` we will have to `unsqueeze` it again for the following operation:

In [None]:
dz2.unsqueeze(1).shape

So we have the following tensor shapes

In [None]:
dz2.unsqueeze(1).shape, a1.shape, w2.shape

From this we need to get `da1` and `dw2`, the gradients w.r.t `a1` and `w2` via a VJP with the `dz2` gradient vector

In [None]:
dw2 = ... # your task

In [None]:
dw2.shape

In [None]:
(dw2 == w2.grad).all()

In [None]:
da1 = ... # your task
da1.shape

In [None]:
(da1 == a1.grad).all()

What is the derivative of relu?

In [None]:
dz1 = (z1 > 0) * da1

In [None]:
(dz1 == z1.grad).all()

And another matrix multiplication to get the gradient w.r.t. the weights of the first layer:

In [None]:
w1.shape, dz1.shape, X.shape

In [None]:
dw1 = X.T @ dz1
dw1.shape

In [None]:
(dw1 == w1.grad).all()

The full training loop then looks like this:

In [None]:
# initialize parameters
w1 = torch.randn(2, neurons) * 0.1
w2 = torch.randn(neurons, 1) * 0.1

# training loop
lr = 0.1
history = []
for i in range(100):
    # forward
    z1 = X @ w1
    a1 = torch.relu(z1)
    z2 = a1 @ w2
    z2 = z2.squeeze(1)
    loss = torch.mean((y - z2) ** 2)
    
    history.append(loss.item())

    # backward
    dz2 = - 2 / len(z2) * (y - z2)
    dw2 = ...
    da1 = ...
    dz1 = (z1 > 0) * da1
    dw1 = ...

    # gradient update
    for par, grad in [(w1, dw1), (w2, dw2)]:
        par.add_(-lr * grad)

In [None]:
plt.plot(history)

In [None]:
with torch.no_grad():
    visualize_classifier(
        lambda x: torch.relu(torch.tensor(x, dtype=torch.float32) @ w1) @ w2,
        -2, 2.5, -2, 2.5, cmap="RdBu"
    )
plt.scatter(*X[y==0].T, color="red")
plt.scatter(*X[y==1].T, color="blue")

This does not look too great. There are 2 things we can improve:

* implement a bias for the neurons
* apply a sigmoid activation function to constrain the output to lie between 0 and 1
* switch to the binary cross entropy loss

<div class="alert alert-block alert-success">
    <h2>Exercise 3</h2>
    Add a bias term to the neurons<br>
    <b>Hint:</b> The Jacobi matrix of the output of a layer w.r.t. the biases is an identity matrix, so the VJP is the sum over the incoming gradient.<br>
    So if we have <code>z = a @ w + b</code>, the gradient <code>db = dz.sum(axis=0)</code>
</div>

<div class="alert alert-block alert-success">
    <h2>Exercise 4</h2>
    Apply a sigmoid activation function to the final output.<br>
    <b>Hint:</b> You can use <code>z = torch.sigmoid(a)</code>. The derivative of the sigmoid function is given by $f'(x) = f(x)(1-f(x))$
</div>

<div class="alert alert-block alert-success">
    <h2>Exercise 5</h2>
    Change the loss function to binary cross entropy.<br>
    <b>Reminder:</b> The formula is $L = \frac{1}{N}\sum_i y_i\log(\hat y_i) + (1 - y_i)\log(1 - \hat y_i)$ for NN outputs $\hat y_i$ and labels $y_i$
</div>