# Backpropagation

Take a look at the mathematics of the backpropagation algorithm.

We will cover the following:

- Notations

- Forward pass

- Backward pass

- The chain rule for the backward pass

Neural Networks (NN) are non-linear classifiers that can be formulated as a series of matrix multiplications. Just like linear classifiers, they can be trained using the same principles we followed before, namely the gradient descent algorithm. The difficulty arises in computing the gradients. 

But first things first. 

Let’s start with a straightforward example of a two-layered NN, with each layer containing just one neuron.

# Notations

- The superscript defines the layer that we are in.

- $o^{L}$ denotes the activation of layer L.

- $w^{L}$ is a scalar weight of the layer L.

- $b^{L}$ is the bias term of layer L.

- $C$ is the cost function, $t$ is our target class, and $f$ is the activation function.


# Forward pass

Our lovely model would look something like this in a simple sketch:


![pic](https://raw.githubusercontent.com/CUTe-EmbeddedAI/images/main/images/fig09.PNG)

We can write the output of a neuron at layer $L$ as:

> $o^{L} = f(w^{L}o^{L-1} + b^{L})$

To simplify things, let’s define:

> $z^{L} = w^{L}o^{L-1} + b^{L}$

so that our basic equation will become:

> $o^L = f(z^l)$

We also know that our loss function is:

> $C = (o^L − t)^2 $

This is the so-called **forward pass**. 

We take some input and pass it through the network. From the output of the network, we can compute the loss $C$.

# Backward pass

> Backward pass is the process of adjusting the weights $w$ in all the layers to minimize the loss $C$.

To adjust the weights based on the training example, we can use our known **update rule**:

> $w_t^L = w_{t-1}^L - \lambda \frac{\delta C}{\delta w^L} $

where $\lambda$ is the learning rate that scales down the gradient.

It should be clear by now that the only thing left to compute is the gradient $\frac{\delta C}{\delta w^L}$ (the derivative of the loss with respect to the weight).

One way to think about computing $\frac{\delta C}{\delta w^L}$ is through the following diagram, which is called computational graph:


![pic](https://raw.githubusercontent.com/CUTe-EmbeddedAI/images/main/images/fig10.PNG)

We summarize the performed operation in this way. To convert this into math, we need to revisit the chain rule.

# The chain rule for the backward pass

To compute the gradient $\frac{\delta C}{\delta w^L}$, our most useful tool is calculus and the chain rule.

Using both, we can write:
\frac{\delta o^L}{\delta z^L \frac{\delta C}{\delta o^L}}
> $\frac{\delta C}{\delta w^L} = \frac{\delta z^L}{\delta w^L} \frac{\delta o^L}{\delta z^L} \frac{\delta C}{\delta o^L}$

It is evident that the final gradient is affected by the gradients of the previous neuron, which in turn is affected by the gradients of the one before. You can see that in order to compute the gradient, we need to go back (through the chain rule) all the way to the beginning of the network.

In other terms, we need to propagate the error backwards. This is how the backpropagation algorithm got its name.

To find the gradients, let’s compute all the subgradients. By using basic calculus,we get:

>> $C = (o^L - t)^2 \rightarrow \frac{\delta C}{\delta o^L} = 2(o^L - t)$

>> $o^L = f(w^Lo^{L-1} + b^L) = f(z^L) \rightarrow \frac{\delta o^L}{\delta z^L} = f'(o^L)$

>> $z^L = w^Lo^{L-1} + b^L \rightarrow \frac{\delta z^L}{\delta w^L} = o^{L-1}$

Combining them all together, we acquire our final gradient:

>> $ \frac{\delta C}{\delta w^L} = o^{L-1} \ast f'(o^L) \ast 2(o^L - t)$

Similar equations can be derived for the biases. Instead of $\frac{\delta z^L}{\delta b^L}$, we would have:

>> $ z^L = w^Lo^{L-1} + b^L \rightarrow \frac{\delta z^L}{\delta b^L} =1 $ 

For completion, if we do the math, we get:

>> $\frac{\delta C}{\delta b^L} = f'(o^L) \ast 2(o^L - t)$


Now, we can adjust the weight and biases based on a single training example based on the update rule:

>> $w^L_t = w^L_{t-1} - \lambda \frac{\delta C}{\delta w^L}$

Next, we’ll feed the next example and readjust, and repeat. This is the infamous **BACKPROPAGATION**.

You might argue that this is oversimplified because we only have 1 neuron. To be honest, not much will change if we add more neurons per layer. We will essentially conclude to the same equation.


![pic](https://raw.githubusercontent.com/CUTe-EmbeddedAI/images/main/images/fig11.PNG)

Two final things to note here:

- The derivative with respect to the activation is a summation due to the fact that the activation of a neuron now depends on the activations of all th eneurons on the previous layer.

- The same derivative also depends on the derivatives of the next layer’s activation (backpropagation of the error).

You now have a sense of how NNs learn, and that is no easy task.

> **Important note:** We will not be computing gradients in every network that we define. The gradients are computed automatically in modern frameworks such as PyTorch.

No more partial derivatives!