# Backpropagation

When training a model, we have a predefined cost/error/loss function that we are aiming to minimise. This loss function determines how well the model is performing in terms of prediction accuracy, usually computed by comparing the output of a network with the desired result. The lower the output of the cost function, the more accurate predictions being made by the model.

In training, we want to alter the weights of the neuron connections within a network so that the cost function will be minimised effectively. To calculate the gradient of a network's cost function, the algorithm known as backpropagation is used. With this alogorithm, we try to determine how sensitive the cost function is to small changes in the weights. 

To further understand this, we can use a very simple example. Examining the connection between the last two layers  of a simple network, where there is only one node in each layer, and thus one weighted connection between the two:

The activation of the final layer can be given as $a^L$, with the activation of the previous layer given as $a^{L-1}$.

The desired output for this specific training sample can be described as $y$.

A simple cost function for this network for an individual training item can be calculated as:

\begin{equation}
 C_0 = (a^L-y)^2
\end{equation}

The activation of the final layer of a network ($z^L$) is computed by multiplying the output of the previous layer $a^{L-1}$ by a specific weight value $w^L$ plus the associated bias $b^L$. This output is then passed into the respective non-linear activation function of the output layer, such as ReLU or softmax, which transforms the activation into the final output, $a^L$.

\begin{equation}
 z^L = a^{L-1}w^L + b^L
\end{equation}

\begin{equation}
 a^L = ReLU(z^L)
\end{equation}

Therefore, the weight, the previous activation output and the bias each influence the activation of the last layer and thus the final output from this layer. The final output then, in comparison with the desired output (y), is used to determine the output of the loss function.

To determine how sensitive the cost function is to changes in weight, we can calculate the derivative of $C_0$, with respect to $w_L$.

\begin{equation}
\frac{\partial C_0}{\partial w^L}
\end{equation}

We basically want to determine how a change in $w^L$ affects the calculation of $z^L$, which in turn affects the calcualtion of $a^L$ and thus overall loss function output $C_0$.

Therefore, we first compute the derivative of $z^L$, with respect to $w^L$, then the derivative of $a^L$ with respect to $z^L$, and finally the derivative of $C_0$ with respect to $a^L$. These three derivatives are multiplied together according to the chain rule. This would be the calculation for just one training example, with the derivative of the overall cost function of the network with respect to $w_L$ being the average across all training samples:

\begin{equation}
 \frac{\partial C}{\partial w^L} = \frac{1}{n} \sum_{k=0}^{n-1}\frac{\partial C_k}{\partial w^L}
\end{equation}


To determine how sensitive the cost function is to changes in bias, we could do the same calculations, but instead of $w^L$ within the equations, we would have $b^L$.

![image-2.png](attachment:image-2.png)


The same chain rule computation can be performed backwards through the network, allowing us to determine how the cost function output changes with respect to each previous weight or bias associated with the connection between previous nodes in the network.

When multiple neurons are involved in each layer, $a^L$ will now have a subscript $a^L_i$, with $i$ corresponding to the $i$-th neuron/node of the layer. The same would apply for $a^{L-1}_j$, and so on. Each weighted connection between different nodes can be represented as $w^L_{ij}$ for example, which corresponds to the weight assigned to the connection between neuron $j$ and neuron $i$ of layers $L-1$ and $L$, respectively.

The overall cost function would be reprsented as the sum of the squared differences between the last layer neurons activations and the desired output:

\begin{equation}
 C_0 = \sum_{i=0}^{n_L-1}(a^L_i-y_i)^2
\end{equation}

![image-3.png](attachment:image-3.png)

When calculating the cost function derivative with respect to a previous node activation, each of the individual derivate computations for each neuron of a particular layer must be summed together, as one previous neuron now will influence the output of more than one neuron of the next layer. 

![image-4.png](attachment:image-4.png)

Tensorflow is able to perform, and keep track of, all these derivative calculations for us using Automatic differentiation.

https://www.3blue1brown.com/lessons/backpropagation-calculus

https://www.youtube.com/watch?v=tIeHLnjs5U8 