# Backpropagation in PyTorch

To explain backpropagation let us look at the below compoisition of two functions:

$w = g(f(x))$

Where the function requires the set of images of $f$ to be the domain of $g$.<br>
When training a Neural Network, we want to minimise the use of this kind of (loss) function.<br>
We have done this previously by computing gradients, namely by computing the derivative of the loss with respect to the tensor $x$.<br>
In this instance, we want to compute:
$$\frac{dw}{dx} = \frac{dw}{dy} \frac{dy}{dx}$$

We have two tensors, $x$ and $y$, and w is the application of a differentiable function with respect to $y$.<br>

## Local gradients

Every time we compute a gradient we are implicitly performing an operation on the computational graph.

In our example, this translates into the multiplication of the two tensors, which means we can compute independently the partial derivative of $w$ with respect to $y$ and the partial derivative of $y$ with respect to $x$.<br>
But those two quantivites are called local gradients, and play a key role in backpropagation.<br>
i.e. local gradients are the partial derivatives performed on the computational graph operation.

The following is an Computation Graph of an Ordinary Least Square Model:
$$^{x} _{w} \Rightarrow \hat{y} = w * x \Rightarrow l = ( \hat{y} - y ) \Rightarrow l^{2} \Rightarrow Loss$$ 

We have two inputs, tensor $x$ and tensor $w$ denoting the weights of the model and the predicted value is obtained $\hat{y}$ as a linear combination of those weights and $x$, typically denotes the set of features.<br>
We then define a loss function to evaluate the model performance.<br>

We are splitting the calculation of ordinary least squares:
1. define the difference between the prediction ($\hat{y}$) and actual value ($y$),
2. squaring the tensor ($l^{2}$) to get the final $Loss$.

## Minimising Loss

Assuming we have performed the necessary forward step, computing the loss with respect to the inpur tensors, we then need to calculate the local gradients at each note inside the computational graph:

$$\frac{dLoss}{dl} , \frac{dl}{d\hat{y}} , \frac{d\hat{y}}{dw}$$

We use these partial derivatives to get the partial derivative of the loss with respect to the input via $w$ via backpropagation, or better by a simple application of the chain rule.

We do not need to compute the partial derivative of $l$ with respect to $y$ or the partial derivative of $\hat{y}$ with respect to $x$ becuase they are fixed tensors in a typical OLD application.<br>
In other words, both $x$ and $y$ are given, and so when minimising a loss function, that which changes is the set of parameters we wish to estimate - in our case the weights $w$, and not the features.<br>
That is why we do not compute the local gradient of $\hat{y}$ with respect to $x$.


In [1]:
import torch

In [2]:
x = torch.tensor(1.0)
y = torch.tensor(2.0)

In [3]:
w = torch.tensor(1.0, requires_grad=True)

In [4]:
y_hat = w * x
loss = (y_hat - y) ** 2

In [5]:
print(f"Forward step: {loss}")

Forward step: 1.0


From an algebraic point of view, we can see that $x$ is 1 and $w$ is 1 therefore their product is 1.<br>
The square of -1 is also 1, which is the $loss$.<br>
So everything is working as expected.

A minimizing process requires a forward step and then a backward step.<br>
To perform backpropgation, we apply the backwarrd function on the loss.

In [6]:
loss.backward()
print(f'Backward Step: {w.grad}')

Backward Step: -2.0


After having the weights based on a learning rate, we continute the minimisation phase with a new forward and backward step until a minimum is reached.

We continue updating weights but without this being part of the computation graph.<br>
i.e. we set the gradient to zero after updating the weights (at each epoch/step).

We wrap the update operation inside `no_grad` method and specify the weights are updated based on the product between a learning rate:

In [7]:
with torch.no_grad():
    w -= 0.01 * w.grad
    print(w) # 1.02
print(w.grad) # -2

tensor(1.0200, requires_grad=True)
tensor(-2.)


While we are learning the weights at each step with a particular learning rate, we expected 1.02.<br> 
On the below, we are setting the gradient to zero at each step and expected this to be -2.0 - as we computed before in the first step.

In a Neural Network training phase, we will have to perform several forward and backward passes.<br>
But the logic will remain the same.<br>
We have an update on the weights based on the learning rate and on the gradient that was computed and thanks to this highlighted operation, the gradients function is not part of the computation graph.