# Autograd Module
`torch.autograd` is PyTorch’s automatic differentiation engine that powers neural network (NN) training.

NNs are a collection of nested functions that are executed on some input data.<br>
These fucntions are defined by *parameters* (consisting of weights and biases), which are stored in tensors.

Training a NN happens in two steps:

**Forward Propagation**: In forward prop, NN makes the best guess about the correct output.<br>
&emsp;It runs data through each of its functions to make this guess.

**Backward Propagation**: In backprop, NN adjusts the parameters proportionate to teh error in its guess.<br>
&emsp;It does by traversing backwards from the output, collection the derivatives of the error with respect to the parameters of the functions (gradients), and optimising the parameters using gradient descent.

In [15]:
import torch

In [16]:
x = torch.tensor([-1.2365, 1.8245, -1.0701, 0.0869, -0.0376])

In [17]:
loss = x**2 -5

In [18]:
loss

tensor([-3.4711, -1.6712, -3.8549, -4.9924, -4.9986])

PyTorch is applying the loss function operations element-wise to tensor x.<br>
We can say that this is the forward propagation step.<br>
Given the input, we get an output based on a loss function.

For PyTorch to compute the gradients, we have to set the tensor argument `requires_grad=True`:

In [19]:
x = torch.tensor([-1.2365, 1.8245, -1.0701, 0.0869, -0.0376], requires_grad=True)

The argument `requires_grad=True` all operations on the tensor are tracked in the computational graph.<br>
Every operation on the tensor is tracked by the PyTorch backend for us.

PyTorch uses a dynamic computation graph versus Tensorflow using a static computation graph.<br>
It is worth noting in PyTorch is dynamic because when we create any operation involving a tensor, it is executed immediately.

In [20]:
loss = x**2 -5

In [21]:
loss

tensor([-3.4711, -1.6712, -3.8549, -4.9924, -4.9986], grad_fn=<SubBackward0>)

In [22]:
loss.grad_fn

<SubBackward0 at 0x24100f72d00>

To compute the gradients, we need to apply the backward fucntion to the loss.<br>
As we are expecting a non-scalar output, we have to specify the gradient argument inside the backward call, which is simply a tensor with the same shape as tensor *x*.

In [23]:
loss.backward(gradient=torch.ones(5))

With this operation we allow the backward method to perform the vector Jacobian product to get the gradients.<br>
Hence the backward method computes for us the gradient with respect to the tensor *x*, which is then accumulated into the `grad` attribute.<br>
This takes into account the partial derivative of the function with respect to the tensor's elements.

In [24]:
x.grad

tensor([-2.4730,  3.6490, -2.1402,  0.1738, -0.0752])

To check PyTorch is correctly computing the gradients, we can compute the algebriac derivative of the loss with respect to x.

PyTorch is automatically computing the following partial derivative, namely the partial derivative of the loss with respect to tensor x.<br>
This is nothing more than two times x:

$$\frac{\partial y}{\partial x} = 2x$$

In [25]:
2 * x

tensor([-2.4730,  3.6490, -2.1402,  0.1738, -0.0752], grad_fn=<MulBackward0>)

The `autograd` package keeps track of all tensors' operations, along with the resulting new tensors, in a Directed Acyclic Graph (DAG).<br>
In this DAG, leaves are the input tensors and roots are the output tensors.<br>
This concept can be extracted from the previous graph.

This is a nice feature, but not recommended when input data and/or the Computational Graph structuire is too complicated.<br>
So it is good practice to prevent PyTorch from keeping track of all the gradients' history inside the DAG.

For example, when training a Neural Network.<br>
In a backpropagation step, you update the weights on the net, and this operation should not be part of the gradient computation.<br>
How can we prevent PyTorch from keeping the history of gradients?<br>
There are multiple ways but we will look at two methods.

## Using detach
Detach creates a new tensor with the same values but does not require the gradient.<br>
When we compute the loss, PyTorch will not create a gradient function to be stored inside the DAG:

In [26]:
a = torch.rand(2,2, requires_grad=True)

In [27]:
a.requires_grad

True

In [28]:
b = a.detach()

In [29]:
b.requires_grad

False

## Using wrapper

In [30]:
c = torch.rand(2,2, requires_grad=True)

In [31]:
with torch.no_grad():
    loss =c ** 2
    print(loss.requires_grad)

False
