# Automatic Defferentiation with `Torch.Autograd`
When training neural networks, the most frequently used algorithm is **back propagation**. Essentially, the adjustment of model parameters (weights and biases) using gradients of the loss function with respect to the given parameter.

To compute the gradients, PyTorch has a built-in differentiation engine called `torch.autograd`. It supports automatic computation of gradient for any computational graph.

Consider the simplest one-layer neural network, with input `x`, parameters `w` and `b`, and some loss function. It can be PyTorch in the following manner:

In [1]:
import torch
import torch.nn.functional as F

x = torch.ones(5)
y = torch.zeros(3)
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b
loss = F.binary_cross_entropy_with_logits(z, y)

## Tensors, Functions and Computational Graph
This code defines the computational graph [here](https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html). In this network, `w` and `b` are the **parameters** which we need to optimize through back propagation. Thus, we need to be able to compute the gradients of the loss function with respect to those variables. In order to do that, we set the `requires_grad` property to the tensors.

Note: `requires_grad` can be set during tensor creation, or later with `tensor.requires_grad_(True)`.

A function that we apply to tensors to construct computational graphs is in fact an object of class `Function`. This object knows how to compute the function in the *forward* direction, and also how to compute its derivative in the *back propagation* step. A reference to the back prop function is stored in `grad_fn` property of a tensor. More information of `Function` is available in the [documentation](https://pytorch.org/docs/stable/autograd.html#function).

In [2]:
print (f"Gradient function for z = {z.grad_fn}")
print (f"Gradient function for loss = {loss.grad_fn}")

Gradient function for z = <AddBackward0 object at 0x0000025EDA0EB8E0>
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward0 object at 0x0000025EDA0EB400>


## Computing Gradients
To optimize weights of parameters in neural networks, we need to compare the derivatives of our loss function with respect to parameters, namely, we need ∂loss/∂w and ∂loss/∂b under some fixed values of `X` and `y`. To compute those derivatives, we call `loss.backward()`, and then retrieve the values from `w.grad` and `b.grad`.

In [3]:
loss.backward()
print (w.grad)
print (b.grad)

tensor([[0.2798, 0.0079, 0.2763],
        [0.2798, 0.0079, 0.2763],
        [0.2798, 0.0079, 0.2763],
        [0.2798, 0.0079, 0.2763],
        [0.2798, 0.0079, 0.2763]])
tensor([0.2798, 0.0079, 0.2763])


Note:
- We can only obtain the `grad` properties for the leaf nodes of the computational graph, which have `requires_grad` property set to `True`. For all other nodes in our graph, gradients will not be available.
- We can only perform gradient calculations using `backward` once on a given graph for performance reasons. If we need to do several `backward` calls on the same graph, we need to pass `retain_graph=True` to the `backward` call.

## Disabling Gradient Tracking
By default, all tensors with `requires_grad=True` are tracking their computational history and support gradient computation. However, there are some cases where we do not need to do that. For example, when we have trained the model, and just want to apply it to some input data, i.e. we only want to do *forward* computations through the network. We can stop tracking computations by surrounding our computation code with `torch.no_grad()` block:

In [5]:
z = torch.matmul(x, w)+b
print (z.requires_grad)

with torch.no_grad():
    z = torch.matmul(x, w)+b
print (z.requires_grad)

True
False


Another way to achieve this is to use the `detach()` method on the tensor:

In [7]:
z = torch.matmul(x, w)+b
z_det = z.detach()
print (z_det.requires_grad)

False


There are reasons to disable gradient tracking:
- To mark some parameters in the neural network as **frozen parameters**. This is very common for [finetuning a pretrained network](https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html).
- To **speed up computation** when only doing forward pass. E.g. when running through test data.

## More on Computational Graphs
Conceptually, autograd keeps a record of data (tensors) and all executed operations (along with the resulting new tensors) in a directed acyclic graph (DAG) consisting of [Function](https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function) objects. In this DAG, leaves are the input tensors, roots are the output tensors. By tracing this graph from roots to leaves, gradients can be automatically computed using the chain rule.

In a forward pass, autograd does two thing simultaneously:
- Run the requested operation to compute the resulting tensor
- Maintain the operation's *gradient function* in the DAG

The backward runs when the `.backward()` method is called on the DAG root. `autograd` then:
- Computes the gradients from each `.grad_fn`
- Accumulates them in the respective tensor's `.grad` attribute
- Using chain rule, propagates all the way to the leaf tensors

Note:
**DAGs are dynamic in PyTorch**; it is important to note that the graph is recreated from scratch. After each `.backward()` call, autograd starts populating a new graph.This is exactly what allows the use of control flow statements in the model; allow for size, shape and operation changes at every iteration if needed.

### Optional Reading: Tensor Gradients and Jacobian Products
In many cases, we have a scalar loss function, and we need to compute the gradient with respect to some parameters. However, there are cases when the output function is an arbitrary tensor. In this case, PyTorch allows the computation of so-called [Jacobian Products](https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant), and not the actual gradient.

Instead of computing the full **Jacobian Matrix** itself, PyTorch allows you to compute the **Jacobian Product** *v^T . J* for a given input vector, *v*. This is achieved by calling `backward` with *v* as an argument. The size of *v* is the same as the original tensor, with respect to which we want to compute the product.

In [9]:
inp = torch.eye(5, requires_grad=True)
out = (inp+1).pow(2)
out.backward(torch.ones_like(inp), retain_graph=True)
print (f"First call\n{inp.grad}")
out.backward(torch.ones_like(inp), retain_graph=True)
print (f"Second call\n{inp.grad}")
inp.grad.zero_()
out.backward(torch.ones_like(inp), retain_graph=True)
print (f"Second call\n{inp.grad}")

First call
tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.],
        [2., 2., 2., 2., 4.]])
Second call
tensor([[8., 4., 4., 4., 4.],
        [4., 8., 4., 4., 4.],
        [4., 4., 8., 4., 4.],
        [4., 4., 4., 8., 4.],
        [4., 4., 4., 4., 8.]])
Second call
tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.],
        [2., 2., 2., 2., 4.]])


Notice that when we call `backward` for the second time with the same argument, the value of the gradient is different. This happens because we are doing *backward propagation*. PyTorch **accumulates the gradients**, i.e. the value of computed gradients is added to the `grad` property of all leaf nodes of the computational graph. If you want to compute the proper gradients, you need to zero out the `grad` property before. In real-life training an *optimizer* helps us to do this.

Note:
We previously call `backward()` function without parameters. This is essentially equivalent to calling `backward(torch.tensor(1.0))`, which is a useful way to compute the gradients in case of a scalar-valued function, such as loss during neural network training.