# Automatic differentiation with `TORCH.AUTOGRAD`

In [1]:
import torch

x = torch.ones(5)  # input tensor
y = torch.zeros(3)  # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

When training NN, famous method is the **backpropagation**. In the algorithm above, parameters (model weights) are adjusted according to the **gradient** of the loss function with respect to the given parameter.

`torch.autograd` is a pytorch built-in differentiation engine. It supports automatic computation of gradient for **any** computational graph.

## Tensors, Functions and Computational graph

The code above defines the following **computational graph**:

![computation_graph](./imgs/torch_tensor_function_computational_graph.png)

w and b are the **parameters**, which we need to `optimize`. Thus, we need to be able to compute the gradients of loss fx with respect to those variables. In order to do that, we set the `requires_trad` property of those tensors.

|*Note|
|-|
|You can set the value of `requires_grad` when creating a tensor, or later by `x.requires_grad(True)`|

A function that we apply to tesnors to contruct computational graph as above is in fact an object of class `Function`. This object **knows** how to compute the function in the forward direction, and also how to compute its derivative during the backward propagation step. A reference to the backward propagation function is stored in `grad_fn` property of a tensor. You can find more info of `Function` class in [this documentation](https://pytorch.org/docs/stable/autograd.html#function)

In [2]:
print(f"Gradient function for z = {z.grad_fn}")
print(f"Gradient function for loss = {loss.grad_fn}")

Gradient function for z = <AddBackward0 object at 0x7f102bd40be0>
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward0 object at 0x7f102bd430a0>


## Computing Gradients

To optimize weights of parameters in the NN, compute $\frac{\delta loss}{\delta w}$ and $\frac{\delta loss}{\delta b}$ under some fixed values of `x` and `y`. To compute those derivates, call `loss.backward()`, and then retrieve the values from `w.grad` and `b.grad`:

In [3]:
loss.backward()
print(w.grad)
print(b.grad)

tensor([[0.2986, 0.0666, 0.0132],
        [0.2986, 0.0666, 0.0132],
        [0.2986, 0.0666, 0.0132],
        [0.2986, 0.0666, 0.0132],
        [0.2986, 0.0666, 0.0132]])
tensor([0.2986, 0.0666, 0.0132])


|*Note|
|-|
|<ul><li>We can only obtain the grad properties for the leaf nodes of the computational graph, which have requires_grad property set to True. For all other nodes in our graph, gradients will not be available.</li><li>We can only perform gradient calculations using backward once on a given graph, for performance reasons. If we need to do several backward calls on the same graph, we need to pass retain_graph=True to the backward call.</li></ul>|

## Disabling Gradient Tracking

By default, all tensors with `requires_grad=True` are tracking their computational history and support gradient computation. However, there are some cases when we do not need to do that, e.g., when we have trained the model and just want to apply it to some input data, i.e. we only want to do `forward` computations through the network. We can stop tracking computations by surrounding our computation code with `torch.no_grad()` block:

In [4]:
z = torch.matmul(x, w)+b
print(z.requires_grad)

with torch.no_grad():
    z = torch.matmul(x, w)+b
print(z.requires_grad)

True
False


Another way to achieve the same result is to use `detach()` method on the tensor:

In [5]:
z = torch.matmul(x, w)+b
z_det = z.detach()
z_det.requires_grad

False

Why want to disable grad tracking?
- To mark some parameters in the NN as **frozen parameters**
- To **speed up computation** when you are only doing forward pass.

## More on Computational graphs

Conceptually, autograd keeps a record of data (tensors) and all executed operations (along with the resulting new tensors) in a directed acyclic graph (DAG) consisting of [Function](https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function) objects. In this DAG, leaves are the input tensors, roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things simultaneously:
- run the requested operation to compute a resulting tensor
- maintain the operation’s `gradient` function in the DAG.

The backward pass kicks off when `.backward()` is called on the DAG root. autograd then:
- computes the gradients from each `.grad_fn`,
- accumulates them in the respective tensor’s .grad attribute
- using the chain rule, propagates all the way to the leaf tensors.

|*Note|
|-|
|**DAGs are dynamic in PyTorch** An important thing to note is that the graph is recreated from scratch; after each .backward() call, autograd starts populating a new graph. This is exactly what allows you to use control flow statements in your model; you can change the shape, size and operations at every iteration if needed.|

In many cases, we have a scalar loss function, and we need to compute the gradient with respect to some parameters. However, there are cases when the output function is an arbitrary tensor. In this case, PyTorch allows you to compute the so-called Jacobian product, and not the actual gradient.

For a vector function **$y = f(x)$**, where **$x = (x_1, ..., x_n)$** and **$y = (y_1, ..., y_m)$**, a gradient of **$y$** with respect to **$x$** is given by the Jacobian matrix:

$J = \frac{partial\space y_i}{partial\space x_j}$

Instead of computing the Jacobian matrix itself, PyTorch allows you to compute the Jacobian Product **$v^T · J$** for a given input vector **$v = (v_1, ..., v_m)$**. This is achieved by calling `backward` with **$v$** as an argument. The size of **$v$** should be the same as the size of the original tensor with respect to which we want to compute the product.


In [6]:
inp = torch.eye(4, 5, requires_grad=True)
out = (inp+1).pow(2).t()
out.backward(torch.ones_like(out), retain_graph=True)
print(f"First call\n{inp.grad}")
out.backward(torch.ones_like(out), retain_graph=True)
print(f"\nSecond call\n{inp.grad}")
inp.grad.zero_()
out.backward(torch.ones_like(out), retain_graph=True)
print(f"\nCall after zeroing gradients\n{inp.grad}")

First call
tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.]])

Second call
tensor([[8., 4., 4., 4., 4.],
        [4., 8., 4., 4., 4.],
        [4., 4., 8., 4., 4.],
        [4., 4., 4., 8., 4.]])

Call after zeroing gradients
tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.]])


Notice that when we call backward for the second time with the same argument, the value of the gradient is different. This happens because when doing backward propagation, PyTorch accumulates the gradients, i.e. the value of computed gradients is added to the grad property of all leaf nodes of computational graph. If you want to compute the proper gradients, you need to zero out the grad property before. In real-life training an optimizer helps us to do this

|*Note|
|-|
|Previously we were calling `backward()` function without parameters. This is essentially equivalent to calling `backward(torch.tensor(1.0))`, which is a useful way to compute the gradients in case of a scalar-valued function, such as loss during neural network training.|