# Overview

> How the autograd works and records the operations.

Autograd is a reverse automatic differentiation system. Conceptually, autograd records a graph recording all of the operations that created the data as you execute operations, giving you a directed acyclic graph whose leaves are the input tensors and roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

Internally, autograd represents this graph as a graph of Function objects(really expression), which can be `apply()` ed to compute the result of evaluating the graph. When computing the forward pass, autograd simultaneously performs the requested computations and builds up a graph representing the function that computes the gradient(the `.grad_fn` attribute of each `torch.Tensor` is an entry point into this graph). When the forward pass is completed, PyTorch evaluates this graph in the backwards passs to compute the gradients.

The graph is recreated from scratch at every  iteration, and this is exactly what allows for using arbitrary Python control flow statements, that can change the overall shape and size of the graph at every iteration. You don't have to encode all possible paths before you launch the training- what you run is what you differentiate.

In [1]:
import torch

x=torch.randn(5, requires_grad=True)
y=x.pow(2)
# y.grad_fn._saved_self refers to the same Tensor object as x.
print(x.equal(y.grad_fn._saved_self))
print(x is y.grad_fn._saved_self)

# But that may not always be the case
x=torch.randn(5, requires_grad=True)
y=x.exp()
print(y.equal(y.grad_fn._saved_result))
print(y is y.grad_fn._saved_result)

True
True
True
False


Under the hood, to prevent reference cycles, PyTorch has packed the tensor upon saving and `uppacked` it into a different tensor for reading. Here, the tensor we get from accessing `y.grad_fn._saved_result` is a different tensor object than `y` (but they sill share the same storage).

Whether a tensor will be packed into a different tensor object depends on whether it is an outputs of its own `grad_fn`, which is an implementation detail subject to change and that users should not rely on.

# Setting requires_grad

`requires_grad` is a flag, defaulting to false unless wrapped in a `nn.Parameter`, that allows for fine-grained exclusion of subgraphs from gradient computation. It takes effect in both the forward and backward passes:

During the forward pass, an operation is only recorded in the backward graph if at least one of its input tensors require grad. During the backward pass `backward()`, only leaf tensors with `requires_grad=True` will have gradients accumulated into their `.grad` fields.

It it important to note that even though every tensor has this flag, setting it only make sense for **leaf tensors**(tensors that do not have `grad_fn`, e.g., a `nn.Module's` parameters). **Non-leaf tensors(tensors that do have `grad_fn`) are tensors that have a backward graph associated with them.** Thus their gradients will be needed as an intermediary result to compute the gradient for a leaf tensor that requires grad. From this definition, it is clear that all non-leaf tensors will automatically have `require_grad=True.`

Setting `requires_grad` should be the main way you control which parts if the model are part of the gradient computation, for example, if you need to freeze parts of your pretrained model during model fine-tuning.

To freeze parts of your model, simply apply `.requires_grad_(False)` to the parameters that you don't want updated. And as described above, since computations that use these parameters as inputs would not be recorded in the forward pass, they won't have their `.grad` fields updated in the backward pass because they won't be part of the backward graph in the first place, as desired.

Because this is such as common pattern, `requires_grad` can also be set at the module level with `nn.Module.required_grad_()`. When applied to a module, `.requires_grad_()` takes effect on all of the module's parameters(which have `requires_grad=True` by default).

# Locally disabling gradient computation

There are several mechanisms avaliable from Python to locally diable gradient computation:

To diable gradients across entire blocks of code, there are context managers like:
* no-grade model
* inference mode. 

For more fine-grained exclusion of subgraphs from gradient computation, there is setting the `requires_grad` field of a tensor.

Below, inaddition to discussing the mechanisms above, we also describe evaluation mode `nn.Module.eval()`, a method that is not used to diable gradient computation but, because of its name, is often mixed up with the three.

# Grad Modes

Apart from setting `required_grad` there are also three grad modes:

* default(grade mode)
* no-grad mode
* inference mode

<div style="text-align: center"><img src="https://files.mastodon.social/media_attachments/files/111/854/008/221/626/357/original/1d2fa9e43eb4c8dd.png" width="60%" heigh="60%" alt="Grad Models under different context"></div>


## Default Mode(Grad Mode)

It is the only mode in which `requires_grad` takes effect.


## No-grad Mode

Computations in no-grad mode are never recorded in the backward graph even if there are inputs that have `require_grad=True`. **Enable no-grad mode when you need to perform operations that should not be recorded by autograd, but you would sitll like to use the outputs of these computations in grad mode later.**


## Inference Mode

It is the extreme version of no-grad mode. Just like no-grad mode, computations in inference mode are not recorded in the backward graph, but enabling inference mode will allow PyTorch to speed up your model even more.


## Evaluation Mode(nn.Module.eval())

Evaluation mode is not a machanism to locally disable gradient computation. It is included here anyway because it it sometimes confused to be such a mechanism.

Functionally, `module.eval()`(or equivalently `module.train(False)`) are completely orthogonal to no-grad mode and inference mode.

It is recommended that you always use `model.train()` when training and `model.eval()` when evaluating model.

# Examples with code

When training neural network, the most frequently used algorithm is **back propgation**. In this algorithm, parameters(model weights) are adjusted according to the **gradient** of the loss function with respect to the given parameter.

And the PyTorch build in differentiation engine `torch.autograd`, it supports automatic computation of gradient for any computational graph.

Consider the simplest one-layer neural network, which input x, parameters w and b, and some loss function. It can be defined in PyTorch in the following manner:

In [2]:
# input tensor
x=torch.ones(5)
# expected output
y=torch.zeros(3)

w=torch.randn(5,3,requires_grad=True)
b=torch.randn(3, requires_grad=True)
z=torch.matmul(x,w)+b
loss=torch.nn.functional.binary_cross_entropy_with_logits(z,y)
loss

tensor(0.6917, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)

## Tensors, Functions and Computational graph

The code defines the following **computational graph**:

<div style="text-align: center"><img src="https://files.mastodon.social/media_attachments/files/111/854/136/166/407/621/original/ce241b35f174e9fe.png" width="60%" heigh="60%" alt="Computational graph of the Code"></div>


In this network, `w` and `b` are parameters, which we need to optimize. Thus, we need to able to compute the gradients of loss function with respect to those variables. In order to do that, we set the `requires_grad` property of those tensors.

A function that we apply to tensors to construct computational graph is in fact an object of class `Function`. This object knows how to compute the function in the forward direction, and also how to compute its derivative during the backward propagation step. A reference to the backward propagation function is stored in `grad_fn` property of a tensor.

In [3]:
print(f'Gradient function for z ={z.grad_fn}')
print(f'Gradient function for loss={loss.grad_fn}')

Gradient function for z =<AddBackward0 object at 0x7e83efdb20e0>
Gradient function for loss=<BinaryCrossEntropyWithLogitsBackward0 object at 0x7e83efdb3550>


## More on Computational Graphs

> Note: DAGs are dynamic in PyTorch an important thing to note is that the graph is recreated from scratch; after each `.backward()` call, autograd starts populating a new graph. This is exactly what allows you to use control flow statements in your model; you can change the shape, size and operations at every iteration if needed.

Conceptually, autograd keeps a record of data(tensors) and all executed operations(along with the resulting new tensors) in a directed acyclic graph(DAG) consisting of function objects. In this DAG, leaves are the input tensors, roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things simultaneously:
* run the requested operation to compute a resulting tensor
* maintain the operation's gradient function in the DAG

The backwaes pass kicks off when `.backward()` is called on the DAG root. `autograd` then:
* computes the gradients from each `.grad_fn`
* accumulates them in the respective tensor's `grad` attribute
* using the chain rule, propagates all the way to the leaf tensors



## Computing Gradients

**NOTE**

* We can only obtain the grad properties for the leaf nodes of the computational graph, wich have requires_grad property set to True. For all other nodes in our graph, gradients will not be avaliable.

* We can only perform gradient calculations using backward onece on a given graph, for performance reasons.

To optimize weights of parameters in the neural network, we need to compute the derivatives of our loss function with respect to parameters, namely, we need $\frac{\partial loss}{\partial w}$ and $\frac{\partial loss}{\partial b}$ under some fixed values of `x` and `y`. 


In [4]:
loss.backward()
print(w.grad)
print(b.grad)

tensor([[0.0828, 0.2673, 0.0525],
        [0.0828, 0.2673, 0.0525],
        [0.0828, 0.2673, 0.0525],
        [0.0828, 0.2673, 0.0525],
        [0.0828, 0.2673, 0.0525]])
tensor([0.0828, 0.2673, 0.0525])


## Disabling Gradient Tracking

By default, all tensors with `requires_grad=True` are tracking their computational history and support gradient computation. However, there are some cases when we do not need to do that, for example, when we have trained the model and just want to apply it to some input data, i.e. we only want to `forward` computations through the network. We can stop tracking computations by surrounding our computation code with `torch.no_grad()` block:

In [5]:
z=torch.matmul(x,w)+b
print(z.requires_grad)

with torch.no_grad():
    z=torch.matmul(x,w)+b
print(z.requires_grad)

True
False


Another way achieve the same result is to use the `detach()` method on the tensor:

In [6]:
z=torch.matmul(x,w)+b
z_det=z.detach()
z_det.requires_grad

False

There are reasons you might want to disable gradient tracking:
* To mark some parameters in your neural networks as **frozen parameters**
* To **speed up computations** when you are only doing forward pass, because computations on tensors that do not track gradients would be more efficient.