In [1]:
import torch

**Autograd** 

It is a tool that does the calculation of derivatives via a technique called **automatic differentiation**. As quoted from the official documentation: `torch.autograd` *provides classes and functions implementing automatic differentiation of arbitrary scalar-valued functions.* 

Automatic differentiation is a set of techniques to numerically evaluate the derivative of a function. As it is required during the backpropagation pass(to compute the gradient of weights w.r.t loss function) while training a neural network.

**Computation Graph**

So how does during backpropagation, PyTorch(or any other DL library for that matter) calculates gradients, it does by generating a data structure called **Computation graph**. In a complex setup where there are thousands of variables to calculate the gradient, a computation graph comes into the picture.

**Computation graph** is nothing but a simple map of references of variables(or tensors) and operators(or functions) generated for a set of algebraic equations, through which autograd can traverse and trace back (to leaves) to calculate gradients.

Now, as PyTorch generate these graphs during runtime in a forward pass(simple calculation of outputs from inputs), graphs are called Dynamic Computation Graphs.

**1. Important properties : requires_grad , grad_fn , is_leaf**


**1.1. requires_grad**

The `requires_grad` attribute tells autograd to track your operations. So if you want PyTorch to create a graph corresponding to these operations, you will have to set the `requires_grad` attribute of the Tensor to True.

There are 2 ways in which it can be done, either by passing it as an argument in `torch.tensor` (`requires_grad=True`)) or explicitly setting up the `requires_grad` property to True.

*It is to remember that tensors with only ***float*** data types can require gradient (or ask autograd to record its operations).*

In [2]:
t2 = torch.DoubleTensor([1., 2.])  # dtype = torch.float16
t2.requires_grad=True
print(f't2 : {t2}')

t3 = torch.HalfTensor([1., 2.])  # dtype = torch.float64
t3.requires_grad=True
print(f't3 : {t3}')

t2 : tensor([1., 2.], dtype=torch.float64, requires_grad=True)
t3 : tensor([1., 2.], dtype=torch.float16, requires_grad=True)


**Points to ponder:**

- The Tensors generated by applying any operations on other tensors, given that the for at least one input tensor `requires_grad = True`, then the resultant tensor will also have `requires_grad = True`.

- It is also helpful when in a network we don’t want to change the gradients and hence don’t want to update the weights associated with some tensors. Just setting `require_grad = False` the tensors won’t participate in the computation graph.

**1.2. grad_fn**

The `grad_fn` property holds the reference to the function (mathematical operator) that creates it. It is very important during a backward pass as the function here is responsible to calculate the gradient and send it to the appropriate next function in the next pass.

- If `requires_grad` is set to False, `grad_fn` would be None.

**1.3. is_leaf**

The `is_leaf` property tells whether a tensor is a leaf node or not. Essentially leaf tensors are the tensors whom we want to accumulate the gradient and are present at the edge of the computation graph. **Only leaf Tensors will have their grad populated during a call to** `backward()`. Technically, the leaf tensors are any tensors that created by the following approaches:

- Tensors resulting in operations from tensors that have `requires_grad = False` will be leaf Tensors.

- Any tensor that is explicitly created by the user will be leaf Tensors. This means as they are not the result of an operation and so `grad_fn = None`.

In the following example, the tensor `x` is only the leaf node. And as x is a leaf node, the `grad_fn = None` (as it is not obtained from any operations).

The tensor `y` has `grad_fn` a multiplication operator since `y` is obtained from the multiplication of `a` and `x`. Similarly the case for `z`.

In [21]:
# leaf nodes
x = torch.tensor(3., requires_grad=True)
a = torch.tensor(4., requires_grad=True)
b = torch.tensor(5., requires_grad=True)

y = a * x
# y is non-leaf
y.retain_grad()

z = y + b

print("Tensor x")
print(f'grad funtion = {x.grad_fn}')
print(f'is leaf = {x.is_leaf}')
print(x.requires_grad)

print("\nTensor y")
print(f'grad funtion = {y.grad_fn}')
print(f'is leaf = {y.is_leaf}')
print(y.requires_grad)

print("\nTensor z")
print(f'grad funtion = {z.grad_fn}')
print(f'is leaf = {z.is_leaf}')
print(z.requires_grad)

Tensor x
grad funtion = None
is leaf = True
True

Tensor y
grad funtion = <MulBackward0 object at 0x13a552c80>
is leaf = False
True

Tensor z
grad funtion = <AddBackward0 object at 0x13a552c80>
is leaf = False
True


**2. backward()**

The signature for `backward` is `backward(gradient=None, retain_graph=None, create_graph=False)`.

This the most important of the tensor methods present here. It computes the gradient of current tensor w.r.t. graph leaves. It is responsible to calculate the gradient during a backward pass.

These are the typical steps involved in gradient calculation during a backward pass:

1. The backward function takes an **incoming gradient** from the part of the network in front of it.

2. Then it calculates the local gradient at a particular tensor.

3. Then it multiplies the local gradient to with incoming gradient

4. Finally, forwards the computed gradient to the tensor’s inputs by invoking the backward method of the `grad_fn` of their inputs or simply save the gradient in `grad` property for leaf nodes.

\
Recall that z = y + b = (a*x) + b, we easily compute that

$$
\frac{\partial z}{\partial x} = a = 4 
\\
\frac{\partial z}{\partial a} = x = 3 
\\
\frac{\partial z}{\partial b} = 1
$$

In [22]:
z.backward()    # imagine pushing back to the leaf nodes

print("Tensor x")
print(f'grad funtion = {x.grad_fn}')

print("\nTensor a")
print(f'grad funtion = {a.grad_fn}')

print("\nTensor b")
print(f'grad funtion = {b.grad_fn}')

print("\nTensor y")
print(f'grad funtion = {y.grad_fn}')

print("\nTensor z")
print(f'grad funtion = {z.grad_fn}')

print('\n')
print('dz/dx:', x.grad) 
print('dz/da:', a.grad) 
print('dz/db:', b.grad) 
print('(normally we don\'t compute grad of non-leaf nodes) dz/dy:', y.grad) 

Tensor x
grad funtion = None

Tensor a
grad funtion = None

Tensor b
grad funtion = None

Tensor y
grad funtion = <MulBackward0 object at 0x13aa16560>

Tensor z
grad funtion = <AddBackward0 object at 0x13aa16560>


dz/dx: tensor(4.)
dz/da: tensor(3.)
dz/db: tensor(1.)
(normally we don't commpute grad of non-leaf nodes) dz/dy: tensor(1.)


Suppose in the above example, when calling `z.backward`. The `grad_fn` of `z` is `<AddBackward>`.

1. The backward function of `<AddBackward>` takes a default input tensor as `torch.tensor([1.])`.

2. Then it calculates gradient for `y` and `b`. For both, the gradients will be [1.] as the operator is an addition function.

3. The gradient is multiplied with the incoming tensor i.e. [1.] * [1.].

4. Now for `b`, the `grad_fn = None` so the gradient computed directly will get stored in `grad` property of tensor `b`. And for tensor `y` the backward function passes the gradient to its input tensor’s `grad_fn` (i.e. `<MulBackward>` of `y` since it is formed after the multiplication of `x` and `a`)

5. Similarly, the backward function will be called for y’s `<MulBackward>` with an input gradient from `<AddBackward>` i.e. [1.] in this case.

As noticed, the backward function is recursively called throughout the graph as we backtrack. You can access the gradients by calling the `grad` attribute of Tensor.

>Note: backward function only calculates gradients by going over an already made backward graph. The backward graph is as discussed generated during a forward pass only.

**2.1. Calling backward() on non-scaler tensor**

For a vector-valued tensor, the backward function gives a `Runtime error: grad can be implicitly created only for scalar outputs`.

This is because for a non-scalar tensor a jacobian-vector is to be computed and then the `backward` expects incoming gradients as it’s input (usually the gradient of the differentiated function w.r.t. corresponding tensors). Hence the `backward` *expects incoming gradient a Tensor of the same size as the current tensor*, then it’ll able to backpropagate.

**So either you can pass the tensor of the same shape,**

In [16]:
x = torch.tensor(3., requires_grad=True)
a = torch.tensor([4.,2.], requires_grad=True)
z = x + a

display(z)  # Broadcasting, see Ch. 02a
z.backward(torch.tensor([1., 1.])) # passing a gradient to backward.

tensor([7., 5.], grad_fn=<AddBackward0>)

In [17]:
print('dz/dx:', x.grad) 
print('dz/da:', a.grad) 

dz/dx: tensor(2.)
dz/da: tensor([1., 1.])


>Note : If you’ll pass non-ones tensor in backward, the gradients will get scaled accordingly.

In [20]:
x = torch.tensor(3., requires_grad=True)
a = torch.tensor([4.,2.], requires_grad=True)
z = x + a

display(z)  # Broadcasting, see Ch. 02a
z.backward(torch.tensor([1000., 1000.])) # passing a gradient to backward.

tensor([7., 5.], grad_fn=<AddBackward0>)

In [21]:
print('dz/dx:', x.grad) 
print('dz/da:', a.grad) 

dz/dx: tensor(2000.)
dz/da: tensor([1000., 1000.])


**Or simply change the size of the current tensor to torch.Size([]) as expected by backward.**

In [7]:
x = torch.tensor(3., requires_grad=True)
a = torch.tensor([4.,2.], requires_grad=True)
z = x + a
z = z.mean()
z.backward()

In [8]:
print('dz/dx:', x.grad) 
print('dz/da:', a.grad) 

dz/dx: tensor(1.)
dz/da: tensor([0.5000, 0.5000])


In [9]:
x = torch.tensor(3., requires_grad=True)
a = torch.tensor([4.,2.], requires_grad=True)
z = x + a
z = z.sum()
z.backward()

In [10]:
print('dz/dx:', x.grad) 
print('dz/da:', a.grad) 

dz/dx: tensor(2.)
dz/da: tensor([1., 1.])


**Wow, now we have understood the basic functioning of Autograd in PyTorch along with functions to implement that. But we’ll wait now and get back to our Computation Graph diagram for the same equation to concretize the concept.**

Following is the DCG of the same Example (Example 1) when we have `require_grad = True`

- The tensors in green are leaf nodes.

- Tensors in yellow are intermediate nodes.

- `MulBackward` and `AddBackward` are two `grad_fn` for `y` and `z` respectively.

- `grad` attribute stores the value of calculated gradients.

![alt text](/Users/ilpreterosso/GitHub/VSCode/NN101/notebooks/Module2/img/Screenshot.png)

**4. register_hook()**

The hook will be called every time a gradient (w.r.t a Tensor) is computed. The hook should have the following signature:

`hook(grad) -> Tensor or None`

This function returns a *handle* with a method `handle.remove()` that removes the hook from the module.

**So, the hook can take the value of grad and can return a new value or perform operations with the value.**

This is the best part, this can help to

- Modify the `grad` on the fly during a backward pass without waiting for the pass to be completed. This can influence our ways to calculate the gradient in a graph.

- Debug the code for the flow of gradients in your graph. Identifying gradients at each step even for non-leaf nodes.

Looking at an example from the PyTorch documentation.

In [34]:
v = torch.tensor([0., 0., 0.], requires_grad=True)
h = v.register_hook(lambda grad: grad * 2)  # double the gradient
v.backward(torch.tensor([1., 2., 3.]))
display(v.grad)
h.remove() # removes the hook

tensor([2., 4., 6.])