## Autograd

Revise the micrograd notebooks by Andrej Karpathy to get a better understanding of this notebook. 

set `requires_grad = True` every time you wish to track the gradient for a variable. 

In [2]:
import torch
import numpy

In [5]:
t1 = torch.randn(5, requires_grad=True) #random normal 
t1

tensor([ 0.7659, -2.0262, -1.4139,  1.1822, -1.1543], requires_grad=True)

In [6]:
t2 = t1 + 2
print(t2)

tensor([ 2.7659, -0.0262,  0.5861,  3.1822,  0.8457], grad_fn=<AddBackward0>)


<span style="color:#FF0000; font-family: 'Bebas Neue'; font-size: 01em;">NOTE:</span>
Pytorch automatically stores t2 using a _micrograd_ like notation i.e. t2 has t1,2 as parents and operator = + 

which is why  `grad_fn=<AddBackward0>` appears in output of the above cell. 

In [7]:
t3 = 2*t2**2
print(t3)

tensor([1.5300e+01, 1.3746e-03, 6.8693e-01, 2.0253e+01, 1.4306e+00],
       grad_fn=<MulBackward0>)


Similarly, for `t3`, the operator is `*` and parents as `t2`, `t2`. <br>
and this time `grad_fn=<MulBackward0>`

In [9]:
t4 = t3.mean()
t4

tensor(7.5344, grad_fn=<MeanBackward0>)

In [None]:
t4.backward() #calling backward method on final variable set the gradient for all vraibles before. 

In [12]:
print('t1.grad = ', t1.grad)
print('t2.grad = ', t2.grad)
print('t3.grad = ', t3.grad)
print('t4.grad = ', t4.grad) #expected to be 1 since d(t4)/d(t4) = 1

t1.grad =  tensor([ 2.2127, -0.0210,  0.4688,  2.5458,  0.6766])
t2.grad =  None
t3.grad =  None
t4.grad =  None


  print('t2.grad = ', t2.grad)
  print('t3.grad = ', t3.grad)
  print('t4.grad = ', t4.grad) #expected to be 1 since d(t4)/d(t4) = 1


The above error points that __'gradients are store only for leaf nodes by default'__

#### What are leaf nodes? 

A leaf node is a torch.Tensor that:
- Was created by the user (not resulting from a computation), and
- Has requires_grad=True.

So in the above process, t1 is a leaf node (init by user, requires_grad = True) but t2,t3,t4 are simply result of some mathematical operations. Pytorch doesnt store _intermediate gradients_ by default. Why? Because they are rarely needed in practice. 

For ex: in a NN with 2 inputs, 3 hidden layer neurons and 1 output: 
```
x = torch.randn(1, 2, requires_grad=False)   # Input
W1 = torch.randn(2, 3, requires_grad=True)   # Leaf
b1 = torch.randn(3, requires_grad=True)      # Leaf

W2 = torch.randn(3, 1, requires_grad=True)   # Leaf
b2 = torch.randn(1, requires_grad=True)      # Leaf

# Forward pass
h = x @ W1 + b1      # Intermediate tensor (not leaf)
a = torch.relu(h)    # Intermediate
y = a @ W2 + b2      # Final output
```

__So gradients of `h` and `a` are not store, which is in line with our usage.__ 

However, intermediate gradients can be accessed by explicitly calling t2.retain_grad() or torch.autograd.grad(t4, t2) -- d(t4)/d(t2)

In [13]:
# consider the following case: 

t1 = torch.randn(3, requires_grad=True)
t2 = t1 + 2
t3 = 2 * t2 ** 2

In [None]:
t3.backward()

^ leads into an error since t3 is not a scalar. 

#### Why? and what to do?

`backward()` needs the output tensor (`t3`) to be a scalar, or you must explicitly provide a "gradient vector" of the same shape as the output.

`.backward()` actually computes the __Jacobian vector product (JVP)__: $$\frac{∂t_3}{∂t_1}^T \cdot v$$

So instead we do:
```
v = torch.tensor([0.1, 1.0, 0.01])
t3.backward(v)
```
You’re saying:

“I don’t need the full Jacobian — just give me the result of multiplying it by this vector v". Check below image for it in action, to calculate d(t_3)/d(t_1), which is needed, since t_1 is the leaf node!


<img src="pictures/j.v working.jpg" width="50%">

In [15]:
# instead 

v = torch.ones(t3.shape[0])

t3.backward(v)

In [17]:
# .grad attribute stores the gradient
t1.grad

tensor([ 7.0332,  9.2314, 10.7467])

### Emptying gradients

use `object.gard.zero_()`

In [19]:
# backward() accumulates the gradient for this tensor into .grad attribute.
# !!! We need to be careful during optimization !!!
# Use .zero_() to empty the gradients before a new optimization step!
weights = torch.ones(4, requires_grad=True)

for epoch in range(2):
    # just a dummy example
    model_output = (weights*3).sum()
    model_output.backward()
    
    print(weights.grad)

    # optimize model, i.e. adjust weights...
    with torch.no_grad():
        weights -= 0.1 * weights.grad

    # this is important! It affects the final weights & output
    weights.grad.zero_()

print(weights)
print(model_output)

tensor([3., 3., 3., 3.])
tensor([3., 3., 3., 3.])
tensor([0.4000, 0.4000, 0.4000, 0.4000], requires_grad=True)
tensor(8.4000, grad_fn=<SumBackward0>)


^ self-explanatory. 