# Automatic differentation and training loop

Why autograd is important and how it works: [Why Jacobians are useful to compute gradients in a NN?](https://suzyahyah.github.io/calculus/machine%20learning/2018/04/04/Jacobian-and-Backpropagation.html)

In [2]:
import torch 

We specify that our tensor requires gradient, i.e, we want to compute a derivate w.r.t. that tensor:

In [3]:
x = torch.rand(4,requires_grad=True)
print(x)

tensor([9.5146e-01, 7.6532e-05, 6.8714e-01, 1.9988e-01], requires_grad=True)


Lets start doing some computations and see how pytorch keeps track of those in a [computational graph.](https://blog.paperspace.com/pytorch-101-understanding-graphs-and-automatic-differentiation/)

In [4]:
y = x*5
print(y)

tensor([4.7573e+00, 3.8266e-04, 3.4357e+00, 9.9938e-01],
       grad_fn=<MulBackward0>)


In [5]:
w = y**2
print(w)

tensor([2.2632e+01, 1.4643e-07, 1.1804e+01, 9.9876e-01],
       grad_fn=<PowBackward0>)


In [6]:
z = w.mean()
print(z)

tensor(8.8587, grad_fn=<MeanBackward0>)


As you can see, each output coming from a computation has an associated `grad_fun` that describes the graph that relates every computation with the initial variable.

so if we want to compute $\frac{\partial z}{\partial x}$ we simply do:

In [7]:
z.backward()
print(x.grad)

tensor([1.1893e+01, 9.5665e-04, 8.5893e+00, 2.4984e+00])


`z.backward()` produces a *vector-jacobian product* (i.e, chain-rule) and the gradients are stored in the attribute `.grad` of the variable of interest.

## How to prevent to compute the gradients ?? 

There are 3 main ways:

| Syntax      | Description |
| ----------- | ----------- |
| `with torch.no_grad():`  | Context manager that allows us to do computations without gradients|
|  `x.requires_grad_(False)` | Sets the requires_grad argument to False in place|
| `x.detach()`  | Creates a new tensor without gradient capabilities|


In [8]:
x = torch.rand(5,requires_grad=True)
print(x)
x.requires_grad_(False)
print(x)

tensor([0.9272, 0.4491, 0.1590, 0.3386, 0.3007], requires_grad=True)
tensor([0.9272, 0.4491, 0.1590, 0.3386, 0.3007])


In [9]:
x = torch.rand(5,requires_grad=True)
print(x)
y = x.detach()
print(y)

tensor([0.0322, 0.0197, 0.5358, 0.2156, 0.6406], requires_grad=True)
tensor([0.0322, 0.0197, 0.5358, 0.2156, 0.6406])


In [10]:
with torch.no_grad():
    y=x**2+2
    print(y)

tensor([2.0010, 2.0004, 2.2871, 2.0465, 2.4103])


One thing that we should be careful with is that the attribute `.grad` sums up all the gradients, so in many cases we will need to set gradients to zero after a training iteration.

Dummy example:

In [11]:
W = torch.ones(3, requires_grad=True)

for epoch in range(4):
    out = (W*2).sum()

    out.backward()

    print(W.grad)

tensor([2., 2., 2.])
tensor([4., 4., 4.])
tensor([6., 6., 6.])
tensor([8., 8., 8.])


As you can see at each iteration we are storing the previous values of the gradient, how to solve this? 

In [12]:
W = torch.ones(3, requires_grad=True)

for epoch in range(4):
    out = (W*2).sum()

    out.backward()

    print(W.grad)

    W.grad.zero_()

tensor([2., 2., 2.])
tensor([2., 2., 2.])
tensor([2., 2., 2.])
tensor([2., 2., 2.])


Here `.grad.zero_()` is setting inplace the gradients to zero at the end of each epoch, later we will talk about optimizers and how those have a built-in function that does exactly the same.

## Dummy example: Perceptron with one input, one neuron and one output for one epoch

In [13]:
input = torch.tensor(2.0)
output = torch.tensor(10.0)

W = torch.tensor(1.0, requires_grad=True) # this requires grad since it is our learnable param
pred = W*input
loss = (pred - output)**2 #MSE loss
print(loss)
loss.backward() # backprop

tensor(64., grad_fn=<PowBackward0>)


Try to compute the backprop by hand and compare it with the result!!

## Pytorch Trainig Loop

The basic pytorch model pipeline is given by:

 1. Defining the model
 2. Defining the loss function and the optimizer
 3. Train the model:
    1. Compute the prediction: Forward pass
    2. Compute the gradients: Backward pass
    3. Parameter update
    

In [18]:
import torch.nn as nn 

In [31]:
x = torch.randint(1,10,(5,), dtype=torch.float32)
y = torch.randint(1,10,(5,), dtype=torch.float32)
w = torch.zeros(1, requires_grad=True, dtype=torch.float32)

1. Define the model

In [32]:
def forward(x):
    return w*x

2. Define the loss and optimizer

In [33]:
lr = 0.01
n_iter = 11

In [34]:
loss = nn.MSELoss()
optimizer = torch.optim.SGD([w], lr = lr)

3. Training loop

In [35]:
for epoch in range(n_iter):
    pred = forward(x)
    l = loss(y, pred)
    l.backward() # gradients
    optimizer.step() #update weights
    optimizer.zero_grad() # set grads to 0

    if epoch % 5 == 0:
        print(f'epoch {epoch}: w = {w}, loss = {l}')

epoch 0: w = tensor([0.7400], requires_grad=True), loss = 38.599998474121094
epoch 5: w = tensor([0.9997], requires_grad=True), loss = 1.6000522375106812
epoch 10: w = tensor([1.0000], requires_grad=True), loss = 1.600000023841858


We can see how our loss decreases with epochs, and our weight is adjusted.