# Automatic differentation and gradient descend 

Why autograd is important and how it works: [Why Jacobians are useful to compute gradients in a NN?](https://suzyahyah.github.io/calculus/machine%20learning/2018/04/04/Jacobian-and-Backpropagation.html)

In [1]:
import torch 

We specify that our tensor requires gradient, i.e, we want to compute a derivate w.r.t. that tensor:

In [2]:
x = torch.rand(4,requires_grad=True)
print(x)

tensor([0.9014, 0.1748, 0.5335, 0.7465], requires_grad=True)


Lets start doing some computations and see how pytorch keeps track of those in a [computational graph.](https://blog.paperspace.com/pytorch-101-understanding-graphs-and-automatic-differentiation/)

In [3]:
y = x*5
print(y)

tensor([4.5071, 0.8739, 2.6675, 3.7325], grad_fn=<MulBackward0>)


In [4]:
w = y**2
print(w)

tensor([20.3136,  0.7637,  7.1155, 13.9315], grad_fn=<PowBackward0>)


In [5]:
z = w.mean()
print(z)

tensor(10.5311, grad_fn=<MeanBackward0>)


As you can see, each output coming from a computation has an associated `grad_fun` that describes the graph that relates every computation with the initial variable.

so if we want to compute $\frac{\partial z}{\partial x}$ we simply do:

In [6]:
z.backward()
print(x.grad)

tensor([11.2676,  2.1848,  6.6687,  9.3312])


`z.backward()` produces a *vector-jacobian product* (i.e, chain-rule) and the gradients are stored in the attribute `.grad` of the variable of interest.

## How to prevent to compute the gradients ?? 

There are 3 main ways:

| Syntax      | Description |
| ----------- | ----------- |
| `with torch.no_grad():`  | Context manager that allows us to do computations without gradients|
|  `x.requires_grad_(False)` | Sets the requires_grad argument to False in place|
| `x.detach()`  | Creates a new tensor without gradient capabilities|


In [7]:
x = torch.rand(5,requires_grad=True)
print(x)
x.requires_grad_(False)
print(x)

tensor([0.0090, 0.0596, 0.9173, 0.0520, 0.7266], requires_grad=True)
tensor([0.0090, 0.0596, 0.9173, 0.0520, 0.7266])


In [8]:
x = torch.rand(5,requires_grad=True)
print(x)
y = x.detach()
print(y)

tensor([0.7471, 0.1520, 0.9719, 0.5339, 0.8926], requires_grad=True)
tensor([0.7471, 0.1520, 0.9719, 0.5339, 0.8926])


In [9]:
with torch.no_grad():
    y=x**2+2
    print(y)

tensor([2.5581, 2.0231, 2.9447, 2.2851, 2.7967])


One thing that we should be careful with is that the attribute `.grad` sums up all the gradients, so in many cases we will need to set gradients to zero after a training iteration.

Dummy example:

In [15]:
W = torch.ones(3, requires_grad=True)

for epoch in range(4):
    out = (W*2).sum()

    out.backward()

    print(W.grad)

tensor([2., 2., 2.])
tensor([4., 4., 4.])
tensor([6., 6., 6.])
tensor([8., 8., 8.])


As you can see at each iteration we are storing the previous values of the gradient, how to solve this? 

In [17]:
W = torch.ones(3, requires_grad=True)

for epoch in range(4):
    out = (W*2).sum()

    out.backward()

    print(W.grad)

    W.grad.zero_()

tensor([2., 2., 2.])
tensor([2., 2., 2.])
tensor([2., 2., 2.])
tensor([2., 2., 2.])


Here `.grad.zero_()` is setting inplace the gradients to zero at the end of each epoch, later we will talk about optimizers and how those have a built-in function that does exactly the same.

## Dummy example: Perceptron with one input, one neuron and one output for one epoch

In [19]:
input = torch.tensor(2.0)
output = torch.tensor(10.0)

W = torch.tensor(1.0, requires_grad=True) # this requires grad since it is our learnable param
pred = W*input
loss = (pred - output)**2 #MSE loss
print(loss)
loss.backward() # backprop

tensor(64., grad_fn=<PowBackward0>)


Try to compute the backprop by hand and compare it with the result! 