# Gradient calculation with autograd

In [1]:
import torch

Create a simple tensor, and set its `requires_grad` parameter to `True`: in this way, if inserted in a function, the gradient of the function is going to be computed on this variable

In [2]:
x = torch.randn(3, requires_grad=True)
print(x)

tensor([0.4051, 0.5039, 1.2624], requires_grad=True)


Create a `y` variable that depends on `x`

In [3]:
y = x + 2
print(y)

tensor([2.4051, 2.5039, 3.2624], grad_fn=<AddBackward0>)


Notice the `grad_fn` attribute
> Each variable has a `.grad_fn` attribute that references a function that has created a function (except for Tensors created by the user - these have `None` as `.grad_fn`) - [source](https://pytorch.org/tutorials/beginner/former_torchies/autograd_tutorial.html)

Create a `z` variable that depends on `y`

In [4]:
z = (y**2*2).mean()
print(z)

tensor(15.1317, grad_fn=<MeanBackward0>)


Compute the gradient of `z` with respect to `x`. 

In [5]:
z.backward() #dz/dx

Look at the gradient of `z` with respect to `x`

In [6]:
print(x.grad)

tensor([3.2068, 3.3386, 4.3499])


Notice that `z` must be a scalar value to implicitly compute the gradient. If you want to compute the gradient of a vector, the values of vector $v$ have to be specified

In [7]:
z2 = y**2*2
print(z2)

tensor([11.5690, 12.5394, 21.2867], grad_fn=<MulBackward0>)


In [8]:
z2.backward(torch.Tensor([1.0,1.0,1.0]))
print(x.grad)

tensor([12.8272, 13.3544, 17.3996])


Indeed, in the background `.backword()` computes a Jacobian-vector ($J*v$) product

## Stop gradient computation and tracking history on computation graph
It might be useful, in some cases, to exclude some operations form gradient computations. To do so, you can use:
- `x.requires_grad_(False)` : this method sets `requires_grad` of `x` to `False` (notice that the trailing underscore of the function means that the tensor is going to be modified inplace)
- `x.detach()` : this method creates a tensor with `requires_grad` set to `False`; however, notice that the new tensor shares the storage space with the old tensor, so any computation on the old tensor also updates the new tensor
- `with torch.no_grad():` : this is a context manager, and operations made within it will be performed with tensors having `requires_grad` temporarily set to `False`

In [9]:
x = torch.randn(3, requires_grad=True)
x.requires_grad_(False)
print(x)

tensor([ 0.5404,  1.3869, -1.6816])


In [10]:
x = torch.randn(3, requires_grad=True)
x_new = x.detach()
print(x_new)

tensor([-2.5733,  0.7729,  1.8764])


In [11]:
x = torch.randn(3, requires_grad=True)
with torch.no_grad():
    y = x*2
    print(y)

tensor([-5.6967e-01, -5.9780e-04, -3.7614e+00])


## Gradient accumulation using `.backword()`
Gradients should be re-initialized when computation involves loops, since PyTorch automatically accumulates gradients when `.backword()` is used

In [12]:
# initialize x
x = torch.ones(4, requires_grad=True)

# create a loop: at each loop look at the value of x.grad
for epoch in range(3):

    # define a variable y that depends on x
    y = (x*3).sum()

    # compute the gradients wrt x
    y.backward()

    # print the gradient computed wrt x
    print(x.grad)

tensor([3., 3., 3., 3.])
tensor([6., 6., 6., 6.])
tensor([9., 9., 9., 9.])


Gradients are not correct anymore after the first iteration. To correct this problem use `.grad.zero_()`

In [13]:
# initialize x
x = torch.ones(4, requires_grad=True)

# create a loop: at each loop look at the value of x.grad
for epoch in range(3):

    # define a variable y that depends on x
    y = (x*3).sum()

    # compute the gradients wrt x
    y.backward()

    # print the gradient computed wrt x
    print(x.grad)

    # reset the gradients
    x.grad.zero_()

tensor([3., 3., 3., 3.])
tensor([3., 3., 3., 3.])
tensor([3., 3., 3., 3.])


This is particularly useful when optimizing the weights (i.e., the parameters with respect to which the gradient is computed on the loss function) of a NN

In [15]:
# initialize x
x = torch.ones(4, requires_grad=True)
print(f'x: {x}', '\n')

# initialize the optimizer, defining the parameters to optimize and setting the learning rate
params, lr = [x], 0.01 # the params parameter of SGD accepts only a list of parameters
optimizer = torch.optim.SGD(params, lr)

# create a loop: at each loop look at the value of x.grad
for epoch in range(3):
    print(f'epoch {epoch+1}')

    # define a variable y that depends on x
    y = (x*3).sum()

    # compute the gradients wrt x
    y.backward()

    # print the gradient computed wrt x
    print(f'x.grad: {x.grad}')
    x_old = x.clone()

    # launch an optimizer step: it updates the values of x to minimize the gradient
    optimizer.step()

    print(f'x_new[i] = x_old[i] - (x.grad[i] * lr) = {x_old[0]} - ({x.grad[0]} * {lr}) = {x_old[0] - (x.grad[0] * lr)}')
    print(f'x: {x}', '\n')

    # reset the gradients
    x.grad.zero_()

x: tensor([1., 1., 1., 1.], requires_grad=True) 

epoch 1
x.grad: tensor([3., 3., 3., 3.])
x_new[i] = x_old[i] - (x.grad[i] * lr) = 1.0 - (3.0 * 0.01) = 0.9700000286102295
x: tensor([0.9700, 0.9700, 0.9700, 0.9700], requires_grad=True) 

epoch 2
x.grad: tensor([3., 3., 3., 3.])
x_new[i] = x_old[i] - (x.grad[i] * lr) = 0.9700000286102295 - (3.0 * 0.01) = 0.940000057220459
x: tensor([0.9400, 0.9400, 0.9400, 0.9400], requires_grad=True) 

epoch 3
x.grad: tensor([3., 3., 3., 3.])
x_new[i] = x_old[i] - (x.grad[i] * lr) = 0.940000057220459 - (3.0 * 0.01) = 0.9100000858306885
x: tensor([0.9100, 0.9100, 0.9100, 0.9100], requires_grad=True) 

