# Gradient Calculation with Autograd

Importing `pytorch`.

In [16]:
import torch

Today we are talking about the `autograd` package which comes with `pytorch` and how we can calculate gradients with it.

Gradients are essential for all model optimization, so it is important to understand how to apply it. 

Now, let's create a a tensor object which we will use to demonstrate how to use `autograd`.

In [17]:
x = torch.randn(3)
print(x)

tensor([-0.9524, -0.7465,  0.4141])


Now, later, if we want to calculate the gradient of some function with respect to `x`, then what we have to do is specify the argument `requires_grad=True`.

In [18]:
x = torch.randn(3, requires_grad = True)
print(x)

tensor([-0.0742, -0.8322,  0.7960], requires_grad=True)


Now, when we do operations with this tensor, `pytorch` will create a so-called operational graph for us. So, let's try an operation:

In [19]:
x = torch.randn(3, requires_grad = True)
print(x)

y = x + 2
print(y)

tensor([-1.9064, -0.4724,  0.4434], requires_grad=True)
tensor([0.0936, 1.5276, 2.4434], grad_fn=<AddBackward0>)


As we can see, `y` has the attribute `grad_fn`. This will point to a gradient function, which is `AddBackward` in this case. With this function, we can calculate the gradient of `y` with respect `x`. 

Let's see another operation in action. 

In [20]:
x = torch.randn(3, requires_grad = True)
print(x)

y = x + 2
print(y)

z = y*y*2
print(z)

tensor([-0.4800, -1.0948,  0.9835], requires_grad=True)
tensor([1.5200, 0.9052, 2.9835], grad_fn=<AddBackward0>)
tensor([ 4.6205,  1.6389, 17.8025], grad_fn=<MulBackward0>)


`z` will have the `grad_fn` attribute, but it will be `MulBackward`.

And, if we apply the `mean()` function, it will change again:

In [21]:
x = torch.randn(3, requires_grad = True)
print(x)

z = x.mean()
print(z)

tensor([ 1.8938,  1.2487, -1.0845], requires_grad=True)
tensor(0.6860, grad_fn=<MeanBackward0>)


Now, when we want to calculate the gradients, we must just use the `backward()` function. In this case, it will calculate `z` with respect to `x`. After we do this, `x` will not have the gradients stored, so we can use `x.grad` to see them.

In [22]:
x = torch.randn(3, requires_grad = True)
print(x)

y = x + 2
print(y)

z = y*y*2
print(z)

z = z.mean()
print(z)

z.backward() #dz/dx

print(x.grad)

tensor([1.4458, 0.9939, 0.4736], requires_grad=True)
tensor([3.4458, 2.9939, 2.4736], grad_fn=<AddBackward0>)
tensor([23.7469, 17.9269, 12.2376], grad_fn=<MulBackward0>)
tensor(17.9705, grad_fn=<MeanBackward0>)
tensor([4.5944, 3.9919, 3.2982])


Now, let's see what happens if we do not specify the `required_grad = True`.

In [23]:
x = torch.randn(3)
print(x)

y = x + 2
print(y)

z = y*y*2
print(z)

z = z.mean()
print(z)

z.backward() #dz/dx

print(x.grad)

tensor([ 0.9426, -0.6766, -2.6491])
tensor([ 2.9426,  1.3234, -0.6491])
tensor([17.3174,  3.5030,  0.8426])
tensor(7.2210)


RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

As you can see, there was an error after removing that argument. We cannot calculate the gradient without it. 

One thing we should also notice is that in the background, what calculating the gradient does is make a vector Jacobian product to get the gradients. We have the Jacobian matrix with the partial derivatives multiplied by a gradient vector. This will give us the final gradients we are interested in. This is called the chain rule. 

We should know that we must multiply it with a vector.

In the previous examples, `z` is a single value, so we do not need to multiply it with anything. But, if it is not just a single value, we must take additional steps.

In [26]:
x = torch.randn(3, requires_grad = True)
print(x)

y = x + 2
print(y)

z = y*y*2
print(z)

z.backward() #dz/dx

print(x.grad)

tensor([0.1869, 0.5155, 1.6582], requires_grad=True)
tensor([2.1869, 2.5155, 3.6582], grad_fn=<AddBackward0>)
tensor([ 9.5655, 12.6556, 26.7646], grad_fn=<MulBackward0>)


RuntimeError: grad can be implicitly created only for scalar outputs

As you can see, now that we are not dealing with a scalar, we are getting an error. 

We have to create a vector of the same size.

In [28]:
x = torch.randn(3, requires_grad = True)
print(x)

y = x + 2
print(y)

z = y*y*2
print(z)

v = torch.tensor([0.1, 1.0, 0.001], dtype = torch.float32)
z.backward(v) #dz/dx

print(x.grad)

tensor([-0.6494,  1.0740,  0.3425], requires_grad=True)
tensor([1.3506, 3.0740, 2.3425], grad_fn=<AddBackward0>)
tensor([ 3.6485, 18.8986, 10.9749], grad_fn=<MulBackward0>)
tensor([5.4026e-01, 1.2296e+01, 9.3701e-03])


In the background we should know that these is a vector Jacobian product. Typically, the last operation will be something that creates a scalar value, so we do not need an additional vector.

We can prevent `pytorch` from tracking the history and calculating the `grad_fn`. For example, sometimes during our training loop and we want to update our weights, then this operation should not be part of our gradient computation. We will discuss this in the future. 

To prevent `pytorch` from tracking the gradient, we can do it in 3 ways: 

* `x.requires_grad(False)`
* `x.detach()`
* `with torch.no_grad()`

Option 1:

In [30]:
x = torch.randn(3, requires_grad = True)
print(x)

x.requires_grad_(False)
print(x)

tensor([ 0.1721, -1.7907, -0.2643], requires_grad=True)
tensor([ 0.1721, -1.7907, -0.2643])


Option 2:

In [31]:
x = torch.randn(3, requires_grad = True)
print(x)

y = x.detach()
print(y)

tensor([ 0.1274, -1.3130, -0.7082], requires_grad=True)
tensor([ 0.1274, -1.3130, -0.7082])


This will create a new tensor with the same values, but it does not require the gradient. 

Option 3:

In [32]:
x = torch.randn(3, requires_grad = True)
print(x)

with torch.no_grad():
    y = x + 2
    print(y)

tensor([-0.6808, -1.6855, -0.9310], requires_grad=True)
tensor([1.3192, 0.3145, 1.0690])


One more very important thing to notice is that, whenever we call the `.backward` function, then the gradient for the tensor will be accumulated into the `.grad` attribute. The values will be summed up.

In [35]:
weights = torch.ones(4, requires_grad = True)

for epoch in range(1):
    model_output = (weights*3).sum()
    
    model_output.backward()
    
    print(weights.grad)

tensor([3., 3., 3., 3.])


Now, if we do another iteration, then the second `.backward` call will again accumulate the values and write then into the `.grad` attribute.

In [36]:
weights = torch.ones(4, requires_grad = True)

for epoch in range(2):
    model_output = (weights*3).sum()
    
    model_output.backward()
    
    print(weights.grad)

tensor([3., 3., 3., 3.])
tensor([6., 6., 6., 6.])


And, if we do a third iteration:

In [37]:
weights = torch.ones(4, requires_grad = True)

for epoch in range(3):
    model_output = (weights*3).sum()
    
    model_output.backward()
    
    print(weights.grad)

tensor([3., 3., 3., 3.])
tensor([6., 6., 6., 6.])
tensor([9., 9., 9., 9.])


So all the values are summed up and our gradients are incorrect. Before we do the next iteration and optimization step, we must empty the gradient.

In [38]:
weights = torch.ones(4, requires_grad = True)

for epoch in range(3):
    model_output = (weights*3).sum()
    
    model_output.backward()
    
    print(weights.grad)
    
    weights.grad.zero_()

tensor([3., 3., 3., 3.])
tensor([3., 3., 3., 3.])
tensor([3., 3., 3., 3.])


Later, we will work with the `pytorch` `optimizer` optimization package. With this optimizer, we can do an optimization step and before we do the next iteration, we must call the `optimizer.zero_grad()` function. It will look something like:

`weights = torch.ones(4, requires_grad = True)`

`optimizer = torch.optim.SGD(weights, lr = 0.01)`
`optimizer.step()`
`optimizer.zero_grad()`

We will discuss optimization in more detail later. For now, we must remember that when we want to calculate gradients, we have to specify `requires_grad = True`, we can simply calculate the gradients using the `.backward()` function, and before we want to do the next iteration in the optimization steps, we must empty the gradient by calling the `.grad.zero_()`.