# Gradient Computation
The autograd package provides automatic differentiation for all operations on Tensors.

## Gradient

In [1]:
import torch

requires_grad = True -> tracks all operations on the tensor. 

In [2]:
x = torch.randn(3, requires_grad=True)
print(x)

tensor([ 1.3567,  0.8755, -0.2205], requires_grad=True)


y was created as a result of an operation, so it has a grad_fn attribute.  
grad_fn: references a Function that has created the Tensor

In [3]:
y = x + 2
print(y)

tensor([3.3567, 2.8755, 1.7795], grad_fn=<AddBackward0>)


<img src="images/1.jpg" width=600>

In [4]:
z = y * y * 2
print(z)

tensor([22.5349, 16.5370,  6.3329], grad_fn=<MulBackward0>)


In [5]:
z = z.mean()
print(z)

tensor(15.1349, grad_fn=<MeanBackward0>)


Let's compute the gradients with backpropagation. When we finish our computation we can call `.backward()` and have all the gradients computed automatically. The gradient for this tensor will be accumulated into `.grad` attribute. It is the partial derivate of the function w.r.t. the tensor.

In [6]:
z.backward() # dz/dx (Will not work if requires_grad=False)
print(x.grad)

tensor([4.4756, 3.8340, 2.3726])


Generally speaking, torch.autograd is an engine for computing vector-Jacobian product. It computes partial derivates while applying the chain rule.

### Model with non-scalar output:
If a Tensor is non-scalar (more than 1 elements), we need to specify arguments for `backward()`. Specify a gradient argument that is a tensor of matching shape. Needed for vector-Jacobian product.

In [7]:
x = torch.randn(3, requires_grad=True)
y = x + 2
z = y * y * 2
print(z)
try:
    z.backward()
except RuntimeError as e:
    print("Error: ", e)
finally:
    v = torch.tensor([0.1, 1.0, 0.001], dtype=torch.float)
    z.backward(v)
    print(x.grad)

tensor([7.1237, 9.5409, 8.9240], grad_fn=<MulBackward0>)
Error:  grad can be implicitly created only for scalar outputs
tensor([7.5492e-01, 8.7365e+00, 8.4494e-03])


## Prevent Gradient
Stop a tensor from tracking history:
For example during our training loop when we want to update our weights
then this update operation should not be part of the gradient computation
- x.requires_grad_(False)
- x.detach()
- wrap in 'with torch.no_grad():'

In [8]:


x = torch.randn(3, requires_grad=True)
print(x)
x.requires_grad_(False)
print(x)

tensor([-1.1390,  0.0124, -0.4944], requires_grad=True)
tensor([-1.1390,  0.0124, -0.4944])


In [9]:
x = torch.randn(3, requires_grad=True)
print(x)
x = x.detach()
print(x)

tensor([-0.5216, -0.3057,  1.9626], requires_grad=True)
tensor([-0.5216, -0.3057,  1.9626])


In [10]:
x = torch.randn(3, requires_grad=True)

with torch.no_grad():
    y = x * 2
    print(y) # No gradients are computed for y

tensor([ 3.7750,  1.7014, -0.7601])


## Gradient Accumulates While Training
backward() accumulates the gradient for this tensor into .grad attribute.  
We need to be careful during optimization !!!  
Use .zero_() to empty the gradients before a new optimization step!


In [11]:
weights = torch.ones(3, requires_grad=True)
for epoch in range(3):
    model_output = (weights * 3).sum()
    model_output.backward()
    print(weights.grad)

tensor([3., 3., 3.])
tensor([6., 6., 6.])
tensor([9., 9., 9.])


In [12]:
weights = torch.ones(3, requires_grad=True)
for epoch in range(3):
    model_output = (weights * 3).sum()
    model_output.backward()
    print(weights.grad)
    weights.grad.zero_()

tensor([3., 3., 3.])
tensor([3., 3., 3.])
tensor([3., 3., 3.])


Optimizer has zero_grad() method  
optimizer = torch.optim.SGD([weights], lr=0.1)  
During training:  
optimizer.step()  
optimizer.zero_grad()  