# Motivation

Gradients and optimization are essential parts in training neural networks. In PyTorch it is implemented in a quite convenient way. That is why, in this notebook I want to discuss AutoGrad and Backpropagation.

# 1 AutoGrad

It is PyTorch's automatic differentiation engine. In previous notebook, we looked at the example with Gradient Descent. Here I want to discuss the topic in a more concrete way.

### 1.1 Intro

Suppose we have an initial vector x. We made some other vectors from it, such as y = x + 2, and $z = 2y^2$. As a result, we will be able to calculate the gradient of the final vector with respect to the initial one. In our case, remember the chain rule:

$$\frac{\partial{z}}{\partial{x}} = \frac{\partial{z}}{\partial{y}} \frac{\partial{y}}{\partial{x}} = 4y.$$

By default, gradients are calculated for scalars. This is usually sufficient, as in neural networks we want to minimize MSE or something close to it, which are usually the average of individual losses. Let's see how it looks like in PyTorch.

In [83]:
import torch
import numpy as np

In [84]:
x = torch.randn(3, requires_grad=True)
x

tensor([ 0.8894, -0.1854, -1.1037], requires_grad=True)

In [85]:
y = x + 2
y

tensor([2.8894, 1.8146, 0.8963], grad_fn=<AddBackward0>)

In [86]:
z =  (2 * y ** 2).mean() # gradients are created only for scalars
z.backward()

In [87]:
x.grad

tensor([3.8525, 2.4194, 1.1950])

In [88]:
4 * y / 3
# we see that indeed the gradient was calculated correctly

tensor([3.8525, 2.4194, 1.1950], grad_fn=<DivBackward0>)

### 1.2 One clarification

Suppose we want to create a new vector from another vector (for example, to test something), but we do not need its gradients. In that case we can save time and computational resources by using torch.no_grad() decorator.

In [96]:
x = torch.tensor([1.0,2.0,3.0], requires_grad=True)
y = x * 5
y

tensor([ 5., 10., 15.], grad_fn=<MulBackward0>)

In [97]:
x = torch.tensor([1.0,2.0,3.0], requires_grad=True)
with torch.no_grad():
    y = x * 5
y
# as we see, we do not have gradients for y

tensor([ 5., 10., 15.])

# 2 Backpropagation

Basically, Backpropagation is just the chain rule (I provided an example earlier). We use local gradients to compute the gradient of the loss with respect to our parameters. There are two steps:

1) Do the forward pass. In this step, the neural network makes its best guess about the correct output, based on what it already learnt. We also calculate loss for its further minimization

2) Backpropagation. Here we adjust the parameters based on our loss.

Again, let's see how this is done on the simples possible example: $y = 2x.$

In [112]:
x = torch.tensor([1.0, 2.0, 3.0])
y = torch.tensor([2.0, 4.0, 6.0])
m = torch.tensor(1.0, requires_grad=True) # our initial guess

### 2.1 Forward pass

In [113]:
y_pred = m * x
loss = ((y_pred - y) ** 2).mean() # calculate MSE
loss

tensor(4.6667, grad_fn=<MeanBackward0>)

### 2.2 Backpropagation and parameter update

In [114]:
loss.backward() # calculate gradient
m.grad

tensor(-9.3333)

In [115]:
learning_rate = .01
m = m - learning_rate * m.grad
m
# as we see, we move in the right direction, since the slope increased

tensor(1.0933, grad_fn=<SubBackward0>)

# References

1) https://www.youtube.com/watch?v=c36lUUr864M&t=2181s - a great PyTorch course for beginners

2) https://pytorch.org/docs/stable/index.html - PyTorch documentation.