# Gradient Descent with Autograd & Backpropagation

In this tutorial, we will do a concrete example of how we optimize our model with automatic gradient computation using the `pytroch` `autograd` package. There will be 4 steps.

**Step 1**:

* Prediction: Manually
* Gradients Computation: Manually
* Loss Computation: Manually
* Parameters Updates: Manually

**Step 2**:

* Prediction: Manually
* Gradients Computation: `Autograd`
* Loss Computation: Manually
* Parameters Updates: Manually

**Step 3**:

* Prediction: Manually
* Gradients Computation: `Autograd`
* Loss Computation: `Pytorch Loss`
* Parameters Updates: `Pytorch Optimizer`

**Step 4**:

* Prediction: `Pytorch Model`
* Gradients Computation: `Autograd`
* Loss Computation: `Pytorch Loss`
* Parameters Updates: `Pytorch Optimizer`

Once we understand the manual process, we will move on to using `pytorch` to do everything for us. This tutorial will cover steps 1 and 2. In the next video, we will perform steps 3 and 4.

First, we will import `numpy`.

In [1]:
import numpy as np

We are using linear regression, so a linear combination of ours weights *w* multiplied by an input *x*.

In [2]:
# f = w * x

# our weight will be 2, f = 2 * w

#making training samples
X = np.array([1,2,3,4], dtype = np.float32)
y = np.array([2,4,6,8], dtype = np.float32)

#initializing the weight at 0
w = 0

#inspect the training samples
print(X)
print(y)

[1. 2. 3. 4.]
[2. 4. 6. 8.]


Now, we will make our model predictions. We will define a function. This will be called `forward` to follow the conventions `pytorch`.

In [3]:
def forward(x):
    return w * x

Now, we will define our loss function. The loss depends on y and y predicted (model output). The loss we are using is the mean squared error (MSE) in the case of linear regression.

In [4]:
def loss(y, y_predicted):
    return ((y_predicted - y)**2).mean()

Now we have to calculate the gradient of the loss with respect to our parameters. The MSE formula is:

$$ \frac{1}{N} \cdot (w \cdot x - y)^2 $$

The derivative of this function with respect to w is:

$$ \frac{1}{N} \cdot 2x \cdot (w\cdot x- y) $$

In [5]:
def gradient(x, y, y_predicted):
    return np.dot(2*x, y_predicted - y).mean()

Now, let's print our prediction before the training.

In [6]:
print(f' Prediction before training: f(5) = {forward(5):.3f}')

 Prediction before training: f(5) = 0.000


Now, let's start our training. 

In [9]:
#making training samples
X = np.array([1,2,3,4], dtype = np.float32)
y = np.array([2,4,6,8], dtype = np.float32)

#initializing the weight at 0
w = 0

#learning rate
learning_rate = 0.01
n_iters = 10

#now we do the training loop
for epoch in range(n_iters):
    
    #prediction = forward
    y_pred = forward(X)
    
    #calculate the loss
    l = loss(y, y_pred)
    
    #gradients with respect to w
    dw = gradient(X, y, y_pred)
    
    #update our weights - we go in the negative direction of the gradient
    w -= learning_rate * dw
    
    #printing some information
    if epoch % 1 == 0:
        print(f'epoch {epoch + 1}: w = {w:.3f}, loss = {l:.8f}, gradient = {dw:.3f}')
        
print(f' Prediction after training: f(5) = {forward(5):.3f}')

epoch 1: w = 1.200, loss = 30.00000000, gradient = -120.000
epoch 2: w = 1.680, loss = 4.79999924, gradient = -48.000
epoch 3: w = 1.872, loss = 0.76800019, gradient = -19.200
epoch 4: w = 1.949, loss = 0.12288000, gradient = -7.680
epoch 5: w = 1.980, loss = 0.01966083, gradient = -3.072
epoch 6: w = 1.992, loss = 0.00314570, gradient = -1.229
epoch 7: w = 1.997, loss = 0.00050332, gradient = -0.492
epoch 8: w = 1.999, loss = 0.00008053, gradient = -0.197
epoch 9: w = 1.999, loss = 0.00001288, gradient = -0.079
epoch 10: w = 2.000, loss = 0.00000206, gradient = -0.031
 Prediction after training: f(5) = 9.999


We see that with each training step, it increases our weight and decreases our loss. It gets better with every step. After training the prediction is 9.99, which is close to 10, as it should be. This is because $5 \cdot 2 = 10$. If we increase the number of iterations in the training loop, we will get closer to 10.



In [10]:
#making training samples
X = np.array([1,2,3,4], dtype = np.float32)
y = np.array([2,4,6,8], dtype = np.float32)

#initializing the weight at 0
w = 0

#learning rate
learning_rate = 0.01
n_iters = 20

#now we do the training loop
for epoch in range(n_iters):
    
    #prediction = forward
    y_pred = forward(X)
    
    #calculate the loss
    l = loss(y, y_pred)
    
    #gradients with respect to w
    dw = gradient(X, y, y_pred)
    
    #update our weights - we go in the negative direction of the gradient
    w -= learning_rate * dw
    
    #printing some information
    if epoch % 2 == 0:
        print(f'epoch {epoch + 1}: w = {w:.3f}, loss = {l:.8f}, gradient = {dw:.3f}')
        
print(f' Prediction after training: f(5) = {forward(5):.3f}')

epoch 1: w = 1.200, loss = 30.00000000, gradient = -120.000
epoch 3: w = 1.872, loss = 0.76800019, gradient = -19.200
epoch 5: w = 1.980, loss = 0.01966083, gradient = -3.072
epoch 7: w = 1.997, loss = 0.00050332, gradient = -0.492
epoch 9: w = 1.999, loss = 0.00001288, gradient = -0.079
epoch 11: w = 2.000, loss = 0.00000033, gradient = -0.013
epoch 13: w = 2.000, loss = 0.00000001, gradient = -0.002
epoch 15: w = 2.000, loss = 0.00000000, gradient = -0.000
epoch 17: w = 2.000, loss = 0.00000000, gradient = -0.000
epoch 19: w = 2.000, loss = 0.00000000, gradient = -0.000
 Prediction after training: f(5) = 10.000


As we can see, by the iteration 11, the loss is almost completely 0 and the `w` is 2, as it should be. By iteration 15, the loss is completely 0.

In this first example we did everything manually. Now, let's replace the gradient calculation with the `autograd` package from `pytorch`. First, we need to import `torch`. 

In [12]:
import torch

Now, `X` and `y` need to become tensors. `w` also needs to become a tensor and we need to set the argument `requires_grad = True` since we need to calculate the gradient/derivative of the loss with respect to `w`.

In [13]:
#making training samples with pytorch
X = torch.tensor([1,2,3,4], dtype = torch.float32)
y = torch.tensor([2,4,6,8], dtype = torch.float32)

#initializing the weight at 0 - we need to see requires_grad = True
w = torch.tensor(0.0, dtype = torch.float32, requires_grad = True)

The forward pass function will still be the same and the loss function will still be the same.

In [14]:
#forward pass
def forward(x):
    return w * x

#loss function 
def loss(y, y_predicted):
    return ((y_predicted - y)**2).mean()

Now, in our training loop, the forward pass will be the same and the loss is the same. However, we the gradient, which is equal to the backward pass, will be different, as we will not calculate it manually; rather, we will use the `autograd` package from `pytorch`.

In [17]:
#making training samples with pytorch
X = torch.tensor([1,2,3,4], dtype = torch.float32)
y = torch.tensor([2,4,6,8], dtype = torch.float32)

#initializing the weight at 0 - we need to see requires_grad = True
w = torch.tensor(0.0, dtype = torch.float32, requires_grad = True)

#learning rate
learning_rate = 0.01
n_iters = 20

#now we do the training loop
for epoch in range(n_iters):
    
    #prediction = forward pass
    y_pred = forward(X)
    
    #calculate the loss
    l = loss(y, y_pred)
    
    #gradients with respect to w = backward pass 
    l.backward() #gradient of the loss with respect to w
    
    #update our weights - we go in the negative direction of the gradient
    #we have to be careful here - this operation should not be part of our gradient tracking graph 
    #it should not be part of the computational graph, so we need to wrap it 
    with torch.no_grad():
        w -= learning_rate * w.grad
    
    # we must zero the gradients - remember, whenever we call backward, it will write our gradients and accumulate them
    # in the w.grad() attribute 
    #we want to be sure that our gradients are 0 before the next iteration
    w.grad.zero_()
    
    #printing some information
    if epoch % 2 == 0:
        print(f'epoch {epoch + 1}: w = {w:.3f}, loss = {l:.8f}, gradient = {dw:.3f}')
        
print(f' Prediction after training: f(5) = {forward(5):.3f}')

epoch 1: w = 0.300, loss = 30.00000000, gradient = 0.000
epoch 3: w = 0.772, loss = 15.66018772, gradient = 0.000
epoch 5: w = 1.113, loss = 8.17471695, gradient = 0.000
epoch 7: w = 1.359, loss = 4.26725292, gradient = 0.000
epoch 9: w = 1.537, loss = 2.22753215, gradient = 0.000
epoch 11: w = 1.665, loss = 1.16278565, gradient = 0.000
epoch 13: w = 1.758, loss = 0.60698116, gradient = 0.000
epoch 15: w = 1.825, loss = 0.31684780, gradient = 0.000
epoch 17: w = 1.874, loss = 0.16539653, gradient = 0.000
epoch 19: w = 1.909, loss = 0.08633806, gradient = 0.000
 Prediction after training: f(5) = 9.612


Now we also see that it will increase our `w` and decrease our loss. Here we had 20 iterations, but it is not enough, like we saw previously. This is because backpropagation is not as exact as numerical gradient computation. Let's increase the number of iterations to 1000 and see what happens.

In [19]:
#making training samples with pytorch
X = torch.tensor([1,2,3,4], dtype = torch.float32)
y = torch.tensor([2,4,6,8], dtype = torch.float32)

#initializing the weight at 0 - we need to see requires_grad = True
w = torch.tensor(0.0, dtype = torch.float32, requires_grad = True)

#learning rate
learning_rate = 0.01
n_iters = 1000

#now we do the training loop
for epoch in range(n_iters):
    
    #prediction = forward pass
    y_pred = forward(X)
    
    #calculate the loss
    l = loss(y, y_pred)
    
    #gradients with respect to w = backward pass 
    l.backward() #gradient of the loss with respect to w
    
    #update our weights - we go in the negative direction of the gradient
    #we have to be careful here - this operation should not be part of our gradient tracking graph 
    #it should not be part of the computational graph, so we need to wrap it 
    with torch.no_grad():
        w -= learning_rate * w.grad
    
    # we must zero the gradients - remember, whenever we call backward, it will write our gradients and accumulate them
    # in the w.grad() attribute 
    #we want to be sure that our gradients are 0 before the next iteration
    w.grad.zero_()
    
    #printing some information
    if epoch % 100 == 0:
        print(f'epoch {epoch + 1}: w = {w:.3f}, loss = {l:.8f}, gradient = {dw:.3f}')
        
print(f' Prediction after training: f(5) = {forward(5):.3f}')

epoch 1: w = 0.300, loss = 30.00000000, gradient = 0.000
epoch 21: w = 1.934, loss = 0.04506890, gradient = 0.000
epoch 41: w = 1.997, loss = 0.00006770, gradient = 0.000
epoch 61: w = 2.000, loss = 0.00000010, gradient = 0.000
epoch 81: w = 2.000, loss = 0.00000000, gradient = 0.000
epoch 101: w = 2.000, loss = 0.00000000, gradient = 0.000
epoch 121: w = 2.000, loss = 0.00000000, gradient = 0.000
epoch 141: w = 2.000, loss = 0.00000000, gradient = 0.000
epoch 161: w = 2.000, loss = 0.00000000, gradient = 0.000
epoch 181: w = 2.000, loss = 0.00000000, gradient = 0.000
epoch 201: w = 2.000, loss = 0.00000000, gradient = 0.000
epoch 221: w = 2.000, loss = 0.00000000, gradient = 0.000
epoch 241: w = 2.000, loss = 0.00000000, gradient = 0.000
epoch 261: w = 2.000, loss = 0.00000000, gradient = 0.000
epoch 281: w = 2.000, loss = 0.00000000, gradient = 0.000
epoch 301: w = 2.000, loss = 0.00000000, gradient = 0.000
epoch 321: w = 2.000, loss = 0.00000000, gradient = 0.000
epoch 341: w = 2.00

As we can see, the `w` is now correct and so is the prediction. In the next video, we will complete steps 3 and 4.