# Introduction to computational graph and implementation of backpropagation and Gradient descent

**INSTRUCTIONS**

Each week you will be given an assignment related to the associated module. You have roughly one week to complete and submit each of them. There are 3 weekly group sessions available to help you complete the assignments. Attendance is not mandatory but recommended. However, **assignments are graded each week and not submitting them or submitting them after the deadline will give you no points**

- **FORMAT**: Jupyter notebook
- **DEADLINE**:  Sunday 21st February, 23:59


## Introduction

Last week we saw how adding regularization to SGD could reduce overfitting. This week we will try to understand what happens exactly inside your model during the training process when applying the gradient descent method, with or without regularization and momentum. To do so we will simply implement first the backprogation algorithm which is how we compute the gradients and then the gradient descent, which is the trainable parameter update in response to this gradient.


## Contents

1. Utils  

2. MiniNet  

3. Introduction to PyTorch's computational graph  
  3.1 Weight values and update  
  3.2 Gentle (but still painful) introduction to computational graph  
4. From an optimization problem to the backprogation algorithm  

5. Implementing Gradient descent in Pytorch inside the training loop   
  5.0 LetNet5  
  5.1 Manual weight optimization  
  5.2 Compare results to optim.SGD: vanilla version  
  5.3 Compare results to optim.SGD: (with Regularization)  
  5.4 Compare results to optim.SGD: (with Momentum)  
  5.5 Compare results to optim.SGD: (with Regularization and Momentum)  
6. Implementing an Optimizer in PyTorch
  
## Andrew's Videos related to today's assignent (entirely optional! just here to help you if you're confused about something specific!)

- [Computation Graph (C1W2L07)](https://www.youtube.com/watch?v=hCP1vGoCdYU&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=13)
- [Derivatives With Computation Graphs (C1W2L08)](https://www.youtube.com/watch?v=nJyUyKN-XBQ&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=14)
- [Gradient Descent (C1W2L04)](https://www.youtube.com/watch?v=uJryes5Vk1o&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=10)
- [Gradient Descent For Neural Networks (C1W3L09)](https://www.youtube.com/watch?v=7bLEWDZng_M&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=33) (includes backprogation)
- [Backpropagation Intuition (C1W3L10)](https://www.youtube.com/watch?v=yXcQ4B-YSjQ&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=34)
- [Deep L-Layer Neural Network (C1W4L01)](https://www.youtube.com/watch?v=2gw5tE2ziqA&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=36) (clarify notations)
- [Forward and Backward Propagation (C1W4L06)](https://www.youtube.com/watch?v=qzPQ8cEsVK8&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&index=41)
- [Regularization (C2W1L04)](https://www.youtube.com/watch?v=6g0t3Phly2M&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=4)
- [Why Regularization Reduces Overfitting (C2W1L05)](https://www.youtube.com/watch?v=NyG-7nRpsW8&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=5)
- [Gradient Descent With Momentum (C2W2L06)](https://www.youtube.com/watch?v=k8fTYJPd3_I&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=20)

In [None]:
import torch
from torch.optim import Optimizer
from torch import nn, optim
import torch.nn.functional as F
from torchvision import datasets, transforms
import datetime
import copy
import collections

## 1. Utils

Nothing to see in the cell below, just the definition of 3 functions we'll need later, you don't even need to read them, just know that there is:

- ``load_cifar``: load CIFAR-2 (keeping only birds and plane)
- ``training_loop``
- ``validate``

In [4]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f"Device {device}.")


def load_cifar(data_path='../data/'):
    """
    Return CIFAR-2 keeping only birds and plane
    """
    cifar10_train = datasets.CIFAR10(
        data_path,       
        train=True,      
        download=True,   
        transform=transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.4915, 0.4823, 0.4468),
                                (0.2470, 0.2435, 0.2616))
        ]))

    cifar10_val = datasets.CIFAR10(
        data_path, 
        train=False,
        download=True,
        transform=transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.4915, 0.4823, 0.4468),
                                (0.2470, 0.2435, 0.2616))
        ]))

    label_map = {0: 0, 2: 1}
    class_names = ['airplane', 'bird']

    cifar2_train = [(img, label_map[label]) for img, label in cifar10_train if label in [0, 2]]
    cifar2_val = [(img, label_map[label]) for img, label in cifar10_val if label in [0, 2]]
    print('Size of the training dataset: ', len(cifar2_train))
    print('Size of the validation dataset: ', len(cifar2_val))

    return cifar2_train, cifar2_val

def training_loop(n_epochs, optimizer, model, loss_fn, train_loader):
    """
    Train our model
    """
    model.train()
    for epoch in range(1, n_epochs + 1):
        loss_train = 0.0
        for imgs, labels in train_loader:
            imgs = imgs.to(device=device) 
            labels = labels.to(device=device)

            outputs = model(imgs)
            loss = loss_fn(outputs, labels)

            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

            loss_train += loss.item()

        if epoch == 1 or epoch % 10 == 0:
            print('{}  |  Epoch {}  |  Training loss {:.3f}'.format(
                datetime.datetime.now(), epoch,
                loss_train / len(train_loader)))

def validate(model, train_loader, val_loader):
    """
    Plot training and validation accuracy
    """
    model.eval()
    accdict = {}
    for name, loader in [("train", train_loader), ("val", val_loader)]:
        correct = 0
        total = 0

        with torch.no_grad():
            for imgs, labels in loader:
                imgs = imgs.to(device=device)
                labels = labels.to(device=device)

                outputs = model(imgs)
                _, predicted = torch.max(outputs, dim=1)
                total += labels.shape[0]
                correct += int((predicted == labels).sum())

        print("Accuracy {}: {:.2f}".format(name , correct / total))
        accdict[name] = correct / total
    return accdict

cifar2_train, cifar2_val = load_cifar()

Device cuda.
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ../data/cifar-10-python.tar.gz


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ../data/cifar-10-python.tar.gz to ../data/
Files already downloaded and verified
Size of the training dataset:  10000
Size of the validation dataset:  2000


## 2. MiniNet

Here is the definition of a very simple MLP that we will use in the first part of this assignment. It corresponds to the following network (I tried to follow Andrew's notation):

![MiniNet architecture](MiniNet.jpeg)

As you will use this implementation, it might be useful to read the code bellow

In [5]:
class MiniNet(nn.Module):
    def __init__(self):
        super().__init__() 
        self.fc1 = nn.Linear(in_features=2, out_features=3)
        self.fc2 = nn.Linear(in_features=3, out_features=2)

        # Where we will store our neuron values
        # - z: before activation function 
        # - a: after activation function (a=f(z))
        self.z = {str(i) : None for i in range(1,3)}
        self.a = {str(i) : None for i in range(3)}

        # Where we will store the gradients for our custom backpropagation algo
        self.dL_dw = {str(i) : None for i in range(1,3)}
        self.dL_db = {str(i) : None for i in range(1,3)}

        # Derivatives of our activation functions
        self.df = {str(i) : None for i in range(1,3)}
        self.df['1'] = lambda x : torch.div(1, torch.cosh(x)**2)
        self.df['2'] = lambda x : torch.div(1, torch.cosh(x)**2)
        
    def forward(self, x):
        # The first dimension of the input must be the batch size
        out = x.view(-1, 2)

        # Input layer
        self.a['0'] = out
        
        # First layer (hidden layer)
        self.z['1'] = self.fc1(out)
        self.a['1'] = torch.tanh(self.z['1'])
        
        # Second layer (output layer)
        self.z['2'] = self.fc2(self.a['1'])
        self.a['2'] = torch.tanh(self.z['2'])

        return self.a['2']

## 3. Introduction to PyTorch's computational graph

### 3.1 Weight values and update

Get more familiar with models in Pytorch. Look at the cell below to understand how to:

- Access neuron values
- Access parameter values
- Update parameters

In [18]:
model = MiniNet()

def print_parameters(model):
    """
    Print trainable parameters of our MiniNet
    """
    print(" ============== Parameters ============== ")
    for name, p in model.named_parameters():
        print("\nName : ", name, "\nValue: ", p.data)

def print_neuron_values(model):
    """
    Print neuron values (a and z) of our MiniNet 
    """
    print("\n ============== Neuron values ============== ")
    print("\n -------------- Input ---------------- ")
    print("a0:            ", model.a['0'] )
    print("\n -------------- First Layer ---------------- ")
    print("z1:            ", model.z['1'] )
    print("a1 = tanh(z1): ", model.a['1'] )
    print("\n --------------  2nd Layer  ---------------- ")
    print("z2:            ", model.z['2'] )
    print("a2 = tanh(z2): ", model.a['2'] )


print_parameters(model) # We can see that all parameters are randomly initialized

# We can access the different parameters using their names as specified in model.state_dict()
print("\n ========= Print state_dict() ========= ")
print(model.state_dict())
# We can now access one of our parameters using its name
# And we can update its value using '.data' after its name
model.fc1.weight.data = torch.ones(3,2)
model.fc1.weight.data[0,0] = 42

model.fc2.bias.data[:] = torch.ones(2)
model.fc2.bias.data[0] = 42

print("\n ========== Updated parameters ==========" )

print(model.fc1.weight)
print(model.fc2.bias)


# We have not given any input to our model yet, so all neuron values should be None
print_neuron_values(model)
# Now we give some input...
input = torch.Tensor([1, 1])
model(input)
# ... and everything has been computed in the forward pass 
print_neuron_values(model)




Name :  fc1.weight 
Value:  tensor([[ 0.3983,  0.6245],
        [-0.4197,  0.2017],
        [-0.2798,  0.5716]])

Name :  fc1.bias 
Value:  tensor([0.2636, 0.0850, 0.3211])

Name :  fc2.weight 
Value:  tensor([[-0.2217, -0.1855, -0.1029],
        [-0.4510, -0.3690, -0.4821]])

Name :  fc2.bias 
Value:  tensor([ 0.3209, -0.3564])

OrderedDict([('fc1.weight', tensor([[ 0.3983,  0.6245],
        [-0.4197,  0.2017],
        [-0.2798,  0.5716]])), ('fc1.bias', tensor([0.2636, 0.0850, 0.3211])), ('fc2.weight', tensor([[-0.2217, -0.1855, -0.1029],
        [-0.4510, -0.3690, -0.4821]])), ('fc2.bias', tensor([ 0.3209, -0.3564]))])

Parameter containing:
tensor([[42.,  1.],
        [ 1.,  1.],
        [ 1.,  1.]], requires_grad=True)
Parameter containing:
tensor([42.,  1.], requires_grad=True)


 -------------- Input ---------------- 
a0:             None

 -------------- First Layer ---------------- 
z1:             None
a1 = tanh(z1):  None

 --------------  2nd Layer  ---------------- 
z2:  

### 3.2 Gentle (but still painful) introduction to computational graph

If you take a closer look at the output of "Updated parameters" you can see that it mentions "``requires_grad=True``". Then, if you take a closer look at the output of "neuron values" you see that it mentions "``grad_fn=<TanhBackward>``"  "``grad_fn=<AddmmBackward>``"

What is this all about? Well, it has to do with the *computational graph* which is how Pytorch manages all the operations made during the forward pass (``outputs = model(inputs)`` i.e ``forward`` method) so that it can compute all the gradients in the backward pass (``loss.backward()``) and finally update parameters accordingly when calling ``optimizer.step()``. 

Now to illustrate how necessary it is to have some understanding of this computational graph, let's try to initialize our weights in a custom way and check how easily we can mess up things.

This is okay:

- ``model.layer.param.data = new_values``
- ``model.layer.param.data[:] = new_values``

This is **NOT** okay:

- ``model.layer.param = new_values``     Raises an error (unless ``new_values`` are of nn.Parameter and not ``torch.Tensor``): ``TypeError: cannot assign 'torch.FloatTensor' as parameter 'weight' (torch.nn.Parameter or None expected)``
- ``model.layer.param[:] = new_values``  Will remove the parameter from the list of leaves and put ``CopySlices`` as gradient function

Basically, we want each of our trainable parameters (weights) to require grad ([requires_grad](https://pytorch.org/docs/stable/autograd.html?highlight=requires#torch.Tensor.requires_grad)) and to be a leaf ([is_leaf](https://pytorch.org/docs/stable/autograd.html#torch.Tensor.is_leaf)). Variables that have nothing to do with the computational graph (not a part of the network) should be detached of the computational graph. (see [detach](https://pytorch.org/docs/stable/autograd.html#torch.Tensor.detach) or [detach_](https://pytorch.org/docs/stable/autograd.html#torch.Tensor.detach_) for in place version). 

The concept of leaf might be counter intuitive in PyTorch. The fact that your weight is in the middle of your network does not mean that it should not be a leaf. It should always be a leaf. Basically weights are leaves because in the forward pass their values do not depend on the values of the input nor the weights, etc. Their values only change when you call optimizer.step() or when you initialize them manually. 

In [7]:
def weight_initialization(model):
    """
    Initialize our MiniNet's weights
    """
    model.fc1.weight.data = torch.arange(1,7, dtype=torch.float).view(3,2)/10
    model.fc1.bias.data = torch.arange(7,10, dtype=torch.float)/10
    model.fc2.weight.data = torch.arange(10,16, dtype=torch.float).view(2,3)/10
    model.fc2.bias.data = torch.arange(16,18, dtype=torch.float)/10
    return model

def check_computational_graph(model):
    """
    Make sure all trainable parameters require grad and are leaves
    """
    res = True
    # Go through layer 1,2
    for i_layer in range(1,3):
        # Each layer has a weight and bias parameter
        for param_name in ['weight', 'bias']:
            
            layer_name = "fc"+str(i_layer)
            # 'getattr(object, string variable)' is like `object.myattribut` when variable = "myattribute"
            if not getattr(getattr(model, layer_name), param_name).requires_grad:
                print(" ==== WARNING ==== ", layer_name + "." + param_name + "does not require grad!")
                print(getattr(getattr(model, layer_name), param_name))
                res = False
            if not getattr(getattr(model, layer_name), param_name).is_leaf:
                print(" ==== WARNING ==== ", layer_name + "." + param_name + " is not a leaf!")
                print(getattr(getattr(model, layer_name), param_name))
                res = False
    if res:
        print("\nAll parameters are correctly attached to the computational graph! :) ")


model = MiniNet()
check_computational_graph(model)      # So far so good, since we have not done anything yet
model = weight_initialization(model)
print_parameters(model)               # To check that parameters have been updated
check_computational_graph(model)      # To check that there are still correctly attached to the graph
model.fc1.weight[:,:] = torch.arange(1,7, dtype=torch.float).view(3,2)/10
check_computational_graph(model)      # Now fc1.weight is not a leaf anymore! (and see "grad_fn=<CopySlices>"")

# This would raise an error if uncommented
#model.fc1.weight=torch.arange(1,7, dtype=torch.float).view(3,2)/10 



All parameters are correctly attached to the computational graph! :) 

Name :  fc1.weight 
Value:  tensor([[0.1000, 0.2000],
        [0.3000, 0.4000],
        [0.5000, 0.6000]])

Name :  fc1.bias 
Value:  tensor([0.7000, 0.8000, 0.9000])

Name :  fc2.weight 
Value:  tensor([[1.0000, 1.1000, 1.2000],
        [1.3000, 1.4000, 1.5000]])

Name :  fc2.bias 
Value:  tensor([1.6000, 1.7000])

All parameters are correctly attached to the computational graph! :) 
Parameter containing:
tensor([[0.1000, 0.2000],
        [0.3000, 0.4000],
        [0.5000, 0.6000]], grad_fn=<CopySlices>)


## 4. From an optimization problem to the backprogation algorithm

**Optimization problem**

In optimization, (so not only specific to machine learning) the problem often boils down to minimize (or maximize) some loss function (or score). The loss function can for instance be the mean squared error:

$$L(\theta) = \frac{1}{m}\sum_{s=1}^{m}\frac{1}{2}||\mathbf{y_s} - \mathbf{\hat{y}_s}||^2_2$$ 

with:

- $m$ total number of sample in your dataset (or batch)
- $\theta$ all the parameters to be optimized ($\in \mathrm{R}^q$)
- $\mathbf{y}$ your expected result ($\in \mathrm{R}^{n \times m}$)
- $\mathbf{\hat{y}}$ your predicted result ($\in \mathrm{R}^{n \times m}$)
- $L : \mathrm{R}^q \rightarrow \mathrm{R}$ your loss function

**Gradient descent**

One way to solve this problem (still not only specific to machine learning) is to minimize the loss function iteratively using the gradient descent method. That is to say, we iteratively update all the parameters as follows (in simplest version of the gradient descent algorithm):

$$\theta = \theta - \alpha \nabla L(\theta)  $$

with 
- $\alpha$ often called "step" in optimization and "learning rate" in machine learning
- $\nabla L(\theta)$ the gradient of the loss function 

So if $\theta = \begin{bmatrix} w_{1} \cdots w_{q} \end{bmatrix}^T$  then $\nabla L(\theta) = \begin{bmatrix} \frac{\partial L}{\partial w_{1}}(\theta) \cdots \frac{\partial L}{\partial w_{q}}(\theta) \end{bmatrix}^T$

**Backpropagation algorithm**

So the only thing we need is to compute $\frac{\partial L}{\partial w_{i}}, \forall i \in [1..q]$. In machine learning, we use the backpropagation algorithm to compute $\frac{\partial L}{\partial w_{i}}$ for all the weights $w_{i}$ of a neural network. This is a recursive algorithm, from the output layer to the input layer (hence the name *backpropagation*). Again, I tried to follow Andrew's notation.

$$\frac{\partial L}{\partial w^{[l]}_{i,j}} = \delta^{[l]}_i \times a_j^{[l-1]} \qquad  \qquad \forall l \in [1 .. L] \quad \text{with }  \delta^{[l]}_i \text{ called "local gradient" }\qquad \delta^{[l]}_i = \frac{\partial L}{\partial z^{[l]}_{i}} = \frac{\partial L}{\partial a^{[l]}_{i}} \times \frac{\partial a^{[l]}_{i}}{\partial z^{[l]}_{i}}$$

For the output layer, since $a^{[L]}_{i} = \hat{y}_{i}$:

$$\delta^{[L]}_i = \frac{\partial L}{\partial z^{[L]}_{i}} = \frac{\partial L}{\partial \hat{y}_{i}} \times \frac{\partial \hat{y}_i}{\partial z^{[L]}_{i}} = e_i'(\hat{y}_i) \times f'^{[L]}_i (z^{[L]}_{i})$$

For the hidden layers (general case):

$$\delta^{[l]}_i = \frac{\partial L}{\partial z^{[l]}_{i}} = \frac{\partial L}{\partial a^{[l]}_{i}} \times \frac{\partial a^{[l]}_{i}}{\partial z^{[l]}_{i}} = \Big( \sum_{k=1}^{n^{[l+1]}} \frac{\partial a^{[l]}_{i}}{\partial a^{[l+1]}_{k}} \frac{\partial a^{[l+1]}_{k}}{\partial a^{[l]}_{i}} \Big)\times \frac{\partial a^{[l]}_{i}}{\partial z^{[l]}_{i}} = \Big( \sum_{k=1}^{n^{[l+1]}} \delta_k^{[l+1]} w^{[l+1]}_{k,i} \Big) \times f'^{[l]}_i (z^{[l]}_{i})$$

## TODO:

Write a function ``backpropagation`` that computes:

- $\frac{\partial L}{\partial w^{[l]}_{i,j}}$ and store them in ``model.dL_dw[str(l)][i,j]`` for $l \in [1,2]$ 
- $\frac{\partial L}{\partial b^{[l]}_{j}}$ and store them in ``model.dL_db[str(l)][j]`` for $l \in [1,2]$ 

**Hints**

- Look at how MiniNet is implemented and at the cell above (``weight_initialization`` and ``check_computational_graph`` to see how to access MiniNet weights and update MiniNet attributes)

Once you think you are done:

- You can check if your function works well by runnning the next cell.
- Since ``dL_dw`` are not trainable parameters of your model they should not be in the computational graph. Check that all your ``dL_dw``  (or at least one of them) have its attribute ``require_grad`` set to ``False``. If not use the [detach](https://pytorch.org/docs/stable/autograd.html#torch.Tensor.detach) or [detach_](https://pytorch.org/docs/stable/autograd.html#torch.Tensor.detach_) methods appropriately in your ``backpropagation`` function

In [156]:
def mse(y_true, y_pred):
    return torch.sum((y_true - y_pred)**2)/2/len(y_true)

def dmse(y_true, y_pred):
    return torch.sum(-(y_true - y_pred))/len(y_true)


def backpropagation(model, y_true, y_pred, dloss=mse):
    dL = dloss(y_true, y_pred)
    # for l in range(1, 3):
    #     for j in range(???):
    #         for i in range(???):

    # Go through layer 1,2
    for i_layer in range(1,3):
        layer_name = "fc"+str(i_layer)

        # if model.dL_dw[str(i_layer)].require_grad:
        #     print("*******")

        param_name = 'weight'
        model.dL_dw[str(i_layer)] = model.df[str(i_layer)](dL/getattr(getattr(model, layer_name), param_name))
        model.dL_dw[str(i_layer)].detach_()
        param_name = 'bias'
        model.dL_db[str(i_layer)] = model.df[str(i_layer)](dL/getattr(getattr(model, layer_name), param_name))
        
    return None


In [157]:
def compare_with_gradients(model):
    print( " =========== COMPARE GRADIENTS =========== ")
    print(" --- Our computation:\n", model.fc1.weight.grad)
    print(" --- Autograd's computation:\n",  model.dL_dw['1'])
    print("Norm (fc1.weight.grad - model.dL_dw['1']):   %.2f" %torch.norm(model.fc1.weight.grad.flatten() - model.dL_dw['1'].flatten()) )
    print("Norm (fc2.weight.grad - model.dL_dw['2']):   %.2f" %torch.norm(model.fc2.weight.grad.flatten() - model.dL_dw['2'].flatten()))
    print("Norm (fc1.weight.bias - model.dL_db['1']):   %.2f" %torch.norm(model.fc1.bias.grad.flatten() - model.dL_db['1'].flatten()))
    print("Norm (fc2.weight.bias - model.dL_db['2']):   %.2f" %torch.norm(model.fc2.bias.grad.flatten() - model.dL_db['2'].flatten()))
    return None

def test_backprop(model, optimizer, loss_fn):
    model.train()
    for i, (x, y) in enumerate(zip(inputs, y_true)):
        print("\n === Input %d === "%i)
    
        out = model(x)
        loss = loss_fn(out, y)

        optimizer.zero_grad()
        loss.backward()
        backpropagation(model, y, out)

        optimizer.step()
        
        compare_with_gradients(model)

N = 20
inputs = [torch.normal( torch.tensor([-i,i], dtype=torch.float), torch.tensor([1.,2.]) ) for i in range(N)]
y_true = [i*torch.ones(2) for i in range(N)]

model = MiniNet()
model = weight_initialization(model)    

loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-2)

test_backprop(model, optimizer, loss_fn)


print("\n ====  Check that weights have been updated ======")
print(model.fc1.weight)


 === Input 0 === 
 --- Our computation:
 tensor([[1.5105e-04, 3.2374e-04],
        [9.0159e-05, 1.9323e-04],
        [4.9590e-05, 1.0628e-04]])
 --- Autograd's computation:
 tensor([[1.8223e-04, 2.6639e-02],
        [1.3318e-01, 2.8062e-01],
        [4.2020e-01, 5.3473e-01]])
Norm (fc1.weight.grad - model.dL_dw['1']):   0.75
Norm (fc2.weight.grad - model.dL_dw['2']):   2.08
Norm (fc1.weight.bias - model.dL_db['1']):   1.19
Norm (fc2.weight.bias - model.dL_db['2']):   1.29

 === Input 1 === 
 --- Our computation:
 tensor([[1.0906, 8.9605],
        [0.6097, 5.0098],
        [0.1851, 1.5207]])
 --- Autograd's computation:
 tensor([[2.5687e-10, 3.2053e-05],
        [1.6007e-03, 1.1260e-02],
        [3.5951e-02, 7.6939e-02]])
Norm (fc1.weight.grad - model.dL_dw['1']):   10.44
Norm (fc2.weight.grad - model.dL_dw['2']):   0.95
Norm (fc1.weight.bias - model.dL_db['1']):   2.96
Norm (fc2.weight.bias - model.dL_db['2']):   2.37

 === Input 2 === 
 --- Our computation:
 tensor([[0.0103, 0.0052],

  return F.mse_loss(input, target, reduction=self.reduction)


## 5. Implementing Gradient descent in Pytorch inside the training loop 

Now that we know how to backpropagate the gradient, the only thing missing to the learning process of our network is how to update parameters using the gradient descent update equation. 

In the previous section we used a very simple MLP so the only derivatives we had to compute were the derivates of the activation functions. Computing all the terms required by the backpropagation algorithm might become harder when you start including more complex layers such as convolutions. In this section we will go back to using ``loss.backward()`` and will focus on updating parameters once the backpropagation is already done.

The update step does not depend on the architecture's complexity. To illustrate this fact, we will now use a LeNet5 model.

### 5.0 LetNet5

## TODO:

Copy paste your favorite implementation of LeNet5 architecture.

In [10]:
#TODO!

num_classes = 2

class MyLeNet5(nn.Module):
    def __init__(self, dropout=False):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(16, 8, kernel_size=3, padding=1)
        self.fc1 = nn.Linear(8 * 8 * 8, 32)
        self.fc2 = nn.Linear(32, num_classes)
        self.dropout = dropout
        self.drop = nn.Dropout2d(p=0.5)
        
    def forward(self, x):
        out = F.max_pool2d(torch.tanh(self.conv1(x)), 2)
        if self.dropout:
            out = self.drop(out)
        out = F.max_pool2d(torch.tanh(self.conv2(out)), 2)
        if self.dropout:
            out = self.drop(out)
        out = out.view(-1, 8 * 8 * 8)
        out = torch.tanh(self.fc1(out))
        out = self.fc2(out)
        return out

 

### 5.1 Manual weight optimization

## TODO: (read until the end of this cell)

1. Remove the line corresponding to the parameter update step in the training loop below  
2. Replace it by a manual update of the parameters using the gradient descent rule $\theta = \theta - \alpha \nabla L(\theta)  $   
3. Run the '5.2 Compare results` cell to compare with the PyTorch implementation of SGD with consistent optimizer parameters  

Once the 3rd point works well (it's normal if values are not exactly equal, but they should follow the same trend):

4. Add L2 regularization (using the ``weight_decay`` parameter) to your optimizer.
5. Run the '5.3 Compare results` cell (keep the 5.2 one for vanilla SGD so that we can easily compare) to compare with the PyTorch implementation of SGD with consistent optimizer parameters and using regularization

Once the 5th point works well:

6. Add momentum (using the ``momentum_coeff`` parameter) to your optimizer. **Warning** there are different equations for momentum, use the one used by PyTorch so that comparisons make sense. (See SGD's [note](https://pytorch.org/docs/stable/optim.html?highlight=sgd#torch.optim.SGD))
7. Run the '5.4 Compare results` cell (keep the 5.3 one for SGD+L2 so that we can easily compare) to compare with the PyTorch implementation of SGD with consistent optimizer parameters and using momentum **WITHOUT** regularization

Once the 7th point works well:

9. Run the '5.5 Compare results` cell (keep the 5.4 one for SGD+Momentum so that we can easily compare) to compare with the PyTorch implementation of SGD with consistent optimizer parameters and using momentum **WITH** regularization

**Hints** 

- Add the `with torch.no_grad():` before updating parameters
- Remember that you can access each of the parameters of your model using for example ``for p in model.parameters()``
- Might be useful to detach some tensors at some point :) 
- Remember to be careful when updating trainable parameters!
- You'll still have to re-initialize the gradients after each step (see [torch.Tensor.zero_](https://pytorch.org/docs/stable/tensors.html?highlight=zero_#torch.Tensor.zero_))

In [52]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f"Device {device}.")

def training_loop_manual_optim(n_epochs, model, loss_fn, train_loader, lr=1e-2, momentum_coeff=0., weight_decay=0.):
    model.train()
    
    for epoch in range(1, n_epochs + 1):
        loss_train = 0.0
        for imgs, labels in train_loader:
            imgs = imgs.to(device=device) 
            labels = labels.to(device=device)

            # TODO!
            outputs = model(imgs)
            loss = loss_fn(outputs, labels)
            loss.backward()
            # optimizer.step()
            with torch.no_grad():
                for p in model.parameters():
                    if v is None:
                        v = torch.zeros_like(p.grad)
                    v = p.grad + momentum_coeff * v# Momentum
                    p -= lr * v #SGD
                    p -= weight_decay * p.grad# Weight_decay            ## += or -= I think "-" is correct.
                    # new_val = update_function(p, p.grad, loss, other_params)
                    # p.copy_(new_val)
                    p.grad.zero_()

            loss_train += loss.item()

        if epoch == 1 or epoch % 10 == 0:
            print('{}  |  Epoch {}  |  Training loss {:.3f}'.format(
                datetime.datetime.now(), epoch,
                loss_train / len(train_loader)))

Device cuda.


### 5.2 Compare results to optim.SGD: vanilla version

**NOTE** You can reduce the number of epochs (and also increase how often they are printed) if your computer is slow when you are still trying to make it work. Once it works try to go back to a higher number of epochs before submitting your assignment.



In [35]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f"Training on device {device}.")


train_loader = torch.utils.data.DataLoader(cifar2_train, batch_size=64, shuffle=True)
model = MyLeNet5()  # TODO: Change this with your own LeNet5 class name
loss_fn = nn.CrossEntropyLoss()

n_epochs = 31
# n_epochs = 11
lr = 1e-2

# print("\n ========= Training using Pytorch's SGD =========")
# model01 = copy.deepcopy(model).to(device=device) 

# # TODO: Choose the parameter values consistently so that comparisons make sense
# optimizer = optim.SGD(model01.parameters(), lr=lr)


# training_loop(
#     n_epochs = n_epochs,
#     optimizer = optimizer, 
#     model = model01,
#     loss_fn = loss_fn,
#     train_loader = train_loader,
# )

print("\n ==== Using manual update inside training loop ======")
model03 = copy.deepcopy(model).to(device=device) 

# TODO: Choose the parameter values consistently so that comparisons make sense
training_loop_manual_optim(
    n_epochs = n_epochs,
    model = model03,
    loss_fn = loss_fn,
    train_loader = train_loader,
    lr = lr,
)

print("\n ========= Training using Pytorch's SGD =========")
model01 = copy.deepcopy(model).to(device=device) 

# TODO: Choose the parameter values consistently so that comparisons make sense
optimizer = optim.SGD(model01.parameters(), lr=lr)


training_loop(
    n_epochs = n_epochs,
    optimizer = optimizer, 
    model = model01,
    loss_fn = loss_fn,
    train_loader = train_loader,
)

# UNCOMMENT ONCE SECTION 6 IS DONE
# print("\n ========= Training using our SGD =========")
# model02 = copy.deepcopy(model).to(device=device) 
# # TODO: Choose the parameter values consistently so that comparisons make sense
# optimizer = MySGD(model02.parameters(), lr=lr)

# training_loop(
#     n_epochs = n_epochs,
#     optimizer = optimizer,
#     model = model02,
#     loss_fn = loss_fn,
#     train_loader = train_loader,
# )

train_loader = torch.utils.data.DataLoader(cifar2_train, batch_size=64, shuffle=False)
val_loader = torch.utils.data.DataLoader(cifar2_val, batch_size=64, shuffle=False)

print("\n ========= Using Pytorch's SGD =========")
validate(model01, train_loader, val_loader)
print("\n ==== Using manual update inside training loop ======")
validate(model03, train_loader, val_loader)
# UNCOMMENT ONCE SECTION 6 IS DONE
# print("\n ========= Using our SGD =========")
# validate(model02, train_loader, val_loader)

Training on device cuda.

2021-02-21 17:54:29.734727  |  Epoch 1  |  Training loss 0.565
2021-02-21 17:54:33.244825  |  Epoch 10  |  Training loss 0.330
2021-02-21 17:54:37.153431  |  Epoch 20  |  Training loss 0.297
2021-02-21 17:54:41.042328  |  Epoch 30  |  Training loss 0.275

2021-02-21 17:54:41.802002  |  Epoch 1  |  Training loss 0.564
2021-02-21 17:54:45.004742  |  Epoch 10  |  Training loss 0.332
2021-02-21 17:54:48.563154  |  Epoch 20  |  Training loss 0.297
2021-02-21 17:54:52.097406  |  Epoch 30  |  Training loss 0.272

Accuracy train: 0.89
Accuracy val: 0.88

Accuracy train: 0.89
Accuracy val: 0.88


{'train': 0.8885, 'val': 0.877}

### 5.3 Compare results to optim.SGD: (with Regularization)

**NOTE** You can reduce the number of epochs (and also increase how often they are printed) if your computer is slow when you are still trying to make it work. Once it works try to go back to a higher number of epochs before submitting your assignment.

In [48]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f"Training on device {device}.")

train_loader = torch.utils.data.DataLoader(cifar2_train, batch_size=64, shuffle=True)
model = MyLeNet5()  # TODO: Change this with your own LeNet5 class name
loss_fn = nn.CrossEntropyLoss()

n_epochs = 31
lr = 1e-2
weight_decay = 0.001

# print("\n ========= Training using Pytorch's SGD =========")
# model01 = copy.deepcopy(model).to(device=device) 

# # TODO: Choose the parameter values consistently so that comparisons make sense
# optimizer = optim.SGD(model01.parameters(), lr=lr, weight_decay=weight_decay)


# training_loop(
#     n_epochs = n_epochs,
#     optimizer = optimizer, 
#     model = model01,
#     loss_fn = loss_fn,
#     train_loader = train_loader,
# )

print("\n ==== Using manual update inside training loop ======")
model03 = copy.deepcopy(model).to(device=device) 

# TODO: Choose the parameter values consistently so that comparisons make sense
training_loop_manual_optim(
    n_epochs = n_epochs,
    model = model03,
    loss_fn = loss_fn,
    train_loader = train_loader,
    lr = lr,
    weight_decay=weight_decay,
)

print("\n ========= Training using Pytorch's SGD =========")
model01 = copy.deepcopy(model).to(device=device) 

# TODO: Choose the parameter values consistently so that comparisons make sense
optimizer = optim.SGD(model01.parameters(), lr=lr, weight_decay=weight_decay)


training_loop(
    n_epochs = n_epochs,
    optimizer = optimizer, 
    model = model01,
    loss_fn = loss_fn,
    train_loader = train_loader,
)

# UNCOMMENT ONCE SECTION 6 IS DONE
# print("\n ========= Training using our SGD =========")
# model02 = copy.deepcopy(model).to(device=device) 
# # TODO: Choose the parameter values consistently so that comparisons make sense
# optimizer = MySGD(model02.parameters(), lr=lr, weight_decay=weight_decay)

# training_loop(
#     n_epochs = n_epochs,
#     optimizer = optimizer,
#     model = model02,
#     loss_fn = loss_fn,
#     train_loader = train_loader,
# )

train_loader = torch.utils.data.DataLoader(cifar2_train, batch_size=64, shuffle=False)
val_loader = torch.utils.data.DataLoader(cifar2_val, batch_size=64, shuffle=False)

print("\n ========= Using Pytorch's SGD =========")
validate(model01, train_loader, val_loader)
print("\n ==== Using manual update inside training loop ======")
validate(model03, train_loader, val_loader)
# UNCOMMENT ONCE SECTION 6 IS DONE
# print("\n ========= Using our SGD =========")
# validate(model02, train_loader, val_loader)

Training on device cuda.

2021-02-21 18:44:26.200464  |  Epoch 1  |  Training loss 0.575
2021-02-21 18:44:29.768624  |  Epoch 10  |  Training loss 0.356
2021-02-21 18:44:33.722637  |  Epoch 20  |  Training loss 0.305
2021-02-21 18:44:37.640056  |  Epoch 30  |  Training loss 0.279

2021-02-21 18:44:38.411956  |  Epoch 1  |  Training loss 0.570
2021-02-21 18:44:41.678463  |  Epoch 10  |  Training loss 0.348
2021-02-21 18:44:45.270098  |  Epoch 20  |  Training loss 0.303
2021-02-21 18:44:48.865315  |  Epoch 30  |  Training loss 0.274

Accuracy train: 0.89
Accuracy val: 0.88

Accuracy train: 0.88
Accuracy val: 0.86


{'train': 0.8789, 'val': 0.8645}

### 5.4 Compare results to optim.SGD: (with Momentum)

**NOTE** You can reduce the number of epochs (and also increase how often they are printed) if your computer is slow when you are still trying to make it work. Once it works try to go back to a higher number of epochs before submitting your assignment.

In [53]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f"Training on device {device}.")

train_loader = torch.utils.data.DataLoader(cifar2_train, batch_size=64, shuffle=True)
model = MyLeNet5()  # TODO: Change this with your own LeNet5 class name
loss_fn = nn.CrossEntropyLoss()

n_epochs = 31
lr = 1e-2
momentum_coeff = 0.6


# print("\n ========= Training using Pytorch's SGD =========")
# model01 = copy.deepcopy(model).to(device=device) 

# # TODO: Choose the parameter values consistently so that comparisons make sense
# optimizer = optim.SGD(model01.parameters(), lr=lr, momentum=momentum_coeff)


# training_loop(
#     n_epochs = n_epochs,
#     optimizer = optimizer, 
#     model = model01,
#     loss_fn = loss_fn,
#     train_loader = train_loader,
# )

print("\n ==== Using manual update inside training loop ======")
model03 = copy.deepcopy(model).to(device=device) 

# TODO: Choose the parameter values consistently so that comparisons make sense
training_loop_manual_optim(
    n_epochs = n_epochs,
    model = model03,
    loss_fn = loss_fn,
    train_loader = train_loader,
    lr = lr,
    momentum_coeff=momentum_coeff
)

print("\n ========= Training using Pytorch's SGD =========")
model01 = copy.deepcopy(model).to(device=device) 

# TODO: Choose the parameter values consistently so that comparisons make sense
optimizer = optim.SGD(model01.parameters(), lr=lr, momentum=momentum_coeff)


training_loop(
    n_epochs = n_epochs,
    optimizer = optimizer, 
    model = model01,
    loss_fn = loss_fn,
    train_loader = train_loader,
)

# UNCOMMENT ONCE SECTION 6 IS DONE
# print("\n ========= Training using our SGD =========")
# model02 = copy.deepcopy(model).to(device=device) 
# # TODO: Choose the parameter values consistently so that comparisons make sense
# optimizer = MySGD(model02.parameters(), lr=lr, momentum_coeff=momentum_coeff)

# training_loop(
#     n_epochs = n_epochs,
#     optimizer = optimizer,
#     model = model02,
#     loss_fn = loss_fn,
#     train_loader = train_loader,
# )

train_loader = torch.utils.data.DataLoader(cifar2_train, batch_size=64, shuffle=False)
val_loader = torch.utils.data.DataLoader(cifar2_val, batch_size=64, shuffle=False)

print("\n ========= Using Pytorch's SGD =========")
validate(model01, train_loader, val_loader)
print("\n ==== Using manual update inside training loop ======")
validate(model03, train_loader, val_loader)
# UNCOMMENT ONCE SECTION 6 IS DONE
# print("\n ========= Using our SGD =========")
# validate(model02, train_loader, val_loader)

Training on device cuda.

2021-02-21 18:59:53.348808  |  Epoch 1  |  Training loss 0.587
2021-02-21 18:59:57.165142  |  Epoch 10  |  Training loss 0.322
2021-02-21 19:00:01.418598  |  Epoch 20  |  Training loss 0.280
2021-02-21 19:00:05.670779  |  Epoch 30  |  Training loss 0.248

2021-02-21 19:00:06.474540  |  Epoch 1  |  Training loss 0.552
2021-02-21 19:00:10.070924  |  Epoch 10  |  Training loss 0.303
2021-02-21 19:00:13.860776  |  Epoch 20  |  Training loss 0.255
2021-02-21 19:00:17.624119  |  Epoch 30  |  Training loss 0.211

Accuracy train: 0.90
Accuracy val: 0.88

Accuracy train: 0.90
Accuracy val: 0.88


{'train': 0.9, 'val': 0.8835}

### 5.5 Compare results to optim.SGD: (with Regularization and Momentum)

**NOTE** You can reduce the number of epochs (and also increase how often they are printed) if your computer is slow when you are still trying to make it work. Once it works try to go back to a higher number of epochs before submitting your assignment.

In [117]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f"Training on device {device}.")


train_loader = torch.utils.data.DataLoader(cifar2_train, batch_size=64, shuffle=True)
model = MyLeNet5()  # TODO: Change this with your own LeNet5 class name
loss_fn = nn.CrossEntropyLoss()

n_epochs = 31
lr = 1e-2
momentum_coeff = 0.6
weight_decay = 0.001


print("\n ========= Training using Pytorch's SGD =========")
model01 = copy.deepcopy(model).to(device=device) 

# TODO: Choose the parameter values consistently so that comparisons make sense
optimizer = optim.SGD(model01.parameters(), lr=lr, momentum=momentum_coeff, weight_decay=weight_decay)


training_loop(
    n_epochs = n_epochs,
    optimizer = optimizer, 
    model = model01,
    loss_fn = loss_fn,
    train_loader = train_loader,
)

print("\n ==== Using manual update inside training loop ======")
model03 = copy.deepcopy(model).to(device=device) 

# TODO: Choose the parameter values consistently so that comparisons make sense
training_loop_manual_optim(
    n_epochs = n_epochs,
    model = model03,
    loss_fn = loss_fn,
    train_loader = train_loader,
    lr = lr,
    momentum_coeff=momentum_coeff,
    weight_decay=weight_decay,

)

# # UNCOMMENT ONCE SECTION 6 IS DONE
# print("\n ========= Training using our SGD =========")
# model02 = copy.deepcopy(model).to(device=device) 
# # TODO: Choose the parameter values consistently so that comparisons make sense
# optimizer = MySGD(model02.parameters(), lr=lr, momentum_coeff=momentum_coeff, weight_decay=weight_decay)

# training_loop(
#     n_epochs = n_epochs,
#     optimizer = optimizer,
#     model = model02,
#     loss_fn = loss_fn,
#     train_loader = train_loader,
# )

# train_loader = torch.utils.data.DataLoader(cifar2_train, batch_size=64, shuffle=False)
# val_loader = torch.utils.data.DataLoader(cifar2_val, batch_size=64, shuffle=False)

print("\n ========= Using Pytorch's SGD =========")
validate(model01, train_loader, val_loader)
print("\n ==== Using manual update inside training loop ======")
validate(model03, train_loader, val_loader)
# UNCOMMENT ONCE SECTION 6 IS DONE
# print("\n ========= Using our SGD =========")
# validate(model02, train_loader, val_loader)

Training on device cuda.

2021-02-21 20:31:04.054379  |  Epoch 1  |  Training loss 0.534
2021-02-21 20:31:07.560588  |  Epoch 10  |  Training loss 0.293
2021-02-21 20:31:11.447181  |  Epoch 20  |  Training loss 0.245
2021-02-21 20:31:15.327619  |  Epoch 30  |  Training loss 0.213

2021-02-21 20:31:16.152758  |  Epoch 1  |  Training loss 0.554
2021-02-21 20:31:19.975598  |  Epoch 10  |  Training loss 0.313
2021-02-21 20:31:24.235124  |  Epoch 20  |  Training loss 0.271
2021-02-21 20:31:28.495703  |  Epoch 30  |  Training loss 0.235

Accuracy train: 0.92
Accuracy val: 0.89

Accuracy train: 0.91
Accuracy val: 0.88


{'train': 0.9099, 'val': 0.881}

## 6. Implementing an Optimizer in PyTorch

You might have noticed that when implementing more and more complex optimizers you start polluting a bit too much your training loop. We might prefer to seperate the code consisting in applying the gradient descent method and the code consisting in feeding your model with new inputs. This is what PyTorch does naturally. Indeed, we instantiate an optimizer (e.g. optim.SGD) with all its parameters, (learning rate, momentum, weight_decay, etc) and then we simply need to call ``optimizer.step()``  and ``optimizer.zero_grad()`` in our training loop independently of the optimization method used and of its parameters.

Similarly to how we defined our custom neural network in Pytorch, we will create a class to define our custom optimizer. But instead of inheriting the ``nn.Module`` class, we will inherit the [torch.optim.Optimizer](https://pytorch.org/docs/stable/optim.html#torch.optim.Optimizer). As written in the documentation, it is simply the "Base class for all optimizers". Keeping the ``nn.Module`` comparison, instead of having to implement a ``forward`` method (regardless of our model) we will have to implement a [step](https://pytorch.org/docs/stable/optim.html#torch.optim.Optimizer.step) method (regardless of the optimization method chosen )

## TODO:

Same questions as above (implement SGD + momentum + L2 successively and compare with PyTorch's implementation by uncommenting the block of code calling ``MySGD``)

**Hints**

- You might want to add an attribute to your Optimizer when implementing the momentum version
- Might be useful to detach some tensors at some point :) 
- Remember to be careful when updating trainable parameters!
- It's okay if you don't succeed here. Try and we will grade generously. :) 


In [128]:
class MySGD(Optimizer):
    """Implements SGD (optionally with momentum and regularization)
    """

    def __init__(
        self, 
        params,
        lr=0.01,
        momentum_coeff=0.,
        weight_decay=0., 
    ):
        # Don't pay attention to this 'default' thing
        # it's just something required by PyTorch
        defaults = dict(
            lr=lr, 
            momentum_coeff=momentum_coeff,
            weight_decay=weight_decay, 
        )
        super().__init__(params, defaults)
        self.state = {"step": 0}
        self.velocity = None

    @torch.no_grad()
    def step(self, closure=None):
        """Performs a single optimization step.
        """  
        # Don't pay attention to this 'closure' thing
        # it's just something required by PyTorch


        # In the optimizer object, the parameters are stored 
        # in groups for more flexibility (e.g each layer 
        # could have a different learning rate for example)
        # In our case just consider that :
        #
        # for group in self.param_groups:
        #   for p in group['params']:
        #
        # Is the optimizer counterpart of:
        #
        # for p in model.parameters()
        #
        for group in self.param_groups:
            weight_decay = group['weight_decay']
            momentum_coeff = group['momentum_coeff']
            lr = group['lr']

            for p in group['params']:
                if p.grad is not None:

                    # Now we are safe
                    grad = torch.clone(p.grad).detach()
                    # grad.to(device=device)
      
                    # TODO! parameter update
                    if self.velocity is None:
                        # self.velocity = torch.zeros([16, 3, 3, 3]).to(device=device)
                        # self.velocity = torch.zeros([16]).to(device=device)
                        self.velocity = torch.zeros_like(grad).to(device=device)
                    self.velocity *= momentum_coeff
                    self.velocity += grad
                    grad -= lr * self.velocity
                    grad -= weight_decay * grad

        self.state["step"] += 1


In [129]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f"Training on device {device}.")


train_loader = torch.utils.data.DataLoader(cifar2_train, batch_size=64, shuffle=True)
model = MyLeNet5()  # TODO: Change this with your own LeNet5 class name
loss_fn = nn.CrossEntropyLoss()

n_epochs = 31
lr = 1e-2
momentum_coeff = 0.6
weight_decay = 0.001

# UNCOMMENT ONCE SECTION 6 IS DONE
print("\n ========= Training using our SGD =========")
model02 = copy.deepcopy(model).to(device=device) 
# TODO: Choose the parameter values consistently so that comparisons make sense
optimizer = MySGD(model02.parameters(), lr=lr, momentum_coeff=momentum_coeff, weight_decay=weight_decay)

training_loop(
    n_epochs = n_epochs,
    optimizer = optimizer,
    model = model02,
    loss_fn = loss_fn,
    train_loader = train_loader,
)

train_loader = torch.utils.data.DataLoader(cifar2_train, batch_size=64, shuffle=False)
val_loader = torch.utils.data.DataLoader(cifar2_val, batch_size=64, shuffle=False)

print("\n ========= Using Pytorch's SGD =========")
validate(model01, train_loader, val_loader)
print("\n ==== Using manual update inside training loop ======")
validate(model03, train_loader, val_loader)
# UNCOMMENT ONCE SECTION 6 IS DONE
print("\n ========= Using our SGD =========")
validate(model02, train_loader, val_loader)

Training on device cuda.



RuntimeError: ignored

In [None]:
%debug

In [None]:
# grad.size() ~ torch.Size([16])
# self.velocity.size() ~ torch.Size([16, 3, 3, 3])

