# Lesson 14 

Backpropagation continued...  Jeremy points out a nice explaination by one of the students: [Simple Neural Net Backward Pass](https://nasheqlbrm.github.io/blog/posts/2021-11-13-backward-pass.html).  This is similar to what I suggested in the previous notebook: work out the derivatives long hand.  This is a good way to understand the process.

This clarified for me that when he writes `inp.g` , its the gradient of the loss *with respect to* the input.  I.e. $\frac{\partial L}{\partial x}$ where x is the input.  And same for all other similar variables.

As a reminder, the course notebooks are here: [Course Repo](https://github.com/fastai/course22p2)

## Recap

Let's get back to where we were before..  This section uses notebook [04_minibatch_training.ipynb](https://github.com/fastai/course22p2/blob/master/nbs/04_minibatch_training.ipynb)

In [1]:
import pickle,gzip,math,os,time,shutil,torch,matplotlib as mpl,numpy as np,matplotlib.pyplot as plt
from pathlib import Path
from torch import tensor,nn
import torch.nn.functional as F

torch.set_printoptions(precision=2, linewidth=140, sci_mode=False)
torch.manual_seed(1)
mpl.rcParams['image.cmap'] = 'gray'

path_data = Path('data')
path_gz = path_data/'mnist.pkl.gz'
with gzip.open(path_gz, 'rb') as f: ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')
x_train, y_train, x_valid, y_valid = map(tensor, [x_train, y_train, x_valid, y_valid])

# if gpu available, use it
#device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
#x_train, y_train, x_valid, y_valid = x_train.to(device), y_train.to(device), x_valid.to(device), y_valid.to(device)

In [2]:

n,m = x_train.shape
c = y_train.max()+1
nh = 50


class Model(nn.Module):
    def __init__(self, n_in, nh, n_out):
        super().__init__()
        self.layers = [nn.Linear(n_in,nh), nn.ReLU(), nn.Linear(nh,n_out)]
        
    def __call__(self, x):
        for l in self.layers: x = l(x)
        return x

In [3]:
model = Model(m, nh, 10)
pred = model(x_train)
pred.shape

torch.Size([50000, 10])

In [4]:
loss_func = F.cross_entropy

bs=50                  # batch size

xb = x_train[0:bs]     # a mini-batch from x
preds = model(xb)      # predictions
preds[0], preds.shape

(tensor([-0.09, -0.21, -0.08,  0.10, -0.04,  0.08, -0.04, -0.03,  0.01,  0.06], grad_fn=<SelectBackward0>),
 torch.Size([50, 10]))

In [5]:
# helper functions
def accuracy(out, yb): return (out.argmax(dim=1)==yb).float().mean()
def report(loss, preds, yb): print(f'{loss:.2f}, {accuracy(preds, yb):.2f}')

Simple training loop:

In [6]:
lr = 0.5
epochs = 3

for epoch in range(epochs):
    for i in range(0, n, bs):
        s = slice(i, min(n,i+bs))
        xb,yb = x_train[s],y_train[s]
        preds = model(xb)
        loss = loss_func(preds, yb)
        loss.backward()
        with torch.no_grad():
            for l in model.layers:
                if hasattr(l, 'weight'):
                    l.weight -= l.weight.grad * lr
                    l.bias   -= l.bias.grad   * lr
                    l.weight.grad.zero_()
                    l.bias  .grad.zero_()
    report(loss, preds, yb)

0.12, 0.98
0.12, 0.94
0.08, 0.96


NOTE: ON WSL_2 : The above takes 2 seconds on cpu using pdl_gpu  (and about same on gpu), 3 minutes using environment pdl_p.  

As a reminder, everything there we have implemented ourselves, and as per our rule, we can now use the torch functions like we are above. 

## Refactoring


### Parameters

* Using parameters: pytorch modules have a way to track the parameters, so that you can use `model.parameters()` to get them all.  

In [10]:
class MLP(nn.Module):
    def __init__(self, n_in, nh, n_out):
        super().__init__()
        self.l1 = nn.Linear(n_in,nh)
        self.l2 = nn.Linear(nh,n_out)
        self.relu = nn.ReLU()
        
    #def forward(self, x): return self.l2(self.relu(self.l1(x)))

Note the user of forward rather than `__call__`. This is the way pytorch modules are meant to be used, as the base class `__call__` does some additional work that we were bypassing.  This was also the way our own module implementation was meant to be used.  We should have been using `forward` instead of `__call__`.  

In [11]:
model = MLP(m, nh, 10)
list(model.parameters())
list(model.named_children())

[('l1', Linear(in_features=784, out_features=50, bias=True)),
 ('l2', Linear(in_features=50, out_features=10, bias=True)),
 ('relu', ReLU())]

Now we can just model.parameters() to get all the parameters:

In [12]:
def fit():
    for epoch in range(epochs):
        for i in range(0, n, bs):
            s = slice(i, min(n,i+bs))
            xb,yb = x_train[s],y_train[s]
            preds = model(xb)
            loss = loss_func(preds, yb)
            loss.backward()
            with torch.no_grad():
                for p in model.parameters(): p -= p.grad * lr
                model.zero_grad()
        report(loss, preds, yb)

Lets understand how this works. It works by using `__setattr__` to add the parameters to the module.  This is a bit of python magic.  

In [13]:
class MyModule:
    def __init__(self, n_in, nh, n_out):
        self._modules = {}
        self.l1 = nn.Linear(n_in,nh)
        self.l2 = nn.Linear(nh,n_out)

    def __setattr__(self,k,v):
        if not k.startswith("_"): self._modules[k] = v
        super().__setattr__(k,v)

    def __repr__(self): return f'{self._modules}'
    
    def parameters(self):
        for l in self._modules.values(): yield from l.parameters()  # yield from is a shortcut for a loop that yields each element of an iterable

In [14]:
mdl = MyModule(m, nh, 10)
for p in mdl.parameters(): print(p.shape)

torch.Size([50, 784])
torch.Size([50])
torch.Size([10, 50])
torch.Size([10])


In our model above we used a list of layers, which Module will not know about.  One way to do this is to call self.add_module manually for each layer, but instead we can use `nn.ModuleList`:

In [15]:
layers = [nn.Linear(m,nh), nn.ReLU(), nn.Linear(nh,10)] 

In [16]:
class SequentialModel(nn.Module):
    def __init__(self, layers):
        super().__init__()
        self.layers = nn.ModuleList(layers)
        
    def forward(self, x):
        for l in self.layers: x = l(x)
        return x
    
model = SequentialModel(layers)
model

SequentialModel(
  (layers): ModuleList(
    (0): Linear(in_features=784, out_features=50, bias=True)
    (1): ReLU()
    (2): Linear(in_features=50, out_features=10, bias=True)
  )
)

nn.Sequential is a convenience class for this kind of model:

In [17]:
model = nn.Sequential(nn.Linear(m,nh), nn.ReLU(), nn.Linear(nh,10))
fit()

0.16, 0.96
0.10, 0.94
0.07, 0.98


### Optimizers

We can also refactor the common optimization tasks into a class:

In [18]:
class Optimizer():
    def __init__(self, params, lr=0.5): self.params,self.lr=list(params),lr

    def step(self):
        with torch.no_grad():
            for p in self.params: p -= p.grad * self.lr

    def zero_grad(self):
        for p in self.params: p.grad.data.zero_()

In [20]:
model = nn.Sequential(nn.Linear(m,nh), nn.ReLU(), nn.Linear(nh,10))
opt = Optimizer(model.parameters())

def fit(model, epochs):
    for epoch in range(epochs):
        for i in range(0, n,  bs):
            s = slice(i, min(n,i+bs))
            xb,yb = x_train[s],y_train[s]
            preds = model(xb)
            loss = loss_func(preds, yb)
            loss.backward()
            opt.step()  # much easier
            opt.zero_grad()
        report(loss, preds, yb)

fit(model, 3)

0.15, 0.96
0.13, 0.98
0.08, 0.98


Pytorch of course has this built in, using optim.SGD. 

In [21]:
from torch import optim

# convenient function to get model and optimizer
def get_model():
    model = nn.Sequential(nn.Linear(m,nh), nn.ReLU(), nn.Linear(nh,10))
    return model, optim.SGD(model.parameters(), lr=lr)

model,opt = get_model()
loss_func(model(xb), yb)


tensor(2.30, grad_fn=<NllLossBackward0>)

### Dataset and DataLoader

In [22]:
# this simple class just pairs x and y, and implements len and getitem

class Dataset():
    def __init__(self, x, y): self.x,self.y = x,y
    def __len__(self): return len(self.x)
    def __getitem__(self, i): return self.x[i],self.y[i]

train_ds,valid_ds = Dataset(x_train, y_train),Dataset(x_valid, y_valid)

train_ds[0:5]

(tensor([[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]),
 tensor([5, 0, 4, 1, 9]))

The dataloader takes care of the batching part.   We will define this as an iterator that yields the next batch of data. 

In [23]:
class DataLoader():
    def __init__(self, ds, bs): self.ds,self.bs = ds,bs
    def __iter__(self):
        for i in range(0, len(self.ds), self.bs): yield self.ds[i:i+self.bs]

In [24]:
train_dl = DataLoader(train_ds, bs)
xb,yb = next(iter(train_dl))
xb.shape,yb.shape

(torch.Size([50, 784]), torch.Size([50]))

In [25]:

def fit():
    for epoch in range(epochs):
        for xb,yb in train_dl:
            preds = model(xb)
            loss = loss_func(preds, yb)
            loss.backward()
            opt.step()
            opt.zero_grad()
        report(loss, preds, yb)

fit()

0.10, 0.96
0.06, 0.96
0.04, 1.00


## AT about 35:30 into the video.