# Lesson 13

Much of the begninning of the video is a review of the basics of the neural network (as matrix multiplications and nonlinear pointwise function).  In the beginning he uses a simple 2 layer network with 50 hidden layers and 1 output layer that predicts the digit.  The loss is then the MSE of the difference between the actual digit and the output of this last layer. This is a simplified model that is not correct, but it is a good starting point to understand the basics of the neural network. 

First we have to set things up again:

In [1]:

import pickle,gzip,math,os,time,shutil,torch,matplotlib as mpl, numpy as np
from pathlib import Path
from torch import tensor
from fastcore.test import test_close
torch.manual_seed(42)

mpl.rcParams['image.cmap'] = 'gray'
torch.set_printoptions(precision=2, linewidth=125, sci_mode=False)
np.set_printoptions(precision=2, linewidth=125)

path_data = Path('data')
path_gz = path_data/'mnist.pkl.gz'
with gzip.open(path_gz, 'rb') as f: ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')
x_train, y_train, x_valid, y_valid = map(tensor, [x_train, y_train, x_valid, y_valid])

He then discusses that with the loss in hand, we need gradients to know how to adjust the weights. There is a long discussion giving some intuition of the chain rule, which I will not repeat here, as I am quite familier with calculus and the chain rule.
The chain rule is important because we have a function (the prediction) that is a composition of other functions.
 

The approach is to start at the end, take the derivative, and keep multiplying (chain rule!) by the derivative of the next function until we reach the beginning. This is the backpropagation algorithm in essence.

Lets see how this works, we will set up this super simple model

In [17]:
n,m = x_train.shape
c = y_train.max()+1  # number of classes.
n,m,c

nh = 40  # hidden layer.
w1 = torch.randn(m,nh)
b1 = torch.zeros(nh)
w2 = torch.randn(nh,1)
b2 = torch.zeros(1)

def lin(x, w, b): return x@w + b
def relu(x): return x.clamp_min(0.)
     
# very basic model:
def model(xb):
    l1 = lin(xb, w1, b1)
    l2 = relu(l1)
    return lin(l2, w2, b2)

# reminder, mse loss, which is NOT really suitable for classification.
def mse(output, targ): return (output[:,0]-targ).pow(2).mean()

Now to compute the gradients using the chain rule. The video walks through this, and it is certainly worth doing.  The way i did it is to write out long hand the equations and work through the derivatives, keeping track of the jacobians as we go. This is a bit tedious, but it is a good exercise to understand the chain rule.  I don't document that here though.

we save the gradients (jacobians) as properties of the variables. e.g. `v.g`

In [40]:
def lin_grad(inp, out, w, b):
    # grad of matmul with respect to input
    inp.g = out.g @ w.t()    # note w.t() is the transpose of w, same as w.T  for 2D tensors.
    #w.g = (inp.unsqueeze(-1) * out.g.unsqueeze(1)).sum(0)
    w.g = inp.T @ out.g  # this is the same as the above line, but more concise. 
    b.g = out.g.sum(0)    

In [41]:
def forward_and_backward(inp, targ):
    # forward pass, we can't use model() here, as we need to keep track of intermediate values.
    l1 = lin(inp, w1, b1)
    l2 = relu(l1)
    out = lin(l2, w2, b2)
    diff = out[:,0]-targ
    loss = diff.pow(2).mean() # note we don't care about the value of the loss, just the gradient, so this line is not needed.
    
    # backward pass:
    out.g = 2.*diff[:,None] / inp.shape[0]
    lin_grad(l2, out, w2, b2)
    l1.g = (l1>0).float() * l2.g
    lin_grad(inp, l1, w1, b1)

In [42]:
forward_and_backward(x_train, y_train)

Using vscodes 'debug cell'  we can explore what is happening, looking at the gradients in each layer.  In the video he uses pdb directly using `pdb.set_trace()` to do this.  Using the vscode debugger is much nicer, when it works. I think it is a bit.. buggy. For example you have to make sure that you have executed the current version of the functions you will be stepping into, and I think you have to avoid any extra blank lines at the top of cells.. It is probably worth knowing the basics of pdb though.

In [43]:
# Save for testing against later
def get_grad(x): return x.g.clone()
chks = w1,w2,b1,b2,x_train
grads = w1g,w2g,b1g,b2g,ig = tuple(map(get_grad, chks))
     

Use pytorch to check our results using autograd

In [8]:
def mkgrad(x): return x.clone().requires_grad_(True)
ptgrads = w12,w22,b12,b22,xt2 = tuple(map(mkgrad, chks))

def forward(inp, targ):
    l1 = lin(inp, w12, b12)
    l2 = relu(l1)
    out = lin(l2, w22, b22)
    return mse(out, targ)

loss = forward(xt2, y_train)
loss.backward()

for a,b in zip(grads, ptgrads): test_close(a, b.grad, eps=0.01)

### NN by scratch

We can use this to create our own version of the basic pytorch modules by refactoring the code above.  
In the video he uses this as an oppurtnity to introduce __call__.  
I skipped an intermediate refactor where each module has repeated code to save inputs and outputs.    

In [10]:
class Model():
    def __init__(self, w1, b1, w2, b2):
        self.layers = [Lin(w1,b1), Relu(), Lin(w2,b2)]
        self.loss = Mse()
        
    def __call__(self, x, targ):
        for l in self.layers: x = l(x)
        return self.loss(x, targ)
    
    def backward(self):
        self.loss.backward()
        for l in reversed(self.layers): l.backward()


class Module():
    def __call__(self, *args):
        # save inputs and outputs for later.
        self.args = args
        self.out = self.forward(*args)
        return self.out

    def forward(self): raise Exception('not implemented')
    def backward(self): self.bwd(self.out, *self.args) # note the * in front of self.args, this is to unpack the list of arguments.
    def bwd(self): raise Exception('not implemented')
     

class Relu(Module):
    def forward(self, inp): return inp.clamp_min(0.)
    def bwd(self, out, inp): inp.g = (inp>0).float() * out.g
     

class Lin(Module):
    def __init__(self, w, b): self.w,self.b = w,b
    def forward(self, inp): return inp@self.w + self.b
    def bwd(self, out, inp):
        inp.g = self.out.g @ self.w.t()
        self.w.g = inp.t() @ self.out.g
        self.b.g = self.out.g.sum(0)
     

class Mse(Module):
    def forward (self, inp, targ): return (inp.squeeze() - targ).pow(2).mean()
    def bwd(self, out, inp, targ): inp.g = 2*(inp.squeeze()-targ).unsqueeze(-1) / targ.shape[0]
     

model = Model(w1, b1, w2, b2)
     

loss = model(x_train, y_train)
     

model.backward()
     

test_close(w2g, w2.g, eps=0.01)
test_close(b2g, b2.g, eps=0.01)
test_close(w1g, w1.g, eps=0.01)
test_close(b1g, b1.g, eps=0.01)
test_close(ig, x_train.g, eps=0.01)

### pytorch 
Or better, use autograd and nn.Module.   We dont need to define (but can!) backward, because it uses autograd.

In [16]:
from torch import nn
import torch.nn.functional as F

class Linear(nn.Module):
    def __init__(self, n_in, n_out):
        super().__init__()
        self.w = torch.randn(n_in,n_out).requires_grad_()
        self.b = torch.zeros(n_out).requires_grad_()
    def forward(self, inp): return inp@self.w + self.b

class Model(nn.Module):
    def __init__(self, n_in, nh, n_out):
        super().__init__()
        self.layers = [Linear(n_in,nh), nn.ReLU(), Linear(nh,n_out)]
        
    def __call__(self, x, targ):
        for l in self.layers: x = l(x)
        return F.mse_loss(x, targ[:,None])
    
model = Model(m, nh, 1)
loss = model(x_train, y_train.to(torch.float))
loss.backward()

l0 = model.layers[0]
l0.b.grad

tensor([ 41.11,   3.76,  14.97, -14.83, -22.24, -15.44,  62.55,  -4.69,   7.02,  -1.47,   1.37,  -4.08, -29.48,  -0.13,
         -0.27,  -3.64,  -3.19,  -2.51, -41.20,   0.99,  -3.10,  35.40,  20.97,  14.38, -24.57,  16.74, -10.15,   8.52,
          5.86,  29.04, -10.40,  -6.18,  16.78, -28.45,   5.12,  15.02,  -2.23,  -2.19, -13.74,   0.83])

But now that we have seen how it works, we can use the pytorch versions in the sequal!

## Minibatch training (notebook number 4)

About 1:18 minutes in, he jumps to notebook number 4

### Improve loss, use cross entropy loss / softmax .  

In video he works through how to get from this (using basic math of logs and exponentials):

$$

\log \text{softmax}(x)_i = \log \frac{e^{x_i}}{\sum_j e^{x_j}}
$$


to this:


In [48]:
def log_softmax(x): return x - x.logsumexp(-1,keepdim=True)

Note that logsumexp uses the trick of factorign out $e^{max(x)}$ to avoid numerical instability.  



In [49]:
# model without loss
class Model(nn.Module):
    def __init__(self, n_in, nh, n_out):
        super().__init__()
        self.layers = [nn.Linear(n_in,nh), nn.ReLU(), nn.Linear(nh,n_out)]
        
    def __call__(self, x):
        for l in self.layers: x = l(x)
        return x
    
model = Model(m, nh, 10) # 10 classes.
pred = model(x_train) # raw predictions (untrained)
sm_pred = log_softmax(pred) # log softmax of predictions.

The cross entropy loss is :

$$
\text{cross entropy loss} = - \sum_i y_i \log \hat{y}_i
$$

$\hat{y}$'s here are probability computed from the softmax.  $y_i$ is just 0 or 1 depending on the prediction, so all this really does is look up the log of the probability (log of the softmax) for the correct prediction.  For a full batch we average over all the samples.


Lets look at this for a batch of just 3

In [52]:
y_train[:3]  #predictions are just integers

tensor([5, 0, 4])

In [54]:
sm_pred[[0,1,2], y_train[:3]]   #the log of the softmax of the correct predictions for the first 3 items

tensor([-2.34, -2.36, -2.39], grad_fn=<IndexBackward0>)

In [57]:
def nll(input, target): return -input[range(target.shape[0]), target].mean()

nll(sm_pred, y_train)

tensor(2.32, grad_fn=<NegBackward0>)

Pytorch combines softmax and nll to make a single function called F.cross_entropy()

In [58]:
F.cross_entropy(pred, y_train)

tensor(2.32, grad_fn=<NllLossBackward0>)

### Basic training loop


In [61]:
loss_func = F.cross_entropy
bs = 64  # batch size

lr = 0.5  # learning rate
epochs = 3  # how many epochs to train for

# compute accuracy metric
def accuracy(out, yb): return (torch.argmax(out, dim=1)==yb).float().mean()

accuracy(pred, y_train) # random

tensor(0.14)

In [63]:
for epoch in range(epochs):
    print(f'epoch {epoch}')
    for i in range(0, n, bs):
        s = slice(i, min(i+bs, n))
        xb, yb  = x_train[s], y_train[s]
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        with torch.no_grad():
            for l in model.layers:
                if hasattr(l, 'weight'):
                    l.weight -= l.weight.grad * lr
                    l.bias   -= l.bias.grad   * lr
                    l.weight.grad.zero_()
                    l.bias  .grad.zero_()

Even for just 3 epochs this is VERY slow

In [65]:
pred = model(x_train)
accuracy(pred, y_train) # random

tensor(0.96)

Next time: refactor this loop