# Mini batch training
In this notebook, we are first going to implement cross entropy, then move on to mini-batch training.

## Setup

In [395]:
import pickle,gzip,math,os,time,shutil,torch,matplotlib as mpl,numpy as np,matplotlib.pyplot as plt
from pathlib import Path
from torch import tensor,nn
import torch.nn.functional as F

In [396]:
from fastcore.test import test_close

torch.set_printoptions(precision=2, linewidth=140, sci_mode=False)
torch.manual_seed(1)
mpl.rcParams['image.cmap'] = 'gray'

path_data = Path('data')
path_gz = path_data/'mnist.pkl.gz'
with gzip.open(path_gz, 'rb') as f: ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')
x_train, y_train, x_valid, y_valid = map(tensor, [x_train, y_train, x_valid, y_valid])

### Data

In [397]:
num_input, input_dim = x_train.shape
c = y_train.max()+1
nh = 50

In [398]:
class Model(nn.Module):
    def __init__(self, n_in, nh, n_out):
        super().__init__()
        self.layers = [nn.Linear(n_in,nh), nn.ReLU(), nn.Linear(nh,n_out)]
        
    def __call__(self, x):
        for l in self.layers: x = l(x)
        return x

In [399]:
model = Model(input_dim, nh, c)
pred = model(x_train)
pred.shape

torch.Size([50000, 10])

## Cross entropy loss
We implement the log of cross entropy loss (for easier computation).

In [400]:
torch.sum(torch.exp(pred), dim=1, keepdim=True).shape

torch.Size([50000, 1])

Broadcasting will now be automatically applied to divide `pred` with the sum of the exponent of each `x_j`, where `x_j` is the value for each of the class.

In [401]:
def log_softmax(x):
    return x - x.exp().sum(-1, keepdim=True).log()

In [402]:
log_softmax(pred)

tensor([[-2.37, -2.49, -2.36,  ..., -2.31, -2.28, -2.22],
        [-2.37, -2.44, -2.44,  ..., -2.27, -2.26, -2.16],
        [-2.48, -2.33, -2.28,  ..., -2.30, -2.30, -2.27],
        ...,
        [-2.33, -2.52, -2.34,  ..., -2.31, -2.21, -2.16],
        [-2.38, -2.38, -2.33,  ..., -2.29, -2.26, -2.17],
        [-2.33, -2.55, -2.36,  ..., -2.29, -2.27, -2.16]], grad_fn=<SubBackward0>)

In [403]:
log_softmax(pred).shape

torch.Size([50000, 10])

Next, we want to use the LogSumExp trick. The rationale is that we want the gradients to be as precise as possible (suppose we have a region where we would otherwise 'bounce around'). However, the exponent of a value may be very large, which makes the computed value less precise (because of how a computer can lose precision when the magnitude gets very large).

In [404]:
m = pred.max(-1)[0]
m

tensor([0.10, 0.14, 0.21,  ..., 0.14, 0.11, 0.14], grad_fn=<MaxBackward0>)

In [405]:
m[:, None].shape

torch.Size([50000, 1])

In [406]:
pred.shape

torch.Size([50000, 10])

In [407]:
def logsumexp_softmax(x):
    m = x.max(-1)[0]
    return (m + (x - m[:, None]).exp().sum(-1).log()).unsqueeze(-1)

In [408]:
logsumexp_softmax(pred).shape

torch.Size([50000, 1])

In [409]:
logsumexp_softmax(pred)

tensor([[2.28],
        [2.30],
        [2.29],
        ...,
        [2.30],
        [2.28],
        [2.30]], grad_fn=<UnsqueezeBackward0>)

In [410]:
pred.exp().sum(-1, keepdim=True)

tensor([[ 9.79],
        [ 9.97],
        [ 9.91],
        ...,
        [10.01],
        [ 9.80],
        [ 9.94]], grad_fn=<SumBackward1>)

In [411]:
def log_softmax(x):
    return x - logsumexp_softmax(x)

In [412]:
log_softmax(pred)

tensor([[-2.37, -2.49, -2.36,  ..., -2.31, -2.28, -2.22],
        [-2.37, -2.44, -2.44,  ..., -2.27, -2.26, -2.16],
        [-2.48, -2.33, -2.28,  ..., -2.30, -2.30, -2.27],
        ...,
        [-2.33, -2.52, -2.34,  ..., -2.31, -2.21, -2.16],
        [-2.38, -2.38, -2.33,  ..., -2.29, -2.26, -2.17],
        [-2.33, -2.55, -2.36,  ..., -2.29, -2.27, -2.16]], grad_fn=<SubBackward0>)

Alright, our logsumexp trick is working correctly! Meanwhile, Pytorch actually already implements this for us.

In [413]:
def log_softmax(x):
    return x - x.logsumexp(-1, keepdim=True)

In [414]:
sm_pred = log_softmax(pred)
sm_pred

tensor([[-2.37, -2.49, -2.36,  ..., -2.31, -2.28, -2.22],
        [-2.37, -2.44, -2.44,  ..., -2.27, -2.26, -2.16],
        [-2.48, -2.33, -2.28,  ..., -2.30, -2.30, -2.27],
        ...,
        [-2.33, -2.52, -2.34,  ..., -2.31, -2.21, -2.16],
        [-2.38, -2.38, -2.33,  ..., -2.29, -2.26, -2.17],
        [-2.33, -2.55, -2.36,  ..., -2.29, -2.27, -2.16]], grad_fn=<SubBackward0>)

Now that we have calculated the softmax, we can proceed to calculate the cross entropy loss for some target $x$ and some prediction $p(x)$ by $- \sum x \log p(x) $.

However, given that our y values are 'one hot encoded' (in reality they represent an index of the predicted class), we can simplify the above equation to $ - \log(p_i)$, where $p_i$ is the probability of the actual class.

The above is also known as negative log likelihood loss.

In [415]:
y_train

tensor([5, 0, 4,  ..., 8, 4, 8])

In [416]:
sm_pred[torch.arange(sm_pred.shape[0]), y_train].shape, y_train.shape

(torch.Size([50000]), torch.Size([50000]))

In [417]:
sm_pred[torch.arange(sm_pred.shape[0]), y_train].mean()

tensor(-2.30, grad_fn=<MeanBackward0>)

In [418]:
def nll(sm_pred, tgt):
    return -1. * sm_pred[torch.arange(sm_pred.shape[0]), tgt].mean()

In [419]:
nll(sm_pred, y_train)

tensor(2.30, grad_fn=<MulBackward0>)

In Pytorch, softmax and negative log likelihood loss are combined together in the softmax function.

In [420]:
loss = nn.CrossEntropyLoss()
loss(pred, y_train)

tensor(2.30, grad_fn=<NllLossBackward0>)

## Basic training loop
A basic training loop repeats over the following steps:
- get the output of the model on a batch of inputs
- compare the output to the labels we have and compute a loss
- calculate the gradients of the loss with respect to every parameter of the model
- update said parameters with those gradients to make them a little bit better

In [421]:
loss_function = F.cross_entropy

In [422]:
bs = 50
xb = x_train[0:bs]
preds = model(xb)

In [423]:
preds[0], preds.shape

(tensor([-0.09, -0.21, -0.08,  0.10, -0.04,  0.08, -0.04, -0.03,  0.01,  0.06], grad_fn=<SelectBackward0>),
 torch.Size([50, 10]))

In [424]:
y_preds = preds.argmax(dim=1).float()
y_preds

tensor([3., 9., 3., 8., 5., 9., 3., 9., 3., 9., 5., 3., 9., 9., 3., 9., 9., 5., 8., 7., 9., 5., 3., 8., 9., 5., 9., 5., 5., 9., 3., 5., 9.,
        7., 5., 7., 9., 9., 3., 9., 3., 5., 3., 8., 3., 5., 9., 5., 9., 5.])

In [425]:
yb = y_train[0:bs]
yb

tensor([5, 0, 4, 1, 9, 2, 1, 3, 1, 4, 3, 5, 3, 6, 1, 7, 2, 8, 6, 9, 4, 0, 9, 1, 1, 2, 4, 3, 2, 7, 3, 8, 6, 9, 0, 5, 6, 0, 7, 6, 1, 8, 7, 9,
        3, 9, 8, 5, 9, 3])

In [426]:
loss_function(model(xb), yb)

tensor(2.30, grad_fn=<NllLossBackward0>)

In [427]:
def accuracy(preds, actual): 
    return 1. * sum(preds.argmax(dim=1) == actual) / actual.shape[0]

In [428]:
accuracy(preds, yb)

tensor(0.08)

In [429]:
def report(loss, preds, yb): print(f'{loss:.2f}, {accuracy(preds, yb):.2f}')

In [430]:
lr=0.5
epochs=3

In [431]:
xb, yb = x_train[:bs], y_train[:bs]
y_pred = model(xb)
report(loss_function(y_pred, yb), y_pred, yb)

2.30, 0.08


We can now write our training loop.

In [432]:
for i in range(epochs):
    for b in range(0, num_input, bs):
        s = slice(b, min(b+bs, num_input))
        xb, yb = x_train[s], y_train[s]
        y_pred = model(xb)
        loss = loss_function(y_pred, yb)
        loss.backward()

        with torch.no_grad():
            for l in model.layers:
                if hasattr(l, 'weight'):
                    l.weight -= l.weight.grad * lr
                    l.bias -= l.bias.grad * lr
                    l.weight.grad.zero_()
                    l.bias.grad.zero_()

    report(loss, y_pred, yb)

0.12, 0.98
0.12, 0.94


0.08, 0.96


We see that we are getting pretty good accuracy on our training set in just 3 epochs! Recall that we actually have 10 classes, so an accuracy of 96% on a balanced dataset is actually very good.

## Use parameters and optim

### Parameters
We can now use nn's parameters to help us reduce the number of lines of code. nn's parameters refer to the weights and biases of each layer.

In [433]:
class MLP(nn.Module):
    def __init__(self, input_dim, nh, output_dim):
        super().__init__()
        self.l1 = nn.Linear(input_dim, nh)
        self.l2 = nn.Linear(nh, output_dim)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.l2(self.relu(self.l1(x)))

In [434]:
model = MLP(input_dim, nh, c.item())

In [435]:
list(model.parameters())

[Parameter containing:
 tensor([[ 0.04,  0.03, -0.02,  ...,  0.03,  0.03,  0.00],
         [ 0.04,  0.01,  0.03,  ..., -0.01,  0.02,  0.03],
         [ 0.03,  0.03, -0.01,  ...,  0.01, -0.01,  0.00],
         ...,
         [-0.01, -0.02, -0.03,  ..., -0.01,  0.00, -0.00],
         [-0.01, -0.02,  0.00,  ...,  0.00,  0.01, -0.01],
         [ 0.03,  0.02, -0.02,  ..., -0.01, -0.01, -0.04]], requires_grad=True),
 Parameter containing:
 tensor([-0.01, -0.01, -0.00,  0.01, -0.03,  0.02, -0.03,  0.01, -0.01, -0.00,  0.03, -0.02, -0.01, -0.02,  0.00,  0.03, -0.04,  0.01, -0.00,
         -0.02, -0.01, -0.02,  0.03,  0.02, -0.01,  0.02, -0.02, -0.02, -0.00, -0.02, -0.01, -0.01,  0.02,  0.01,  0.03,  0.03,  0.02, -0.02,
          0.02, -0.00,  0.01, -0.03,  0.03,  0.01,  0.03, -0.01,  0.02, -0.01,  0.03,  0.00], requires_grad=True),
 Parameter containing:
 tensor([[    -0.05,      0.02,      0.07,      0.09,      0.10,      0.12,     -0.13,      0.10,     -0.09,     -0.08,     -0.12,     -0.01,


We see that calling `parameters()` on an object inheriting `nn.Module` gives us access to the list of weights and biases in that model.

We can now shorten our training loop.

In [436]:
for i in range(epochs):
    for b in range(0, x_train.shape[0], bs):
        s = slice(b, min(x_train.shape[0], b+bs))
        xb, yb = x_train[s], y_train[s]
        y_pred = model(xb)
        loss = loss_function(y_pred, yb)
        loss.backward()

        with torch.no_grad():
            for p in model.parameters():
                p -= p.grad * lr
            model.zero_grad()

    report(loss, y_pred, yb)

0.13, 0.96
0.08, 0.98
0.09, 0.98


We see that our code is now shorter. Next, we can use PyTorch's `optim` to make the code even shorter.

### Optim
PyTorch's `optim` offers various optimisation algorithms for training neural network, such as SGD, Adam etc. We will implement our own SGD optimizer from scratch.

In [437]:
class SGDOptimizer():
    def __init__(self, params, lr=0.5):
        self.params = list(params)
        self.lr = lr

    def step(self):
        with torch.no_grad():
            for p in self.params:
                p -= p.grad * self.lr

    def zero_grad(self):
        for p in self.params:
            p.grad.zero_()

In [438]:
model.parameters()

<generator object Module.parameters at 0x17a72c740>

In [439]:
model = nn.Sequential(nn.Linear(input_dim, nh), nn.ReLU(), nn.Linear(nh, c))
optim = SGDOptimizer(model.parameters())

We can further shorten our training loop.

In [440]:
for i in range(epochs):
    for b in range(0, num_input, bs):
        s = slice(b, min(num_input, b+bs))
        xb, yb = x_train[s], y_train[s]
        y_pred = model(xb)
        loss = loss_function(y_pred, yb)
        loss.backward()
        optim.step()
        optim.zero_grad()
            
    report(loss, y_pred, yb)

0.11, 0.98
0.12, 0.96
0.12, 0.98


Now that we have implemented SGD Optimizer from scratch, we can use PyTorch's `optim`.

In [441]:
def get_model():
    model = nn.Sequential(nn.Linear(input_dim,nh), nn.ReLU(), nn.Linear(nh,10))
    return model, torch.optim.SGD(model.parameters(), lr=lr)

## Dataset and Dataloader
It is clunky to retrieve `xb`, `yb` separately in our code. We want to retrieve them together.

In [442]:
class Dataset:
    def __init__(self, x, y):
        self.x, self.y = x, y

    def __len__(self):
        return len(self.x)
    
    def __getitem__(self, i):
        return self.x[i], self.y[i]

In [443]:
train_ds, valid_ds = Dataset(x_train, y_train), Dataset(x_valid, y_valid)
assert(len(train_ds) == len(x_train))
assert(len(valid_ds) == len(x_valid))

In [444]:
xb, yb = train_ds[0:5]
assert xb.shape == (5, 28*28)
assert yb.shape == (5,)

Let's shorten our traning loop.

In [445]:
model, optim = get_model()

In [446]:
for i in range(epochs):
    for b in range(0, num_input, bs):
        s = slice(b, min(num_input, b+bs))
        xb, yb = train_ds[s]
        y_pred = model(xb)
        loss = loss_function(y_pred, yb)
        loss.backward()
        optim.step()
        optim.zero_grad()
            
    report(loss, y_pred, yb)

0.16, 0.96


0.10, 0.98
0.10, 0.96


Our current training loop iterates over elements in the dataset as:
`for b in range(0, num_input, bs):
  s = slice(b, min(num_input, b+bs))
  xb, yb = train_ds[s]`

Let's replace this with a Dataloader.

In [447]:
class Dataloader:
    def __init__(self, dataset, bs):
        self.dataset = dataset
        self.bs = bs

    # defines how an object of this class behaves in an iteration context
    def __iter__(self):
        for i in range(0, len(self.dataset), self.bs): yield self.dataset[i: i+bs]

In [448]:
train_dl = Dataloader(train_ds, bs)
valid_dl = Dataloader(valid_ds, bs)

In [449]:
model, optim = get_model()

In [450]:
for i in range(epochs):
    for batch in train_dl:
        xb, yb = batch
        y_pred = model(xb)
        loss = loss_function(y_pred, yb)
        loss.backward()
        optim.step()
        optim.zero_grad()
            
    report(loss, y_pred, yb)

0.12, 0.98
0.13, 0.96
0.11, 0.96


We have now implemented a Dataloader from scratch!

### Random sampling
We want our Dataloader to sample randomly during training (to smoothen out any potential bias in how the data is prepared), and sample non-randomly during validation.

In [451]:
import random

In [452]:
class Sampler():
    def __init__(self, ds, shuffle=False):
        self.n, self.shuffle = len(ds), shuffle

    def __iter__(self):
        res = list(range(self.n))
        if self.shuffle: random.shuffle(res)
        return iter(res)

In [453]:
from itertools import islice

In [454]:
ss = Sampler(train_ds)

In [455]:
it = iter(ss)
for i in range(5): print(next(it))

0
1
2
3
4


In [456]:
ss = Sampler(train_ds, shuffle=True)
it = iter(ss)
for i in range(5): print(next(it))

38864
6875
17031
11550
23101


We are shuffling our dataset. Let's use the shuffled values as indices to select the data.

In [457]:
import fastcore.all as fc

In [458]:
??fc

[0;31mType:[0m        module
[0;31mString form:[0m <module 'fastcore.all' from '/Users/pj/miniconda3/envs/np/lib/python3.11/site-packages/fastcore/all.py'>
[0;31mFile:[0m        ~/miniconda3/envs/np/lib/python3.11/site-packages/fastcore/all.py
[0;31mSource:[0m     
[0;32mfrom[0m [0;34m.[0m[0mimports[0m [0;32mimport[0m [0;34m*[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0;34m.[0m[0mfoundation[0m [0;32mimport[0m [0;34m*[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0;34m.[0m[0mdispatch[0m [0;32mimport[0m [0;34m*[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0;34m.[0m[0mutils[0m [0;32mimport[0m [0;34m*[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0;34m.[0m[0mparallel[0m [0;32mimport[0m [0;34m*[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0;34m.[0m[0mnet[0m [0;32mimport[0m [0;34m*[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0;34m.[0m[0mtransform[0m [0;32mimport[0m [0;34m*[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0;34m.[0m[0mtest[0m [0;3

In [459]:
class BatchSampler():
    def __init__(self, sampler, bs, drop_last=False):
        fc.store_attr()

    def __iter__(self):
        yield from fc.chunked(iter(self.sampler), self.bs)

In [460]:
batchs = BatchSampler(ss, 50)

In [461]:
def collate(b):
    xs, ys = zip(*b)
    return torch.stack(xs), torch.stack(ys)

In [462]:
class Dataloader():
    def __init__(self, ds, batchs, collate_fn=collate): fc.store_attr()

    def __iter__(self): yield from (self.collate_fn(self.ds[i] for i in b) for b in self.batchs)

In [463]:
train_samp = BatchSampler(Sampler(train_ds, shuffle=True), bs)
val_samp = BatchSampler(Sampler(valid_ds), bs)

In [464]:
train_dl = Dataloader(train_ds, batchs=train_samp)
val_dl = Dataloader(valid_ds, batchs=val_samp)

In [465]:
model, opt = get_model()

In [466]:
for i in range(epochs):
    num_train_rows = 0
    for batch in train_dl:
        xb, yb = batch
        num_train_rows += xb.shape[0]
        y_pred = model(xb)
        loss = loss_function(y_pred, yb)
        loss.backward()
        opt.step()
        opt.zero_grad()
            
    print("Num training rows in this epoch:", num_train_rows)
    report(loss, y_pred, yb)

Num training rows in this epoch: 50000
0.05, 1.00


Num training rows in this epoch: 50000
0.04, 0.98
Num training rows in this epoch: 50000
0.09, 0.92


### Multiprocessing Dataloader

In [467]:
import torch.multiprocessing as mp
from fastcore.basics import store_attr

In [468]:
train_ds[[3,6,8,1]]

(tensor([[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]),
 tensor([1, 1, 1, 0]))

In [469]:
train_ds.__getitem__([3, 6, 8, 1])

(tensor([[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]),
 tensor([1, 1, 1, 0]))

We see that `__getitem__()` is the same as indexing into the tensor directly.

To translate this to multiple workers, we can use `map` to call `__getitem__()` on the corresponding index.

In [470]:
for o in map(train_ds.__getitem__, ([3,6], [8,1])): print(o)

(tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]]), tensor([1, 1]))
(tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]]), tensor([1, 0]))


In [471]:
class Dataloader():
    def __init__(self, ds, batchs, num_workers=1, collate_fn=collate):
        fc.store_attr()

    def __iter__(self):
        with mp.Pool(self.num_workers) as ex:
            print("started worker")
            yield from ex.map(self.ds.__getitem__, iter(self.batchs))

In [477]:
train_dl = Dataloader(train_ds, batchs=train_samp, num_workers=2)
it = iter(train_dl)

In [478]:
xb, yb = next(it)
xb.shape, yb.shape

started worker


KeyboardInterrupt: 

For some reason each worker is facing problem with executing the mapping.

### PyTorch DataLoader
We can now use PyTorch's DataLoader.

In [479]:
from torch.utils.data import DataLoader, SequentialSampler, RandomSampler, BatchSampler

In [485]:
train_samp = BatchSampler(RandomSampler(train_ds), bs, drop_last=False)
valid_samp = BatchSampler(SequentialSampler(valid_ds), bs, drop_last=False)

In [486]:
train_dl = DataLoader(train_ds, batch_sampler=train_samp, collate_fn=collate)
val_dl = DataLoader(valid_ds, batch_sampler=val_samp, collate_fn=collate)

In [487]:
model, opt = get_model()
for i in range(epochs):
    for batch in train_dl:
        xb, yb = batch
        y_pred = model(xb)
        loss = loss_function(y_pred, yb)
        loss.backward()
        opt.step()
        opt.zero_grad()
            
    report(loss, y_pred, yb)

0.27, 0.94
0.03, 1.00
0.08, 0.96


PyTorch can actually automatically generate the `BatchSampler` and `RandomSampler` for us.

In [488]:
train_dl = DataLoader(train_ds, bs, shuffle=True, drop_last=True, num_workers=2)
val_dl = DataLoader(valid_ds, bs, shuffle=False, num_workers=2)

In [489]:
model, opt = get_model()
for i in range(epochs):
    for batch in train_dl:
        xb, yb = batch
        y_pred = model(xb)
        loss = loss_function(y_pred, yb)
        loss.backward()
        opt.step()
        opt.zero_grad()
            
    report(loss, y_pred, yb)



0.10, 0.98




0.07, 0.98




0.08, 0.98


## Validation
We check our model's performance on a validation set at the end of each epoch. We need to call `model.train()` before training, and `model.eval()` before inference 

In [495]:
model, opt = get_model()

for i in range(epochs):
    model.train()
    for batch in train_dl:
        xb, yb = batch
        y_pred = model(xb)
        loss = loss_function(y_pred, yb)
        loss.backward()
        opt.step()
        opt.zero_grad()

    model.eval()
    val_loss = 0.
    val_acc = 0
    count = 0
    for batch in val_dl:
        xb, yb = batch
        n = len(xb)
        y_pred = model(xb)
        loss = loss_function(y_pred, yb)
        val_loss += loss.item() * n
        val_acc += accuracy(y_pred, yb).item() * n
        count += n

    print(i, val_loss/count, val_acc/count)



0 0.14612301837652922 0.9564999985694885




1 0.13874237489886582 0.9602999991178512




2 0.10953246485907585 0.968100000321865




We have successfully implemented PyTorch's DataLoaders from scratch!