# MNIST 99 Percent
> Aiming for 99% accuracy across the full MNIST data set.

- toc: false 
- badges: true
- comments: true
- author: Paul D
- image: images/2020-05-05-mnist-99-title.png
- categories: [deep learning, fastai, gradient descent, sgd, mnist]
- hide: false

### Introduction
In my [previous blog post](https://pdito.github.io/blog/deep%20learning/fastai/gradient%20descent/sgd/mnist/2020/04/17/fastai2-ch4-mnist-first-principles.html) I ran through classification for a subset of the MNIST data (3s and 7s only) as a learning experience, following along with Fastbook chapter 4.

From here, I look to take what I've learned previously to build a model to tackle the full MNIST data set, attempting to eventually hit an accuracy of > 99% on my validation set.

I won't be going into as much detail for each step, so please review the previous blog post for a verbose explanation of what is going on.

### Getting Started

As usual, we start by importing the necessary libraries. 

In [5]:
from fastai2.vision.all import *
from utils import *
import itertools
import math

matplotlib.rc('image', cmap='Greys')

We then use fastai's built in ```untar_data``` function to download and extract the full MNIST data set.

In [6]:
path = untar_data(URLs.MNIST)
path.ls()

(#2) [Path('/home/pdito/.fastai/data/mnist_png/training'),Path('/home/pdito/.fastai/data/mnist_png/testing')]

It looks like we have both training and testing data sub-folders. In this case, we'll use the testing data as our validation data.

*Note: Really we should split our training data into training and validation data and keep our testing data separate, but since we are not building something that will ever make it into production, we use the simpler approach.*

Next we iterate through all the images to create a list of all our training and validation images.

In [7]:
train_path = (path/'training').ls().sorted()
valid_path = (path/'testing').ls().sorted()

In [8]:
train_images = list(itertools.chain.from_iterable(([x.ls().sorted() for x in train_path])))
valid_images = list(itertools.chain.from_iterable(([x.ls().sorted() for x in valid_path])))

We then convert the list into a tensor, where dimension 0 represents each individual image.

In [9]:
train_x = [tensor(Image.open(o)) for o in train_images]
valid_x = [tensor(Image.open(o)) for o in valid_images]

In [10]:
train_x = torch.stack(train_x).float()/255
valid_x = torch.stack(valid_x).float()/255
train_x.shape, valid_x.shape

(torch.Size([60000, 28, 28]), torch.Size([10000, 28, 28]))

As usual, the tensor is then reshaped to combine the row and column pixel images into one long tensor, row by row.

In [11]:
train_x = train_x.view(-1, 28*28)
valid_x = valid_x.view(-1, 28*28)
train_x.shape

torch.Size([60000, 784])

Next, we need to create our labels. We can use the same list we used to create our ```train_x``` and ```valid_x``` tensors, iterating though to generate a tensor of values (in this case an int for the number) based on the parent folder name of the image.

In [12]:
train_y = torch.stack([tensor(int(os.path.basename(os.path.dirname(o)))) for o in train_images])
valid_y = torch.stack([tensor(int(os.path.basename(os.path.dirname(o)))) for o in valid_images])
train_y.shape, valid_y.shape

(torch.Size([60000]), torch.Size([10000]))

We use the zip function to create a list of tuples for the images and their labels.

In [13]:
train_dset = list(zip(train_x, train_y))
valid_dset = list(zip(valid_x, valid_y))

Now we have our training set we can create our ```Dataloaders```, which pass mini-batches of our data to our training model. Note, its typically good practice to shuffle our training data. In our example, this step is **essential**. Since if we don't shuffle, most mini-batches will contain images of only one number (as our data set is ordered by folder).

In [14]:
train_dl = DataLoader(train_dset, batch_size=256, shuffle=True)
valid_dl = DataLoader(valid_dset, batch_size=256, shuffle=False)

We then create a function which can be used to randomly initialise our parameters, applying ```.requires_grad_()```to tell PyTorch to calculate our gradients.

In [15]:
def init_params(size, std=1.0): return (torch.randn(size)*std).requires_grad_()

Now we create our model. In this case we are starting with a simple linear model, wx + b. We are however applying a log softmax function to the result. Softmax in effect squashes our output vector to values between 0 and 1, where those values sum to 1. Our output vector can be interpreted as the probability of something belonging to a given class.

We also take the log of the results, for reasons which are explained next.

In [16]:
def model(xb): return torch.log_softmax((xb@weights + bias), 1)

#Example of log_softmax
#def log_softmax(x):
#    return x - x.exp().sum(-1).log().unsqueeze(-1)

For our loss function, we use Negative Log Likelihood. The better our prediction, the lower the NLL. This function focuses only on our prediction for what would have been the correct class. 

As an example, lets assume our dataset only contains numbers 1 - 4. For a given image, our softmax output is ```[0.1, 0.1, 0.1, 0.7]``` and our label tensor is ```[0, 0, 0, 1]```. In this case our NLL is the negative log of ```(0 * 0.1) + (0 * 0.1) + (0 * 0.1) + (1 * 0.7)```. In other words ```-ln(0.7) = 0.155```.

In that case, our model was making a correct guess with 70% confidence. Let's now look at the example where that guess was incorrect, by changing our label tensor to ```[0, 0, 1, 0]```. In this case our NLL is the negative log of ```(0 * 0.1) + (0 * 0.1) + (1 * 0.1) + (0 * 0.7) = -ln(0.1) = 1```. So a much higher loss. 

The reason we took the log of softmax earlier is because the ```nll_loss``` function expects its input to be the log of probabilities as opposed to the probabilities themselves.

In [17]:
def mnist_loss(predictions, targets):
    return F.nll_loss(predictions, targets)

loss_func = mnist_loss

#Example of NLL
#def mnist_loss_manual(predictions, targets): return -predictions[range(targets.shape[0]), targets].mean()

Next we define our step process, which calculates our predictions for a given mini-batch, calculates the loss of those predictions using our loss function and then calculates the gradients of our parameters based on that loss.

In [18]:
def calc_grad(xb, yb, model):
    preds = model(xb)
    loss = loss_func(preds, yb)
    loss.backward()

We then create our training function, which represents an entire epoch. In this case it loops through every mini_batch, calculating the gradients, adjusting our parameters by their gradient multiplied by the learning rate and then resetting the gradients to zero (since they are additive otherwise, which is not what we want).

In [19]:
def train_epoch(model, lr, params):
    for xb, yb in train_dl:
        calc_grad(xb, yb, model)
        for p in params:
            p.data -= p.grad * lr
            p.grad.zero_()

Right now we only have a loss to measure performance. This is great for training, but not great for us to know how we're doing. Below we create a function that outputs the accuracy of a given mini batch (taking the index of our highest probability prediction and comparing that to our label for each image).

We then create a function that performs this on our entire validation set that we can call after each training epoch.

In [20]:
def batch_accuracy(xb, yb):
    preds = torch.argmax(xb, dim=1)
    return (preds == yb).float().mean()

def validate_epoch(model):
    accs = [batch_accuracy(model(xb), yb) for xb, yb in valid_dl]
    return round(torch.stack(accs).mean().item(), 6)

Now it's time to set our model in to action. We start by initialising our parameters, and we also define a learning rate.

In [21]:
weights = init_params((28*28,10))
bias = init_params(10)
params = weights, bias
lr = 1.

Press play...

In [22]:
 def train_model(model, epochs, lr):
    for i in range(epochs):
        train_epoch(model, lr, params)
        print(validate_epoch(model), end=' ')

train_model(model, 20, lr)

0.853223 0.86416 0.875488 0.880273 0.892285 0.890625 0.894043 0.891699 0.897949 0.900977 0.899805 0.905762 0.906055 0.901367 0.908789 0.906543 0.90957 0.904102 0.900391 0.906934 

Around 90% accuracy. Pretty good for a simple linear model. Our model struggles to improve much beyond the 10th epch, perhaps a learning rate that is too high.

### Cleaning Up Code with PyTorch/fastai

Let's simplify our code by using PyTorch's built in nn.Linear to create our model. This also handles parameter initialisation for us.

In [24]:
linear_model = nn.Linear(28*28, 10)
w, b = linear_model.parameters()
lr = 0.1

Since we are no longer taking the log_softmax in our model, we can introduce the PyTorch loss function F.cross_entropy which combines both log softmax and NLL into one function.

In [25]:
loss_func = F.cross_entropy

To tidy thing up, we can also wrap our step and zero grad functions into an optimiser class.

In [26]:
class BasicOptim:
    def __init__(self, params, lr): self.params, self.lr = list(params), lr
    
    def step(self, *args, **kwargs):
        for p in self.params: p.data -= p.grad.data * self.lr
    
    def zero_grad(self, *args, **kwargs):
        for p in self.params: p.grad = None
            
opt = BasicOptim(linear_model.parameters(), lr)

def train_epoch_lm(model):
    for xb, yb in train_dl:
        calc_grad(xb, yb, model)
        opt.step()
        opt.zero_grad()
        
def train_model_lm(model, epochs):
    for i in range(epochs):
        train_epoch_lm(model)
        print(validate_epoch(model), end=' ')


In [27]:
train_model_lm(linear_model, 20)

0.88291 0.895898 0.902441 0.902734 0.909668 0.907422 0.911035 0.90957 0.911621 0.912793 0.912793 0.915137 0.91416 0.914746 0.916406 0.915918 0.918066 0.915918 0.916406 0.918164 

Better results than before, around 92% accuracy. Why, since nothing changed in our actual model architecture?

Actually, something did change, we reduced the learning rate from 1.0 to 0.1. Everything else remain consistent, just represented in a cleaner way using less and more reusable code.

### Replacing BasicOptim with SGD

To further simplify, fastai provides us with a built in SGD class, similar to the BasicOptim class we created above.

In [28]:
linear_model = nn.Linear(28*28, 10)
opt = SGD(linear_model.parameters(), lr)
train_model_lm(linear_model, 20)

0.883691 0.894336 0.901367 0.903223 0.908203 0.908984 0.910156 0.911523 0.91543 0.912305 0.916895 0.916309 0.914453 0.915723 0.91582 0.918555 0.916309 0.915332 0.917773 0.918164 

Again, 92% accuracy. Similar results, which makes sense, since nothing has changed.

### Using fastai Learner

Finally, before we start to improve our model, we use a fastai ```Learner```to replace out training loop in order to further simplify our code. 

In [29]:
dls = DataLoaders(train_dl, valid_dl)
learn = Learner(dls, nn.Linear(28*28,10), opt_func=SGD, loss_func=F.cross_entropy, metrics=batch_accuracy)
learn.fit(20, lr=lr)

epoch,train_loss,valid_loss,batch_accuracy,time
0,0.52186,0.446785,0.8891,00:01
1,0.408741,0.378355,0.8993,00:01
2,0.373681,0.3501,0.906,00:01
3,0.355128,0.335873,0.9078,00:01
4,0.347222,0.323164,0.9126,00:01
5,0.33139,0.315915,0.9144,00:01
6,0.3201,0.309472,0.9151,00:01
7,0.315955,0.305583,0.9154,00:01
8,0.307917,0.301444,0.9164,00:01
9,0.318447,0.298375,0.9169,00:01


92% accuracy here too, just what we wanted to see since again, nothing has changed.

### Adding Non-Linearity

Now we do want things to change. To improve our model, let's add some non-linearity. We'll sandwich a ReLU activation function in between two linear layers.

In [30]:
neural_net = nn.Sequential(
    nn.Linear(28*28, 30),
    nn.ReLU(),
    nn.Linear(30,10)
)

In [31]:
learn = Learner(dls, neural_net, opt_func=SGD, loss_func=F.cross_entropy, metrics=batch_accuracy)
learn.fit(20, 0.1)

epoch,train_loss,valid_loss,batch_accuracy,time
0,0.480248,0.397933,0.8904,00:01
1,0.339421,0.317796,0.9103,00:01
2,0.307536,0.290654,0.9178,00:01
3,0.281915,0.269593,0.9225,00:01
4,0.263428,0.251829,0.9278,00:01
5,0.247231,0.242583,0.9307,00:01
6,0.223604,0.224113,0.9357,00:01
7,0.221571,0.218439,0.9364,00:01
8,0.209643,0.217177,0.9367,00:01
9,0.196226,0.199739,0.9427,00:01


A pretty significant improvement, over 95% accurate, still using just a very simple architecture. Looking at the output we could definitely afford to train this over more epochs and expect continued improvement.

### Using ResNet18

Since we want to achieve an accuracy of over 99%, let's use a more complex neural net, in this case, the infamous ResNet18 architecture.

We want to use fastai's convenience methods for this, so we use the```DataBlock``` function to ensure our data is presented in the desired format.

We have two blocks, an ```ImageBlock``` (our data) and a ```CategoryBlock``` (our labels). We use ```PILImage``` even though our images are greyscale (which would be ```PILImageBW```) as ResNet18 was designed to be used on RGB images and expects its inputs to be structured accordingly.

```get_image_files``` is a helper function that returns all the images under the path.

```GrandparenterSplitter``` let's us specify the training and validation data split by the images' parent's parent (ie. grandparent) folder.

```parent_label```let's us define our image labels as the folder name they are contained within.

We then run ```dataloaders```on our ```DataBlock```to get our ```DataLoaders```.

*Note: nothing actually runs in the DataBlock until we call its dataloaders property against a path.

In [32]:
mnist = DataBlock(blocks=(ImageBlock(cls=PILImage), CategoryBlock), 
                  get_items=get_image_files, 
                  splitter=GrandparentSplitter(train_name='training', valid_name='testing'),
                  get_y=parent_label)

dls = mnist.dataloaders(untar_data(URLs.MNIST), batch_sz=128)

We create our Learner, using the resnet18 architecture without pretrained weights. We also directly reference ```F.cross_entropy```in our Learner and use the fastai's built in ```accuracy``` metric. We use fastai's ```.fit_one_cycle``` training method which is a more sophisticated version of ```.fit```.

I'm sure we'll blog about this soon, but you can read more [here](https://docs.fast.ai/callbacks.one_cycle.html).

In [33]:
learn = cnn_learner(dls, resnet18, pretrained=False, loss_func=F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(5)

epoch,train_loss,valid_loss,accuracy,time
0,0.189028,0.167352,0.9463,00:27
1,0.099857,0.090754,0.9756,00:28
2,0.05201,0.045006,0.9873,00:28
3,0.02118,0.017461,0.9947,00:28
4,0.013487,0.015505,0.9949,00:28


Finally, success. 99.5% accuracy after just 5 epochs and two and half minutes of training. We achieve this result in just 4 lines of code. A good indication of the power of fastai.

### Doing Something Ridiculous Like Using ResNet152

Just as an aside, let's try an extremely deep model to see if we get any improvement.

In [4]:
learn = cnn_learner(dls, resnet152, pretrained=False, loss_func=F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(5)

epoch,train_loss,valid_loss,accuracy,time
0,0.551158,0.449306,0.8583,02:03
1,0.264751,0.383378,0.8803,02:03
2,0.075151,0.046404,0.9857,01:58
3,0.063741,0.031806,0.9907,02:05
4,0.015324,0.021084,0.9941,02:03


In this case, no additional benefit from further complexity.

Of course, we could try more epochs, but this comes at risk of overfitting. Investigating what our model got wrong and using that to form the basis of our next steps would be the best way forward. But for now, we're content with our >99%.

Thanks for reading.