<a href="https://colab.research.google.com/github/Angus-Eastell/Intro_to_AI/blob/main/7_2_nns_for_mnist.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neural networks for MNIST

Now, we're going to put everything together and train a fully connected network to recognise MNIST handwritten digits.

The point of this notebook is for you to have a play with tweaking the optimizer.

Here are some things to try:
* changing the SGD learning rate.
* changing the SGD momentum.
* changing the optimizer to Adam.
* changing the Adam learning rate.
* changing the `beta1` and `beta2` parameters in Adam.

Overall, what's the best result you can get after 5 epochs?  Is your best result with Adam or SGD?

You will need to check out the docs for [SGD](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html) and [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html).

The notebook will be faster on GPU, but its still perfectly fine on CPU.  Remember, to switch to GPU, go to "Runtime" -> "Change Runtime Type".

In [65]:
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
torch.manual_seed(0)

# Check whether we have a GPU.  Use it if we do.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')



# MNIST train and test datasets.  I'm not going to talk about these in this course.
# you should just be able to follow "recipes" online.
train_dataset = torchvision.datasets.MNIST(root='data',
                                           train=True,
                                           transform=transforms.ToTensor(),
                                           download=True)

test_dataset = torchvision.datasets.MNIST(root='data',
                                          train=False,
                                          transform=transforms.ToTensor())

# MNIST train and test datasets.  I'm not going to talk about these in this course.
# However, note that I'm using a much bigger batch size at test-time.  That's
# because at training time, we have to backprop, so we have to save all the
# intermediate variables, which takes alot of memory.  We don't have to do that
# at test-time.
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=100,
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=1000,
                                          shuffle=False)
# Define network
input_size = 784
hidden_size = 500
num_classes = 10

model = nn.Sequential(
    nn.Linear(input_size, hidden_size),
    nn.ReLU(),
    nn.Linear(hidden_size, hidden_size),
    nn.ReLU(),
    nn.Linear(hidden_size, num_classes)
).to(device)

#############################
#### Tweak this line !!! ####
#############################
opt = torch.optim.SGD(model.parameters(), lr=0.015, momentum= 0.95)

def train():
    # Does one training epoch (i.e. one pass over the data.)
    for images, labels in train_loader:
        # Move tensors to the GPU device, and convert image to vector.
        images = images.reshape(-1, 28*28).to(device)
        labels = labels.to(device)

        # Forward pass
        logits = model(images)

        # Backpropagation and optimization
        loss = nn.functional.cross_entropy(logits, labels)
        loss.backward()
        opt.step()
        opt.zero_grad()

def test(epoch):
    # Do one pass over the test data.
    # In the test phase, don't need to compute gradients (for memory efficiency)
    with torch.no_grad():
        correct = 0
        total = 0
        for images, labels in test_loader:
            #Convert image pixels to vector
            images = images.reshape(-1, 28*28).to(device)
            labels = labels.to(device)

            # Forward pass
            logits = model(images)

            # Compute total correct so far
            predicted = torch.argmax(logits, -1)
            correct += (predicted == labels).sum().item()
            total += labels.size(0)
        print(f'Test accuracy after {epoch+1} epochs: {100 * correct / total} %')


# Run training
for epoch in range(5):
    train()
    test(epoch)


Test accuracy after 1 epochs: 95.39 %
Test accuracy after 2 epochs: 96.78 %
Test accuracy after 3 epochs: 97.32 %
Test accuracy after 4 epochs: 97.55 %
Test accuracy after 5 epochs: 97.67 %


SDG:
Impact of altering the learning rate for SDG:
- Lowering learning rate reduces accuracy in five epochs as there is not enough time to converge to the optimal gradient descent or could get stuck in poor local minima.
- Increasing learning rate increases the speed in which the neural network can get to high accuracies, however it can cause oscillations and instablility around the optimal loss.
- Ideal around 0.1-0.3
Best accuracy was 97.87% at lr = 0.2

Impact of adding momentum-
- Low momentum increases time to conversion as it relies more on the current gradient rather than past gradients, this slows convergence. It can also lead to less stability with higher learning rates.
- High momentum could lead to overshooting of optimal gradients wiht reduced responsivness to new gradients. It can also lead to higher instability with high learing rates.
- Ideal momentum around 0.9
Best accuracy was 97.67% at lr = 0.015 and momentum 0.95

Adam:
- Learnign rate of adam needs to be significantly lower around 0.001 as higher learning rates even at 0.01 can lead to instability and as adam needs less learning time, due to using both momentum and adaptive learning, a high learning rate is unnecessary.

Adjusting betas:
- Beta 1 should be around 0.9 or range (0.85 - 0.95) and dictates the exponential moving average beta term. If beta 1 is too low (around 0.5) it can make the optimisation to repsonsive to recent updates making it more unstable. If beta 1 is too high around (0.99) it can make it less responsive to recent gradients and slow to react to change in loss in the landscape.
- Beta 2 should be around 0.999 and represents the RMS moving average term. If beta 2 is too low (around 0.9) the optimisation will react to changes in the landscape faster causing step sizes to fluctuate too much. If beta 2 is too high (around 0.9999) it increases the stability of the optimisation however it can lead to slow adaptation to steep gradients.

Best found at  lr=0.001, betas = (0.9, 0.995) at 98%


Overall adam outperformed SDG with and without momentum over 5 epochs.