<a href="https://colab.research.google.com/github/SelAw432/DeepLearning/blob/main/Practical2/ThingsWillGoWrongAgain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Learning Practical 2 - Part 3 - Things will go wrong again!
---

## Author : Amir Atapour-Abarghouei, amir.atapour-abarghouei@durham.ac.uk

This notebook will provide you with another exercise in identifying issues when building and training a simple nerual network.

Copyright (c) 2022 Amir Atapour-Abarghouei, UK.

License : LGPL - http://www.gnu.org/licenses/lgpl.html

Just like [our last exercise](https://colab.research.google.com/gist/atapour/11bbb081bcac21a45aa1bf985c644d36#scrollTo=caZAL4nPAV2H), we are going to have the entire setup and training loop in one cell. This time, though, we are going to use the FashionMNIST dataset, but to speed up the process, we will use a large batch size.

In fact, the "client" that has hired us to complete the project has insisted that we use a batch size of 1950 and the input images have to be resized to 128x128 - *they have their reasons, don't poke at the scenario too much!!! :o)*. We should hopefully be able to change any other parameter we want.

The point is trying to use as much of the GPU memory without actually going over the limit. This has been test on Google Colab and NCC Jupyter Hub (on the res partition). On a different GPU, you might run out of memeory right off the bat. That is not meant to happen, so you may have to adjust the batch size and image resolution so you don't run out of memeory before training starts.

Make sure you go through and understand every part of the code. Ask questions if there is something you don't understand.

In [None]:
!pip install livelossplot --quiet

import torch
import torch.nn as nn
import torchvision
import matplotlib.pyplot as plt
from livelossplot import PlotLosses
import os.path

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f'Device is {device}!')

train_loader = torch.utils.data.DataLoader(
    torchvision.datasets.FashionMNIST('data', train=True, download=True, transform=torchvision.transforms.Compose([
        torchvision.transforms.Resize(128),
        torchvision.transforms.Grayscale(num_output_channels=3),
        torchvision.transforms.ToTensor()
    ])),
shuffle=True, batch_size=1950)

print('created the dataloader for training set!')

# we will use a ResNet model here - you will learn about these later
# but you can see how easy CNNs are to use in PyTorch
model = torchvision.models.resnet18(weights=torchvision.models.ResNet18_Weights.IMAGENET1K_V1)
num_infea = model.fc.in_features
model.fc = nn.Linear(num_infea, 10)
model = model.to(device)

# create the optimiser:
optimiser = torch.optim.SGD(model.parameters(), lr=9.9)

# since this is a binary problem, we will use binary cross entropy as the loss function
# criterion = nn.BCEWithLogitsLoss()
criterion = nn.CrossEntropyLoss()

# to calculate the total loss over entire training:
total_loss = 0

# initialising epoch variable
epoch = 0

# to plot losses
liveloss = PlotLosses()

# to keep the logs for loss plots
logs = {}

# main training loop:
while epoch < 5:

    for j, batch in enumerate(train_loader):

        print(f'Training step: {j}')

        x, y = batch
        x, y = x.to(device), y.to(device)

        output = model(x)
        loss = criterion(output, y)
        total_loss += loss

        model.zero_grad()

        # backward pass
        loss.backward()
        optimiser.step()

        # calculating the accuracy
        _, argmax = torch.max(output, dim=1)
        accuracy = argmax.eq(y).float().mean() * 100

    logs['Loss'] = loss.item()
    logs['Accuracy'] = accuracy.item()
    liveloss.update(logs)
    liveloss.send()
    print(f'Total loss is {total_loss}')

You should "hopefully" see that after a number of steps, you are getting an error. How can you be getting such an error after the model has correctly trained for a few steps? What is this error? Why is it happening?

Try to investigate and find the issue.

Use your favourite debugging techniques and find out what the issue could be.

You may also find the following material useful in your quest:

https://pytorch.org/docs/stable/notes/faq.html

Right! Hopefully by this point, you will have fixed the error and your model should be training - if not ask for help!

However, fixing errors is just half the battle. Even if the error is gone, you should notice that the model is not actually training and performing well.

Try to investigate further and find the issue there.

The following material should help you understand the issue:

https://machinelearningmastery.com/understand-the-dynamics-of-learning-rate-on-deep-learning-neural-networks/
(the practical examples here use Keras - but don't mind that, PyTorch is better!)

So go back and fix the issue and make sure that the model is training well!