# Before you start
1. **Don't edit this file, make a copy first:**
  * Click on File -> Save a copy in Drive

2. Also do the following:
  * Click on Runtime -> Change runtime type -> Make sure hardware accelerator is set to GPU

# An Overview Before We Begin
Here's a couple of important concepts to note down before we start:
- There is no one particular methodology that works best in all scenarios. This includes everything from model architecture, learning rate, loss function, and optimizer.
- Like any other engineering project, validation of what we have built is just as important as building it.

# Library Imports

In [None]:
import torch
from torch import nn
from torch import optim
from torchvision import datasets, transforms, models

from tqdm.notebook import tqdm

# Defining Path Variables

In [None]:
train_path = 'data/train'
valid_path = 'data/valid'

# Creating DataLoader 
- Creating a Dataset. This bundles the data in a way that the model can understand.
- Creating a DataLoader(wraps an iterable around the dataset). This tells the model how to receive the images. Including batch_size, num_workers, shuffle configurations, etc.

## [Instantiating Transforms](https://https://pytorch.org/docs/stable/torchvision/transforms.html)

  - the transforms.Normalize([...]) function basically changes the data slightly according to the average RGB weights in the dataset.
  - This may seem a bit strange to you, why do we do this? Turns out, normalizing the data before training results in noticeable performance gains and reduction in training time. Since we generalize the data which makes it easier for the model to train
  - So why does normalizing data have such performance boosts in training? That's because by itself, the RGB values of the raw data have differing ranges. the blue pixel may have a range of of 0->125 while the red pixel may have a range of 120->245. This different range often causes headaches for the optimizer and it takes the gradient descent to converge much slowly as it has to cater to the differing conditions of both the red and blue pixel.
  - What batch normalization does is that it makes the RGB ranges somewhat similar, so the optimizer doesn't have such a hard time trying to cater for all the different ranges, and thus gradient descent covnerges faster.
  - More information here https://medium.com/@urvashilluniya/why-data-normalization-is-necessary-for-machine-learning-models-681b65a05029

Now, in the below block of code, come up with a set of tranforms for the training and validation datasets that you think might be suitable. Try using transforms such as rotation, flipping, normalizing, etc.

Find the documentation for the transforms here : https://pytorch.org/vision/stable/transforms.html

In [None]:
# Define transforms for the training and validation set
training_transforms = transforms.Compose([# Insert random rotation, 30 degrees
                                          # Insert random horizontal flip
                                          # Convert to tensor
                                          ?,
                                          transforms.Normalize([0.485, 0.456, 0.406],
                                                               [0.229, 0.224, 0.225])])

validation_transforms = transforms.Compose([#
                                            ?,
                                            transforms.Normalize([0.485, 0.456, 0.406],
                                                                 [0.229, 0.224, 0.225])])

## [Torchvision datasets](https://pytorch.org/docs/stable/torchvision/datasets.html)
There are a number of ways to create a dataset, for example:
- Use an available Torchvision dataset
    - We're using one below called CIFAR10 which we used in the first training session
- Use ImageFolder to create a dataset from folders
- Write your own dataset as a subclass of torch.utils.data.Dataset

In [None]:
training_dataset = datasets.CIFAR10(train_path, train=True, transform=?, download=True)
validation_dataset = datasets.CIFAR10(valid_path, train=False, transform=?, download=True)

Files already downloaded and verified
Files already downloaded and verified


## [Instantiating DataLoader](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)
- This tells the model *how* how to receive the data; with the dataset as an input, along with batch size and shuffle as args

Now code up data loaders for the training and validation datasets as per the followig specs :

- In the training_loader, we're telling it batch_size = 32, and we want to shuffle the dataloader after each epoch.
- In the validation_loader, we also take batch_size = 32 but we DON'T want to shuffle the dataloader. This is because we want to be testing on the data in the same order to make sure the model really is improving and didn't hit a fluke ordering of the dataset.

In [None]:
from torch.utils.data import DataLoader

training_loader = DataLoader(?, ?,?) #what do you want to pass on to the dataloader? there shuold at least be 3 important stuffs
validation_loader = DataLoader(?, ?, ?)


In [None]:
# Check what classes are in our dataset

training_dataset.classes, validation_dataset.classes

#mane note of how many classes are there

(['airplane',
  'automobile',
  'bird',
  'cat',
  'deer',
  'dog',
  'frog',
  'horse',
  'ship',
  'truck'],
 ['airplane',
  'automobile',
  'bird',
  'cat',
  'deer',
  'dog',
  'frog',
  'horse',
  'ship',
  'truck'])

# Instantiating ResNet18
- In PyTorch when a model is downloaded, you need to reconfigure the 'classification' layer as the pretrained model that was trained for ImageNet, hence it comes ready to classify for many classes (we only need it to classify 10 classes)
- In addition, downloaded models from PyTorch come unfrozen, which means we need to 'freeze' the entire network except for the classification layer so we can perform the first batch of training.
- Unfrozen here means that the weights in each layer can be updated during training and in some cases this is not what we want since the pretrained model has been optimized in such a way that it could classify images in the Imagenet dataset with a high accuracy

In [None]:
model = models.resnet18(pretrained=True)

Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████| 44.7M/44.7M [00:01<00:00, 27.1MB/s]


## Freezing The Model
- *%%capture* is colab syntax, it essentially stops the cell from printing out any logs. This is purely for *aesthetic* purposes.
- We're looping through all the parameters in the model, and setting requires_grad = False. Which 'freezes' the entire model.
  - requires_grad stands for 'requires gradient'. When requires_grad is False, it's weights does not get updated and hence it is 'frozen'.

In [None]:
%%capture
for param in model.parameters():
  param.requires_grad = False

## Reconfiguring The Classification Layer
- model.fc = the last layer of the network
- model.fc.in_features = the features going into the last layer
- But we know that the pretrained Resnet18 is designed for dozens of classes, but we only need it to classify 10 classes, so we're going to have to replace the last layer.
  - To do so, we use nn.Linear(out_ftrs, 10).
  - This way, we keep the same numnber of features going into the last layer, but only change the number of features going out, which in this case is 10.
  - We then reinsert it to the model using model.fc = nn.Linear(out_ftrs, 10)

In [None]:
# Print last layer of the model
model.fc

Linear(in_features=512, out_features=1000, bias=True)

In [None]:
# Redefine final linear layer

out_ftrs = ? # Number of features going INTO the last layer, what do you think this will be? remember how many classes do we have
model.fc = nn.Linear(512, out_ftrs) # Redefine last linear layer of network to output 10 classes

# The Training Function
- The fit() function in Pytorch is used to 'fit' the model and train it
<br>
<br>

Everytime we run through a 'batch' of data we need to do a few things
1. Clear the gradients from the previous loop  
2. Perform a forward pass (put the input through the model once)
3. Calculate the loss
4. Back propogate the loss
5. Update the parameter weights by taking a step with the optimiser

In [None]:
# Function for the training

def train(model, train_loader, loss_fn, optimizer, device):
    ? # puts the model in training mode
    running_loss = 0
    with tqdm(total=len(train_loader)) as pbar:
        for i, data in enumerate(train_loader, 0): # loops through training data
            inputs, labels = ? # separate inputs and labels (outputs),
            inputs, labels = inputs.to(device), labels.to(device) # puts the data on the GPU

            # forward + backward + optimize
            optimizer.zero_grad() # clear the gradients in model parameters
            outputs = ? # forward pass and get predictions, how do we get outputs?
            loss = ? # calculate loss, how do we get loss
            loss.backward() # calculates gradient w.r.t to loss for all parameters in model that have requires_grad=True
            optimizer.step() # iterate over all parameters in the model with requires_grad=True and update their weights.

            running_loss += loss.item() # sum total loss in current epoch for print later

            pbar.update(1) #increment our progress bar

    return running_loss/len(train_loader) # returns the total training loss for the epoch

# The Validation Function
- A validation function is essential in any model training, because it helps you validate how well your model is performing on the validation dataset.

Note: the validation function validates the model performance by passing the entire validation set through the model ONCE. Also note that we cacluate the loss but don't propogate it back or update any weights!

In [None]:
# Function for the validation pass

def validation(model, val_loader, loss_fn, device):
    model.eval() # puts the model in validation mode
    running_loss = 0
    total = 0
    correct = 0

    with torch.no_grad(): # save memory by not saving gradients which we don't need
        with tqdm(total=len(val_loader)) as pbar:
            for images, labels in iter(val_loader):
                images, labels = images.to(device), labels.to(device) # put the data on the GPU
                outputs = ? # passes image to the model, and gets a ouput which is the class probability prediction

                val_loss = loss_fn(outputs, labels) # calculates val_loss from model predictions and true labels
                running_loss += val_loss.item()
                _, predicted = torch.max(outputs, 1) # turns class probability predictions to class labels
                total += labels.size(0) # sums the number of predictions
                correct += (predicted == labels).sum().item() # sums the number of correct predictions

                pbar.update(1)

        return running_loss/len(val_loader), correct/total # return loss value, accuracy

#Things to note about our training and validation functions

## What's the difference between `model.train()` and `model.eval()`?
These two are extremely important to your training and validation loops. `model.eval()` takes away some layers that should only be used during training such as dropout and batch normalisation. It's important to always use `model.train()` when training and `model.eval()` when evaluating.

## Why do we need torch.no_grad()?
Running `with torch.no_grad()` means that we don't want gradients which is what happens during validation or testing, we don't need to update any gradients so we don't need to record them. Running this means that we optimize our code to not do things it doesn't need to.



# Setting Up Training
- When training models, it is substantially faster to train on NVIDIA GPU's, beacuse they offer a parallel computing platform called [cuda](https://developer.nvidia.com/cuda-zone) (cudnn is the API package to interface with cuda) that speeds up these computations exponentially.
- So here we check if cuda is available with cuda.is_available().
  - Following which, we send the model to the cuda device so the computation can be done on the GPU.

In [None]:
%%capture
import torch.backends.cudnn as cudnn
torch.cuda.empty_cache()
cudnn.benchmark = True  # Optimise for hardware

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device) # send model to GPU

# Loss Function & Optimizers
- The **Loss Function** calculates how 'far' the model's class probability predictions are to the actual labels.
  - Notice how I'm saying "how far the model's class prob predictions are to the actual labels" instead of "how innacurate the model is", that's because accurate/inaccurate is the percentage of correctly or incorrectly predicted labels. This may sound the same to you but just keep this in mind, it will all make sense in due time.
  - CrossEntropyLoss is a way of calculating the loss of a model, other loss functions include Kullback Leibler Divergence Loss, Sparse Multiclass Cross-Entropy Loss, and much more.

- **The Optimizer** is a way of updating the weights of the model to minimize loss. In other words, the optimizer is the part of deep learning that helps a model 'learn'.
- In this case we're using the Adam optimizer, this is purely by random choice as no particular optimizer can be said to be superior to the other. There's an important concept in deep learning called "no free lunch", which means there isn't a particular methodology that will achieve the best outcome for all scenarios, what it comes down to is experimentation.
  - a lr of 0.001 is also chosen, this is usually a good learning rate start from with the Adam optimizer, however to get a more optimum learning rate, experimentation would need to be done. (The Pytorch documentation includes defaults for each different optimizer)


In [None]:
loss_fn = ? #again, what loss function would be good here?
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Let The Training Begin! Part 1


In [None]:
total_epoch = ? #feel free to play around with how many epochs you guys wanna do
for epoch in range(total_epoch): # loops through number of epochs
  train_loss = train(model, training_loader, loss_fn, optimizer, device)  # train the model for one epoch
  val_loss, accuracy = validation(model, validation_loader, loss_fn, device) # after training for one epoch, run the validation() function to see how the model is doing on the validation dataset
  print("Epoch: {}/{}, Training Loss: {}, Val Loss: {}, Val Accuracy: {}".format(epoch+1, total_epoch, train_loss, val_loss, accuracy))
  print('-' * 20)

print("Finished Training")

  0%|          | 0/782 [00:00<?, ?it/s]

KeyboardInterrupt: ignored

In [None]:
# We then save the model so we can come back later to it if need be
torch.save(model.state_dict(), 'stage-1')

# Let The Training Begin! Part 2
- Function below allows the rest of the model to be optimized for this specific task.
- The cell after is exactly the same as the training of the model in 'let The Training Begin! Part 1', just that we're retraining for 2 epochs.

In [None]:
%%capture
for param in model.parameters():
  param.requires_grad = True

In [None]:
total_epoch = 10
for epoch in range(total_epoch): # loops through number of epochs
  train_loss = train(model, training_loader, loss_fn, optimizer, device) # train the model for one epoch
  val_loss, accuracy = validation(model, validation_loader, loss_fn, device) # after training for one epoch, run the validation() function to see how the model is doing on the validation dataset
  print("Epoch: {}/{}, Training Loss: {}, Val Loss: {}, Val Accuracy: {}".format(epoch+1, total_epoch, train_loss, val_loss, accuracy))
  print('-' * 20)

print("Finished Training")