# HW 1

For this homework, please design, train, and evaluate a MLP (Multi-Layer Perceptron, aka Neural Network whose layers are all Linear layers) on the FashinMNIST data set.
The dataset can be downloaded via PyTorch, just like how I downloaded the CIFAR-10 dataset.


# 0. Introduction and Importing

First, we import the most fundamental packages/modules of PyTorch.

In [2]:
import torch # a Tensor library like NumPy, with strong GPU support
import torch.optim # functions related to optimization algorithms

import torch.nn # a neural networks library deeply integrated with autograd designed for maximum flexibility
import torch.nn.functional
# nn and nn.functional works kinda similarly. If you are interested in their subtile difference, please check out this discussion https://discuss.pytorch.org/t/what-is-the-difference-between-torch-nn-and-torch-nn-functional/33597

import torch.utils.data # utility functions such as DataLoader


# import torch # * a Tensor library like NumPy, with strong GPU support
# import torch.optim as optim # functions related to optimization algorithms

# import torch.nn as nn # a neural networks library deeply integrated with autograd designed for maximum flexibility
# import torch.nn.functional as F
# # nn and nn.functional works kinda similarly. If you are interested in their subtile difference, please check out this discussion https://discuss.pytorch.org/t/what-is-the-difference-between-torch-nn-and-torch-nn-functional/33597

# import torch.utils.data as data # utility functions such as DataLoader

Pytorch essentially does two things:
- Manipulates the so-called tensor data structure on GPU, just like NumPy can manipulate ndarray on CPU.
- Provides a automatic differentiation engine and some convenient helper functions for deep learning

Tensor is a data structure that can be thought of as a generalization of a matrix. A grayscale image is a matrix, but a colored image with 3 channels can be thought of a tensor.

Check if we are using GPU. Computation will be very slow if not.

In [3]:
torch.cuda.is_available()

True

Next we import some vision-related packges

In [5]:
import torchvision
import torchvision.datasets

Finally some generic helper packages

In [6]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import numpy as np

import copy
import random
import time
import cv2


The code sets a seed for the random number generators used by the random module, the numpy library, and the PyTorch library. By setting a seed, the code ensures that the results of the random number generation will be deterministic and reproducible, meaning that each time the code is run, the same sequence of random numbers will be generated. This is useful for debugging and testing, as well as for reproducing experimental results.

Additionally, the code sets the device to either the GPU (if available) or the CPU. The PyTorch library allows computations to be performed on either the GPU or the CPU, and the device to be used can be specified by setting the device variable.

Finally, the code sets torch.backends.cudnn.deterministic to True. This flag controls the deterministic behavior of the cuDNN library, which is used by PyTorch for GPU acceleration. By setting this flag to True, the code ensures that the cuDNN library will produce deterministic results and further improves the reproducibility of the code.

In [7]:
SEED = 1234


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

# 1. Using Pretrained Network

In [18]:
import torchvision.models
from PIL import Image

# * we use Resnet pre-trained weights to help initialize weights and make our classification task perform better
resnet18_model = torchvision.models.resnet18(weights=True)
resnet18_model = resnet18_model.to(device)
resnet18_model.eval()

def prepare_an_img_resnet(img):
    preprocess = torchvision.transforms.Compose([
    torchvision.transforms.Resize(256),
    torchvision.transforms.CenterCrop(224),
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),])
  
    img = Image.open(img)
    img = preprocess(img)
    img = torch.unsqueeze(img, 0)
    return img


In [20]:
# Since we are not training the model, the gradients for this operation are not needed and can be temporarily disabled using torch.no_grad().
with torch.no_grad():
    # ! get the tensor form of the image
    output = resnet18_model(prepare_an_img_resnet('cat.jpg').to(device))

probabilities = torch.nn.functional.softmax(output[0], dim=0)
# Download ImageNet labels
!wget https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt

# Read the categories
with open("imagenet_classes.txt", "r") as f:
    categories = [s.strip() for s in f.readlines()]
    
# Show top categories per image
top5_prob, top5_catid = torch.topk(probabilities, 5)
for i in range(top5_prob.size(0)):
    print(categories[top5_catid[i]], top5_prob[i].item())

Egyptian cat 0.30802950263023376
Siamese cat 0.16848227381706238
Angora 0.12675005197525024
tabby 0.07222902774810791
hamper 0.04242419824004173


'wget' is not recognized as an internal or external command,
operable program or batch file.


Now let's work on our own model, we will train a slight variation of a network called *AlexNet*. This is a landmark model in deep learning, and arguably kickstarted the current (and ongoing, and massive) wave of innovation in modern AI in 2012. AlexNet was the first real-world demonstration of a *deep* classifier that was trained end-to-end on data and that outperformed all other ML models thus far.

We will train AlexNet using the [FashionMNIST](https://github.com/zalandoresearch/fashion-mnist) dataset, which consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. The classes are: 

0	T-shirt/top

1	Trouser

2	Pullover

3	Dress

4	Coat

5	Sandal

6	Shirt

7	Sneaker

8	Bag

9	Ankle boot


This process is called finetuning. Please take a look at this [article](https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html)

# 2. Data Loading and Pre-processing

[FashionMNIST](https://github.com/zalandoresearch/fashion-mnist) dataset is included in PyTorch because it's so widely used

In [45]:
ROOT = '.data' # folder that contains the 

train_dataset = torchvision.datasets.FashionMNIST(root='./data', train=True, transform=torchvision.transforms.ToTensor(), download=True)
test_dataset = torchvision.datasets.FashionMNIST(root='./data', train=False, transform=torchvision.transforms.ToTensor(), download=True)

### Data augmentation


Next, we will do data augmentation. DL models are data hungry. A good trick to increse the size of dataset without the hardwork of acquiring/labeling more data is data augmentation. 

For each training image we will randomly rotate it (by up to 5 degrees), flip/mirror with probability 0.5, shift by +/-1 pixel. 

In [46]:
# here we compose all the data augmentation actions we want to do.
# note that there is no need to do data augmentation on the testing set.
train_transforms = [torchvision.transforms.RandomRotation(5),
                  torchvision.transforms.RandomHorizontalFlip(0.5),
                  torchvision.transforms.RandomCrop(32, padding = 2),
                  torchvision.transforms.ToTensor()]

### Normalization and Standardization


To put it simple:

***normalize***: making your data range in [0, 1]

**standardize**: making your data's mean=0 and std=1

In modern deep learning, sometimes it's often okay if you don't do these, but they will often help with faster training and better accuracy. Please see this [article](https://stats.stackexchange.com/questions/185853/why-do-we-need-to-normalize-the-images-before-we-put-them-into-cnn).

Calculate the mean and standard deviation of pixel values so we can standardize the dataset later. 

In [47]:
train_dataset = train_dataset.data.float()
means = train_dataset.data.mean(axis = (0,1,2)) / 255
stds = train_dataset.data.std(axis = (0,1,2)) / 255

In [48]:
# append the standardization to the list of transformations we want to do.
train_transforms.append(torchvision.transforms.Normalize(means, stds))
train_transforms = torchvision.transforms.Compose(train_transforms)

test_transforms = torchvision.transforms.Compose([
                           torchvision.transforms.ToTensor(),
                           torchvision.transforms.Normalize(mean = means, 
                                                std = stds)
                       ])

In [49]:
train_transforms

Compose(
    RandomRotation(degrees=[-5.0, 5.0], interpolation=nearest, expand=False, fill=0)
    RandomHorizontalFlip(p=0.5)
    RandomCrop(size=(32, 32), padding=2)
    ToTensor()
    Normalize(mean=0.2860405743122101, std=0.35302427411079407)
)

Apply these transformations on our training set and testing set separately

In [52]:
train_data = torchvision.datasets.FashionMNIST(ROOT, 
                              train = True, 
                              download = True, 
                              transform = train_transforms)

test_data = torchvision.datasets.FashionMNIST(ROOT, 
                             train = False, 
                             download = True, 
                             transform = test_transforms)

Leave out 10% of data from the training set as the validation set. **The model won't train on the validation set, but only do inference on it.** 

Validation set is similar to test set (hence the similar transformations), but it's a good practice to only run your model on test set for only **once**, and use your validation set as a gauge of how well your model generalize while tweaking hyper-parameters

In [53]:
VALIDATION_RATIO = 0.9

n_train_examples = int(len(train_data) * VALIDATION_RATIO)
n_valid_examples = len(train_data) - n_train_examples

train_data, valid_data = torch.utils.data.random_split(train_data, [n_train_examples, n_valid_examples])

valid_data = copy.deepcopy(valid_data)
valid_data.dataset.transform = test_transforms # we do want to do data augmentation on the validation set


The final step is to create a DataLoader object. 

DataLoader object can be thought of as an iterator we use in Python. Deep learning dataset are usually too large to fit on memory (RAM, usually 8GB to 32GB) entirely, so we want to have a DataLoader that can spit out a fixed size of the dataset every time we need more data to process.

Batch_size can be thought of the number of data point we will ask the DataLoader to spit out. After DataLoader spit out a chunk partitioned from the entire dataset, we will send it to GPU's memory (VRAM) so GPU can work on it. Similarly, GPU has limited memory, usually ranging from a few GB to 40GB, so the number should be adjusted according to the VRAM of your GPU.

In [54]:
BATCH_SIZE = 64

# we only shuffle the training set 
train_iterator = torch.utils.data.DataLoader(train_data,
                                             batch_size=BATCH_SIZE, 
                                             shuffle=True)

valid_iterator = torch.utils.data.DataLoader(valid_data,
                                             batch_size=BATCH_SIZE,
                                             shuffle=False)

test_iterator = torch.utils.data.DataLoader(test_data,
                                            batch_size=BATCH_SIZE, 
                                            shuffle=False)

# 3. Defining the Model

Next up is defining the model.

AlexNet will have the following architecture:

* There are 5 2D convolutional layers (which serve as *feature extractors*), followed by 3 linear layers (which serve as the *classifier*).
* All layers (except the last one) have `ReLU` activations. (Use `inplace=True` while defining your ReLUs.)
* All convolutional filter sizes have kernel size 3 x 3 and padding 1. 
* Convolutional layer 1 has stride 2. All others have the default stride (1).
* Convolutional layers 1,2, and 5 are followed by a 2D maxpool of size 2.
* Linear layers 1 and 2 are preceded by Dropouts with Bernoulli parameter 0.5.

* For the convolutional layers, the number of channels is set as follows. We start with 3 channels and then proceed like this:

  - $3 \rightarrow 64 \rightarrow 192 \rightarrow384 \rightarrow256 \rightarrow 256$

  In the end, if everything is correct you should get a feature map of size $2\times2 \times 256 = 1024$.

* For the linear layers, the feature sizes are as follows:

  - $1024 \rightarrow 4096 \rightarrow 4096 \rightarrow 10$.

  (The 10, of course, is because 10 is the number of classes in FashionMNIST).