# HW 1

For this homework, please design, train, and evaluate a MLP (Multi-Layer Perceptron, aka Neural Network whose layers are all Linear layers) on the FashinMNIST data set.
The dataset can be downloaded via PyTorch, just like how I downloaded the CIFAR-10 dataset.


# 0. Introduction and Importing

First, we import the most fundamental packages/modules of PyTorch.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms
import cv2
import torch.nn.functional
import numpy as np
import os

Pytorch essentially does two things:
- Manipulates the so-called tensor data structure on GPU, just like NumPy can manipulate ndarray on CPU.
- Provides a automatic differentiation engine and some convenient helper functions for deep learning

Tensor is a data structure that can be thought of as a generalization of a matrix. A grayscale image is a matrix, but a colored image with 3 channels can be thought of a tensor.

Check if we are using GPU. Computation will be very slow if not.

In [2]:
torch.cuda.is_available()

True

Next we import some vision-related packges

In [3]:
import torchvision
import torchvision.datasets

Finally some generic helper packages

In [4]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import numpy as np

import copy
import random
import time
import cv2


The code sets a seed for the random number generators used by the random module, the numpy library, and the PyTorch library. By setting a seed, the code ensures that the results of the random number generation will be deterministic and reproducible, meaning that each time the code is run, the same sequence of random numbers will be generated. This is useful for debugging and testing, as well as for reproducing experimental results.

Additionally, the code sets the device to either the GPU (if available) or the CPU. The PyTorch library allows computations to be performed on either the GPU or the CPU, and the device to be used can be specified by setting the device variable.

Finally, the code sets torch.backends.cudnn.deterministic to True. This flag controls the deterministic behavior of the cuDNN library, which is used by PyTorch for GPU acceleration. By setting this flag to True, the code ensures that the cuDNN library will produce deterministic results and further improves the reproducibility of the code.

In [5]:
SEED = 1234


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Now let's work on our own model. A MLP (Multi-Layer Perceptron, aka Neural Network whose layers are all Linear layers) on the FashinMNIST data set.

We will train out Linear_MLP on the FashinMNIST using consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.

The classes are: 

0	T-shirt/top

1	Trouser

2	Pullover

3	Dress

4	Coat

5	Sandal

6	Shirt

7	Sneaker

8	Bag

9	Ankle boot


# 1. Data Loading and Pre-processing

[FashionMNIST](https://github.com/zalandoresearch/fashion-mnist) dataset is included in PyTorch because it's so widely used

In [6]:
ROOT = '.data' # folder that contains the 

train_dataset = torchvision.datasets.FashionMNIST(root='./data', train=True, transform=torchvision.transforms.ToTensor(), download=True)
test_dataset = torchvision.datasets.FashionMNIST(root='./data', train=False, transform=torchvision.transforms.ToTensor(), download=True)

### Data augmentation


Next, we will do data augmentation. DL models are data hungry. A good trick to increse the size of dataset without the hardwork of acquiring/labeling more data is data augmentation. 

For each training image we will randomly rotate it (by up to 5 degrees), flip/mirror with probability 0.5, shift by +/-1 pixel. 

In [7]:
# here we compose all the data augmentation actions we want to do.
# note that there is no need to do data augmentation on the testing set.
train_transforms = [torchvision.transforms.RandomRotation(5),
                  torchvision.transforms.RandomHorizontalFlip(0.5),
                  torchvision.transforms.ToTensor()]

### Normalization and Standardization


To put it simple:

***normalize***: making your data range in [0, 1]

**standardize**: making your data's mean=0 and std=1

In modern deep learning, sometimes it's often okay if you don't do these, but they will often help with faster training and better accuracy. Please see this [article](https://stats.stackexchange.com/questions/185853/why-do-we-need-to-normalize-the-images-before-we-put-them-into-cnn).

In [8]:
train_dataset = train_dataset.data.float()
means = train_dataset.data.mean(axis = (0)) / 255
stds = train_dataset.data.std(axis = (0)) / 255

Calculate the mean and standard deviation of pixel values so we can standardize the dataset later. 

Apply these transformations on our training set and testing set separately

In [9]:
# append the standardization to the list of transformations we want to do.
train_transforms.append(torchvision.transforms.Normalize(mean = means, std = stds))
train_transforms = torchvision.transforms.Compose(train_transforms)

test_transforms = torchvision.transforms.Compose([
                           torchvision.transforms.ToTensor(),
                           torchvision.transforms.Normalize(mean = means, 
                                                std = stds)
                       ])

# Load the FashionMNIST dataset
train_dataset = torchvision.datasets.FashionMNIST(root='data/',
                                                   train=True,
                                                   transform=train_transforms,
                                                   download=True)

test_dataset = torchvision.datasets.FashionMNIST(root='data/',
                                                  train=False,
                                                  transform=test_transforms,
                                                  download=True)

Leave out 10% of data from the training set as the validation set. **The model won't train on the validation set, but only do inference on it.** 

Validation set is similar to test set (hence the similar transformations), but it's a good practice to only run your model on test set for only **once**, and use your validation set as a gauge of how well your model generalize while tweaking hyper-parameters

In [10]:
VALID_RATIO = 0.9

n_train_examples = int(len(train_dataset) * VALID_RATIO)
n_valid_examples = len(train_dataset) - n_train_examples

train_dataset, valid_dataset = torch.utils.data.random_split(train_dataset, 
                                           [n_train_examples, n_valid_examples])

valid_dataset = copy.deepcopy(valid_dataset)
valid_dataset.dataset.transform = test_transforms

The final step is to create a DataLoader object. 

DataLoader object can be thought of as an iterator we use in Python. Deep learning dataset are usually too large to fit on memory (RAM, usually 8GB to 32GB) entirely, so we want to have a DataLoader that can spit out a fixed size of the dataset every time we need more data to process.

Batch_size can be thought of the number of data point we will ask the DataLoader to spit out. After DataLoader spit out a chunk partitioned from the entire dataset, we will send it to GPU's memory (VRAM) so GPU can work on it. Similarly, GPU has limited memory, usually ranging from a few GB to 40GB, so the number should be adjusted according to the VRAM of your GPU.

In [11]:
BATCH_SIZE = 64

# we only shuffle the training set 
train_iterator = torch.utils.data.DataLoader(train_dataset,
                                             batch_size=BATCH_SIZE, 
                                             shuffle=True)

validation_iterator = torch.utils.data.DataLoader(valid_dataset,
                                             batch_size=BATCH_SIZE,
                                             shuffle=False)

test_iterator = torch.utils.data.DataLoader(test_dataset,
                                            batch_size=BATCH_SIZE, 
                                            shuffle=False)

# 2. Defining the Model

Next up is defining the model.

Linear_MLP will have the following architecture:

* There are 4 Fully connected linear layers (which serve as *feature extractors*), followed by 1 linear layers (which serve as the *classifier*).
* All layers have `ReLU` activations. (Use `inplace=True` while defining your ReLUs.)

* For the linear layers, the feature sizes are as follows:

  - $1024 \rightarrow 512 \rightarrow 256 \rightarrow 128 \rightarrow 10$.

  (The 10, of course, is because 10 is the number of classes in FashionMNIST).

In [12]:
# Define the Linear_MLP model
class Linear_MLP(torch.nn.Module):
    def __init__(self):
        super(Linear_MLP, self).__init__()
        self.fully_connected_layer_1 = torch.nn.Linear(28 * 28, 1024)
        self.fully_connected_layer_2 = torch.nn.Linear(1024, 512)
        self.fully_connected_layer_3 = torch.nn.Linear(512, 256)
        self.fully_connected_layer_4 = torch.nn.Linear(256, 128)
        self.fully_connected_layer_5 = torch.nn.Linear(128, 10)

    def forward(self, image_tensor):
        image_tensor = image_tensor.view(-1, 28 * 28)
        image_tensor = torch.relu(self.fully_connected_layer_1(image_tensor))
        image_tensor = torch.relu(self.fully_connected_layer_2(image_tensor))
        image_tensor = torch.relu(self.fully_connected_layer_3(image_tensor))
        image_tensor = torch.relu(self.fully_connected_layer_4(image_tensor))
        image_tensor = self.fully_connected_layer_5(image_tensor)
        return image_tensor

In [13]:
model = Linear_MLP()

# 3. Training the Model

Before we start the training, we will need to initialize our models. To put it simple, we are assigning the intial values of weight. We could just assign them all 0 to start with, and it would work. But data scientists have come up with smarter ways to to this to make things work even better. 

For the linear layers we initialize using the *Xavier Normal* scheme, also known as *Glorot Normal*. For both types of layer we initialize the bias terms to zeros.

In [14]:
def initialize_parameters(m):
    if isinstance(m, nn.Linear):
        nn.init.xavier_normal_(m.weight.data, gain = nn.init.calculate_gain('relu'))
        nn.init.constant_(m.bias.data, 0)

In [15]:
model.apply(initialize_parameters)

Linear_MLP(
  (fully_connected_layer_1): Linear(in_features=784, out_features=1024, bias=True)
  (fully_connected_layer_2): Linear(in_features=1024, out_features=512, bias=True)
  (fully_connected_layer_3): Linear(in_features=512, out_features=256, bias=True)
  (fully_connected_layer_4): Linear(in_features=256, out_features=128, bias=True)
  (fully_connected_layer_5): Linear(in_features=128, out_features=10, bias=True)
)

Next we create a optimizer and loss function.

Here the optimzer is called Adam. It's a slightly more advanced version of the common optimization algorithm called gradient descent. There are a few other optimizers out there, but for most common tasks we will just use Adam.

The loss function is the cross entropy loss. Notice that in our model definition, there is no activation function for the very last layer. This is because the loss function itself has softmax baked in to do multi-class classification. Part of the design choice is explained [here](https://stackoverflow.com/questions/57516027/does-pytorch-apply-softmax-automatically-in-nn-linear)

In [16]:
optimizer = optim.Adam(model.parameters(), lr = 1e-3)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
criterion = nn.CrossEntropyLoss()

model = model.to(device)
criterion = criterion.to(device)

cuda


In [17]:
def calculate_accuracy(y_pred, y):
    top_pred = y_pred.argmax(1, keepdim = True)
    correct = top_pred.eq(y.view_as(top_pred)).sum()
    acc = correct.float() / y.shape[0]
    return acc

In [18]:
def train(model, iterator, optimizer, criterion, device):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for i, (images, y) in enumerate(iterator):
        images = images.to(device)
        y = y.to(device)
        
        optimizer.zero_grad()
        
        y_pred = model(images)

        loss = criterion(y_pred, y)
        acc = calculate_accuracy(y_pred, y)

        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


- put our model into train mode with `model.train()`. Some layers should act differently during training than testing.

For each iteration

- acquire [batch_size] pairs of (image, label) from the data loader 
- send the data we just acquired to GPU.
- clear the gradients calculated from the last iteration. 
- pass our batch of images, x, through to model to get predictions, y_pred
- calculate the loss between our predictions and the actual labels
- calculate the accuracy between our predictions and the actual labels
- calculate the gradients of each parameter backward (hence backpropagation)
- update the parameters by taking an optimizer step forward
- update our metrics

In [19]:
def evaluate(model, iterator, criterion, device):
    
    epoch_loss = 0
    epoch_accuracy = 0
    
    model.eval()
    
    with torch.no_grad():
        
        for (images, y) in iterator:
    
            images = images.to(device)
            y = y.to(device)

            y_pred = model(images)

            loss = criterion(y_pred, y)

            accuracy = calculate_accuracy(y_pred, y)

            epoch_loss += loss.item()
            epoch_accuracy += accuracy.item()
        
    return epoch_loss / len(iterator), epoch_accuracy / len(iterator)

The evaluation loop is similar to the training loop with a few differences:
1. we put our model into evaluation mode with `model.eval()` for the same reason above.
2. we wrap the iterations inside a `with torch.no_grad()` because for testing time we no longer need to calculate gradient, and we can save memory and computational time for not doing it.
3. We also do not need to update our optimizer because we are no longer optimizing our model.


In [20]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Finally we can start training. 

For each epoch, we run through the training process once to update our model. It's important to notice that the entire dataset is being run through once for just 1 training process. Then we use the updated model to run through the evaluation process to get our validation accuracy to gauge how well our model generalizes.

We repeat this for 25 epochs here.

In [21]:
EPOCHS = 25

# used to record history of the traning
train_loss_history = []
train_accuracy_history = []
validation_loss_history = []
validation_accuracy_history = []

for epoch in range(EPOCHS):
    start_time = time.time() # record start time

    train_loss, train_acc = train(model=model, 
                                    iterator=train_iterator, 
                                    optimizer=optimizer, 
                                    criterion=criterion, 
                                    device=device)
    torch.save(model, './model_'+str(epoch)+'.pt')
    
    train_loss_history.append(train_loss)
    train_accuracy_history.append(train_acc)
    
    validation_loss, validation_accuracy = evaluate(model=model, 
                                 iterator=validation_iterator, 
                                 criterion=criterion, 
                                 device=device)
    
    validation_loss_history.append(validation_loss) 
    validation_accuracy_history.append(validation_accuracy)
    end_time = time.time()
    minute, second = epoch_time(start_time, end_time)
    
    print(f'Epoch: {epoch+1}/{EPOCHS}') 
    print(f'Training Loss: {train_loss}.. Validation Loss: {validation_loss}')
    print(f'Training Accuracy: {train_acc}.. Validtion Accuracy: {validation_accuracy}')
    print(f'Time Elapsed: {minute} in minute.. {second} in second')
    print('')

Epoch: 1/25
Training Loss: 0.5600574032284354.. Validation Loss: 0.43378684574619253
Training Accuracy: 0.8077730055527664.. Validtion Accuracy: 0.8389849295007422
Time Elapsed: 0 in minute.. 14 in second

Epoch: 2/25
Training Loss: 0.4122523602221814.. Validation Loss: 0.4164559430581458
Training Accuracy: 0.8532965541309655.. Validtion Accuracy: 0.849678634963137
Time Elapsed: 0 in minute.. 12 in second

Epoch: 3/25
Training Loss: 0.37520327194818953.. Validation Loss: 0.366104356627515
Training Accuracy: 0.8638366409952607.. Validtion Accuracy: 0.8643617021276596
Time Elapsed: 0 in minute.. 12 in second

Epoch: 4/25
Training Loss: 0.35483999490314183.. Validation Loss: 0.3884074401031149
Training Accuracy: 0.8748395537595614.. Validtion Accuracy: 0.8645279255319149
Time Elapsed: 0 in minute.. 11 in second

Epoch: 5/25
Training Loss: 0.32781164094770404.. Validation Loss: 0.35412098657577595
Training Accuracy: 0.8804119767453433.. Validtion Accuracy: 0.874501329787234
Time Elapsed: 0

In [22]:
test_model = Linear_MLP()
test_model = torch.load('model_24.pt', map_location=device)
test_model = test_model.to(device)

In [23]:
_, test_acc = evaluate(model=test_model, 
                        iterator=test_iterator, 
                        criterion=criterion, 
                        device=device)
print('Our test accuracy is:', test_acc*100, '%')

Our test accuracy is: 88.81369426751591 %


In [24]:
# Load test images from local folder
test_images_folder = './test_images'
test_images = []
test_labels = []

for image_name in os.listdir(test_images_folder):
    image = cv2.imread(os.path.join(test_images_folder, image_name), cv2.IMREAD_GRAYSCALE)
    image = cv2.resize(image, (28, 28))
    label = image_name.split('_')[0] # assuming that the label is the first part of the file name, separated by '_'
    
    test_images.append(image)
    test_labels.append(int(label))

# Convert the images to tensors and normalize
test_images = torch.tensor(test_images, dtype=torch.float32) / 255
test_images = test_images.view(-1, 1, 28, 28)
test_labels = torch.tensor(test_labels)

# Move all tensors to the same device
test_model = test_model.to(device)
test_images = test_images.to(device)
test_labels = test_labels.to(device)

# Evaluate the model on the test images
test_model.eval()
with torch.no_grad():
    outputs = test_model(test_images)
    _, predicted = torch.max(outputs.data, 1)
    print("Predicted labels: ", predicted)
    print("Actual labels: ", test_labels)

    correct = (predicted == test_labels).sum().item()
    accuracy = correct / len(test_labels)

print(f'Accuracy of the network on the local test images: {accuracy * 100}%')


Predicted labels:  tensor([8, 8, 8], device='cuda:0')
Actual labels:  tensor([1, 4, 7], device='cuda:0')
Accuracy of the network on the local test images: 0.0%


  test_images = torch.tensor(test_images, dtype=torch.float32) / 255
