# Assignment 1: Multi-Layer Perceptron with MNIST Dataset

In this assignment, you are required to train two MLPs to classify images from the [MNIST database](http://yann.lecun.com/exdb/mnist/) hand-written digit database by using PyTorch.

The process will be broken down into the following steps:
>1. Load and visualize the data.
2. Define a neural network. (30 marks)
3. Train the models. (30 marks)
4. Evaluate the performance of our trained models on the test dataset. (20 marks)
5. Analysis your results. (20 marks)

In [1]:
import torch
import numpy as np
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

device

device(type='cuda')

---
## Load and Visualize the Data

Downloading may take a few moments, and you should see your progress as the data is loading. You may also choose to change the `batch_size` if you want to load more data at a time.

This cell will create DataLoaders for each of our datasets.

In [12]:
from torchvision import datasets
import torchvision.transforms as transforms

# number of subprocesses to use for data loading
num_workers = 0
# how many samples per batch to load
batch_size = 20

# convert data to torch.FloatTensor
transform = transforms.ToTensor()

# choose the training and test datasets
train_data = datasets.MNIST(root='data', train=True,
                                   download=True, transform=transform)
test_data = datasets.MNIST(root='data', train=False,
                                  download=True, transform=transform)

# prepare data loaders
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, num_workers=num_workers)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, num_workers=num_workers)

Dataset MNIST
    Number of datapoints: 10000
    Root location: data
    Split: Test
    StandardTransform
Transform: ToTensor()

### Visualize a Batch of Training Data

The first step in a classification task is to take a look at the data, make sure it is loaded in correctly, then make any initial observations about patterns in that data.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
    
# obtain one batch of training images
dataiter = iter(train_loader)
images, labels = dataiter.next()
images = images.numpy()

# plot the images in the batch, along with the corresponding labels
fig = plt.figure(figsize=(25, 4))
for idx in np.arange(20):
    ax = fig.add_subplot(2, 20/2, idx+1, xticks=[], yticks=[])
    ax.imshow(np.squeeze(images[idx]), cmap='gray')
    # print out the correct label for each image
    # .item() gets the value contained in a Tensor
    ax.set_title(str(labels[idx].item()))

In [None]:
images.shape

### View an Image in More Detail

In [None]:
img = np.squeeze(images[1])

fig = plt.figure(figsize = (12,12)) 
ax = fig.add_subplot(111)
ax.imshow(img, cmap='gray')
width, height = img.shape
thresh = img.max()/2.5
for x in range(width):
    for y in range(height):
        val = round(img[x][y],2) if img[x][y] !=0 else 0
        ax.annotate(str(val), xy=(y,x),
                    horizontalalignment='center',
                    verticalalignment='center',
                    color='white' if img[x][y]<thresh else 'black')

---
## Define the Network Architecture (30 marks)

* Input: a 784-dim Tensor of pixel values for each image.
* Output: a 10-dim Tensor of number of classes that indicates the class scores for an input image. 

You need to implement three models:
1. a vanilla multi-layer perceptron. (10 marks)
2. a multi-layer perceptron with regularization (dropout or L2 or both). (10 marks)
3. the corresponding loss functions and optimizers. (10 marks)

### Build model_1

In [6]:
## Define the MLP architecture
import torch.nn as nn
class VanillaMLP(nn.Module):
    def __init__(self):
        super(VanillaMLP, self).__init__()
        
        # implement your codes here
        num_inputs, num_outputs, num_hiddens = 784, 10, [64, 128, 256] 
        self.l1 = nn.Linear(num_inputs, num_hiddens[0])
        self.l2 = nn.Linear(num_hiddens[0], num_hiddens[1])
        self.l3 = nn.Linear(num_hiddens[1], num_hiddens[2])
        self.relu = nn.ReLU()
        self.l4 = nn.Linear(num_hiddens[2], num_outputs)

    def forward(self, x):
        # flatten image input
        x = x.view(-1, 28 * 28)

        # implement your codes here
        x = self.relu(self.l1(x))
        x = self.relu(self.l2(x))
        x = self.relu(self.l3(x))
        x = self.l4(x)
        return x

# initialize the MLP
model_1 = VanillaMLP()
model_1.to(device)

# specify loss function
# implement your codes here
loss_1 = torch.nn.CrossEntropyLoss()
loss_1.to(device)
# specify your optimizer
# implement your codes here
optimizer_1 = torch.optim.SGD(model_1.parameters(), lr = 0.1)

### Build model_2

In [17]:
## Define the MLP architecture
class RegularizedMLP(nn.Module):
    def __init__(self):
        super(RegularizedMLP, self).__init__()
        
        # implement your codes here
        num_inputs, num_outputs, num_hiddens = 784, 10, [64, 128, 256] 
        drop_prob = [0.2, 0.3, 0.5]
        self.l1 = nn.Linear(num_inputs, num_hiddens[0])
        self.l2 = nn.Linear(num_hiddens[0], num_hiddens[1])
        self.l3 = nn.Linear(num_hiddens[1], num_hiddens[2])
        self.relu = nn.ReLU()
        self.dropout1 = nn.Dropout(drop_prob[0])
        self.dropout2 = nn.Dropout(drop_prob[1])
        self.dropout3 = nn.Dropout(drop_prob[2])
        self.l4 = nn.Linear(num_hiddens[2], num_outputs)


    def forward(self, x):
        # flatten image input
        x = x.view(-1, 28 * 28)

        # implement your codes here
        x = self.dropout1(self.relu(self.l1(x)))
        x = self.dropout2(self.relu(self.l2(x)))
        x = self.dropout3(self.relu(self.l3(x)))
        x = self.l4(x)
        
        return x

# initialize the MLP
model_2 = RegularizedMLP()
model_2.to(device)
print(model_2)
# specify loss function
# implement your codes here
loss_2 = torch.nn.CrossEntropyLoss()
loss_2.to(device)

# specify your optimizer
# implement your codes here
weight_decay = 0.001
weight_decay_list = (param for name, param in model_2.named_parameters() if name[-4:] != 'bias' and "bn" not in name)
no_decay_list = (param for name, param in model_2.named_parameters() if name[-4:] == 'bias' or "bn" in name)

parameters = [{'params': weight_decay_list},
              {'params': no_decay_list, 'weight_decay': 0.}]

optimizer_2 = torch.optim.SGD(parameters, weight_decay=weight_decay, lr = 0.1)


RegularizedMLP(
  (l1): Linear(in_features=784, out_features=64, bias=True)
  (l2): Linear(in_features=64, out_features=128, bias=True)
  (l3): Linear(in_features=128, out_features=256, bias=True)
  (relu): ReLU()
  (dropout1): Dropout(p=0.2, inplace=False)
  (dropout2): Dropout(p=0.3, inplace=False)
  (dropout3): Dropout(p=0.5, inplace=False)
  (l4): Linear(in_features=256, out_features=10, bias=True)
)


---
## Train the Network (30 marks)

Train your models in the following two cells.

The following loop trains for 30 epochs; feel free to change this number. For now, we suggest somewhere between 20-50 epochs. As you train, take a look at how the values for the training loss decrease over time. We want it to decrease while also avoiding overfitting the training data. 

**The key parts in the training process are left for you to implement.**

### Train model_1

In [7]:
# number of epochs to train the model
n_epochs = 30  # suggest training between 20-50 epochs
model_1.train() # prep model for training

def accuracy(y_hat, y):
    #print((y_hat.argmax(dim=1)== y).float().sum())
    return (y_hat.argmax(dim=1) == y).float().sum().item()

for epoch in range(n_epochs):
    # monitor training loss
    train_loss = 0.0
    total_correct = 0
    
    for data, target in train_loader:
        # implement your code here
        data = data.to(device)
        target = target.to(device)
        predicts = model_1(data)
        #print(predicts, target)
        l = loss_1(predicts, target).sum()
       
        optimizer_1.zero_grad()
        l.backward()
        optimizer_1.step()
        
        train_loss += l # the total loss of this batch
        total_correct += accuracy(predicts, target) # the accumulated number of correctly classified samples of this batch
        
    # print training statistics 
    # calculate average loss and accuracy over an epoch
    train_loss = train_loss / len(train_loader.dataset)
    train_acc = 100. * total_correct / len(train_loader.dataset)
    
    print('Epoch: {} \tTraining Loss: {:.6f} \tTraining Acc: {:.2f}%'.format(
        epoch+1, 
        train_loss,
        train_acc
        ))

Epoch: 1 	Training Loss: 0.017278 	Training Acc: 89.17%
Epoch: 2 	Training Loss: 0.006289 	Training Acc: 96.17%
Epoch: 3 	Training Loss: 0.004551 	Training Acc: 97.29%
Epoch: 4 	Training Loss: 0.003568 	Training Acc: 97.88%
Epoch: 5 	Training Loss: 0.002905 	Training Acc: 98.25%
Epoch: 6 	Training Loss: 0.002445 	Training Acc: 98.51%
Epoch: 7 	Training Loss: 0.002214 	Training Acc: 98.69%
Epoch: 8 	Training Loss: 0.001848 	Training Acc: 98.89%
Epoch: 9 	Training Loss: 0.001562 	Training Acc: 98.96%
Epoch: 10 	Training Loss: 0.001407 	Training Acc: 99.11%
Epoch: 11 	Training Loss: 0.001348 	Training Acc: 99.10%
Epoch: 12 	Training Loss: 0.001347 	Training Acc: 99.07%
Epoch: 13 	Training Loss: 0.001220 	Training Acc: 99.20%
Epoch: 14 	Training Loss: 0.000990 	Training Acc: 99.32%
Epoch: 15 	Training Loss: 0.001075 	Training Acc: 99.27%
Epoch: 16 	Training Loss: 0.000827 	Training Acc: 99.45%
Epoch: 17 	Training Loss: 0.000980 	Training Acc: 99.34%
Epoch: 18 	Training Loss: 0.000762 	Trai

### Train model_2

In [18]:
# number of epochs to train the model
n_epochs = 30  # suggest training between 20-50 epochs

model_2.train() # prep model for training

for epoch in range(n_epochs):
    # monitor training loss
    train_loss = 0.0
    total_correct = 0
    
    for data, target in train_loader:
        # implement your code here
        data, target = data.to(device), target.to(device)
        predicts = model_2(data)
        l = loss_2(predicts, target).sum()
       
        optimizer_2.zero_grad()
        l.backward()
        optimizer_2.step()
        
        train_loss += l # the total loss of this batch
        total_correct += accuracy(predicts, target) # the accumulated number of correctly classified samples of this batch
        
        
    # print training statistics 
    # calculate average loss and accuracy over an epoch
    train_loss = train_loss / len(train_loader.dataset)
    train_acc = 100. * total_correct / len(train_loader.dataset)
    
    print('Epoch: {} \tTraining Loss: {:.6f} \tTraining Acc: {:.2f}%'.format(
        epoch+1, 
        train_loss,
        train_acc
        ))

Epoch: 1 	Training Loss: 0.024614 	Training Acc: 84.87%
Epoch: 2 	Training Loss: 0.012468 	Training Acc: 92.77%
Epoch: 3 	Training Loss: 0.010735 	Training Acc: 93.72%
Epoch: 4 	Training Loss: 0.010144 	Training Acc: 94.14%
Epoch: 5 	Training Loss: 0.009851 	Training Acc: 94.25%
Epoch: 6 	Training Loss: 0.009311 	Training Acc: 94.57%
Epoch: 7 	Training Loss: 0.008983 	Training Acc: 94.82%
Epoch: 8 	Training Loss: 0.009018 	Training Acc: 94.73%
Epoch: 9 	Training Loss: 0.008721 	Training Acc: 94.90%
Epoch: 10 	Training Loss: 0.008652 	Training Acc: 94.90%
Epoch: 11 	Training Loss: 0.008468 	Training Acc: 95.10%
Epoch: 12 	Training Loss: 0.008502 	Training Acc: 94.97%
Epoch: 13 	Training Loss: 0.008481 	Training Acc: 95.05%
Epoch: 14 	Training Loss: 0.008370 	Training Acc: 95.03%
Epoch: 15 	Training Loss: 0.008474 	Training Acc: 95.10%
Epoch: 16 	Training Loss: 0.008266 	Training Acc: 95.11%
Epoch: 17 	Training Loss: 0.008420 	Training Acc: 95.10%
Epoch: 18 	Training Loss: 0.008320 	Trai

---
## Test the Trained Network (20 marks)

Test the performance of trained models on test data. Except the total test accuracy, you should calculate the accuracy for each class.

### Test model_1

In [22]:
# initialize lists to monitor test loss and accuracy
test_loss = 0.0
class_correct = list(0. for i in range(10))
class_total = list(0. for i in range(10))

model_1.eval() # prep model for *evaluation*

for data, target in test_loader:
    # implement your code here
    data, target = data.to(device), target.to(device)
    preds = model_1(data)
    test_loss = loss_1(preds, target) # the total loss of this batch
    
    for label in range(10):
        class_correct[label] = accuracy(preds, target) # the list of number of correctly classified samples of each class of this batch. label is the index.
        class_total[label] = len(data) # the list of total number of samples of each class of this batch. label is the index.


# calculate and print avg test loss
test_loss = test_loss / len(test_loader.dataset)
print('Test Loss: {:.6f}\n'.format(test_loss))

for i in range(10):
    if class_total[i] > 0:
        print('Test Accuracy of class %d: %.2f%%' % (i, 100 * class_correct[i] / class_total[i]))
    else:
        print('Test Accuracy of class %d: N/A (no training examples)' % (i))

print('\nTest Accuracy (Overall): %.2f%%' % (100. * np.sum(class_correct) / np.sum(class_total)))

Test Loss: 0.000000

Test Accuracy of class 0: 100.00%
Test Accuracy of class 1: N/A (no training examples)
Test Accuracy of class 2: N/A (no training examples)
Test Accuracy of class 3: N/A (no training examples)
Test Accuracy of class 4: N/A (no training examples)
Test Accuracy of class 5: N/A (no training examples)
Test Accuracy of class 6: N/A (no training examples)
Test Accuracy of class 7: N/A (no training examples)
Test Accuracy of class 8: N/A (no training examples)
Test Accuracy of class 9: N/A (no training examples)

Test Accuracy (Overall): 100.00%


### Test model_2

In [None]:
# initialize lists to monitor test loss and accuracy
test_loss = 0.0
class_correct = list(0. for i in range(10))
class_total = list(0. for i in range(10))

model_2.eval() # prep model for *evaluation*

for data, target in test_loader:
    
    # implement your code here
    
    test_loss = # the total loss of this batch
    class_correct[label] = # the list of number of correctly classified samples of each class of this batch. label is the index.
    class_total[label] = # the list of total number of samples of each class of this batch. label is the index.

# calculate and print avg test loss
test_loss = test_loss / len(test_loader.dataset)
print('Test Loss: {:.6f}\n'.format(test_loss))

for i in range(10):
    if class_total[i] > 0:
        print('Test Accuracy of class %d: %.2f%%' % (i, 100 * class_correct[i] / class_total[i]))
    else:
        print('Test Accuracy of class %d: N/A (no training examples)' % (i))

print('\nTest Accuracy (Overall): %.2f%%' % (100. * np.sum(class_correct) / np.sum(class_total)))

---
## Analyze Your Result (20 marks)
Compare the performance of your models with the following analysis. Both English and Chinese answers are acceptable.
1. Does your vanilla MLP overfit to the training data? (5 marks)

Answer:

2. If yes, how do you observe it? If no, why? (5 marks)

Answer:

3. Is regularized model help prevent overfitting? (5 marks)

Answer:

4. Generally compare the performance of two models. (5 marks)

Answer:
