# CS549 Machine Learning
# Assignment 8: Optimization of Deep Neural Networks

**Total points: 15**

In this assignment, you will implement a multiple layer feed-forward neural network for a multi-class classification task.

In [1]:
!pip install torch
!pip install torchvision

Collecting torch
  Obtaining dependency information for torch from https://files.pythonhosted.org/packages/50/9e/acf04ff375b0b49a45511c55d188bcea5c942da2aaf293096676110086d1/torch-2.7.1-cp311-cp311-win_amd64.whl.metadata
  Using cached torch-2.7.1-cp311-cp311-win_amd64.whl.metadata (28 kB)
Using cached torch-2.7.1-cp311-cp311-win_amd64.whl (216.1 MB)
Installing collected packages: torch
Successfully installed torch-2.7.1
Collecting torchvision
  Obtaining dependency information for torchvision from https://files.pythonhosted.org/packages/e5/73/1b009b42fe4a7774ba19c23c26bb0f020d68525c417a348b166f1c56044f/torchvision-0.22.1-cp311-cp311-win_amd64.whl.metadata
  Using cached torchvision-0.22.1-cp311-cp311-win_amd64.whl.metadata (6.1 kB)
Using cached torchvision-0.22.1-cp311-cp311-win_amd64.whl (1.7 MB)
Installing collected packages: torchvision
Successfully installed torchvision-0.22.1


In [2]:
import torch
torch.manual_seed(0)
torch.use_deterministic_algorithms(True)

from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

## Task 1: Build a deep neural network model
**Points: 3**

Implement the `NeuralNetModel1` class. The model takes a $28\times 28$ grey-scale image as input, and pass it through a deep neural network.

The network has 2 hidden layers and 1 output layers, whose sizes are: 512 -> 512 -> 10. That is, the number of output classes is 10. The activation function for each hidden layer is `ReLU`.

The input image is first passed through a `nn.Flatten()` layer so that a 2D tensor becomes 1D.

In [3]:
class NeuralNetModel1(nn.Module):
    def __init__(self):
        super(NeuralNetModel1, self).__init__()
        ### START YOUR CODE ###
        self.flatten = nn.Flatten() # Use nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512), # Input size is 28*28
            nn.ReLU(), # ReLU
            nn.Linear(512, 512), # 512 -> 512
            nn.ReLU(), # ReLU
            nn.Linear(512, 10), # 512 -> 10
        )
        ### END YOUR CODE ###

    def forward(self, x):
        ### START YOUR CODE ###
        x = self.flatten(x) # Call self.flatten()
        logits = self.linear_relu_stack(x) # Call self.linear_relu_stack()
        ### END YOUR CODE ###

        return logits

In [4]:
# Do not change the test code here
sample_input = torch.randn(5, 28, 28)
print('input size:', sample_input.size())

model1 = NeuralNetModel1()
with torch.no_grad():
    output = model1(sample_input)
print('output size:', output.size())

input size: torch.Size([5, 28, 28])
output size: torch.Size([5, 10])


**Expected output**:

input size: torch.Size([5, 28, 28])\
output size: torch.Size([5, 10])

---

## Task 2: Use dataloader
**Points: 1**

Download the FashionMNIST dataset provided by PyTorch to the folder "data", which takes some time for the first time execution.
Use the `DataLoader` module to wrap the loaded training and test data. Specify the `batch_size` correctly for both training and test dataloader.

See <https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader> for more information.

In [5]:
training_data = datasets.FashionMNIST(
    root="data",
    train=True, # True
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False, # False
    download=True,
    transform=ToTensor()
)

batch_size = 64

### START YOUR CODE ###
train_loader = DataLoader(training_data, batch_size=batch_size) # Specify data source and batch size correctly
test_loader = DataLoader(test_data, batch_size=batch_size)
### END YOUR CODE ###

100%|██████████| 26.4M/26.4M [00:04<00:00, 6.59MB/s]
100%|██████████| 29.5k/29.5k [00:00<00:00, 168kB/s]
100%|██████████| 4.42M/4.42M [00:01<00:00, 2.76MB/s]
100%|██████████| 5.15k/5.15k [00:00<?, ?B/s]


In [6]:
# Do not change the test code here
print('Training data size:', len(training_data))
print('Testing data size:', len(test_data))

count = 0
for batch in train_loader:
    X, y = batch
    print('X size:', X.size())
    print('y size:', y.size())
    count += 1
    if count > 0:
        break

Training data size: 60000
Testing data size: 10000
X size: torch.Size([64, 1, 28, 28])
y size: torch.Size([64])


**Expected output**:

Training data size: 60000\
Testing data size: 10000\
X size: torch.Size([64, 1, 28, 28])\
y size: torch.Size([64])

## Task 3: Define loss and optimizer
**Points: 1**

Use `nn.CrossEntropyLoss()` as the loss function, and use `torch.optim.SGD()` as the optimizer. Specify the arguments for `SGD()`, including the learning rate correctly.

See <https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html> and <https://pytorch.org/docs/stable/optim.html> for more information.

In [7]:
learning_rate = 1e-3

### START YOUR CODE ###
loss_fn = nn.CrossEntropyLoss()
optimizer_sgd = torch.optim.SGD(model1.parameters(), lr=learning_rate)
### END YOUR CODE ###

In [8]:
# Do not change the test code here
print(loss_fn)
print(type(optimizer_sgd))

CrossEntropyLoss()
<class 'torch.optim.sgd.SGD'>


**Expected output**:

CrossEntropyLoss()
<class 'torch.optim.sgd.SGD'>

---

## Task 4: Implement train and test functions
**Points: 6**

Implement the code for training the model in `train()`. Implement the code for testing the model in `test()`. For the backpropagation step, you need to first zero out all gradients by calling `optimizer.zero_grad()` before carrying out `backward()` and `step()` to update parameters.

In `test()`, you need to calculate the number of correct prediction in the current batch, and add it to the `correct` variable.
Finally, you need to divide `correct` by the total number of test examples to obtain the test accuracy.

In [9]:
def train_loop(dataloader, model, loss_fn, optimizer, verbose=True):
    for i, (X, y) in enumerate(dataloader):
        # Compute prediction and loss
        ### START YOUR CODE ###
        pred = model(X) # Get the prediction output from model
        loss = loss_fn(pred, y) # compute loss by calling loss_fn()
        ### END YOUR CODE ###

        # Backpropagation
        ### START YOUR CODE ###
        optimizer.zero_grad() # zero_grad()
        loss.backward() # backward()
        optimizer.step() # step()
        ### END YOUR CODE ###

        if verbose and i % 100 == 0:
            loss = loss.item()
            current_step = i * len(X)
            print(f"loss: {loss:>7f}  [{current_step:>5d}/{len(dataloader.dataset):>5d}]")

In [10]:
@torch.no_grad()
def test_loop(dataloader, model, loss_fn):
    test_loss, correct = 0, 0

    for X, y in dataloader:
        ### START YOUR CODE ###
        pred = model(X) # Similar to how it is computed in train()
        loss = loss_fn(pred, y)
        test_loss += loss.item()
        correct += (pred.argmax(dim=1) == y).sum().item() # Add the number of correct prediction in the current batch to `correct`
        ### END YOUR CODE ###

    test_loss /= len(dataloader)
    ### START YOUR CODE ###
    test_acc = correct / len(dataloader.dataset) # Use `correct` to compute accuracy
    ### END YOUR CODE ###

    print(f"Test Error: \n Accuracy: {(100*test_acc):>0.1f}%, Avg loss: {test_loss:>8f} \n")

Next, execute the following cell to start the training and testing loop. Make sure that the cell containing the loss function and optimizers has already been executed.

In [11]:
model1 = NeuralNetModel1() # Reset the model
### START YOUR CODE ###
optimizer_sgd = torch.optim.SGD(model1.parameters(), lr=learning_rate) # Because the model1 is reset, the optimizer also needs redefined.
### END YOUR CODE ###

epochs = 10
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    ### START YOUR CODE ###
    train_loop(train_loader, model1, loss_fn, optimizer_sgd, verbose=True) # Use verbose=False, if you want to see less information
    test_loop(test_loader, model1, loss_fn)
    ### END YOUR CODE ###

print("Done!")

Epoch 1
-------------------------------
loss: 2.296701  [    0/60000]
loss: 2.284246  [ 6400/60000]
loss: 2.266165  [12800/60000]
loss: 2.270359  [19200/60000]
loss: 2.252516  [25600/60000]
loss: 2.231632  [32000/60000]
loss: 2.239763  [38400/60000]
loss: 2.206202  [44800/60000]
loss: 2.202750  [51200/60000]
loss: 2.174662  [57600/60000]
Test Error: 
 Accuracy: 40.0%, Avg loss: 2.171465 

Epoch 2
-------------------------------
loss: 2.178695  [    0/60000]
loss: 2.166121  [ 6400/60000]
loss: 2.112746  [12800/60000]
loss: 2.130644  [19200/60000]
loss: 2.084809  [25600/60000]
loss: 2.037293  [32000/60000]
loss: 2.054000  [38400/60000]
loss: 1.979529  [44800/60000]
loss: 1.982393  [51200/60000]
loss: 1.910792  [57600/60000]
Test Error: 
 Accuracy: 58.6%, Avg loss: 1.913630 

Epoch 3
-------------------------------
loss: 1.944621  [    0/60000]
loss: 1.911706  [ 6400/60000]
loss: 1.797971  [12800/60000]
loss: 1.834720  [19200/60000]
loss: 1.737230  [25600/60000]
loss: 1.692371  [32000/600

**Expected output**

The test accuracy from the last epoch should be above 70%.

---

Next, train an ADAM optimizer. Note that the model needs be reset.

In [12]:
model1 = NeuralNetModel1() # Reset the model

### START YOUR CODE ###
optimizer_adam = torch.optim.Adam(model1.parameters(), lr=learning_rate)
### END YOUR CODE ###

epochs = 10
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    ### START YOUR CODE ###
    train_loop(train_loader, model1, loss_fn, optimizer_adam, verbose=True) # Use verbose=False, if you want to see less information
    test_loop(test_loader, model1, loss_fn)
    ### END YOUR CODE ###

print("Done!")

Epoch 1
-------------------------------
loss: 2.299667  [    0/60000]
loss: 0.565681  [ 6400/60000]
loss: 0.392925  [12800/60000]
loss: 0.491291  [19200/60000]
loss: 0.450433  [25600/60000]
loss: 0.447974  [32000/60000]
loss: 0.378937  [38400/60000]
loss: 0.541005  [44800/60000]
loss: 0.462291  [51200/60000]
loss: 0.490549  [57600/60000]
Test Error: 
 Accuracy: 85.0%, Avg loss: 0.414252 

Epoch 2
-------------------------------
loss: 0.258051  [    0/60000]
loss: 0.344899  [ 6400/60000]
loss: 0.303737  [12800/60000]
loss: 0.394993  [19200/60000]
loss: 0.418085  [25600/60000]
loss: 0.374468  [32000/60000]
loss: 0.320626  [38400/60000]
loss: 0.510361  [44800/60000]
loss: 0.378971  [51200/60000]
loss: 0.416497  [57600/60000]
Test Error: 
 Accuracy: 85.4%, Avg loss: 0.394885 

Epoch 3
-------------------------------
loss: 0.222765  [    0/60000]
loss: 0.343310  [ 6400/60000]
loss: 0.238116  [12800/60000]
loss: 0.317114  [19200/60000]
loss: 0.369154  [25600/60000]
loss: 0.345552  [32000/600

**Expected output**:

You can find that the training converges much faster using ADAM.

---

## Task 5: Add batchnorm and dropout
**Points: 4**

Use `torch.nn.BatchNorm1d()` and `nn.Dropout()` after the ReLU activation of each hidden layer. `Batchnorm1d()` takes the size of previous activation as input. `Dropout()` takes the probability of dropout as input.

For more information, see <https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html> and <https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html>.

In [13]:
class NeuralNetModel2(nn.Module):
    def __init__(self, dropout = 0.1): # Note the additional dropout parameter here
        """
        :param dropout: float, the probability of dropout
        """
        super(NeuralNetModel2, self).__init__()
        ### START YOUR CODE ###
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(), # ReLU
            nn.BatchNorm1d(512), # Batchnorm
            nn.Dropout(p=dropout), # Dropout, use the `dropout` parameter

            nn.Linear(512, 512),
            nn.ReLU(), # ReLU
            nn.BatchNorm1d(512), # Batchnorm
            nn.Dropout(p=dropout), # Dropout, use the `dropout` parameter

            nn.Linear(512, 10),
        )
        ### END YOUR CODE ###

    def forward(self, x):
        ### START YOUR CODE ###
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        ### END YOUR CODE ###

        return logits

In the following cell, test with different `dropout` rates, and observe how that affects the test accuracy.

In [14]:
### START YOUR CODE ###
model2 = NeuralNetModel2(dropout=0.1) # Call NeuralNetModel2() with the dropout value you want to try
optimizer = torch.optim.Adam(model2.parameters(), lr=learning_rate) # You may try Adam/SGD optimizer
### END YOUR CODE ###

epochs = 10
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    ### START YOUR CODE ###
    train_loop(train_loader, model2, loss_fn, optimizer, verbose=True) # Use verbose=False, if you want to see less information
    test_loop(test_loader, model2, loss_fn)
    ### END YOUR CODE ###

print("Done!")

Epoch 1
-------------------------------
loss: 2.612558  [    0/60000]
loss: 0.503965  [ 6400/60000]
loss: 0.426678  [12800/60000]
loss: 0.491898  [19200/60000]
loss: 0.582632  [25600/60000]
loss: 0.432575  [32000/60000]
loss: 0.317510  [38400/60000]
loss: 0.605072  [44800/60000]
loss: 0.488901  [51200/60000]
loss: 0.490692  [57600/60000]
Test Error: 
 Accuracy: 83.6%, Avg loss: 0.448692 

Epoch 2
-------------------------------
loss: 0.353831  [    0/60000]
loss: 0.382516  [ 6400/60000]
loss: 0.316304  [12800/60000]
loss: 0.397499  [19200/60000]
loss: 0.414132  [25600/60000]
loss: 0.354182  [32000/60000]
loss: 0.312440  [38400/60000]
loss: 0.538872  [44800/60000]
loss: 0.420980  [51200/60000]
loss: 0.478935  [57600/60000]
Test Error: 
 Accuracy: 84.5%, Avg loss: 0.425928 

Epoch 3
-------------------------------
loss: 0.350996  [    0/60000]
loss: 0.356971  [ 6400/60000]
loss: 0.323866  [12800/60000]
loss: 0.348539  [19200/60000]
loss: 0.455284  [25600/60000]
loss: 0.360837  [32000/600

**Expected output**

In theory, you should see that the larger dropout rate you use, the lower test accuracy you will get, at the same epoch number.

But the model trained with some dropout rate should generalize better to new data.