1. [10 points] Consider a dense neural network classifier with 2D input, 1 hidden layer with 3 neurons with ReLU activation, and 3D output with softmax. Generate random numbers for the weights and compute the output for $(4,5)$
. Then, compute the gradient with respect to the weights using backpropagation for a MSE loss. (Paper and pencil question)

2. [30 points] Use a feedforward NN to classify the CIFAR-10 dataset, and tune its hyperparameters as best you can. You must use PyTorch. Requirements below.

In [2]:
import torch.nn as nn
import torch
import torch.optim as optim
from torch.utils.data import DataLoader, random_split, Subset
from torchsummary import summary
from torchvision.datasets import CIFAR10
from torchvision import transforms
import torch.nn.init as init

import matplotlib.pyplot as plt

# Set random seed for reproducibility
random_state = 1
torch.manual_seed(random_state)

# CIFAR-10 dataset preprocessing
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))  # Normalization for CIFAR-10
])

# Load CIFAR-10 dataset
dataset = CIFAR10(root='./data', train=True, download=True, transform=transform)
test_dataset = CIFAR10(root='./data', train=False, download=True, transform=transform)

# Split dataset into 60/20/20
train_size = int(0.6 * len(dataset))
val_size = int(0.2 * len(dataset))
test_size = len(dataset) - train_size - val_size

train_dataset, val_dataset, test_dataset = random_split(dataset, [train_size, val_size, test_size])

# single 10-node hidden layer
model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(32*32*3, 256),  # Input size is for CIFAIR10 is 3072 (32x32x3)
    nn.ReLU(),
    nn.Dropout(p=0.1),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)



# Use CrossEntropyLoss for classification
criterion = nn.CrossEntropyLoss()

# Adam optimizer with L2 regularization (weight_decay is the L2 penalty)
optimizer = optim.Adam(model.parameters(), lr=0.003, weight_decay=1e-4)

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

summary(model, input_size=(3072,))

Files already downloaded and verified
Files already downloaded and verified
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
           Flatten-1                 [-1, 3072]               0
            Linear-2                  [-1, 256]         786,688
              ReLU-3                  [-1, 256]               0
           Dropout-4                  [-1, 256]               0
            Linear-5                  [-1, 128]          32,896
              ReLU-6                  [-1, 128]               0
            Linear-7                   [-1, 10]           1,290
Total params: 820,874
Trainable params: 820,874
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.01
Forward/backward pass size (MB): 0.03
Params size (MB): 3.13
Estimated Total Size (MB): 3.17
----------------------------------------------------------------


In [3]:
# Create DataLoader for batching
batch_size = 256
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

def one_hot_encoding(targets, num_classes=10, device='cpu'):
    # Generate one-hot encoding and move it to the same device as targets
    return torch.eye(num_classes, device=device)[targets]

# Train function
def train(model, device, train_loader, optimizer, epoch):
    model.train()
    train_loss = 0
    correct = 0
    total = 0
    
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        
        # Forward pass
        output = model(data)
        
        # One-hot encode the target
        target_one_hot = one_hot_encoding(target, num_classes=10, device=device)  # Pass the correct device
        
        # Calculate loss with MSE
        loss = criterion(output, target_one_hot)
        loss.backward()
        optimizer.step()
        
        # Track loss
        train_loss += loss.item()
        
        # Track accuracy
        pred = output.argmax(dim=1, keepdim=True)
        correct += pred.eq(target.view_as(pred)).sum().item()
        total += target.size(0)
    
    avg_train_loss = train_loss / len(train_loader)
    train_accuracy = correct / total
    return avg_train_loss, train_accuracy

# Test function
def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            
            # One-hot encode the target
            target_one_hot = one_hot_encoding(target, num_classes=10, device=device)  # Pass the correct device
            
            # Calculate loss with MSE
            test_loss += criterion(output, target_one_hot).item()
            
            # Track accuracy
            pred = output.argmax(dim=1, keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()
    
    avg_test_loss = test_loss / len(test_loader)
    test_accuracy = correct / len(test_loader.dataset)
    return avg_test_loss, test_accuracy

# Training loop
num_epochs = 50
train_losses = []
test_losses = []
train_accuracies = []
test_accuracies = []

for epoch in range(num_epochs):
    avg_train_loss, train_accuracy = train(model, device, train_loader, optimizer, epoch)
    avg_test_loss, test_accuracy = test(model, device, test_loader)
    
    train_losses.append(avg_train_loss)
    test_losses.append(avg_test_loss)
    train_accuracies.append(train_accuracy)
    test_accuracies.append(test_accuracy)
    
    print(f'Epoch {epoch}: \tTrain Loss: {avg_train_loss:.4f} \tTest Loss: {avg_test_loss:.4f}'
          + f'\tTrain Accuracy: {train_accuracy:.4f}, \tTest Accuracy: {test_accuracy:.4f}')

# for Normalization & Standardization
# 0. 28.4s train_accuracy 0.40, test_acuracy 0.39
# 3. The previues attempt has shown progress, but slowly. So enhance learning rate from 0.001 to 0.01 33.4s train_accuracy 0.3545, test_acuracy 0.3269 Epoch 14
# 4. reduce learning rate to 0.005 runtime 32.9s train_accuracy 0.3814, test_acuracy 0.3707 Epoch 19
# 5. reduce learning rate to 0.003 runtime 32.9s train_accuracy 0.3964, test_acuracy 0.3885 Epoch 19
# 6. increase epoch to 30 runtime 49.9s train_accuracy 0.4057, test_acuracy 0.3900 Epoch 29
# 7. increase epoch to 50 runtime 49.9s train_accuracy 0.4179, test_acuracy 0.3755 Epoch 47
# 16. added normalization function correctly. (used the code in week6) learned that right Normalization is really important. However, has huge overfitting. runtime 1m 54.7s train_accuracy 0.7005, test_acuracy 0.4705 Epoch 46


# for Weight Initialization
# 9. I have used he function which is used for rellu functions. However, the results hasn't improved. It is due to the architecture isn't big enough runtime 1m 26.2s train_accuracy 0.4914 test_accuracy 0.4392 Epoch 48
# 15. deleted he function the train accuracy slightly improved and the test accuracy boosted up. runtime 1m 26.1s train_accuracy 0.5634, test_acuracy 0.4867 Epoch 49

# for Architectures
# 1. added a hidden 100 Node layer time runtime 32.4s train_accuracy 0.4070, test_acuracy 0.3798 Epoch 17
# 2. added a extra hidden 50 Node layerand 20 Node layer time runtime 33.4s train_accuracy 0.4085, test_acuracy 0.3887 Epoch 19
# 13. the batch size increased to 256 reduced the runtime and enhanced the accuracy. However, this might give some overfitting problemruntime 1m 15.0s train_accuracy 0.5025, test_acuracy 0.4572 Epoch 49
# 14. use vanialla network to use existing functions proven. runtime 1m 24.8s train_accuracy 0.5598, test_acuracy 0.4633 Epoch 48
# 16. add dropout the rate is 0.5 for 2 layers. runtime 1m 55.7s train_accuracy 0.3511, test_acuracy 0.3887 Epoch 49
# 17. change dropout rate is 0.2 for 2 layers. runtime 1m 55.0s train_accuracy 0.5500, test_acuracy 0.4853 Epoch 49
# 18. change dropout rate is 0.1 for 2 layers. runtime 1m 54.2s train_accuracy 0.6350, test_acuracy 0.4979 Epoch 49
# 19. delete one dropout layer. train accuracy down, test accuracy up runtime 1m 55.5s train_accuracy 0.6985, test_acuracy 0.4983 Epoch 49

# for activation functions
# 8. add 2 rellu layers runtime 1m 25.0s train_accuracy 0.5139, test_acuracy 0.4468 Epoch 46


# for loss function, Cross Entrophy loss was initially used.
# sticked to the original function because Cross entrophy is the most accurate for the test
    
# for regularization L2 was initially used (1e-4).
# 10. updated L2 to 1e-3, which looked successful at first, but later the accuracy stopped at 0.45 ish runtime 1m 26.3s train_accuracy 0.4586, test_acuracy 0.4314 Epoch 47
# 11. updated L2 to 5e-4 accuracy increased a bit, stopped at 0.47 runtime 1m 25.3s train_accuracy 0.4717, test_acuracy 0.4224 Epoch 48
# 12. tried with 7e-4, but decided to stick to the original L2 value

Epoch 0: 	Train Loss: 1.7898 	Test Loss: 1.6404	Train Accuracy: 0.3670, 	Test Accuracy: 0.4125
Epoch 1: 	Train Loss: 1.5670 	Test Loss: 1.5561	Train Accuracy: 0.4473, 	Test Accuracy: 0.4546
Epoch 2: 	Train Loss: 1.4888 	Test Loss: 1.5263	Train Accuracy: 0.4773, 	Test Accuracy: 0.4633
Epoch 3: 	Train Loss: 1.4294 	Test Loss: 1.5157	Train Accuracy: 0.4948, 	Test Accuracy: 0.4688
Epoch 4: 	Train Loss: 1.3875 	Test Loss: 1.5471	Train Accuracy: 0.5096, 	Test Accuracy: 0.4700
Epoch 5: 	Train Loss: 1.3558 	Test Loss: 1.5085	Train Accuracy: 0.5225, 	Test Accuracy: 0.4805
Epoch 6: 	Train Loss: 1.3283 	Test Loss: 1.5035	Train Accuracy: 0.5317, 	Test Accuracy: 0.4778
Epoch 7: 	Train Loss: 1.2931 	Test Loss: 1.5045	Train Accuracy: 0.5407, 	Test Accuracy: 0.4784
Epoch 8: 	Train Loss: 1.2673 	Test Loss: 1.4862	Train Accuracy: 0.5548, 	Test Accuracy: 0.4816
Epoch 9: 	Train Loss: 1.2370 	Test Loss: 1.4941	Train Accuracy: 0.5607, 	Test Accuracy: 0.4893
Epoch 10: 	Train Loss: 1.2141 	Test Loss: 1.5071	T

3. [40 points] Repeat problem 2 using a CNN and the Imagenette dataset. Run at least 10 training experiments, but you are free to use any techniques you choose, but much of the credit is based on your reasoning: your progression requires rationale for why you're tuning the hyperparmeterse you choose to tune. Required: Experiment with data augmentation. [GPU computing is recommended for this.]

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 224, 224]           1,792
              ReLU-2         [-1, 64, 224, 224]               0
         MaxPool2d-3         [-1, 64, 112, 112]               0
            Conv2d-4        [-1, 128, 112, 112]          73,856
              ReLU-5        [-1, 128, 112, 112]               0
         MaxPool2d-6          [-1, 128, 56, 56]               0
            Conv2d-7          [-1, 256, 56, 56]         295,168
              ReLU-8          [-1, 256, 56, 56]               0
         MaxPool2d-9          [-1, 256, 28, 28]               0
          Flatten-10               [-1, 200704]               0
           Linear-11                 [-1, 1024]     205,521,920
             ReLU-12                 [-1, 1024]               0
          Dropout-13                 [-1, 1024]               0
           Linear-14                  [

In [8]:
import torch
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import torch.nn.functional as F
from torchsummary import summary
import torch.optim as optim

# import v2 for data augmentation
from torchvision.transforms import v2

# Define data augmentation transforms (if needed)
test_transform = transforms.Compose([
    transforms.Resize((224,224)),         # Resize images to 224x224
    transforms.ToTensor(),          # Convert images to PyTorch tensors
    
])

train_transform = v2.Compose([
    v2.RandomResizedCrop(size=(224, 224), antialias=True),
    v2.RandomHorizontalFlip(p=0.5),
    v2.ToDtype(torch.float32, scale=True),
    transforms.ToTensor(),
    v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Download and load the Imagenette dataset
train_dataset = datasets.ImageFolder(root='./imagenette2-320/train', transform=train_transform)
test_dataset = datasets.ImageFolder(root='./imagenette2-320/val', transform=test_transform)

# Create data loaders 
# do not suffle the data for consistant values
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False) 

# image, label = train_dataset[0]
# print(image.shape) torch.Size([3, 224, 224])

# import LeNetReg model shown in class first.
class model(nn.Module):
    def __init__(self, num_classes=10):
        super(model, self).__init__()
        
        # First convolutional block
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)  # Output: (64, 160, 160)
        
        # Second convolutional block
        self.conv2 = nn.Conv2d(32, 128, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(128)
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)  # Output: (128, 80, 80)
        
        # Third convolutional block
        self.conv3 = nn.Conv2d(128, 256, kernel_size=3, padding=1)

        self.bn3 = nn.BatchNorm2d(256)
        self.pool3 = nn.MaxPool2d(kernel_size=2, stride=2)  # Output: (256, 40, 40)
        
        # Fourth convolutional block
        self.conv4 = nn.Conv2d(256, 512, kernel_size=3, padding=1)
        self.bn4 = nn.BatchNorm2d(512)
        self.pool4 = nn.MaxPool2d(kernel_size=2, stride=2)  # Output: (512, 20, 20)
        
        # Fifth convolutional block
        self.conv5 = nn.Conv2d(512, 1024, kernel_size=3, padding=1)
        self.bn5 = nn.BatchNorm2d(1024)
        self.pool5 = nn.MaxPool2d(kernel_size=2, stride=2)  # Output: (1024, 10, 10)
        
        # Fully connected layers
        self.fc1 = nn.Linear(50176, 1024)
        self.fc2 = nn.Linear(1024, 512)
        self.fc3 = nn.Linear(512, num_classes)
        
        # Dropout for regularization
        self.dropout = nn.Dropout(0.5)

    def forward(self, x):
        # First conv block
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.pool1(x)
        
        # Second conv block
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.pool2(x)
        
        # Third conv block
        x = F.relu(self.bn3(self.conv3(x)))
        x = self.pool3(x)
        
        # Fourth conv block
        x = F.relu(self.bn4(self.conv4(x)))
        x = self.pool4(x)
        
        # Fifth conv block
        x = F.relu(self.bn5(self.conv5(x)))
        x = self.pool5(x)
        
        # Flatten for fully connected layers
        x = x.view(x.size(0), -1)
        
        # Fully connected layers
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        
        return x
    
# Initialize model, optimizer, and loss function
model = model() 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device = device)

summary(model, input_size=(3,224,224))


----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 32, 224, 224]             896
       BatchNorm2d-2         [-1, 32, 224, 224]              64
         MaxPool2d-3         [-1, 32, 112, 112]               0
            Conv2d-4        [-1, 128, 112, 112]          36,992
       BatchNorm2d-5        [-1, 128, 112, 112]             256
         MaxPool2d-6          [-1, 128, 56, 56]               0
            Conv2d-7          [-1, 256, 56, 56]         295,168
       BatchNorm2d-8          [-1, 256, 56, 56]             512
         MaxPool2d-9          [-1, 256, 28, 28]               0
           Conv2d-10          [-1, 512, 28, 28]       1,180,160
      BatchNorm2d-11          [-1, 512, 28, 28]           1,024
        MaxPool2d-12          [-1, 512, 14, 14]               0
           Conv2d-13         [-1, 1024, 14, 14]       4,719,616
      BatchNorm2d-14         [-1, 1024,

In [85]:


criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

# Train function
def train(model, device, train_loader, optimizer, epoch):
    model.train()
    train_loss = 0
    correct = 0
    total = 0
    
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        
        # Forward pass
        output = model(data)
        
        # Calculate loss with CrossEntropyLoss
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        
        # Track loss and accuracy
        train_loss += loss.item()
        pred = output.argmax(dim=1, keepdim=True)
        correct += pred.eq(target.view_as(pred)).sum().item()
        total += target.size(0)
    
    avg_loss = train_loss / len(train_loader)
    accuracy = correct / total
    print(f'Train Epoch: {epoch}, Loss: {avg_loss:.4f}, Accuracy: {accuracy:.4f}')
    return avg_loss, accuracy


# Test function
def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    total = 0

    
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            
            # Calculate loss
            loss = criterion(output, target)
            test_loss += loss.item()
            pred = output.argmax(dim=1, keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()
            total += target.size(0)
    
    avg_loss = test_loss / len(test_loader)
    accuracy = correct / total
    print(f'Test Loss: {avg_loss:.4f}, Accuracy: {accuracy:.4f}')
    return avg_loss, accuracy


# exp 1. added for non-increasing accuracy 
best_val_loss = float('inf')
best_accuracy = 0
patience = 7  # stop training if no improvement for 'patience' epochs
loss_patience_counter = 0
accuracy_patience_counter = 0
    

# Training and validation loop
num_epochs = 20
train_losses, val_losses = [], []
train_accuracies, val_accuracies = [], []

for epoch in range(num_epochs):
    train_loss, train_accuracy = train(model, device, train_loader, optimizer, epoch)
    val_loss, val_accuracy = test(model, device, test_loader)
    
    train_losses.append(train_loss)
    val_losses.append(val_loss)
    train_accuracies.append(train_accuracy)
    val_accuracies.append(val_accuracy)

    print('\n')
    
    # exp 1. added for non-increasing accuracy 
    if val_loss < best_val_loss :
        best_val_loss = val_loss
        loss_patience_counter = 0  # Reset the patience counter
    else:
        loss_patience_counter += 1

    if val_accuracy > best_accuracy:
        best_accuracy = val_accuracy
        accuracy_patience_counter = 0  # Reset the patience counter on improvement
    else:
        accuracy_patience_counter += 1

        
    if (loss_patience_counter >= patience) or (accuracy_patience_counter >= patience):
        print(f"Early stopping after {epoch} epochs.")
        break

# Test the model on the test set
test_loss, test_accuracy = test(model, device, test_loader)


Train Epoch: 0, Loss: 3.0404, Accuracy: 0.1737
Test Loss: 2.2739, Accuracy: 0.1480


Train Epoch: 1, Loss: 2.1630, Accuracy: 0.1953
Test Loss: 2.3230, Accuracy: 0.1060


Train Epoch: 2, Loss: 2.1504, Accuracy: 0.2114
Test Loss: 2.2601, Accuracy: 0.1468


Train Epoch: 3, Loss: 2.1458, Accuracy: 0.2157
Test Loss: 2.2681, Accuracy: 0.1763


Train Epoch: 4, Loss: 2.1156, Accuracy: 0.2283
Test Loss: 2.2517, Accuracy: 0.1651


Train Epoch: 5, Loss: 2.0967, Accuracy: 0.2385
Test Loss: 2.2407, Accuracy: 0.1623


Train Epoch: 6, Loss: 2.0913, Accuracy: 0.2375
Test Loss: 2.5557, Accuracy: 0.1022


Early stopping after 6 epochs.
Test Loss: 2.5557, Accuracy: 0.1022


### Training Experiment Track
| Exp num | Description(change) | Note | Epoch | Train_accuracy | Test_accuracy | rumtime |
| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | 
| 0. | initial run with data augmentation(flip) <br> cross entrophy loss <br> lr = 0.001 |training accuracy went up, but test accuracy went down. overfitting is the problem| 20 | 0.5703 | 0.1516 | 34m 3.4s|
| 1. | 1. add early stopping(stop when validation is not increasing) to prevent time loss <br> 2. L2 regularization to reduce overfitting weight_decay=1e-4| overfitting is still a problem. <br> went to 15 epoch, and the highest accuracy was epoch 10. <br> will include dropout and increase the number of layers | 10 | 0.5760 | 0.1299 | 27m 46.0s |
| 2. | added extra layers and added dropout| Validation accuracy increased slightly. <br> But more time spent| 9 | 0.4541 | 0.2293 | 60m 49.3s |
| 3. | add full connected layers, and increase dropout rate.<br> [0.2,0.2,0.2,0.4] -> [0.3, 0.3, 0.3, 0.5] <br> fc(90,10) -> fc(90,50,10)|No Improvement| 6 | 0.3974 | 0.1768 | 42m 55.8s |
| 4. | roll back dropout rate |No Improvement| 5 | 0.3924 | 0.1332 | 36m 52.1s |
| 5. | add one more fc layer <br> increase patience number to 7 to see more results <br> fc(90,50,10) -> fc(90,128,64,10)|Accuracy went slightly up, but no improvement| 10 | 0.5907 | 0.1847| 69m 20.0s |
| 6. | couldn't figure out how to improve the CNN, so implememted the same architecture in Class | the accuracy fell down. <br>Reverse to previous CNN| 6 | 0.0980| 0.1068| 117m 31.3s |
| 7. | correctly implemeted dropout. | the accuracy fell down. <br>Reverse to previous CNN| 6 | 0.0980| 0.1068| 117m 31.3s |