# CSCI-5501 – Deep Learning Applications - Assignment 2
## 
## Deadline June 22, 2025
### Total points: 100

## Instructions
* This is an individual assignment
* All your solution, code, analysis, graphs, explanations should be done in this same notebook.
* Please attempt to solve these questions by yourself. You can read official pytorch documentation and online resources to build your understanding, but please refrain from using LLMs (e.g., ChatGPT) to directly generate the answers.
* Please make sure to execute all the cells before you generate the pdf and also the notebook submission on Crowdmark. You will not get points for the plots if they are not generated already.
* **IMPORTANT:** Read every cell extremely carefully and attempt to understand the code clearly. Make note of any questions and bring to the next tutorial/TA-office-hours for further discussion and clarification.
* Note: This notebook includes results corresponding to completed/correct implementation. All of the outputs are not guaranteed to be exactly same across different runs; however, these outputs should give you a sense of whether your implementation is working as expected.
* Utilize Google Colab for training the models as it is compute intensive for CPU based traninig.

## Learning Goals
* Familiarize yourself with PyTorch.
* Implement and analyse Fully-connected networks and convolutional networks.
* Transfer-learning by fine-tuning a pretrained ImageNet classifier on CIFAR10.


In [None]:
import random````````````````````````````````````````````````````````````````````````````````````````````

import numpy as np

import torch
from torch import nn
import torch.nn.functional as F
from torchvision import transforms, datasets, models
from torch.utils.data import DataLoader, random_split
import torch.optim as optim

import matplotlib.pyplot as plt

import copy
import tqdm

In [None]:
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("Using CUDA:", torch.cuda.get_device_name(0))# if on Colab make sure this is running.
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print("Using Apple MPS (Metal Performance Shaders)")
else:
    device = torch.device("cpu")
    print("Using CPU")

# A. Fully-Connected Networks (10 points)

In the following, we will define and train a Fully-Connected Classifier. Some parts of the function are filled out for you.
Your tasks in this section are as follows:

**A.1. [5 points]** Create the layers of fully-connected network.

**A.2. [5 points]** Fill out the forward method. Remember to flatten the input.

In [None]:
class FullyConnectedClassifier(nn.Module):
    """
    A fully-connected neural network for image classification.

    Args:
        image_size (tuple): Size of the input images (C, H, W) or (H, W).
        hidden_units (list): List of integers specifying the number of hidden units in each layer.
        activation_fn (nn.Module): Activation function to use between layers (e.g., nn.ReLU()).
        num_classes (int): Number of output classes.
        device (str): Device to use for training and evaluation (e.g., 'cpu' or 'cuda').
    """

    def __init__(self, image_size, hidden_units=[128, 64], activation_fn=nn.ReLU(), num_classes=10, device='cpu'):
        super(FullyConnectedClassifier, self).__init__()

        ##########################################################################
        ########################          A.1.           #########################
        ######################## COMPLETE THIS FUNCTION. #########################
        ##########################################################################


        self.input_size = image_size[0] * image_size[1] * (image_size[2] if len(image_size) == 3 else 1)
        layers = []
        in_features = self.input_size
        for units in hidden_units:
            """
            Write code here.
            """
            layers.append(nn.Linear(in_features, units))
            layers.append(activation_fn)
            in_features = units

        layers.append(nn.Linear(in_features, num_classes))

        # Convert the list to ModuleList to register the parameters.
        self.layers = nn.ModuleList(layers)

        # Store the activation function.
        self.activation_fn = activation_fn

        # To enable GPU-usage.
        self.device = device
        self.to(device)

    def forward(self, x):
        """
        Apply the classifier.
        # Hint: Every linear layer is followed by non-linearity except the final one
        Input: x
        Returns: logits.
        """
        ##########################################################################
        ########################          A.2.           #########################
        ######################## FILL OUT THIS FUNCTION. #########################
        ##########################################################################
                 
        for layer in self.layers:
            x = layer(x)

        return x  # Return the raw logits

        """
        Write code here.
        """



# B. Convolutional Networks (20 points)

In the following, we will define and train a Convolutional Classifier Network. Some parts of the function are filled out for you.
Your tasks in this section are as follows:

**B.1. [5 points]** The below table shows the sequence of transformations to be applied on an input image. **Fill out the shape after each transformation labelled with ??**

**B.2. [10 points]** Fill out the constructor to initialize the network.

**B.3. [5 points]** Fill out the forward method to apply the convolutional-network to the inputs.


In [None]:
#############################################################################
########################          B.1.           ############################
######################## COMPUTE THE OUTPUT SHAPES. #########################
#############################################################################

architecture = [
    {
        "Layer Type": "Input",
        "Description": "Input image",
        "Output Shape (HxWxC)": "(32, 32, 3)",
    },
    {
        "Layer Type": "Conv2D",
        "Description": "12 filters, 3x3 kernel",
        "Output Shape (HxWxC)": "(32, 32, 12)",
    },
    {
        "Layer Type": "BatchNorm",
        "Description": "Normalize 12 channels",
        "Output Shape (HxWxC)": "(32, 32, 12)",
    },
    {
        "Layer Type": "Activation",
        "Description": "ReLU/LeakyReLU/etc",
        "Output Shape (HxWxC)": "(32, 32, 12)",
    },
    {
        "Layer Type": "MaxPooling",
        "Description": "2x2 pool, stride 2",
        "Output Shape (HxWxC)": "(16, 16, 12)",
    },
    {
        "Layer Type": "Conv2D",
        "Description": "24 filters, 3x3 kernel",
        "Output Shape (HxWxC)": "(16, 16, 24)",
    },
    {
        "Layer Type": "BatchNorm",
        "Description": "Normalize 24 channels",
        "Output Shape (HxWxC)": "(16, 16, 24)",
    },
    {
        "Layer Type": "Activation",
        "Description": "ReLU/LeakyReLU/etc",
        "Output Shape (HxWxC)": "(16, 16, 24)",
    },
    {
        "Layer Type": "MaxPooling",
        "Description": "2x2 pool, stride 2",
        "Output Shape (HxWxC)": "(8, 8, 24)",
    },
    {
        "Layer Type": "Conv2D",
        "Description": "48 filters, 3x3 kernel",
        "Output Shape (HxWxC)": "(8, 8, 48)",
    },
    {
        "Layer Type": "BatchNorm",
        "Description": "Normalize 48 channels",
        "Output Shape (HxWxC)": "(8, 8, 48)",
    },
    {
        "Layer Type": "Activation",
        "Description": "ReLU/LeakyReLU/etc",
        "Output Shape (HxWxC)": "(8, 8, 48)",
    },
    {
        "Layer Type": "MaxPooling",
        "Description": "2x2 pool, stride 2",
        "Output Shape (HxWxC)": "(4, 4, 48)",
    },
    {
        "Layer Type": "Flatten",
        "Description": "Flatten",
        "Output Shape (HxWxC)": "768",
    },
    {
        "Layer Type": "Fully Connected",
        "Description": "64 units",
        "Output Shape (HxWxC)": "64",
    },
    {
        "Layer Type": "Activation",
        "Description": "ReLU/LeakyReLU/etc",
        "Output Shape (HxWxC)": "64",
    },
    {
        "Layer Type": "Fully Connected",
        "Description": "Output layer, 10 classes",
        "Output Shape (HxWxC)": "10",
    }
]



# from google.colab import data_table # Turn on if error on Google Colab
import pandas as pd
df = pd.DataFrame(architecture)
df[["Layer Type","Description","Output Shape (HxWxC)"]]

: 

In [None]:
class ConvNet(nn.Module):
    """
    A simple convolutional neural network for image classification.

    Works with input images of size (32, 32, 3).
    """

    def __init__(self, num_classes=10, activation_fn=nn.ReLU, device='cpu'):
        super(ConvNet, self).__init__()
        ##########################################################################
        ########################          B.2.           #########################
        ######################## COMPLETE THIS FUNCTION. #########################
        ##########################################################################

    
        self.classifier = nn.Sequential(
            # Conv Block 1
            nn.Conv2d(3, 12, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(12),
            activation_fn(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            
            # Conv Block 2
            nn.Conv2d(12, 24, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(24),
            activation_fn(),
            nn.MaxPool2d(kernel_size=2, stride=2),  # (8, 8, 24)

            # Conv Block 3
            nn.Conv2d(24, 48, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(48),
            activation_fn(),
            nn.MaxPool2d(kernel_size=2, stride=2),  # (4, 4, 48)

            """
            Write code here.
            """

            nn.Flatten(),
            
            nn.Linear(768, 64),
            activation_fn(),
            """
            Write code here.
            """

            nn.Linear(64, num_classes)
            )



        # To enable GPU-usage.
        self.device = device
        self.to(device)

    def forward(self, x):
        """
        Apply the classifier.

        Input: x
        Returns: logits.
        """
        ##########################################################################
        ########################          B.3.           #########################
        ######################## FILL OUT THIS FUNCTION. #########################
        ##########################################################################
        
        return self.classifier(x)
     
        """
        Write code here.
        """

# C. Define Trainer-class to train a classifier model. (20 Points)

**C.1 [5 points]** Fill out the code required for **Training Loop**  

**C.2 [5 points]** Fill out the code for the function **def accuracy(self, data_loader)** 

**C.3 [10 points]** Fill out the code required for **Early Stopping**  


In [None]:
class Trainer:
    """
    Trainer class to handle the training and evaluation of the model.

    Args:
        model (nn.Module): The model to be trained.
        device (str): Device to use for training and evaluation (e.g., 'cpu' or 'cuda').
    """

    def __init__(self, model, device='cpu'):
        self.model = model
        self.device = device

    def train_model(self, train_loader, val_loader, epochs=10, lr=0.001, weight_decay=0.1, early_stopping_patience=3, verbose=True, plot_graph=True, plot_title=""):
        """
        Train the model.

        Args:
            train_loader (DataLoader): DataLoader for training data.
            val_loader (DataLoader): DataLoader for validation data.
            epochs (int): Number of training epochs.
            lr (float): Learning rate for the optimizer.
            weight_decay (float): Weight decay for the optimizer.
            early_stopping_patience (int): Patience for early stopping based on validation loss.
        """
        optimizer = optim.AdamW(self.model.parameters(), lr=lr, weight_decay=weight_decay)
        criterion = nn.CrossEntropyLoss()

        best_val_loss = float('inf')
        best_model_params = None
        patience_counter = 0

        train_losses = []
        val_losses = []

        for epoch in tqdm.notebook.tqdm(range(epochs)):
            # Training loop
            self.model.train()

            train_loss = 0.0
            for inputs, targets in train_loader:
                inputs, targets = inputs.to(self.device), targets.to(self.device)
                ##########################################################################
                ########################   Complete the code     #########################
                ########################   for training loop.    #########################
                ##########################################################################
                #pass the inputs through the model
                #calculate the loss
                #check the gradients and update the model parameters
                #add the loss to the train_loss variable 

                optimizer.zero_grad()                     # Reset gradients
                outputs = self.model(inputs)             # Forward pass
                loss = criterion(outputs, targets)       # Compute loss
                loss.backward()                          # Backpropagation
                optimizer.step()                         # Update weights

                train_loss += loss.item()                # Accumulate training loss

            train_loss /= len(train_loader)
            train_losses.append(train_loss)

            """
            For this part you are required to implement a simple early stopping mechanism based on validation loss.

            Objective: The objective is to prevent overfitting by stopping training when the validation loss does not improve for a certain number of epochs (known as patience).

            ### Steps to Implement:

            1. Track Validation Loss:  
            After each training epoch, compute the validation loss using your model.  
            - Implement the function:  
                ```python
                def accuracy(self, data_loader):
                    # Your code to compute validation accuracy or loss
                ```
            - Use this function to evaluate the model on the validation set after each epoch.

            2. Monitor Improvement:  
            - If the current validation loss (`val_loss`) is lower than the best validation loss observed so far (`best_val_loss`), update `best_val_loss` and reset the patience counter (patience_counter) to zero.

            3. Count Non-Improving Epochs**:  
            - If the validation loss does not improve (i.e., `val_loss >= best_val_loss`), increment the patience counter (`patience_counter += 1`).
            - If `patience_counter` reaches or exceeds the specified hyperparameter `patience` (e.g., `early_stopping_patience = 5`), stop training early. This means that the model has not improved for `patience` consecutive epochs.

            REMINDER:
            - You are only required to stop training based on validation loss.
            - Make sure your code prints out when early stopping is triggered
            """
            # Validation loop
            val_acc, val_loss = self.accuracy(val_loader)# you are required to implement this function.
            val_losses.append(val_loss)
            if verbose:
              print(f"Epoch {epoch + 1}/{epochs} - Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f} Val Acc: {val_acc:.2f}")

            ##########################################################################
            ########################   Complete the code  ############################
            ########################   Early Stopping.    ############################
            ##########################################################################
            
            
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                # Update best validation loss and reset patience counter
                best_model_params = copy.deepcopy(self.model.state_dict())
                patience_counter = 0
                if verbose:
                  print(f"Best val-loss updated: {best_val_loss}")
            else:
                # patience counter and 
                if verbose:
                  print(f"Early stopping {patience_counter} of {early_stopping_patience}.")
                #check for early stopping

        final_model_params = copy.deepcopy(self.model.state_dict())

        # Plot losses
        if plot_graph:
          plt.figure(figsize=(10, 5))
          plt.plot(train_losses, label='Train Loss')
          plt.plot(val_losses, label='Validation Loss')
          plt.xlabel('Epochs')
          plt.ylabel('Loss')
          plt.title(plot_title)
          plt.legend()
          plt.show()
        return best_model_params, final_model_params

    def accuracy(self, data_loader):
        """
        Compute accuracy on a given DataLoader.

        Args:
            data_loader: for which the accuracy is to be evaluated.

        Returns:
            accuracy: percentage accuracy.
            loss: CE loss.
        """
        self.model.eval()
        criterion = nn.CrossEntropyLoss(reduction='sum')
        correct = 0
        loss = 0
        total = 0
        with torch.no_grad():
            for inputs, targets in data_loader:
                inputs, targets = inputs.to(self.device), targets.to(self.device)
                ##########################################################################
                ########################   Complete the code     #########################
                ########################   for evaluation loop.    #######################
                ##########################################################################
                #pass the inputs through the model to get raw logit predictions
                #compute the loss using the criterion
                #get the predicted classes by taking the argmax of the raw logit predictions https://docs.pytorch.org/docs/main/generated/torch.argmax.html
                #increase the total number of samples counter (you can use pre built-in functions e.g size or shape https://docs.pytorch.org/docs/stable/size.html)
                #calculate the number of correct predictions (you can use pre built-in functions e.g sum https://docs.pytorch.org/docs/stable/generated/torch.sum.html)
                
                outputs = self.model(inputs)
                loss += criterion(outputs, targets).item()
                predicted = torch.argmax(outputs, dim=1)
                total += targets.size(0)
                correct += (predicted == targets).sum().item()


        return 100 * correct / total, loss/total

# D. Helper code to create models and define datasets (20 points)

#### **All code in this section has been written for you**
 
### D.1 Reading comprehension questions

Read the provided `train_model` function carefully and answer the following questions. You may refer to external documentation, blogs, or tutorials while answering — if you do, please include links.

* These question are designed to help you clearly understand the rationale behind the code implementatiom
* Answer each question as clearly and concisely as possible. You are encouraged to reference official PyTorch documentation or other reliable sources.
---

### Dataset Preparation

1. Why do we apply `RandomHorizontalFlip()` to CIFAR-10 but not to MNIST?  

Answer: CIFAR-10 contains images of objects like animals and vehicles, where flipping them horizontally still shows the same object. But for MNIST digits, flipping (like flipping a       "6" or "2") could change it into a different or unreadable digit, so we avoid flipping MNIST images.

2. What is the purpose of `RandomCrop(32, padding=4)` in the CIFAR-10 preprocessing?  

Answer: This operation adds padding around the image and then randomly crops it back to 32×32. It mimics small shifts or movements of objects in the image, which helps the model handle slight variations in position.

3. Why is `RandomRotation(10)` used for MNIST? What kind of variability does it introduce? 

Answer: Handwritten digits may be slightly slanted in real-life writing. RandomRotation introduces minor angle variations to help the model become more tolerant to rotated digits.

4. Explain why MNIST normalization uses `raw_train_dataset.data.float().mean() / 255` whereas CIFAR-10 normalization uses `raw_train_dataset.data.mean(axis=(0, 1, 2)) / 255`.  

Answer: MNIST has only one grayscale channel, so we take the mean of all pixels. CIFAR-10 has three color channels (Red, Green, Blue), so we calculate the average for each channel separately to normalize the colors properly.

5. What is the role of the `val_test_transform` pipeline? Why don't we apply any data augmentation there?

Answer: This transformation is used for validation and test data. We skip augmentations here because we want to measure how well the model performs on clean, unaltered data — not on modified versions.

---

### Dataset Loading and Splitting

6. Why do we download the dataset again with different transformations (`train_dataset` vs `train_dataset_val`)?  

Answer: We use different transformations for training and validation. So we load the dataset twice — once with augmentation for training, and once without for validation.

7. What is the purpose of calling `random_split` twice on `train_dataset` and `train_dataset_val`?  

Answer: We want to compare the same data samples under two different settings: one with augmentation (for training) and one without (for validation). Splitting both ensures they contain matching samples.

8. Why do we use `torch.Generator().manual_seed(0)` in `random_split()`? What problem does it prevent?
 
Answer:Setting the random seed ensures the split is the same every time the code runs. This helps keep results consistent and makes debugging or comparing models easier.

---

### DataLoader Setup

9. Why is `shuffle=True` used for `train_loader` but not for `val_loader` and `test_loader`?  

Answer: Shuffling the training data gives the model varied batches each time, which improves learning. We don’t shuffle validation or test data because we want repeatable, fair evaluations.

10. What is the impact of batch size on the training process?

Answer:Batch size controls how many samples are processed at once. Larger batches train faster but use more memory and may generalize less well. Smaller batches are slower but often help the model generalize better to new data.

---

### Training and Evaluation

11. Why do we save both `best_model_params` and `final_model_params`?  

Answer:We save the best model (based on validation loss) to use for evaluation. The final model might have overfitted, so saving both lets us choose the better one later.

12. What function does `early_stopping_patience` serve?  

Answer: It stops training if the model hasn't improved for a certain number of epochs. This helps avoid wasting time and also prevents the model from overfitting.

13. The model is evaluated both on validation and test datasets. What is the difference between these two evaluations? 

Answer: Validation is used during training to fine-tune the model. The test set is only used at the end to see how well the model performs on completely unseen data.

14. Why do we reload model parameters before each evaluation (using `model.load_state_dict(params)`)?

Answer: To make sure we’re evaluating the exact version of the model we want — either the best or the final one. It avoids using any partially trained or outdated weights.

---

### Trainer Logic

15. What is the responsibility of the `Trainer` class? Which parts of the training pipeline does it abstract away?

Answer: The Trainer class takes care of training, validation, early stopping, and loss tracking. It simplifies the entire training process so that you don’t have to write the same code again for each new model.


In [None]:
class ModelFactory:
    """
    Model factory class.
    """
    @staticmethod
    def create_model(model_type: str, **kwargs) -> nn.Module:
        """
        Create a model instance based on the specified type.

        Args:
            model_type (str): The type of model to create. Must be one of ["ConvNet", "fully_connected"].
            **kwargs: Additional arguments required for model initialization.

        Returns:
            nn.Module: An instance of the requested model.
        """
        model_type = model_type.lower()
        if model_type == "convnet":
            return ConvNet(**kwargs)
        elif model_type == "fully_connected":
            return FullyConnectedClassifier(**kwargs)
        else:
            raise ValueError(f"Unknown model {model_type}")

In [None]:
def train_model(model=None,
                dataset_name="cifar10",
                epochs=20,
                lr=0.001,
                weight_decay=0.1,
                early_stopping_patience=float('inf'),
                mean_std = None,
                plot_graph=True,
                verbose=True):
    if dataset_name.lower() == "mnist":
        # MNIST preprocessing
        raw_train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transforms.ToTensor())
        image_size = (1, 28, 28)
        num_classes = 10
        if mean_std is not None:
          mean, std = mean_std
        else:
          mean = raw_train_dataset.data.float().mean() / 255
          std = raw_train_dataset.data.float().std() / 255

        # Train transformation specific to MNIST
        train_transform = transforms.Compose([
            transforms.RandomRotation(10),
            transforms.ToTensor(),
            transforms.Normalize(mean, std)
        ])
    elif dataset_name.lower() == "cifar10":
        # CIFAR-10 preprocessing

        raw_train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transforms.ToTensor())
        image_size = (3, 32, 32)
        num_classes = 10
        if mean_std is not None:
          mean, std = mean_std
        else:
          mean = raw_train_dataset.data.mean(axis=(0, 1, 2)) / 255
          std = raw_train_dataset.data.std(axis=(0, 1, 2)) / 255

        # Train transformation specific to CIFAR-10
        train_transform = transforms.Compose([
            transforms.RandomHorizontalFlip(),
            transforms.RandomCrop(32, padding=4),
            transforms.ToTensor(),
            transforms.Normalize(mean, std)
        ])

    val_test_transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize(mean, std)
        ])

    # Load datasets
    if dataset_name.lower() == "mnist":
        train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=train_transform)
        train_dataset_val = datasets.MNIST(root='./data', train=True, download=True, transform=val_test_transform)
        test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=val_test_transform)
    else:
        train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=train_transform)
        train_dataset_val = datasets.CIFAR10(root='./data', train=True, download=True, transform=val_test_transform)
        test_dataset = datasets.CIFAR10(root='./data', train=False, download=True, transform=val_test_transform)

    # Split train dataset into training and validation
    val_size = 10000
    train_size = len(train_dataset) - val_size
    train_subset, _ = random_split(train_dataset, [train_size, val_size],generator=torch.Generator().manual_seed(0))
    _, val_subset = random_split(train_dataset_val, [train_size, val_size],generator=torch.Generator().manual_seed(0))

    train_loader = DataLoader(train_subset, batch_size=64, shuffle=True)
    val_loader = DataLoader(val_subset, batch_size=64, shuffle=False)
    test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

    # Initialize the Trainer
    trainer = Trainer(model, device=model.device)

    # Train the model
    best_model_params, final_model_params = trainer.train_model(train_loader,
                                                                val_loader,
                                                                epochs=epochs,
                                                                lr=lr,
                                                                weight_decay=weight_decay,
                                                                early_stopping_patience=early_stopping_patience,
                                                                verbose=verbose,
                                                                plot_graph=plot_graph,
                                                                plot_title=f"{str(type(model).__name__)} : {dataset_name.upper()}"
                                                                )
    results = dict()
    for description, params in  zip(["Best-checkpoint", "Final-checkpoint"],[best_model_params, final_model_params]):
      print(f"Evaluating {description}")
      model.load_state_dict(params)

      val_accuracy, val_loss = trainer.accuracy(val_loader)
      print(f"\tValidation Accuracy: {val_accuracy:.2f}%")
      print(f"\tValidation loss: {val_loss:.4f}")

      print()

      test_accuracy, test_loss = trainer.accuracy(test_loader)
      print(f"\tTest Accuracy: {test_accuracy:.2f}%")
      print(f"\tTest loss: {test_loss:.4f}")

      results[f"{description}:state_dict"] = params
      results[f"{description}:val_loss"] = val_loss
      results[f"{description}:val_acc"] = val_accuracy
      results[f"{description}:test_loss"] = test_loss
      results[f"{description}:test_acc"] = test_accuracy

    return results

# E. Train Models (15 points)

**All code has been written in this section.**

**E.1 [15 points]** Run the following cells and provide an analysis comparing between the architectures and datasets.

**Answer:** 
 | Model | Dataset  | Best Validation Accuracy | Best Test Accuracy |
| ----- | -------- | ------------------------ | ------------------ |
| FCN   | MNIST    | **98.52%**               | **98.41%**         |
| FCN   | CIFAR-10 | **44.75%**               | **45.55%**         |
| CNN   | CIFAR-10 | **76.94%**               | **77.36%**         |

1. Dataset Complexity Affects Architecture Suitability
MNIST consists of simple, grayscale digit images (28×28). The shapes are centered and don’t require spatial feature recognition. Because of this, a fully connected model performs exceptionally well — achieving over 98% accuracy without the need for convolutional operations.

CIFAR-10, on the other hand, is much more complex. It contains color images (32×32 RGB) of real-world objects with varying backgrounds, positions, and textures. Fully connected layers flatten the image, discarding the layout of pixels, which makes them ill-suited for this type of data. The resulting performance drops to around 45%.


2. CNNs Handle Image Structure Better
CNNs preserve the two-dimensional structure of the image and use filters to detect patterns like edges, corners, and textures. This makes them ideal for handling visual features.

On CIFAR-10, the convolutional model greatly outperforms the FCN, achieving 77.36% test accuracy. This confirms the importance of spatial awareness when dealing with more realistic image data.

3. Validation and Loss Comparison
| Model     | Best Val Loss | Best Test Loss |
| --------- | ------------- | -------------- |
| MNIST FCN | 0.0550        | 0.0554         |
| CIFAR FCN | 1.5370        | 1.5091         |
| CIFAR CNN | 0.6685        | 0.6681         |

The FCN performs well on MNIST because of the simplicity of the data — leading to very low loss values.

On CIFAR-10, the FCN struggles with high loss, showing that it's not able to generalize effectively.

The CNN achieves much lower loss on CIFAR-10, reflecting more accurate predictions.

4. Stability of Training
The difference between the best and final checkpoints is small across all experiments, suggesting that the training process was relatively stable and that the early stopping mechanism worked effectively in most cases.


Conclution

Fully connected networks are efficient and effective for simple datasets like MNIST, where spatial relationships are not critical.

Convolutional networks are far better suited for image classification tasks that involve complex visuals, like those in CIFAR-10.

For real-world image data, using CNNs is essential to achieve strong accuracy and lower loss.



In [None]:
mnist_FC = ModelFactory.create_model(
    model_type="fully_connected",
    image_size=(1, 28, 28),
    hidden_units=[512,256],
    activation_fn=nn.SiLU(),
    num_classes=10,
    device=device
)

_ = train_model(dataset_name="mnist", model=mnist_FC, epochs=50, early_stopping_patience=float('inf'),verbose=True)


In [None]:
cifar_FC = ModelFactory.create_model(
    model_type="fully_connected",
    image_size=(3, 32, 32),
    hidden_units=[1024,512,256],
    activation_fn=nn.SiLU(),
    num_classes=10,
    device=device
)

_ = train_model(dataset_name="cifar10",model=cifar_FC, epochs=50, early_stopping_patience=float('inf'),verbose=True)

In [None]:
cifar_CNN = ModelFactory.create_model(
    model_type="convnet",
    activation_fn=nn.SiLU,
    num_classes=10,
    device=device
)

_ = train_model(dataset_name="cifar10",model=cifar_CNN, epochs=50, early_stopping_patience=float('inf'),verbose=True)

# F. Fine-tuning Pretrained ResNet-18 model (15 points)

In this section, we will take a ResNet-18 model pretrained on ImageNet and fine-tune it on CIFAR10. When fine-tuning a pre-trained model on a downstream task, only a subset of parameters are adapated and remaining are kept frozen.
The ResNet18 model architecture is shown below. At a high-level, the architecture can be categorized into input-convolution (conv1/bn1), layer1, layer2, layer3, layer4 and fc. To adapt ResNet18 to CIFAR10, we have two steps:
* Firstly, we replace fc with a new linear-layer that maps to 10 classes instead of 1000 classes.
* Then, we fine-tune this network on CIFAR10 **for 1 epoch** and experiment with freezing different layers of this network.


```
ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer2): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer3): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (layer4): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (fc): Linear(in_features=512, out_features=10, bias=True)
)
```

**All code has been written in this section.**

**F.1 [15 points]** Run the following cells and provide an analysis connecting the configuration of frozen-layers and observed validation/test accuracies.

**Answer:**
1. Freezing only early layers (conv1, bn1) gives the best performance
Freezing just the very first convolution and batch norm layers led to almost 89% accuracy, the highest observed in the experiment. These early layers typically extract generic low-level features (like edges and textures), which are already well-learned from ImageNet. Reusing them helps the model focus on learning dataset-specific features in deeper layers, speeding up convergence and improving performance.

2. Gradual freezing of more layers gives diminishing returns
When more layers were frozen — layer1, layer2, then layer3 — the model still performed well but with slightly reduced accuracy. This shows that mid-level and high-level layers do benefit from fine-tuning, since CIFAR-10 classes differ from ImageNet (e.g., airplanes, frogs, trucks). These deeper layers need adaptation to match the target domain.

3. Freezing all residual blocks (layer1–4) significantly hurts performance
With almost the entire ResNet backbone frozen, only the final classifier is trainable. This led to only 78% accuracy, nearly equal to the baseline where all layers were trained from scratch. It suggests that too much reliance on pretrained weights limits the model’s ability to adapt to new patterns in CIFAR-10.

4. Training all layers from scratch is not optimal
Interestingly, not freezing anything (training the whole model) gave the lowest performance. This could be due to the model “forgetting” useful pretrained features in early epochs — a known issue in transfer learning when learning rates are not carefully tuned.

In [None]:
class ResNet18(nn.Module):
    def __init__(self, num_classes=10, freeze_blocks=None,device='cpu'):
        super(ResNet18, self).__init__()
        self.device = device
        self.num_classes = num_classes

        # Load pretrained ResNet-18 model from torchvision.models
        # Read more here: https://pytorch.org/vision/main/models.html
        self.model = models.resnet18(weights='DEFAULT')


        # Modify the last fully connected layer to match the number of classes
        self.model.fc = nn.Linear(self.model.fc.in_features, num_classes)

        # Freeze certain blocks if specified
        if freeze_blocks:
            for name, param in self.model.named_parameters():
                if any(block in name for block in freeze_blocks):
                    param.requires_grad = False

        self.model = self.model.to(self.device)

    def forward(self, x):
        # Resize images to 224x224 in the forward pass
        x = F.interpolate(x, size=(224, 224), mode='bilinear')
        return self.model(x)


In [None]:
freeze_blocks_options = [
    None,  # Train all layers
    ["conv1", "bn1"],  # Freeze initial conv/batchnorm
    ["layer1"],  # Freeze first residual layer
    ["layer1", "layer2"],  # Freeze first two residual layers
    ["layer1", "layer2", "layer3"],  # Freeze first three residual layers
    ["layer1", "layer2", "layer3", "layer4"],  # Freeze all four residual layers
]
for freeze_blocks in freeze_blocks_options:
  print(f"{'='*5} Fine-tuning ResNet18: {'Frozen-layers:'+','.join(freeze_blocks) if freeze_blocks else 'Training all layers'} {'='*5}")
  _ = train_model(
      model = ResNet18(freeze_blocks=freeze_blocks,device=device),
      dataset_name="cifar10",
      epochs = 1,
      lr=0.001,
      weight_decay=0.1,
      early_stopping_patience=float('inf'),
      # Normalize using ImageNet mean and std
      # Read: https://pytorch.org/vision/main/models/generated/torchvision.models.resnet18.html
      mean_std = [(0.485, 0.456, 0.406), (0.229, 0.224, 0.225)],
      verbose=False,
      plot_graph=False
  )



