<a href="https://colab.research.google.com/github/Ananya1306/applied-machine-learning-algorithms-3806104/blob/main/Copy_of_Spring2025_hw1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS229-EE242 - Spring 2025 - Homework 1

# Due: Friday, April 25, 2025 @ 11:59pm

### Maximum points: 50 pts


## Submit your solution to elearn:
1. Submit a single PDF to **HW1**
2. Submit your jupyter notebook to **HW1-code**

**Both code and pdf need to show code outputs. See the additional submission instructions at the end of this notebook**


## Enter your information below:

### Your Name (submitter):

### Your student ID (submitter):
    
    
<b>By submitting this notebook, I assert that the work below is my own work, completed for this course.  Except where explicitly cited, none of the portions of this notebook are duplicated from anyone else's work or my own previous work.</b>


## Academic Integrity
Each assignment should be done  individually. You may discuss general approaches with other students in the class, and ask questions to the TAs, but  you must only submit work that is yours . If you receive help by any external sources (other than the TA and the instructor), you must properly credit those sources, and if the help is significant, the appropriate grade reduction will be applied. If you fail to do so, the instructor and the TAs are obligated to take the appropriate actions outlined at http://conduct.ucr.edu/policies/academicintegrity.html . Please read carefully the UCR academic integrity policies included in the link.


### PyTorch Materials:

If you're not familiar with PyTorch, please review the introductory tutorials before proceeding with this notebook:

- [PyTorch Beginner Tutorials](https://pytorch.org/tutorials/beginner/basics/intro.html)
- [PyTorch Neural Networks Tutorial](https://pytorch.org/tutorials/beginner/nn_tutorial.html)

You can also refer to this notebook for an introduction:  
[PyTorch Tutorial - CS231N Spring 2024](https://colab.research.google.com/drive/1FERNv6t8xpX9Nly_JdnePWEPllI7F3Fx?usp=sharing)

# Vision Transformer (ViT) Homework

## Overview

In this homework, we will build and train a Vision Transformer (ViT)[1] from scratch. You will implement key components, including **patch embedding**, **normalizations**, and **self-attention layers**, and **train** the model on an image classification task. Additionally, you will train the model on randomly assigned labels to analyze its generalization behavior.

<!-- <p align="center">
  
</p> -->

<p align="center">
  <img src="https://production-media.paperswithcode.com/methods/Screen_Shot_2021-01-26_at_9.43.31_PM_uI4jjMq.png" />
  <img src="https://theaisummer.com/static/aa65d942973255da238052d8cdfa4fcd/7d4ec/the-transformer-block-vit.png" alt="Vision Transformer Architecture" />
</p>

<p align="center">
  Figure 1. Vision Transformer Architecture
</p>


### Expected Outcomes

- Understanding of Vision Transformer architecture.
- Insights into model memorization and generalization behavior.

### References

[1] Dosovitskiy, Alexey, et al. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." *International Conference on Learning Representations (ICLR)*, 2020.  


## Read *all* cells carefully and answer all parts


In [None]:
import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Subset


from torch.utils.data import DataLoader
from torchvision import datasets, transforms


## Question 1. Implement and train the MLP Block [8 pts]


### Question 1.1 Implement the MLP Block [3 pts]

The MLP accepts input as a vector and contains two layers with a GELU non-linearity.

**TODO**:
Complete the implementation of the MLPBlock class. You should define the layers in the constructor (__init__) and implement the forward method (forward). The module will have two linear layers and activation layer in between.

**Hints:**
- **Linear Layers**: Use `nn.Linear` to map from `in_dim` → `hidden_dim`, then from `hidden_dim` → `out_dim`.
- **Activation**: Insert an activation function (e.g., `nn.ReLU()` or `nn.GELU()`) between the linear layers.
- **Forward**: Apply the layers sequentially: first the **linear layer**, then **activation**, then the **second linear layer**.


In [None]:
class MLPBlock(nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim):
        """
        Initialize the MLP block with two linear layers and an activation function.

        Args:
            in_dim (int): Input feature dimension.
            hidden_dim (int): Hidden layer dimension.
            out_dim (int): Output feature dimension.
        """
        super().__init__()
        # TODO: Define the layers here
        # self.layers =


    def forward(self, x):
        """
        Forward pass through the MLP block.

        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, in_dim).

        Returns:
            torch.Tensor: Output tensor of shape (batch_size, out_dim).
        """
        # TODO: Implement the forward pass
        return x

### Question 1.2 Train the MLP Block [5 pts]

In this section, we will train a simple **2-layer MLP (Multilayer Perceptron)** for image classification using the CIFAR-10 dataset. The model will incorporate a GeLU (Gaussian Error Linear Unit) activation function, which is commonly used in transformer architectures.

**Steps:**
1. **Dataset**: We'll use the CIFAR-10 dataset, which contains 60,000 32x32 color images in 10 classes.
2. **Model**: Model: The model will be a basic 2-layer MLP, with a GeLU activation function between the layers
3. **Training**: Train the model for some epochs using Cross-Entropy loss and Adam optimizer.
4. **Evaluation**: Evaluate the model's performance on the test set.

#### Dataset and dataloading

In the code below, we will define the dataset and dataloader, which will be used in our training loop.

Note that we are using `batch_size = 16`, you can use the same batch size or play with this number.



In [None]:
train_transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
])

test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
])

train_data = datasets.CIFAR10(root='./data', train=True, download=True, transform=train_transform)
test_data = datasets.CIFAR10(root='./data', train=False, download=True, transform=test_transform)

train_loader = DataLoader(train_data, batch_size=16, shuffle=True)
test_loader = DataLoader(test_data, batch_size=16)

#### Define the model [1 pts]

In this section, we will define a classification model using a simple 2-layer MLP. The first layer transforms the input, followed by a GeLU activation, and the second layer produces the final output for classification.

**Hints:**
- The input dimension is the total number of pixels in the image, including all color channels. For CIFAR-10, `in_dim = 32 * 32 * 3 = 3072`.
- Use `.view(x.size(0), -1)` to flatten the image into a 1D vector for the MLP.
- The size of the hidden layer (`hidden_dim`) is a hyperparameter. You can experiment with different sizes to find what works best.
- The size of `out_dim` will be 10, because we want to estimate a probability vector with 10 classes.






In [None]:
model =

#### Implement training loop [2 pts]

In this section, you will implement the training loop for the MLP model. Below is a basic structure for the training process. We have implemented some parts and you will fill in the rest.

- **Training Loop:**: Implement the loop to train the model for some epochs, calculate the loss, and update the model parameters.

- **Evaluation**: Compute and store `test_accuracy` and `train_accuracy` per each epoch.


**You should get at least 50% accuracy on the test data.**


In [None]:
# Define the optimizer and loss function
learning_rate = 2e-4
num_epochs = 20
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

if torch.cuda.is_available():
  model.cuda()

test_accuracy = []
train_accuracy = []

# Training loop
for epoch in range(num_epochs):
    model.train()

    running_loss = 0.0
    for inputs, targets in train_loader:
        # Move data to device if cuda is available
        if torch.cuda.is_available():
          inputs, targets = inputs.cuda(), targets.cuda()

        # Zero the gradients
        optimizer.zero_grad()

        # TODO:
        # Fill the following sections
        # Forward pass
        # Compute the loss
        # Backward propagation on the loss
        # Update model parameters using optimizer

        outputs =
        loss =


        running_loss += loss.item()

    # TODO
    # Compute accuracy on the training set


    # Print the average loss for the epoch
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss / len(train_loader):.4f}')

    # Evaluate on the test set
    model.eval()  # Set model to evaluation mode

    with torch.no_grad():
        for inputs, targets in test_loader:
            # Move data to device if cuda is available
            if torch.cuda.is_available():
              inputs, targets = inputs.cuda(), targets.cuda()

            # TODO:
            # Fill the following sections

            # Forward pass
            outputs =

            # Get predictions
            predicted =


    # TODO:
    # Compute accuracy on the test set

    accuracy =
    test_accuracy.append(accuracy)
    print(f'Accuracy on the test set: {accuracy:.2f}%')

#### 1.2.3 Plot training and test accuracy [2 pts]

In your training loop above, make sure to calculate training accuracy per epoch and append it to the `train_accuracy` and `test_accuracy` variables.

We have written the plotting code in the cell below.

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(12, 5))

axs[0].plot(train_accuracy, label='Train Accuracy', color='blue')
axs[0].set_title('Train Accuracy vs. Epoch')
axs[0].set_xlabel('Epoch')
axs[0].set_ylabel('Accuracy (%)')
axs[0].grid(True)
axs[0].legend()

axs[1].plot(test_accuracy, label='Test Accuracy', color='green')
axs[1].set_title('Test Accuracy vs. Epoch')
axs[1].set_xlabel('Epoch')
axs[1].set_ylabel('Accuracy (%)')
axs[1].grid(True)
axs[1].legend()

plt.tight_layout()
plt.show()

## Question 2. Implement Patch Embedding Layer [5 pts]

Now, we will implement a patch embedding layer that divides each input image into fixed-size patches. Then we will embed each patch independently into a vector (aka token). All the tokens corresponding to all the patches will then be used inputs to the transformer.

**Hints:**
- **Patch Size:** The image is typically divided into square patches of size `(patch_size, patch_size)`. For example, if the image is 224x224 and the patch size is 16, there will be 224/16 = 14 patches along each dimension, resulting in `N=14 x 14 = 196` patches. CIFAR images are small (32 x 32 pixels), so you may want to use a smaller patch size.

- **Conv2D**: Suppose we want to divide an `3 x H x W` image into `N` patches (each of size `3 x p x p`) and then transform each patch into a `d`-dimensional vector (token). In the code below, we refer to `d` as `embed_dim` or `hidden_dim`. We can do all of this using a single convolutional layer (nn.Conv2d) with `d` channels. The kernel size should be equal to the patch size, and the stride should be the same to avoid overlap. The resulting `(d x H/p x W/p)` tensor will have `d` channels, each with `(H/p x W/p)` spatial features.



- **Flatten**: After embedding the patches, we can flatten them to create a sequence of patch embeddings. This will be done across the last two dimensions (height, width) after convolution. \\
Suppose we are using a batch of shape `(batch_size, 3, H, W)`, after Conv2D and Flatten, we should get a tensor of shape `(batch_size, d, N)`.

- **Transpose**: After flattening, transpose the output to `(batch_size, num_patches, hidden_dim)` or `(batch_size, N, d)` for the final format.

In [None]:

class PatchEmbedding(nn.Module):
    def __init__(self, patch_size, in_channels, embed_dim):
        """
        Initialize the Patch Embedding layer.

        Args:
            patch_size (int): Size of each patch (patch_size x patch_size).
            in_channels (int): Number of input channels (This will be 3 for RGB images).
            embed_dim (int): Dimension of the embedding for each patch.
        """
        super().__init__()
        # TODO: Define the layers here
        # self.layers =

    def forward(self, x):
        """
        Forward pass through the PatchEmbedding layer.

        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, in_channels, height, width).

        Returns:
            torch.Tensor: Output tensor of shape (batch_size, num_patches, embed_dim).

            num_patches = (height * width) / (patch_size * patch_size)
        """
        # TODO: Implement the forward pass
        return x

## Question 3. Implement Multi-Head Self-Attention (MHSA) [10 pts]

In this task, we will implement a Multi-Head Self-Attention (MHSA) layer. As the name suggests, MHSA has self attention with multiple heads.

**Self-Attention** module seeks to combine the input sequence of tokens (based on some similarity metric) to generate a set of output tokens. The key equations for self-attention can be written as follows.

For a given $N\times d$ input token matrix $Z$, with $N$ tokens each of dimension $d$, we generate Query, Key, and Value matrices that can be written as
  $$
  Q = Z W_Q, \quad K = Z W_K, \quad V = Z W_V,
  $$
where  $W_Q, W_K, W_V$ are weight matrices that we learn for the query, key, and value projections. You can implement each of them as a linear layer.
  
For simplicity, we can assume that each of the projection matrices have same size (e.g., $d\times d$ matrices that map each $d$-dim token to $d$-dim space; shortly we will see that in the case of multiple heads we prefer to use a reduced dimension for each head).

**Scaled Dot-Product Attention** is computed as

  $$
  \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d}}\right) V
  $$

  Where:
  - $d$ is the embedding dimension of tokens in the Q, K, and V matrices. In the code, we also refer to it as `hidden_dim`.
  - The softmax function is applied over the rows of $\frac{Q K^T}{\sqrt{d}}$.

**Multi-Head Selft Attention (MHSA)**: As the saying goes multiple heads are better than one. In practice, we use multiple heads to perform a separate attention operation on each heads with a different set of learned projections.

Let us assume that we want to use $h$ heads, all of which receive the same input tokens $Z$ and provide a set of outputs that we denote as $(\text{head}_1,\ldots, \text{head}_h)$. We compute each output head as

$$\text{head}_i = \text{Attention}(Q_i, K_i, V_i),$$

where $Q_i=ZW^i_Q, K_i = Z W^i_K, V_i = ZW^i_V$ are the (Q, K, V) for head i with learnable weight matrices: $W_Q^i, W_K^i, W_V^i$.

The choice of embedding dimension for each head is in our control, but for simplicity we use $d_k = d/h$ as the embedding dimension for each head. For this reason, we prefer to use $h$ that divides $d$ into an integer.

To compute the final MHSA output, we concatednate the output of all the heads and apply an output linear transform to get a $N\times d$ output token matrix as
  
  $$
  \text{MHSA}(Z) = \text{concat}(\text{head}_1, \dots, \text{head}_h) W_O,
  $$
where $W_O$ denotes the $d\times d$ output transform matrix.

In [None]:
class MHSA(nn.Module):
    def __init__(self, hidden_dim, num_heads):
        """
        Initialize the Multi-Head Self-Attention (MHSA) layer.

        Args:
            hidden_dim (int): Embedding dimension for the input patches ($d$).
            num_heads (int): Number of attention heads.
        """
        super(MHSA, self).__init__()
        # TODO: Define the layers here

    def forward(self, x):
        """
        Forward pass through the Multi-Head Self-Attention (MHSA) layer.

        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, num_patches, hidden_dim).

        Returns:
            torch.Tensor: Output tensor after applying multi-head attention.
        """
        # TODO:
        # 1. Apply linear projections to input x for to obtain Q, K, and V.
        # 2. Compute attention scores for each head.
        # 3. Apply attention to V: output = attention @ V
        # 4. Concatenate outputs from all heads and apply final linear projection

        return x


## Question 4. Build the Vision Transformer Model [10 pts]



### 4.1 Implement the Encoder Module [5 pts]

The EncoderLayer is a core building block of the Vision Transformer (ViT). Each encoder layer consists of two key components: Multi-Head Self-Attention (MHSA) and a Multi-Layer Perceptron (MLP). Each of these components is followed by residual connections and [layer normalization](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html#layernorm) (take a look at Figure 1 for visual reference).

The `attn_drop_out` argument controls the dropout rate applied to the attention weights. If you set `attn_drop_out=0`, it should disable attention dropout. Use `nn.Dropout` to implement it.




In [None]:

class EncoderLayer(nn.Module):
    def __init__(self, hidden_dim, num_heads, mlp_hidden_dim, attn_drop_out = 0):
        """
          An encoder layer in the Vision Transformer (ViT).

          Args:
              hidden_dim (int): The dimension of the embedding space ($d$).
              num_heads (int): The number of attention heads in the Multi-Head Self-Attention (MHSA) layer.
              mlp_hidden_dim (int): Hidden dimension of the MLP block.
              attn_drop_out (float): Dropout rate applied to the attention weights. Set to 0 to disable dropout and allow full attention flow during training.
        """

        super(EncoderLayer, self).__init__()
        # TODO define Multi-Head Self-Attention (MHSA)
        self.mhsa =

        # TODO define Layer normalization
        self.norm1 =

        # TODO define MLP
        self.mlp =

        # TODO define a second Layer normalization
        self.norm2 =

    def forward(self, x):
        """
        Forward pass through the Encoder layer.

        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, num_patches, embed_dim).

        Returns:
            torch.Tensor: Encoder output
        """
        # TODO:
        # 1. Apply MHSA to x
        # 2. Perform layer normalization with residual connection
        # 3. Perform projection using MLP with residual connection
        # 4. Compute output by performing second layer normalization

        return x

### 4.2 Build the Vision Transformer Model [5 pts]

We have almost everything ready to build a vision transformer. We need to take care of two things: 1) positional encoding; 2) `cls` embedding token

**Positional encoding.** Note that the attention mechanism is permuation invariant (i.e., we can shuffle the patches in a random order and get the same output from the attention layers). To encode positional information in the patches, we add position embedding vectors to all the input embedded patches. The position embedding vectors have the same dimension ($d$) as the embedded patches. A common choice for position embedding is the sinusoidal embedding, but below we will use learnable embedding.

**CLS token.** As we discussed in the class that we usually prepend an extra learnable `cls` token to the input embedding (as shown in Figure 1). We then use the output token corresponding to the `cls` token at the last layer to perform the classification. The total number of tokens in the sequence will therefore be `num_patches+1`.





In [None]:
class VisionTransformer(nn.Module):
    def __init__(self, num_heads, num_layers, hidden_dim, num_classes, patch_size, in_channels, img_size, mlp_hidden_dim, attn_drop_out):
        """
        Vision Transformer (ViT) model.

        Args:
            num_heads (int): The number of attention heads in each Multi-Head Self-Attention (MHSA) layer.
            num_layers (int): The number of transformer encoder layers to be stacked in the model.
            hidden_dim (int): The dimension of the embedding space ($d$).
            num_classes (int): The number of output classes for the classification task.
            patch_size (int): The size of the patches that the input image will be divided into (e.g., 16x16).
            in_channels (int): The number of channels of the input image (e.g, 3 for RGB images).
            img_size (int): The size of the input image (e.g., 224 for 224x224 images).
            mlp_hidden_dim (int): The hidden dimension of the MLP block.
            attn_drop_out (float): Dropout rate applied to the attention weights. Set to 0 to disable dropout and allow full attention flow during training.

        """

        super(VisionTransformer, self).__init__()

        self.cls_token = nn.Parameter(torch.randn(1, 1, hidden_dim)) # A class Token (Learnable Parameter)

        # Learnable Position embedding
        seq_length = (img_size // patch_size) ** 2 + 1
        self.pos_embedding = nn.Parameter(torch.empty(1, seq_length, hidden_dim).normal_(std=0.02))

        # Complete the rest of the layers

        # TODO: Create a Patch Embedding Layer
        self.patch_embed =

        # TODO: Create `num_layers` Encoder Layers
        self.encoder_layers =

        # TODO: Create a Final Linear Layer for Classification
        self.fc =

    def forward(self, x):
        """
        Forward pass through the ViT model.

        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, num_patches, embed_dim).

        Returns:
            torch.Tensor: ViT output
        """
        # TODO
        # 1. Apply patch embedding and concatenate class token
        # 2. Add position encoding and perform forward pass through each encoder layer
        # 3. Extract the class token's output for classification and pass it through the final classification layer
        # 4. Return output

        return x

## Question 5. Training and Evaluation [17 pts]

### 5.1 Image classification [10 pts]

In this section, we will implement a simple image classification model using the CIFAR-10 dataset. The model will be a **2-layer Vision Transformer (ViT)**, where the images are divided into patches, and a Transformer encoder is used for classification.

**Steps:**
1. **Dataset**: We'll use the CIFAR-10 dataset, which contains 60,000 32x32 color images in 10 classes.
2. **Model**: A basic 2-layer Vision Transformer (ViT) with patch embedding, Transformer layers, and a classification head.
3. **Training**: Train the model for 5 epochs using Cross-Entropy loss and Adam optimizer.
4. **Evaluation**: Evaluate the model's performance on the test set.


#### Data loading and training setup

In [None]:
train_transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
])

test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
])

train_data = datasets.CIFAR10(root='./data', train=True, download=True, transform=train_transform)
test_data = datasets.CIFAR10(root='./data', train=False, download=True, transform=test_transform)

train_loader = DataLoader(train_data, batch_size=16, shuffle=True)
test_loader = DataLoader(test_data, batch_size=16)

#### 5.1.1 Define the model [3 pts]

In this section, you will create an object of the VisionTransformer class that you implemented in **Question 4.**

Use the following parameters to instantiate the model:
- `num_heads`: Number of attention heads in each transformer layer ($h$).
- `num_layers`: Number of transformer layers.
- `hidden_dim`: The dimensionality of the embedding space ($d$).
- `num_classes`: The number of output classes (CIFAR-10 has 10 classes).
- `patch_size`: The size of the patches to divide the image.
- `in_channels`: The number of input channels (3 for RGB images).
- `img_size`: The size of the input images (32x32 for CIFAR-10).
- `mlp_hidden_dim`: The hidden dimension of the MLP
- `attn_drop_out`: Dropout rate applied to the attention weights.


In [None]:
# Example values for these parameters can be following but we strongly encourage you try your own values.
num_heads = 8
num_layers = 4
hidden_dim = 256
num_classes = 10
patch_size = 4
in_channels = 3
img_size = 32
mlp_hidden_dim = 256*4
attn_drop_out = 0.1


In [None]:
# TODO
num_heads =
num_layers =
hidden_dim =
num_classes =
patch_size =
in_channels =
img_size =
mlp_hidden_dim =
attn_drop_out =

# Instantiate the model
model =

#### 5.1.2 Training loop [5 pts]

In this section, you will implement the training loop for the Vision Transformer model. Below is a basic structure for the training process. We have implemented some parts and you will fill in the rest.

- **Training Loop:**: Implement the loop to train the model for some epochs, calculate the loss, and update the model parameters.

- **Evaluation**: Compute and store `test_accuracy` and `train_accuracy` per each epoch.


**You should get at least 75% accuracy on the test data.**

In [None]:
# Define the optimizer and loss function
learning_rate = 2e-4
num_epochs = 20
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

if torch.cuda.is_available():
  model.cuda()

test_accuracy = []
train_accuracy = []

# Training loop
for epoch in range(num_epochs):
    model.train()

    running_loss = 0.0
    for inputs, targets in train_loader:
        # Move data to device if cuda is available
        if torch.cuda.is_available():
          inputs, targets = inputs.cuda(), targets.cuda()

        # Zero the gradients
        optimizer.zero_grad()

        # TODO:
        # Fill the following sections
        # Forward pass
        # Compute the loss
        # Backward propagation on the loss
        # Update model parameters using optimizer

        outputs =
        loss =


        running_loss += loss.item()

    # TODO
    # Compute accuracy on the training set


    # Print the average loss for the epoch
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss / len(train_loader):.4f}')

    # Evaluate on the test set
    model.eval()  # Set model to evaluation mode

    with torch.no_grad():
        for inputs, targets in test_loader:
            # Move data to device if cuda is available
            if torch.cuda.is_available():
              inputs, targets = inputs.cuda(), targets.cuda()

            # TODO:
            # Fill the following sections

            # Forward pass
            outputs =

            # Get predictions
            predicted =


    # TODO:
    # Compute accuracy on the test set

    accuracy =
    test_accuracy.append(accuracy)
    print(f'Accuracy on the test set: {accuracy:.2f}%')

#### 5.1.3 Plot training and test accuracy. [2 pts]


In your training loop above, make sure to calculate training accuracy per epoch and append it to the `train_accuracy` and `test_accuracy` variables.

We have written the plotting code in the cell below.





In [None]:
fig, axs = plt.subplots(1, 2, figsize=(12, 5))

axs[0].plot(train_accuracy, label='Train Accuracy', color='blue')
axs[0].set_title('Train Accuracy vs. Epoch')
axs[0].set_xlabel('Epoch')
axs[0].set_ylabel('Accuracy (%)')
axs[0].grid(True)
axs[0].legend()

axs[1].plot(test_accuracy, label='Test Accuracy', color='green')
axs[1].set_title('Test Accuracy vs. Epoch')
axs[1].set_xlabel('Epoch')
axs[1].set_ylabel('Accuracy (%)')
axs[1].grid(True)
axs[1].legend()

plt.tight_layout()
plt.show()

### 5.2 Generalization vs Memorization [7 pts]

In this experiment, we explore the capacity of deep neural networks to memorize arbitrary data by training a model on randomly shuffled labels. Both the training and test labels are randomized. The goal is to observe the model’s behavior when trained and tested on randomly shuffled labels.

**Your model should be able to memorize the random training data. That means you should achieve high training accuracy.**


For more details on label randomization experiments, refer to [2].


**Reference**
[2] Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking generalization." International Conference on Learning Representations. 2022.

#### Dataset and dataloader setup

Here, we will create a dataset that has random labels.

In [None]:
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
])


random_train_data = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
random_test_data = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)

# Apply random labels to the datasets
num_classes = 10
random_train_data.targets = np.random.randint(0, num_classes, size=len(random_train_data))
random_test_data.targets = np.random.randint(0, num_classes, size=len(random_test_data))

# To make the training simpler, let us subsample the dataset and pick only 5000 training and 500 test images
train_subset_indices = torch.randperm(len(random_train_data))[:5000]
test_subset_indices = torch.randperm(len(random_test_data))[:500]

random_train_data = Subset(random_train_data, train_subset_indices)
random_test_data = Subset(random_test_data, test_subset_indices)

batch_size = 8
random_train_loader = DataLoader(random_train_data, batch_size=batch_size, shuffle=True)
random_test_loader = DataLoader(random_test_data, batch_size=batch_size)

#### 5.2.1 Implement training and evaluation [3 pts]

Write a training loop similar to ***Question 5.1.2*** but instead use `random_train_loader` and `random_test_loader` data loaders.

**Make sure to create a new model object.**


Compute and store both test accuracy and training accuracy for this randomized experiment at every epoch.

**Hint**. Your model should be large enough to memorize training data.

In [None]:
# TODO
# Write training loop

#### 5.2.3 Plot training and test accuracy for the randomized training. [2pts]


#### 5.2.4 Generalization Capability of the Model [2 pts]

Discuss your observations during the randomization test. What did you notice about the model's ability to fit the training data and generalize to the test set when trained on randomized labels?

**Answer:**

---

## **Submission instructions**
1. Download this Colab to ipynb, and convert it to PDF. Follow similar steps as [here](https://stackoverflow.com/questions/53460051/convert-ipynb-notebook-to-html-in-google-colab) but convert to PDF.
 - Download your .ipynb file. You can do it using only Google Colab. `File` -> `Download` -> `Download .ipynb`
 - Reupload it so Colab can see it. Click on the `Files` icon on the far left to expand the side bar. You can directly drag the downloaded .ipynb file to the area. Or click `Upload to session storage` icon and then select & upload your .ipynb file.
 - Conversion using %%shell.
 ```
!sudo apt-get update
!sudo apt-get install texlive-xetex texlive-fonts-recommended texlive-plain-generic pandoc
!pip install pypandoc
!jupyter nbconvert --log-level CRITICAL --to pdf name_of_hw.ipynb
  ```
 - Your PDF file is ready. Click 3 dots and `Download`.
**Note: Please follow these instructions to generate the PDF. Do not use any other method, such as `ctrl+p`.**

2. Upload the PDF to elearn and **select** the correct pages for each question. Refer to the week 1 discussion video or contact the TAs if you face any issues. **Important!**


3. Upload the `.ipynb` file to elearn. Make sure that both the **code** and the **PDF** are uploaded. **Important!**


Notice: In case of errors in conversion, please check your LaTeX and debug. In Markdown, when you write in LaTeX math mode, do not leave any leading and trailing whitespaces inside the dollar signs ($). For example, write `(dollarSign)\mathbf(dollarSign)(dollarSign)` instead of `(dollarSign)(space)\mathbf{w}(dollarSign)`. Otherwise, nbconvert will throw an error and the generated pdf will be incomplete.

In [None]:
!sudo apt-get update
!sudo apt-get install texlive-xetex texlive-fonts-recommended texlive-plain-generic pandoc
!pip install pypandoc

Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ Packages [75.2 kB]
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1,604 kB]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:10 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Hit:12 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Get:13 https://r2u.stat.illinois.edu/ubuntu jammy/mai

In [None]:
!jupyter nbconvert --log-level INFO --to pdf Spring2025_hw1.ipynb # make sure the ipynb name is correct

[NbConvertApp] Converting notebook Spring2025_hw1.ipynb to pdf
[NbConvertApp] Writing 83837 bytes to notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: ['xelatex', 'notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: ['bibtex', 'notebook']
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 120043 bytes to Spring2025_hw1.pdf
