<h1 align="center" style="color:green;font-size: 3em;">Assignment 3: LoRA Fine-tuning for LLMs</h1>

# Part 1: Introduction

In this homework assignment, you will implement several fine-tuning methods using a pre-trained [DistildBERT](https://arxiv.org/abs/1910.01108) model for sentiment classification on IMDB text reviews. Specifically, you will:
- Implement Full-finetuning
- Implement LoRA finetuning

**Sentiment classification**  is a text classification task where a model learns to determine whether a given review expresses a positive sentiment (label 1) or a negative sentiment (label 0).

For example, given the following review:
```
'This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stub your toe on the moon" It reminds me of Sinatra\'s song High Hopes, it is fun and inspirational. The Music is great throughout and my favorite song is sung by the King, Hank (bing Crosby) and Sir "Saggy" Sagamore. OVerall a great family movie or even a great Date movie. This is a movie you can watch over and over again. The princess played by Rhonda Fleming is gorgeous. I love this movie!! If you liked Danny Kaye in the Court Jester then you will definitely like this movie.'
```
The model should classify it as positive (label 1).


**Instructions:**

1. Use the provided notebook sections to implement each fine-tuning technique.
2. Complete the code cells marked with `TODO`
3. Ensure all code runs correctly by the end of the notebook.

In [1]:
!pip install datasets -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/485.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m481.3/485.4 kB[0m [31m32.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import os
import io
from PIL import Image
from tqdm.notebook import tqdm
import torch
import torchvision
from torch import nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
from torchvision import transforms
from torch.utils.data import DataLoader

import numpy as np
from matplotlib import pyplot as plt

# Libraries from HugginFace.
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"You selected as device: {device}")

You selected as device: cuda:0


# Part 2: Dowload and prepare the IMDB sentiment classification dataset

The IMDB sentiment classification dataset concists of 25000 training examples and 25000 test examples. From the latter, test dataset, we will use only 3200 examples for faster evaluation.

In [3]:
# Download the dataset
dataset = load_dataset("imdb")

# The train dataset has 25000 examples.
train_dataset = dataset["train"].shuffle(seed=42)
# The test dataset has 25000 examples, from which we will use 3200.
test_dataset = dataset["test"].shuffle(seed=42).select([i for i in list(range(3200))])

label_names = {0: "Negative", 1: "Positive"}

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Print one positive and one negative training example.

In [4]:
# Print the first two training examples
negative = [idx for idx, x in enumerate(train_dataset) if x["label"] == 0]
positive = [idx for idx, x in enumerate(train_dataset) if x["label"] == 1]
print(f"In the training set there are #{len(negative)} negative reviews and #{len(positive)} positive reviews")

print("\nPositive review example:")
print("\tReview:", train_dataset[positive[1]]["text"])
print("\tLabel:", label_names[train_dataset[positive[0]]["label"]])
print("\nNegative review example:")
print("\tReview:", train_dataset[negative[1]]["text"])
print("\tLabel:", label_names[train_dataset[negative[0]]["label"]])

In the training set there are #12500 negative reviews and #12500 positive reviews

Positive review example:
	Review: This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stub your toe on the moon" It reminds me of Sinatra's song High Hopes, it is fun and inspirational. The Music is great throughout and my favorite song is sung by the King, Hank (bing Crosby) and Sir "Saggy" Sagamore. OVerall a great family movie or even a great Date movie. This is a movie you can watch over and over again. The princess played by Rhonda Fleming is gorgeous. I love this movie!! If you liked Danny Kaye in the Court Jester then you will definitely like this movie.
	Label: Positive

Negative review example:
	Review: Yeh, I know -- you're quivering with excitement. Well, *The Secret Lives of Dentists* will not upset your expectations: it's solidly made but essentially u

In this assignment, we will use the **Hugging Face Transformers** Python library (`transformers`). This library provides access to a wide range of pre-trained models, including **DistilBERT** and the latest **LLMs** like **Llama 3**. It also includes model-specific **tokenizers** and various tools for working with deep learning models. Additionally, Hugging Face offers the `datasets` package, which we have already used to download and process the IMDB dataset.  

### **Tokenization and Model Inputs**  
Pre-trained models come with built-in **tokenizers**, which are essential for preprocessing text inputs. A tokenizer converts raw text (single strings or lists of strings) into a structured format that the model can process. The output is a dictionary that includes:  

- **`input_ids`**: A list of token IDs representing the input text.  
- **`attention_mask`**: A list of 0s and 1s, indicating which tokens belong to the original text (**1**) and which are padding tokens (**0**).  

Padding ensures that all inputs have the same length (e.g., **L=512 tokens**), making it possible to process multiple examples together in mini-batches (**batch size > 1**). To achieve this, tokenizers **pad** shorter inputs with **0s**. The `attention_mask` helps the model’s **self-attention layers** ignore these padding tokens during processing.

### **Using Tokenizers**  
You can load tokenizers in two ways:  
- Use a **model-specific tokenizer** (e.g., `DistilBertTokenizer` for DistilBERT).  
- Use `AutoTokenizer`, which automatically loads the correct tokenizer for any model.  

Next, we will use the **DistilBERT tokenizer** to preprocess the IMDB dataset.  

In [5]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

tokenized_train = train_dataset.map(
    lambda example: tokenizer(example['text'], padding=True, truncation=True), # https://huggingface.co/docs/transformers/pad_truncation
    batched=True,
    batch_size=16
)
tokenized_train = tokenized_train.remove_columns(["text"])
tokenized_train = tokenized_train.rename_column("label", "labels")
tokenized_train.set_format("torch")

tokenized_test = test_dataset.map(
    lambda example: tokenizer(example['text'], padding=True, truncation=True), # https://huggingface.co/docs/transformers/pad_truncation
    batched=True,
    batch_size=16
)
tokenized_test = tokenized_test.remove_columns(["text"])
tokenized_test = tokenized_test.rename_column("label", "labels")
tokenized_test.set_format("torch")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3200 [00:00<?, ? examples/s]

Visualize the tokenization for the previous positive example


In [6]:
print("Positive review example:")
print("\n\tReview:\n", train_dataset[positive[1]]["text"])
print("\n\tTokenized review (input_ids):\n", tokenized_train[positive[1]]["input_ids"])
print("\n\tAttention mask showing the padding:\n", tokenized_train[positive[1]]["attention_mask"])

Positive review example:

	Review:
 This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stub your toe on the moon" It reminds me of Sinatra's song High Hopes, it is fun and inspirational. The Music is great throughout and my favorite song is sung by the King, Hank (bing Crosby) and Sir "Saggy" Sagamore. OVerall a great family movie or even a great Date movie. This is a movie you can watch over and over again. The princess played by Rhonda Fleming is gorgeous. I love this movie!! If you liked Danny Kaye in the Court Jester then you will definitely like this movie.

	Tokenized review (input_ids):
 tensor([  101,  2023,  3185,  2003,  1037,  2307,  1012,  1996,  5436,  2003,
         2200,  2995,  2000,  1996,  2338,  2029,  2003,  1037,  4438,  2517,
         2011,  2928, 24421,  1012,  1996,  3185,  4627,  1997,  2007,  1037,
         3496,  2073,

# Part 3: Downloading and using a pretrained DistilBERT model  

Here we use the Hugging Face `transformers` package to download a pretrained **DistilBERT** model and adapt it for text classification.  

We achieve this with the following command:  

```python
AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
```
-  `"distilbert-base-uncased"` specifies the pretrained model to download.
- `num_labels=2` sets the number of output labels for our classification task (Positive vs. Negative reviews).

This call automatically loads DistilBERT and attaches a randomly initialized classification head with two output neurons for sentiment prediction.

In [7]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

def set_dropout(model, dropout_prob=0.05):
    for m in model.modules():
        if isinstance(m, nn.Dropout):
            m.p = dropout_prob

def count_parameters(model):
    num_params_both_trainable_and_frozen = sum(p.numel() for p in model.parameters())
    num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

    num_params_head = (
        sum(p.numel() for p in model.classifier.parameters() if p.requires_grad) +
        sum(p.numel() for p in model.pre_classifier.parameters() if p.requires_grad))

    num_params_backbone = num_params - num_params_head

    print("\n")
    #print(f"Total number of parameters: {num_params_both_trainable_and_frozen}")
    print(f"Number of trainable parameters:")
    print(f"\t Classification head:  {num_params_head}")
    print(f"\t Transformer backbone: {num_params_backbone}")
    print(f"\t Total:                {num_params}")

set_dropout(model, dropout_prob=0.05)

print("Model:")
print(model)

count_parameters(model)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model:
DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.05, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.05, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.05, inpla

Below we use the model (un-trained for now) in order to make predict the sentiment of a review.

In [8]:
input_ids = tokenized_train[0]["input_ids"].unsqueeze(0) # We do the unsqueeze(0) to add a batch-dimension
attention_mask = tokenized_train[0]["attention_mask"].unsqueeze(0) # We do the unsqueeze(0) to add a batch-dimension
print(f'input_ids.shape {input_ids.shape}')
print(f'attention_mask.shape {attention_mask.shape}')

model.eval()
predictions = model(input_ids=input_ids, attention_mask=attention_mask)["logits"]
pred_label = predictions.argmax()
print(f"Logits: {predictions.tolist()} -- predicted label {pred_label}")

input_ids.shape torch.Size([1, 512])
attention_mask.shape torch.Size([1, 512])
Logits: [[-0.08124226331710815, 0.12949447333812714]] -- predicted label 1


# Part 4: Implement a fine-tuning routine

This time, you will implement a function called `finetune_bert_model`. This function will fine-tune a pretrained model with a classification head using the **AdamW** optimizer.  

### **Function Signature:**  
```python
finetune_bert_model(model, tokenized_train, tokenized_test, n_epochs, batch_size, lr, wd)
```

### **Function Inputs:**  
- `model`: The pretrained model with the classification head attached.  
- `tokenized_train`: The tokenized training dataset.  
- `tokenized_test`: The tokenized test dataset.  
- `n_epochs`: Number of training epochs.  
- `batch_size`: Batch size for training.  
- `lr`: Learning rate.  
- `wd`: Weight decay.  

### **Function Requirements:**  
- Train the model for `n_epochs` using the given `batch_size`, `lr`, and `wd`.  
- After each training epoch, run an evaluation epoch on the test dataset.  
- After each training plust evaluation epoch, print:  
  - Average **training loss**  
  - Average **training accuracy**  
  - Average **test loss**  
  - Average **test accuracy**  
- At the end of training (after all epochs), print the **final test accuracy** and **final test loss**. In other words, the test accuracy and test loss of the last evaluation epoch.
- The ``finetune_bert_model`` should return i) the **fine-tuned model**, ii) the **final test accuracy**, and iii) the **final test loss**.

### **Hint:**
As shown in the previous code cell, to obtain sentiment classification logits from the Hugging Face transformer model given an input text (`input_ids` and `attention_mask`), use the following call:  

```python
predictions = model(input_ids=input_ids, attention_mask=attention_mask)["logits"]
```

In [11]:
def finetune_bert_model(model, tokenized_train, tokenized_test, n_epochs, batch_size, lr, wd):

    train_dataloader = DataLoader(tokenized_train, batch_size=batch_size, num_workers=2)
    eval_dataloader = DataLoader(tokenized_test, batch_size=batch_size, num_workers=2)

    model = model.to(device)

    # TODO: fill the remaining code.
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=wd)
    loss_fn = nn.CrossEntropyLoss()

    for epoch in range(n_epochs):
        model.train()
        total_train_loss = 0
        correct_train = 0
        total_train = 0

        for batch in train_dataloader:
            optimizer.zero_grad()

            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs["logits"]
            loss = loss_fn(logits, labels)
            loss.backward()
            optimizer.step()

            total_train_loss += loss.item()
            correct_train += (logits.argmax(dim = -1) == labels).sum().item()
            total_train += labels.size(0)

        train_loss = total_train_loss / len(train_dataloader)
        train_acc = correct_train / total_train

        model.eval()
        total_eval_loss = 0
        correct_eval = 0
        total_eval = 0

        with torch.no_grad():
            for batch in eval_dataloader:
                input_ids = batch["input_ids"].to(device)
                attention_mask = batch["attention_mask"].to(device)
                labels = batch["labels"].to(device)

                outputs = model(input_ids=input_ids, attention_mask=attention_mask)
                logits = outputs["logits"]
                loss = loss_fn(logits, labels)
                total_eval_loss += loss.item()
                correct_eval += (logits.argmax(dim = -1) == labels).sum().item()
                total_eval += labels.size(0)

        eval_loss = total_eval_loss / len(eval_dataloader)
        eval_acc = correct_eval / total_eval

        print(f"Epoch {epoch+1}/{n_epochs}:")
        print(f"\tTrain Loss: {train_loss:.4f} -- Train Acc: {train_acc:.4f}")
        print(f"\tEval Loss: {eval_loss:.4f} -- Eval Acc: {eval_acc:.4f}")

    print(f"\nFinal test accuracy: {eval_acc:.4f}")
    print(f"Final test loss: {eval_loss:.4f}")

    return model, eval_loss, eval_acc

# Part 5: Full-finetuning

Now, it's time to use the `finetune_bert_model` function to perform **full fine-tuning** on the sentiment classification task.  

We will also print the number of trainable parameters involved in full fine-tuning.  

For quick experiments, we will set `n_epochs=1`. You may need to adjust the learning rate (`lr`) for optimal results.  

In [12]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
set_dropout(model, dropout_prob=0.05)
count_parameters(model)
model_ftune_head, test_loss, test_acc = finetune_bert_model(model, tokenized_train, tokenized_test, n_epochs=1, batch_size=16, lr=2e-5, wd=0.01)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.




Number of trainable parameters:
	 Classification head:  592130
	 Transformer backbone: 66362880
	 Total:                66955010
Epoch 1/1:
	Train Loss: 0.2427 -- Train Acc: 0.9017
	Eval Loss: 0.2065 -- Eval Acc: 0.9128

Final test accuracy: 0.9128
Final test loss: 0.2065


# Part 6: Fine-Tuning Only the Classification Head  

Now, we will fine-tune only the **classification head**, which consists of the `pre_classifier` and `classifier` layers in DistilBERT:
```
DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (activation): GELUActivation()
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
      )
    )
  )
  (pre_classifier): Linear(in_features=768, out_features=768, bias=True)
  (classifier): Linear(in_features=768, out_features=2, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)
```

### **Implementation Hints:**  
1. Each model parameter has an attribute called `.requires_grad`, which is `True` by default. Setting it to `False` makes the parameter **non-trainable** (it will not be updated during training).  
2. To train only the classification head you need to:
   - Set `.requires_grad = False` for **all** model parameters.  
   - Keep `.requires_grad = True` **only** for the parameters of the `pre_classifier` and `classifier` layers (both weights and biases).  

For quick experiments, we will set `n_epochs=1`. You may need to adjust the learning rate (`lr`) for optimal results.  


In [13]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

# TODO: Set the `.requires_grad` attribute of each model parameter so that only the classification head is trainable.
for param in model.parameters():
  param.requires_grad = False

for param in model.pre_classifier.parameters():
  param.requires_grad = True

for param in model.classifier.parameters():
  param.requires_grad = True

# Do the fine-tuning
set_dropout(model, dropout_prob=0.05)
count_parameters(model)
model_ftune_head, test_loss, test_acc = finetune_bert_model(model, tokenized_train, tokenized_test, n_epochs=1, batch_size=16, lr=2e-5, wd=0.01)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.




Number of trainable parameters:
	 Classification head:  592130
	 Transformer backbone: 0
	 Total:                592130
Epoch 1/1:
	Train Loss: 0.5097 -- Train Acc: 0.7978
	Eval Loss: 0.3988 -- Eval Acc: 0.8284

Final test accuracy: 0.8284
Final test loss: 0.3988


# Part 6: Implement LoRA fine-tuning

## Part 6.1: Implement a LoRA linear layer

#### **Objective**
In this assignment, you will implement a wrapper layer called `LinearWithLoRA` that adapts an existing linear layer using the LoRA (Low-Rank Adaptation) approach. LoRA is a technique for fine-tuning large neural networks efficiently by introducing low-rank matrices into the model's weight updates.

#### **Background**
LoRA works by freezing the weights of a pre-trained model and injecting trainable low-rank matrices into the architecture. These matrices are used to approximate the weight updates during fine-tuning, significantly reducing the number of trainable parameters.

For a linear layer with weight matrix $W \in \mathbb{R}^{m \times n}$, LoRA introduces two low-rank matrices $A \in \mathbb{R}^{m \times r}$ and $B \in \mathbb{R}^{r \times n}$, where $r$ is the rank of the adaptation. The forward pass of the adapted layer is computed as:    
$$
\text{output} = Wx + \alpha \cdot (A \cdot B)x
$$
where $\alpha$ is a scaling factor. This factor determines the magnitude of the changes introduced by the LoRA adapter layer to the model's existing weights. A higher value of alpha means larger adjustments to the model's behavior, while a lower value results in more subtle changes.

### **Instructions**
You are given a skeleton code for the `LinearWithLoRA` class. Your task is to complete the `__init__` and `forward` methods.

#### **1. Complete the `__init__` Method**
The `__init__` method initializes the LoRA wrapper. It takes three arguments:
- `linear`: An existing `torch.nn.Linear` layer that you will adapt.
- `rank`: The rank $r$ of the low-rank matrices $A$ and $B$.
- `alpha`: A scaling factor for the LoRA adaptation.

Hints:
- You will need to create two trainable parameters, `A` and `B`, using `torch.nn.Parameter`.
- Initialize `A` with random values scaled by $\frac{1}{\sqrt{r}}$ (this helps with stability during training).
- Initialize `B` with zeros.

#### **2. Complete the `forward` Method**
The `forward` method defines how the input `x` is processed by the adapted layer. Here’s what you need to do:
- Compute the original linear output for the input `x`
- Compute the LoRA adaptation for the input `x`
- Combine the Outputs of the previous two steps.


In [31]:
class LinearWithLoRA(torch.nn.Module):
    def __init__(self, linear, rank, alpha):
        super().__init__()
        # TODO
        self.linear = linear
        self.rank = rank
        self.alpha = alpha

        self.A = torch.nn.Parameter(torch.randn(linear.out_features, rank) * (1 / (rank ** 0.5)))
        self.B = torch.nn.Parameter(torch.zeros(rank, linear.in_features))


    def forward(self, x):
        # TODO
        original_output = self.linear(x)

        lora_update = self.A @ self.B  # (out_features, in_features)
        lora_update = torch.matmul(x, lora_update.T)

        return original_output + self.alpha * lora_update

**Example** Here’s how the completed class might be used:


In [32]:
# Create a linear layer
linear_layer = torch.nn.Linear(10, 5)

# Wrap it with LoRA
lora_layer = LinearWithLoRA(linear_layer, rank=4, alpha=8)

# Forward pass
input_tensor = torch.randn(3, 10)  # Batch of 3 samples, each with 10 features
output = lora_layer(input_tensor)
print(output.shape)  # Should print torch.Size([3, 5])

torch.Size([3, 5])


## Part 6.2: Adding LoRA to DistilBERT

**Objective**
Here you will implement a function called `add_lora_to_bert_model` that replaces specific linear layers in a DistilBERT model with the `LinearWithLoRA` layer. This will allow you to fine-tune the model using the LoRA adapter that you implemented above.

**Background**
DistilBERT consists of multiple transformer layers, each containing:
1. **Self-Attention Mechanism**:
   - Query (`q_lin`), Key (`k_lin`), and Value (`v_lin`) linear layers.
   - Output projection (`out_lin`) linear layer.
2. **Feed-Forward Network (MLP)**:
   - Two linear layers (`lin1` and `lin2`).

Your task is to replace these linear layers with the `LinearWithLoRA` layer, depending on the user's input.

**Instructions**

**1. Understand the Function Arguments**
The function takes four arguments:
- `model`: A DistilBERT model
- `rank`: The rank $r$ of the low-rank matrices $A$ and $B$ in the LoRA adaptation.
- `alpha`: A scaling factor for the LoRA adaptation.
- `layers`: A list of strings specifying which linear layers to replace. Possible values are:
  - `"query"`: Replace the query linear layer (`q_lin`) in self-attention.
  - `"key"`: Replace the key linear layer (`k_lin`) in self-attention.
  - `"value"`: Replace the value linear layer (`v_lin`) in self-attention.
  - `"proj"`: Replace the output projection linear layer (`out_lin`) in self-attention.
  - `"mlp"`: Replace the feed-forward network (MLP) linear layers (`lin1` and `lin2`) in self-attention.

**2. Iterate Over Transformer Layers**
The DistilBERT model contains multiple transformer layers, which are accessible via `model.distilbert.transformer.layer`. You need to iterate over these layers and replace the specified linear layers with `LinearWithLoRA`.

**3. Replace Linear Layers**
For each transformer layer, check which layers are specified in the `layers` argument and replace the corresponding linear layers:
- If `"query"` is in `layers`, replace `layer.attention.q_lin` with `LinearWithLoRA`.
- If `"key"` is in `layers`, replace `layer.attention.k_lin` with `LinearWithLoRA`.
- If `"value"` is in `layers`, replace `layer.attention.v_lin` with `LinearWithLoRA`.
- If `"proj"` is in `layers`, replace `layer.attention.out_lin` with `LinearWithLoRA`.
- If `"mlp"` is in `layers`, replace `layer.ffn.lin1` and `layer.ffn.lin2` with `LinearWithLoRA`.

**Hints**
  - Use the `LinearWithLoRA` class to wrap the existing linear layers. Example: `layer.attention.q_lin = LinearWithLoRA(layer.attention.q_lin, rank, alpha)`.

In [33]:
def add_lora_to_bert_model(model, rank=8, alpha=16, layers=["query", "value", "mlp"]):
    # Possible entries in the layers list: ["query", "key", "value", "proj", "mlp"]
    for k in layers:
      assert k in ["query", "key", "value", "proj", "mlp"]

    # TODO: Replace linear layers with the LinearWithLora layer. The linear layers that will be replaced should depent on the arguement layers.
    for layer in model.distilbert.transformer.layer:
      if "query" in layers:
        layer.attention.q_lin = LinearWithLoRA(layer.attention.q_lin, rank, alpha)
      if "key" in layers:
        layer.attention.k_lin = LinearWithLoRA(layer.attention.k_lin, rank, alpha)
      if "value" in layers:
        layer.attention.v_lin = LinearWithLoRA(layer.attention.v_lin, rank, alpha)
      if "proj" in layers:
        layer.attention.out_lin = LinearWithLoRA(layer.attention.out_lin, rank, alpha)
      if "mlp" in layers:
        layer.ffn.lin1 = LinearWithLoRA(layer.ffn.lin1, rank, alpha)
        layer.ffn.lin2 = LinearWithLoRA(layer.ffn.lin2, rank, alpha)

**Exeample:** add LoRA adapter to the model. Print the model layers before and after adding the LoRA adapters.

In [34]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

print(model)

for param in model.parameters():
  param.requires_grad = False

add_lora_to_bert_model(model, rank=8, alpha=16, layers=["query", "value", "mlp"])

print(model)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


## 6.3: Fine-tuning using the LoRA adapters.

**Instructions**

1. **Freeze the Model Parameters**: Set `requires_grad` to `False` for all parameters of the model **except** the classification head (`pre_classifier` and `classifier`). These layers should remain unfrozen (set `requires_grad` to `True`).

2. **Add LoRA Adapters**: Use the `add_lora_to_bert_model` function to add LoRA adapters to the `query`, `value`, and `mlp` linear layers of DistilBERT.

3. **Print the Number of Trainable Parameters**: After adding LoRA adapters, print the number of trainable parameters to verify that only the LoRA parameters and the classification head are trainable.

4. **Fine-Tune the Model**: Use the `finetune_bert_model` function to fine-tune the model on the sentiment classification task.

For quick experiments, we will set n_epochs=1. You may need to adjust the learning rate (lr) for optimal results.


In [35]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

# TODO: Set the `.requires_grad` attribute of each model parameter so that only the classification head is trainable.
for param in model.parameters():
  param.requires_grad = False

for param in model.pre_classifier.parameters():
  param.requires_grad = True

for param in model.classifier.parameters():
  param.requires_grad = True


# Do the fine-tuning
add_lora_to_bert_model(model, rank=8, alpha=16, layers=["query", "value", "mlp"])
set_dropout(model, dropout_prob=0.05)
print(model)

count_parameters(model)
model_lora, test_loss, test_acc = finetune_bert_model(model, tokenized_train, tokenized_test, n_epochs=1, batch_size=16, lr=2e-5, wd=0.01)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.05, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.05, inplace=False)
            (q_lin): LinearWithLoRA(
              (linear): Linear(in_features=768, out_features=768, bias=True)
            )
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): LinearWithLoRA(
              (linear): Linear(in_features=768, out_features=768, bias=True)
            )
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), e

## 6.4: Explore the Impact of `rank` and `layers` Hyperparameters**

In this section, you will analyze the impact of the `rank` and `layers` hyperparameters using the LoRA implementation you coded earlier.
#### **Tasks**: run the following experiments
1. **Explore the Impact of `rank`**:
   - Fix `layers=["query", "value", "mlp"]`.
   - Try `rank` values: `2`, `8`, and `32`.

2. **Explore the Impact of `layers`**:
   - Fix `rank=8`.
   - Use the following `layers` configurations:
     - `["query", "value"]`
     - `["query", "value", "mlp"]`
     - `["query", "value", "key", "proj", "mlp"]`.

#### **Instructions**
Before each configuration/experiment, you will have:
- Reinitialize the model using:
  ```python
  model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
  ```
- Set the `.requires_grad` attribute of each model parameter so that only the classification head is trainable.
- Add LoRA adapters using `add_lora_to_bert_model`.
- Fine-tune the model and record performance (e.g., accuracy) and the number of trainable parameters (the latter, using `count_parameters`).

#### **Report**
In the technical report that you will submit together with the colab notebook, you will have to:
- Provide the results in a table, including:
  - Performance and trainable parameters for each LoRA configuration.
  - Results for full fine-tuning and fine-tuning only the classification head (from previous tasks).
- Discuss your observations:
  - How do the number of parameters and performance compare to full fine-tuning?
  - What is the trade-off between parameter efficiency and performance?

In [36]:
# Set parameters
rank_values = [2, 8, 32]
layer_configs = [
    ["query", "value"],
    ["query", "value", "mlp"],
    ["query", "value", "key", "proj", "mlp"]
]

# Record experiment result
results = []

# Experiment 1: 'rank'
for rank in rank_values:
    # Reinitialize DistilBERT
    model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

    # Freeze all parameters, only train the classification head
    for param in model.parameters():
        param.requires_grad = False

    for param in model.pre_classifier.parameters():
        param.requires_grad = True

    for param in model.classifier.parameters():
        param.requires_grad = True

    # Add LoRA（fix layers）
    add_lora_to_bert_model(model, rank=rank, alpha=16, layers=["query", "value", "mlp"])

    # Computer the amount of trainable parameters
    num_trainable_params = count_parameters(model)

    # Do the fine-tuning
    model_lora, test_loss, test_acc = finetune_bert_model(
        model, tokenized_train, tokenized_test,
        n_epochs=1, batch_size=16, lr=2e-5, wd=0.01
    )

    results.append({
        "Experiment": f"Rank={rank}",
        "Trainable Params": num_trainable_params,
        "Test Loss": test_loss,
        "Test Accuracy": test_acc
    })

# Experiment 2: 'layers'
rank = 8  # Fix 'rank'
for layers in layer_configs:
    # Reinitialize DistilBERT
    model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

    # Freeze all parameters, only train the classification head
    for param in model.parameters():
        param.requires_grad = False

    for param in model.pre_classifier.parameters():
        param.requires_grad = True

    for param in model.classifier.parameters():
        param.requires_grad = True

    # Add LoRA（fix layers）
    add_lora_to_bert_model(model, rank=rank, alpha=16, layers=layers)

    # Computer the amount of trainable parameters
    num_trainable_params = count_parameters(model)

    # Do the fine-tuning
    model_lora, test_loss, test_acc = finetune_bert_model(
        model, tokenized_train, tokenized_test,
        n_epochs=1, batch_size=16, lr=2e-5, wd=0.01
    )

    results.append({
        "Experiment": f"Layers={layers}",
        "Trainable Params": num_trainable_params,
        "Test Loss": test_loss,
        "Test Accuracy": test_acc
    })

import pandas as pd
df_results = pd.DataFrame(results)
print(df_results)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.




Number of trainable parameters:
	 Classification head:  592130
	 Transformer backbone: 129024
	 Total:                721154


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/1:
	Train Loss: 0.3069 -- Train Acc: 0.8695
	Eval Loss: 0.2951 -- Eval Acc: 0.8697

Final test accuracy: 0.8697
Final test loss: 0.2951


Number of trainable parameters:
	 Classification head:  592130
	 Transformer backbone: 516096
	 Total:                1108226


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/1:
	Train Loss: 0.3076 -- Train Acc: 0.8713
	Eval Loss: 0.2585 -- Eval Acc: 0.8950

Final test accuracy: 0.8950
Final test loss: 0.2585


Number of trainable parameters:
	 Classification head:  592130
	 Transformer backbone: 2064384
	 Total:                2656514


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/1:
	Train Loss: 0.3104 -- Train Acc: 0.8692
	Eval Loss: 0.2970 -- Eval Acc: 0.8678

Final test accuracy: 0.8678
Final test loss: 0.2970


Number of trainable parameters:
	 Classification head:  592130
	 Transformer backbone: 147456
	 Total:                739586


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/1:
	Train Loss: 0.2994 -- Train Acc: 0.8700
	Eval Loss: 0.2450 -- Eval Acc: 0.8962

Final test accuracy: 0.8962
Final test loss: 0.2450


Number of trainable parameters:
	 Classification head:  592130
	 Transformer backbone: 516096
	 Total:                1108226


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/1:
	Train Loss: 0.2962 -- Train Acc: 0.8785
	Eval Loss: 0.2595 -- Eval Acc: 0.8941

Final test accuracy: 0.8941
Final test loss: 0.2595


Number of trainable parameters:
	 Classification head:  592130
	 Transformer backbone: 663552
	 Total:                1255682
Epoch 1/1:
	Train Loss: 0.3070 -- Train Acc: 0.8728
	Eval Loss: 0.2581 -- Eval Acc: 0.8928

Final test accuracy: 0.8928
Final test loss: 0.2581
                                        Experiment Trainable Params  \
0                                           Rank=2             None   
1                                           Rank=8             None   
2                                          Rank=32             None   
3                        Layers=['query', 'value']             None   
4                 Layers=['query', 'value', 'mlp']             None   
5  Layers=['query', 'value', 'key', 'proj', 'mlp']             None   

   Test Loss  Test Accuracy  
0   0.295089       0.869687  
1   0.258524       0.89500