<br>
<font>
<div dir=ltr align=center>
<img src="https://cdn.freebiesupply.com/logos/large/2x/sharif-logo-png-transparent.png" width=150 height=150> <br>
<font color=0F5298 size=7>
    Machine learning <br>
<font color=2565AE size=5>
    Computer Engineering Department <br>
    Fall 2024<br>
<font color=3C99D size=5>
    Practical Assignment 5 - NLP - Transformer & Bert <br>
</div>
<div dir=ltr align=center>
<font color=0CBCDF size=4>
   &#x1F349; Masoud Tahmasbi  &#x1F349;  &#x1F353; Arash Ziyaei &#x1F353;
<br>
<font color=0CBCDF size=4>
   &#x1F335; Amirhossein Akbari  &#x1F335;
</div>

____

<font color=9999FF size=4>
&#x1F388; Full Name : Sina Beyrami
<br>
<font color=9999FF size=4>
&#x1F388; Student Number : 400105433

<font color=0080FF size=3>
This notebook covers two key topics. First, we implement a transformer model from scratch and apply it to a specific task. Second, we fine-tune the BERT model using LoRA for efficient adaptation to a downstream task.
</font>
<br>

**Note:**
<br>
<font color=66B2FF size=2>In this notebook, you are free to use any function or model from PyTorch to assist with the implementation. However, TensorFlow is not permitted for this exercise. This ensures consistency and alignment with the tools being focused on.</font>
<br>
<font color=red size=3>**Run All Cells Before Submission**</font>: <font color=FF99CC size=2>Before saving and submitting your notebook, please ensure you run all cells from start to finish. This practice guarantees that your notebook is self-consistent and can be evaluated correctly by others.</font>

# Section 2: BERT and LoRA

Welcome to Section 2 of our Machine Learning assignment! I hope you've been enjoying the journey so far! 😊

 In this section, you will gain hands-on experience with [BERT](https://arxiv.org/abs/1810.04805) (Bidirectional Encoder Representations from Transformers) and [LoRA](https://arxiv.org/abs/2106.09685) (Low-Rank Adaptation) for text classification tasks. The section is divided into three main parts, each focusing on different aspects of NLP techniques.

## Assignment Structure

### Part 1: Data Preparation and Preprocessing
In this part, you will work with a text classification dataset. You will learn how to:
- Download and load the dataset
- Perform necessary preprocessing steps
- Implement data cleaning and transformation techniques
- Prepare the data in a format suitable for BERT training

### Part 2: Building a Small BERT Model
You will create and train a small BERT model from scratch using the Hugging Face [Transformers](https://huggingface.co/docs/transformers/en/index) library. This part will help you understand:
- The architecture of BERT
- How to configure and initialize a BERT model
- Training process and optimization
- Model evaluation and performance analysis

### Part 3: Fine-tuning with LoRA
In the final part, you will work with a pre-trained [TinyBERT](https://arxiv.org/abs/1909.10351) model and use LoRA for efficient fine-tuning. You will:
- Load a pre-trained TinyBERT model
- Implement LoRA adaptation and fine-tune the model on our classification task
- Compare the results with the previous approach

---

> **NOTE**:  
> Throughout this notebook, make an effort to include sufficient visualizations to enhance understanding:  
> - In the data processing section, display the results of your operations (e.g., show data samples or distributions after preprocessing).  
> - In the classification section, report various evaluation metrics such as accuracy, precision, recall, and F1-score to thoroughly assess your model's performance.  
> - Additionally, take a moment to compare the sizes of the models discussed in this notebook with today’s enormous models. This will help you appreciate the challenges and computational demands associated with training such massive models. 😵‍💫

---


## Part 1: Data Preparation and Preprocessing
We'll be working with the [Consumer Complaint](https://catalog.data.gov/dataset/consumer-complaint-database) dataset, which contains ***complaints*** submitted by consumers about financial products and services. Our goal is to build a classifier that can automatically identify the type of complaint based on the consumer's text description. For this task, we will work with a smaller subset of the dataset, available for download through this [link](https://drive.google.com/file/d/1SpIHksR-WzruEgUjp1SQKGG8bZPnJJoN/view?usp=sharing).

In [2]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, BertConfig
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm
from google.colab import drive
import os
import zipfile
import re

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


### 1.2 Loading the Data

In [4]:
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
######################  TODO  ########################
######################  TODO  ########################
drive_path = "/content/drive/My Drive/ML_Projects/BERT/"
zip_file_path = os.path.join(drive_path, "complaints_small.zip")
extracted_folder = os.path.join(drive_path, "extracted_complaints")

if not os.path.exists(extracted_folder):
    os.makedirs(extracted_folder)
with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
    zip_ref.extractall(extracted_folder)

csv_file = os.path.join(extracted_folder, "complaints_small.csv")

df = pd.read_csv(csv_file)

print(df.head())
######################  TODO  ########################
######################  TODO  ########################

                                             Product  \
0  Credit reporting, credit repair services, or o...   
1                                       Student loan   
2  Credit reporting or other personal consumer re...   
3  Credit reporting, credit repair services, or o...   
4  Credit reporting or other personal consumer re...   

                        Consumer complaint narrative  
0  My credit reports are inaccurate. These inaccu...  
1  Beginning in XX/XX/XXXX I had taken out studen...  
2  I am disputing a charge-off on my account that...  
3  I did not consent to, authorize, nor benefit f...  
4  I am a federally protected consumer and I am a...  


### 1.3 Data Sampling and Class Distribution Analysis

Working with large datasets can be computationally intensive during development. Additionally, imbalanced class distribution can affect model performance. In this section, you'll sample the data and analyze class distributions to make informed decisions about your training dataset.

---

We'll work with a manageable portion of the data to develop and test our approach. While using the complete dataset would likely yield better results, a smaller sample allows us to prototype our solution more efficiently.


In [6]:
######################  TODO  ########################
######################  TODO  ########################

df.rename(columns={'Consumer complaint narrative': 'text', 'Product': 'label'}, inplace=True)

# - Sample a portion of the complete dataset
sampled_df = df.sample(frac=0.15, random_state=42)

# - Display the first few rows of your sampled dataset
print(sampled_df.head())

# - Print the shape of your original and sampled datasets
print("-----------------------------------------------------------")
print(f"Original dataset shape: {df.shape}")
print(f"Sampled dataset shape: {sampled_df.shape}")

######################  TODO  ########################
######################  TODO  ########################

                                                    label  \
335123  Credit reporting or other personal consumer re...   
601718                                           Mortgage   
847752  Credit reporting, credit repair services, or o...   
765316  Credit reporting or other personal consumer re...   
798300  Credit reporting, credit repair services, or o...   

                                                     text  
335123  Upon reviewing my credit report, I have identi...  
601718  I was doing a rate check to refinance. The age...  
847752  This is my 2nd request that I have been a vict...  
765316  I'm sending this compliant to inform credit bu...  
798300  Im submitting a complaint to you today to info...  
-----------------------------------------------------------
Original dataset shape: (941128, 2)
Sampled dataset shape: (141169, 2)


---

Let's examine the distribution of ***complaints*** types in our dataset. You'll notice that some products have significantly more instances than others, and some categories are quite similar. For example:

- Multiple categories might refer to similar financial products
- Some categories might have very few examples
- Certain categories might be subcategories of others

You have two main approaches to handle this situation:

1. **Merging Similar Classes:** Identify categories that represent similar products/services and Combine them to create more robust, general categories

2. **Selecting Major Classes:** Only select the categories with sufficient representation



> You may choose any approach, but after this step, your data must include **at least five** distinct classes.



In [7]:
######################  TODO  ########################
######################  TODO  ########################

# - Display the number of complaints in each product category
class_distribution = sampled_df['label'].value_counts()
print("Class distribution:\n", class_distribution)
print("-----------------------------------------------------------")

# - Identify which classes are under-represented
underrepresented_classes = class_distribution[class_distribution < 50]
print("Underrepresented classes:\n", underrepresented_classes)
print("-----------------------------------------------------------")

# - Handle class imbalance by choosing and implementing one of these approaches:
#   1. Merge similar product categories (e.g., combining related categories)
#   2. Keep only the major classes with sufficient examples

# I chose approach 2:

major_classes = class_distribution[class_distribution >= 100].index
balanced_df = sampled_df[sampled_df['label'].isin(major_classes)]

print("Balanced dataset class distribution:\n", balanced_df['label'].value_counts())

######################  TODO  ########################
######################  TODO  ########################

Class distribution:
 label
Credit reporting, credit repair services, or other personal consumer reports    48565
Credit reporting or other personal consumer reports                             37593
Debt collection                                                                 17550
Mortgage                                                                         7345
Checking or savings account                                                      6714
Credit card or prepaid card                                                      6485
Credit card                                                                      3801
Student loan                                                                     2838
Money transfer, virtual currency, or money service                               2704
Vehicle loan or lease                                                            2108
Credit reporting                                                                 1891
Payday loan, title loan, or

---
### 1.4 Data Encoding and Text Preprocessing

Before training our model, we need to prepare both our target labels and text data. This involves converting categorical labels into numerical format and cleaning our text data to improve model performance.

In [8]:
######################  TODO  ########################
######################  TODO  ########################

# Label Encoding
# - Apply label encoding to convert product categories into numeric values
label_encoder = LabelEncoder()
sampled_df['label'] = label_encoder.fit_transform(sampled_df['label'])

# Text Preprocessing
# Choose and implement preprocessing steps that you think will improve the quality of your text data.
# Here are some suggestions:

# - Remove special characters and punctuation
# - Remove very short complaints (e.g., less than 10 words)
# - Remove HTML tags if present

def clean_text(text):
    text = re.sub(r'<[^>]+>', '', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

sampled_df['text'] = sampled_df['text'].apply(clean_text)

sampled_df = sampled_df[sampled_df['text'].apply(lambda x: len(x.split()) >= 10)]

print(sampled_df.head())

print(f"Dataset shape after preprocessing: {sampled_df.shape}")

######################  TODO  ########################
######################  TODO  ########################

        label                                               text
335123      6  Upon reviewing my credit report I have identif...
601718     12  I was doing a rate check to refinance The agen...
847752      7  This is my nd request that I have been a victi...
765316      6  Im sending this compliant to inform credit bur...
798300      7  Im submitting a complaint to you today to info...
Dataset shape after preprocessing: (139922, 2)


## 1.5 Dataset Creation and Tokenization

For training our BERT model, we need to:
1. Create a custom Dataset class that will handle tokenization
2. Split the data into training and testing sets
3. Use BERT's tokenizer to convert text into a format suitable for the model

In [9]:
######################  TODO  ########################
######################  TODO  ########################

class ComplaintDataset(Dataset):
    """A custom Dataset class for handling consumer complaints text data with BERT tokenization.

    Parameters:
        texts (List[str]): List of complaint texts to be processed
        labels (List[int]): List of encoded labels corresponding to each text
        tokenizer (BertTokenizer): A BERT tokenizer instance for text processing
        max_len (int, optional): Maximum length for padding/truncating texts. Defaults to 512

    Returns:
        dict: For each item, returns a dictionary containing:
            - input_ids (torch.Tensor): Encoded token ids of the text
            - attention_mask (torch.Tensor): Attention mask for the padded sequence
            - labels (torch.Tensor): Encoded label as a tensor
    """
    def __init__(self, texts, labels, tokenizer, max_len=512):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            padding="max_length",
            truncation=True,
            return_attention_mask=True,
            return_tensors="pt"
        )

        return {
            "input_ids": encoding["input_ids"].squeeze(0),
            "attention_mask": encoding["attention_mask"].squeeze(0),
            "labels": torch.tensor(label, dtype=torch.long)
        }

######################  TODO  ########################
######################  TODO  ########################

In [10]:
######################  TODO  ########################
######################  TODO  ########################

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    sampled_df["text"].tolist(),
    sampled_df["label"].tolist(),
    test_size=0.2,
    random_state=42
)

# Initialize tokenizer and create datasets
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
train_dataset = ComplaintDataset(X_train, y_train, tokenizer, max_len=128)
test_dataset = ComplaintDataset(X_test, y_test, tokenizer, max_len=128)

# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

######################  TODO  ########################
######################  TODO  ########################

# Checking the first batch
for batch in train_loader:
    print(batch)
    break

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

{'input_ids': tensor([[  101,  2023, 12087,  ...,  2441,  1996,   102],
        [  101, 19045,  4769,  ...,     0,     0,     0],
        [  101, 22038, 20348,  ...,  1037,  7016,   102],
        ...,
        [  101,  3395, 24641,  ...,  3450,  2104,   102],
        [  101,  1045,  3728,  ...,     0,     0,     0],
        [  101,  1045,  2572,  ...,  9837, 16396,   102]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1]]), 'labels': tensor([ 7,  6,  8,  6,  8,  6,  3,  7,  7,  6, 15,  8,  8,  6,  6,  8])}


## Part 2: Training a Small-Size BERT Model

In this part, we will explore how to build and train a small-sized BERT model for our classification task. Instead of using the full-sized BERT model, which is computationally expensive, we will create a smaller version using the Transformers library.

In [13]:
######################  TODO  ########################
######################  TODO  ########################

# 1. Define your BERT model for sequence classifica tion
#    Ensure that you set up the configuration properly (e.g., specify the number of output labels).
num_labels = len(sampled_df['label'].unique())
model_config = BertConfig.from_pretrained("bert-base-uncased", num_labels=num_labels)
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", config=model_config)
model.to(device)

# 2. Print the total number of trainable parameters in the model to understand its size.
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params}")
print(f"Trainable parameters: {trainable_params}")

######################  TODO  ########################
######################  TODO  ########################

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Total parameters: 109498389
Trainable parameters: 109498389


---

Now that you have defined your model, it's time to train it!☠️

Training a model of this size can take some time, depending on the available resources. To manage this, you can train your model for just **2–3 epochs** to demonstrate progress. Here are some hints:
- **Training Metrics:** Ensure you print enough metrics, such as loss and accuracy, to track the training progress.
- **Interactive Monitoring:** Use the `tqdm` library to display the progress of your training loop in real-time.

In [14]:
######################  TODO  ########################
######################  TODO  ########################

optimizer = AdamW(model.parameters(), lr=5e-5)
num_epochs = 3

# Training loop
for epoch in range(num_epochs):

    print(f"Epoch {epoch + 1}/{num_epochs}")

    model.train()

    total_loss = 0
    correct = 0
    total = 0

    for batch in tqdm(train_loader):
        optimizer.zero_grad()

        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )

        # TODO: Perform backpropagation and update the optimizer. Hint: Use outputs.loss to access the model's loss.
        loss = outputs.loss
        logits = outputs.logits
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        preds = torch.argmax(logits, dim=1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)

        # TODO: Monitor the training process by reporting metrics such as loss and accuracy.
    epoch_loss = total_loss / len(train_loader)
    epoch_accuracy = correct / total
    print(f"Epoch {epoch + 1} - Loss: {epoch_loss:.4f}, Accuracy: {epoch_accuracy:.4f}")

######################  TODO  ########################
######################  TODO  ########################

Epoch 1/3


100%|██████████| 6997/6997 [47:36<00:00,  2.45it/s]


Epoch 1 - Loss: 0.9208, Accuracy: 0.6788
Epoch 2/3


100%|██████████| 6997/6997 [47:44<00:00,  2.44it/s]


Epoch 2 - Loss: 0.7439, Accuracy: 0.7375
Epoch 3/3


100%|██████████| 6997/6997 [47:41<00:00,  2.45it/s]

Epoch 3 - Loss: 0.6690, Accuracy: 0.7602





In [15]:
# TODO : Evaluate the model on test dataset
model.eval()
correct = 0
total = 0
total_loss = 0

with torch.no_grad():
    for batch in tqdm(test_loader):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        logits = outputs.logits

        total_loss += loss.item()
        preds = torch.argmax(logits, dim=1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)

test_loss = total_loss / len(test_loader)
test_accuracy = correct / total
print(f"Test Loss: {test_loss:.4f}, Test Accuracy: {test_accuracy:.4f}")

100%|██████████| 1750/1750 [04:47<00:00,  6.08it/s]

Test Loss: 0.7470, Test Accuracy: 0.7390





In [16]:
output_dir = "/content/bert_model"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"Model and tokenizer saved to {output_dir}")


Model and tokenizer saved to /content/bert_model


In [17]:
import shutil

source_folder = "/content/bert_model"
destination_folder = "/content/drive/My Drive/ML_Projects/BERT/Model"

shutil.copytree(source_folder, destination_folder, dirs_exist_ok=True)

print(f"Model files copied to: {destination_folder}")


Model files copied to: /content/drive/My Drive/ML_Projects/BERT/Model


## Part 3: Fine-Tuning TinyBERT with LoRA

As you have experienced, training even a small-sized BERT model can be computationally intensive and time-consuming. To address these challenges, we explore **Parameter-Efficient Fine-Tuning (PEFT)** methods, which allow us to utilize the power of large pretrained models without requiring extensive resources.

---

### **Parameter-Efficient Fine-Tuning (PEFT)**

PEFT methods focus on fine-tuning only a small portion of the model’s parameters while keeping most of the pretrained weights frozen. This drastically reduces the computational and storage requirements while leveraging the rich knowledge embedded in pretrained models.

One popular PEFT method is LoRA (Low-Rank Adaptation).

- **What is LoRA?**

LoRA introduces a mechanism to fine-tune large language models by injecting small low-rank matrices into the model's architecture. Instead of updating all parameters during training, LoRA trains these small matrices while keeping the majority of the original parameters frozen.  This is achieved as follows:

1. **Frozen Weights**: The pretrained weights of the model, represented as a weight matrix $ W \in \mathbb{R}^{d \times k} $, remain **frozen** during fine-tuning.

2. **Low-Rank Decomposition**:
   Instead of directly updating $ W $, LoRA introduces two trainable matrices, $ A \in \mathbb{R}^{d \times r} $ and $ B \in \mathbb{R}^{r \times k} $, where $ r \ll \min(d, k) $.  
   These matrices approximate the update to $ W $ as:
   $$
   \Delta W = A \cdot B
   $$

   Here, $ r $, the rank of the decomposition, is a key hyperparameter that determines the trade-off between computational cost and model capacity.

3. **Adaptation**:
   During training, instead of updating $ W $, the adapted weight is:
   $$
   W' = W + \Delta W = W + A \cdot B
   $$
   Only the low-rank matrices $ A $ and $ B $ are optimized, while $ W $ remains fixed.

4. **Efficiency**:
   Since $ r $ is much smaller than $ d $ and $ k $, the number of trainable parameters in $ A $ and $ B $ is significantly less than in $ W $. This makes the approach highly efficient both in terms of computation and memory.

---

###  **Fine-Tuning TinyBERT**

For this part, we will fine-tune **TinyBERT**, a distilled version of BERT, using the LoRA method.

- **What is TinyBERT?**

TinyBERT is a lightweight version of the original BERT model created through knowledge distillation. It significantly reduces the model size and inference latency while preserving much of the original BERT’s effectiveness. Here are some key characteristics of TinyBERT:
- It is designed to be more resource-efficient for tasks such as classification, question answering, and more.
- TinyBERT retains a compact structure with fewer layers and parameters, making it ideal for fine-tuning with limited computational resources.


> Similar to the previous section, training this model might take some time. Given the resource limitations, you can train the model for just **2-3 epochs** to demonstrate the process.


In [18]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import numpy as np
from tqdm import tqdm

In [19]:
######################  TODO  ########################
######################  TODO  ########################

# Load the pre-trained TinyBERT
model_name = "prajjwal1/bert-tiny"
base_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(sampled_df['label'].unique()))
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define LoRA Configuration
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    target_modules=["query", "value"]
)

######################  TODO  ########################
######################  TODO  ########################

config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/17.8M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [22]:
######################  TODO  ########################
######################  TODO  ########################

# Apply LoRA to model
lora_model = get_peft_model(base_model, lora_config)
lora_model.to(device)

# TODO: Show the number of trainable parameters
total_params = sum(p.numel() for p in lora_model.parameters())
trainable_params = sum(p.numel() for p in lora_model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params}")
print(f"Trainable parameters with LoRA: {trainable_params}")

# Training configuration
optimizer = AdamW(lora_model.parameters(), lr=5e-5)
criterion = nn.CrossEntropyLoss()

######################  TODO  ########################
######################  TODO  ########################

Total parameters: 4399530
Trainable parameters with LoRA: 10901




In [23]:
######################  TODO  ########################
######################  TODO  ########################

num_epochs = 3

# Training loop
for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}/{num_epochs}")
    lora_model.train()
    total_loss = 0
    correct = 0
    total = 0

    for batch in tqdm(train_loader):
        optimizer.zero_grad()

        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = lora_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
        loss = outputs.loss
        logits = outputs.logits

        # TODO: Perform backpropagation and update the optimizer. Hint: Use outputs.loss to access the model's loss.
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        preds = torch.argmax(logits, dim=1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)

        # TODO: Monitor the training process by reporting metrics such as loss and accuracy.
    epoch_loss = total_loss / len(train_loader)
    epoch_accuracy = correct / total
    print(f"Epoch {epoch + 1} - Loss: {epoch_loss:.4f}, Accuracy: {epoch_accuracy:.4f}")

######################  TODO  ########################
######################  TODO  ########################

Epoch 1/3


100%|██████████| 6997/6997 [06:56<00:00, 16.78it/s]


Epoch 1 - Loss: 1.7120, Accuracy: 0.3908
Epoch 2/3


100%|██████████| 6997/6997 [06:52<00:00, 16.94it/s]


Epoch 2 - Loss: 1.4634, Accuracy: 0.4726
Epoch 3/3


100%|██████████| 6997/6997 [06:52<00:00, 16.96it/s]

Epoch 3 - Loss: 1.3551, Accuracy: 0.5111





In [24]:
# TODO : Evaluate the model on test dataset

lora_model.eval()
correct = 0
total = 0
total_loss = 0

with torch.no_grad():
    for batch in tqdm(test_loader):
        input_ids = batch['input_ids'].to(torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
        attention_mask = batch['attention_mask'].to(torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
        labels = batch['labels'].to(torch.device('cuda' if torch.cuda.is_available() else 'cpu'))

        outputs = lora_model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        logits = outputs.logits

        total_loss += loss.item()
        preds = torch.argmax(logits, dim=1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)

print("---------------------------------------------------------------------------")
test_loss = total_loss / len(test_loader)
test_accuracy = correct / total
print(f"Test Loss: {test_loss:.4f}, Test Accuracy: {test_accuracy:.4f}")

100%|██████████| 1750/1750 [01:34<00:00, 18.46it/s]

---------------------------------------------------------------------------
Test Loss: 1.2584, Test Accuracy: 0.5461





In [25]:
lora_model_path = "/content/lora_model"

if not os.path.exists(lora_model_path):
    os.makedirs(lora_model_path)

lora_model.save_pretrained(lora_model_path)
tokenizer.save_pretrained(lora_model_path)

print(f"LoRA model and tokenizer saved to: {lora_model_path}")


LoRA model and tokenizer saved to: /content/lora_model


In [26]:
drive_path = "/content/drive/My Drive/ML_Projects/BERT/LoRA_Model"

shutil.copytree(lora_model_path, drive_path, dirs_exist_ok=True)

print(f"LoRA model copied to Google Drive at: {drive_path}")


LoRA model copied to Google Drive at: /content/drive/My Drive/ML_Projects/BERT/LoRA_Model


In [27]:
drive_folder_path = "/content/drive/My Drive/ML_Projects/BERT/"
os.listdir(drive_folder_path)

['complaints_small.zip', 'extracted_complaints', 'Model', 'LoRA_Model']

In [28]:
drive_folder_path = "/content/drive/My Drive/ML_Projects/BERT/Model/"
os.listdir(drive_folder_path)

['vocab.txt',
 'config.json',
 'tokenizer_config.json',
 'model.safetensors',
 'special_tokens_map.json']

In [29]:
drive_folder_path = "/content/drive/My Drive/ML_Projects/BERT/LoRA_Model/"
os.listdir(drive_folder_path)

['vocab.txt',
 'adapter_model.safetensors',
 'README.md',
 'adapter_config.json',
 'tokenizer.json',
 'tokenizer_config.json',
 'special_tokens_map.json']

**Note for grader**

I implemented this exercise in Colab, and due to the low number of epochs, the accuracy of my model decreased after the training process. Additionally, as my Colab usage quota expired, I could not increase the learning rate and retrain the model. The same issue occurred with the LoRA fine-tuning process, as the number of epochs was limited.

However, as mentioned in the instructions, the number of epochs was kept low to demonstrate the training process rather than achieve high accuracy. Therefore, I hope the lower accuracy on the training and test datasets does not result in a reduction in the assignment's grade.