# <font color="#0b486b">  Student Information</font>
***
Surname: **Zhang**  <br/>
Firstname: **Yiming**    <br/>
Student ID: **35224436**    <br/>
Email: **yzha1213@student.monash.edu**    <br/>
Your tutorial time: **12pm Wed**    <br/>
***

## Section 2: Deep Learning for Sequential Data

### <font color="#0b486b">Set random seeds</font>

We need to install the package datasets for creating BERT datasets.

In [None]:
# !pip install datasets
!pip install datasets==4.0.0
!pip install transformers==4.57.0

We start with importing PyTorch and NumPy and setting random seeds for PyTorch and NumPy. You can use any seeds you prefer.

In [31]:
import os
import torch
import random
import requests
import pandas as pd
import numpy as np
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
from transformers import BertTokenizer
import os
from six.moves.urllib.request import urlretrieve
from sklearn import preprocessing
import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [32]:
def seed_all(seed=1029):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)  # if you are using multi-GPU.
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True
seed_all(seed=1234)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## <font color="#0b486b">Download and preprocess the data</font>

<div style="text-align: right"><font color="red; font-weight:bold"><span></div>

The dataset we use for this assignment is a question classification dataset for which the training set consists of $5,500$ questions belonging to 6 coarse question categories including:
- abbreviation (ABBR),
- entity (ENTY),
- description (DESC),
- human (HUM),
- location (LOC) and
- numeric (NUM).

In this assignment, we will utilize a subset of this dataset, containing $2,000$ questions for training and validation. We will use 80% of those 2000 questions for trainning and the rest for validation.


Preprocessing data is a crucial initial step in any machine learning or deep learning project. The *TextDataManager* class simplifies the process by providing functionalities to download and preprocess data specifically designed for the subsequent questions in this assignment. It is highly recommended to gain a comprehensive understanding of the class's functionality by **carefully reading** the content provided in the *TextDataManager* class before proceeding to answer the questions.

In [33]:
class DataManager:
    """
    This class manages and preprocesses a simple text dataset for a sentence classification task.

    Attributes:
        verbose (bool): Controls verbosity for printing information during data processing.
        max_sentence_len (int): The maximum length of a sentence in the dataset.
        str_questions (list): A list to store the string representations of the questions in the dataset.
        str_labels (list): A list to store the string representations of the labels in the dataset.
        numeral_labels (list): A list to store the numerical representations of the labels in the dataset.
        maxlen (int): Maximum length for padding sequences. Sequences longer than this length will be truncated,
            and sequences shorter than this length will be padded with zeros. Defaults to 50.
        numeral_data (list): A list to store the numerical representations of the questions in the dataset.
        random_state (int): Seed value for random number generation to ensure reproducibility.
            Set this value to a specific integer to reproduce the same random sequence every time. Defaults to 6789.
        random (np.random.RandomState): Random number generator object initialized with the given random_state.
            It is used for various random operations in the class.

    Methods:
        maybe_download(dir_name, file_name, url, verbose=True):
            Downloads a file from a given URL if it does not exist in the specified directory.
            The directory and file are created if they do not exist.

        read_data(dir_name, file_names):
            Reads data from files in a directory, preprocesses it, and computes the maximum sentence length.
            Each file is expected to contain rows in the format "<label>:<question>".
            The labels and questions are stored as string representations.

        manipulate_data():
            Performs data manipulation by tokenizing, numericalizing, and padding the text data.
            The questions are tokenized and converted into numerical sequences using a tokenizer.
            The sequences are padded or truncated to the maximum sequence length.

        train_valid_test_split(train_ratio=0.9):
            Splits the data into training, validation, and test sets based on a given ratio.
            The data is randomly shuffled, and the specified ratio is used to determine the size of the training set.
            The string questions, numerical data, and numerical labels are split accordingly.
            TensorFlow `Dataset` objects are created for the training and validation sets.


    """

    def __init__(self, verbose=True, random_state=6789):
        self.verbose = verbose
        self.max_sentence_len = 0
        self.str_questions = list()
        self.str_labels = list()
        self.numeral_labels = list()
        self.numeral_data = list()
        self.random_state = random_state
        self.random = np.random.RandomState(random_state)

    @staticmethod
    def maybe_download(dir_name, file_name, url, verbose=True):
        if not os.path.exists(dir_name):
            os.mkdir(dir_name)
        if not os.path.exists(os.path.join(dir_name, file_name)):
            urlretrieve(url + file_name, os.path.join(dir_name, file_name))
        if verbose:
            print("Downloaded successfully {}".format(file_name))

    def read_data(self, dir_name, file_names):
        self.str_questions = list()
        self.str_labels = list()
        for file_name in file_names:
            file_path= os.path.join(dir_name, file_name)
            with open(file_path, "r", encoding="latin-1") as f:
                for row in f:
                    row_str = row.split(":")
                    label, question = row_str[0], row_str[1]
                    question = question.lower()
                    self.str_labels.append(label)
                    self.str_questions.append(question[0:-1])
                    if self.max_sentence_len < len(self.str_questions[-1]):
                        self.max_sentence_len = len(self.str_questions[-1])

        # turns labels into numbers
        le = preprocessing.LabelEncoder()
        le.fit(self.str_labels)
        self.numeral_labels = np.array(le.transform(self.str_labels))
        self.str_classes = le.classes_
        self.num_classes = len(self.str_classes)
        if self.verbose:
            print("\nSample questions and corresponding labels... \n")
            print(self.str_questions[0:5])
            print(self.str_labels[0:5])

    def manipulate_data(self):
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        vocab = self.tokenizer.get_vocab()
        self.word2idx = {w: i for i, w in enumerate(vocab)}
        self.idx2word = {i:w for w,i in self.word2idx.items()}
        self.vocab_size = len(self.word2idx)

        token_ids = []
        num_seqs = []
        for text in self.str_questions:  # iterate over the list of text
          text_seqs = self.tokenizer.tokenize(str(text))  # tokenize each text individually
          # Convert tokens to IDs
          token_ids = self.tokenizer.convert_tokens_to_ids(text_seqs)
          # Convert token IDs to a tensor of indices using your word2idx mapping
          seq_tensor = torch.LongTensor(token_ids)
          num_seqs.append(seq_tensor)  # append the tensor for each sequence

        # Pad the sequences and create a tensor
        if num_seqs:
          self.numeral_data = pad_sequence(num_seqs, batch_first=True)  # Pads to max length of the sequences
          self.num_sentences, self.max_seq_len = self.numeral_data.shape

    def train_valid_test_split(self, train_ratio=0.8, test_ratio = 0.1):
        train_size = int(self.num_sentences*train_ratio) +1
        test_size = int(self.num_sentences*test_ratio) +1
        valid_size = self.num_sentences - (train_size + test_size)
        data_indices = list(range(self.num_sentences))
        random.shuffle(data_indices)
        self.train_str_questions = [self.str_questions[i] for i in data_indices[:train_size]]
        self.train_numeral_labels = self.numeral_labels[data_indices[:train_size]]
        train_set_data = self.numeral_data[data_indices[:train_size]]
        train_set_labels = self.numeral_labels[data_indices[:train_size]]
        train_set_labels = torch.from_numpy(train_set_labels)
        train_set = torch.utils.data.TensorDataset(train_set_data, train_set_labels)
        self.test_str_questions = [self.str_questions[i] for i in data_indices[-test_size:]]
        self.test_numeral_labels = self.numeral_labels[data_indices[-test_size:]]
        test_set_data = self.numeral_data[data_indices[-test_size:]]
        test_set_labels = self.numeral_labels[data_indices[-test_size:]]
        test_set_labels = torch.from_numpy(test_set_labels)
        test_set = torch.utils.data.TensorDataset(test_set_data, test_set_labels)
        self.valid_str_questions = [self.str_questions[i] for i in data_indices[train_size:-test_size]]
        self.valid_numeral_labels = self.numeral_labels[data_indices[train_size:-test_size]]
        valid_set_data = self.numeral_data[data_indices[train_size:-test_size]]
        valid_set_labels = self.numeral_labels[data_indices[train_size:-test_size]]
        valid_set_labels = torch.from_numpy(valid_set_labels)
        valid_set = torch.utils.data.TensorDataset(valid_set_data, valid_set_labels)
        self.train_loader = DataLoader(train_set, batch_size=64, shuffle=True)
        self.test_loader = DataLoader(test_set, batch_size=64, shuffle=False)
        self.valid_loader = DataLoader(valid_set, batch_size=64, shuffle=False)

In [34]:
print('Loading data...')
DataManager.maybe_download("data", "train_2000.label", "http://cogcomp.org/Data/QA/QC/")

dm = DataManager()
dm.read_data("data/", ["train_2000.label"])

Loading data...
Downloaded successfully train_2000.label

Sample questions and corresponding labels... 

['manner how did serfdom develop in and then leave russia ?', 'cremat what films featured the character popeye doyle ?', "manner how can i find a list of celebrities ' real names ?", 'animal what fowl grabs the spotlight after the chinese year of the monkey ?', 'exp what is the full form of .com ?']
['DESC', 'ENTY', 'DESC', 'ENTY', 'ABBR']


In [35]:
dm.manipulate_data()
dm.train_valid_test_split(train_ratio=0.8, test_ratio = 0.1)

In [36]:
for x, y in dm.train_loader:
    print(x.shape, y.shape)
    break

torch.Size([64, 36]) torch.Size([64])


We now declare the `BaseTrainer` class, which will be used later to train the subsequent deep learning models for text data.

In [37]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class BaseTrainer:
    def __init__(self, model, criterion, optimizer, train_loader, val_loader):
        self.model = model
        self.criterion = criterion  #the loss function
        self.optimizer = optimizer  #the optimizer
        self.train_loader = train_loader  #the train loader
        self.val_loader = val_loader  #the valid loader

    #the function to train the model in many epochs
    def fit(self, num_epochs):
        self.num_batches = len(self.train_loader)

        for epoch in range(num_epochs):
            print(f'Epoch {epoch + 1}/{num_epochs}')
            train_loss, train_accuracy = self.train_one_epoch()
            val_loss, val_accuracy = self.validate_one_epoch()
            print(
                f'{self.num_batches}/{self.num_batches} - train_loss: {train_loss:.4f} - train_accuracy: {train_accuracy*100:.4f}% \
                - val_loss: {val_loss:.4f} - val_accuracy: {val_accuracy*100:.4f}%')

    #train in one epoch, return the train_acc, train_loss
    def train_one_epoch(self):
        self.model.train()
        running_loss, correct, total = 0.0, 0, 0
        for i, data in enumerate(self.train_loader):
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)
            self.optimizer.zero_grad()
            outputs = self.model(inputs)
            loss = self.criterion(outputs, labels)
            loss.backward()
            self.optimizer.step()

            running_loss += loss.item()
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
        train_accuracy = correct / total
        train_loss = running_loss / self.num_batches
        return train_loss, train_accuracy

    #evaluate on a loader and return the loss and accuracy
    def evaluate(self, loader):
        self.model.eval()
        loss, correct, total = 0.0, 0, 0
        with torch.no_grad():
            for data in loader:
                inputs, labels = data
                inputs, labels = inputs.to(device), labels.to(device)
                outputs = self.model(inputs)
                loss = self.criterion(outputs, labels)
                loss += loss.item()
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

        accuracy = correct / total
        loss = loss / len(self.val_loader)
        return loss, accuracy

    #return the val_acc, val_loss, be called at the end of each epoch
    def validate_one_epoch(self):
      val_loss, val_accuracy = self.evaluate(self.val_loader)
      return val_loss, val_accuracy

## <font color="#0b486b">Part 4: Transformer-based models for sequence modeling and neural embedding</font>

<div style="text-align: right"><font color="red; font-weight:bold">[Total marks for this part: 30 marks]<span></div>

#### <font color="red">**Question 4.1**</font>

**Implement the multi-head attention module of the Transformer for the text classification problem. The provided code is from our tutorial. In this part, we only use the output of the Transformer encoder for the classification task. For further information on the Transformer model, refer to [this paper](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf).**

<div style="text-align: right"><font color="red; font-weight:bold">[Total marks for this part: 10 marks]<span></div>


Below is the code of `MultiHeadSelfAttention`, `PositionWiseFeedForward`, `PositionalEncoding`, and `EncoderLayer`.

In [38]:
import math


class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        # Ensure that the model dimension (d_model) is divisible by the number of heads
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        # Initialize dimensions
        self.d_model = d_model # Model's dimension
        self.num_heads = num_heads # Number of attention heads
        self.d_k = d_model // num_heads # Dimension of each head's key, query, and value

        # Linear layers for transforming inputs
        self.W_q = nn.Linear(d_model, d_model) # Query transformation
        self.W_k = nn.Linear(d_model, d_model) # Key transformation
        self.W_v = nn.Linear(d_model, d_model) # Value transformation
        self.W_o = nn.Linear(d_model, d_model) # Output transformation

    def scaled_dot_product_attention(self, Q, K, V):
        # Calculate attention scores
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        # Apply mask if provided (useful for preventing attention to certain parts like padding)
        #if mask is not None:
            #attn_scores = attn_scores.masked_fill(mask == 0, -1e9)

        # Softmax is applied to obtain attention probabilities
        attn_probs = torch.softmax(attn_scores, dim=-1)

        # Multiply by values to obtain the final output
        output = torch.matmul(attn_probs, V)
        return output

    def split_heads(self, x):
        # Reshape the input to have num_heads for multi-head attention
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)

    def combine_heads(self, x):
        # Combine the multiple heads back to original shape
        batch_size, _, seq_length, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)

    def forward(self, Q, K, V):
        # Apply linear transformations and split heads
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))

        # Perform scaled dot-product attention
        attn_output = self.scaled_dot_product_attention(Q, K, V)

        # Combine heads and apply output transformation
        output = self.W_o(self.combine_heads(attn_output))
        return output

In [39]:
class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PositionWiseFeedForward, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

In [40]:
import math

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length):
        super(PositionalEncoding, self).__init__()

        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

In [41]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        attn_output = self.self_attn(x, x, x)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

Your task is to develop `TransformerClassifier` in which we input the embedding with the shape `[batch_size, seq_len, embed_dim]` to some `EncoderLayer` (i.e., num_layers specifies the number of EncoderLayer) and then compute the average of all token embeddings (i.e., `[batch_size, seq_len, embed_dim]`) across the `seq_len`. Finally, on the top of this average embedding, we build up a linear layer for making predictions.

In [42]:
class TransformerClassifier(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_dim, num_layers, dropout_rate=0.2, data_manager = None):
        super(TransformerClassifier, self).__init__()
        self.vocab_size = data_manager.vocab_size
        self.num_classes = data_manager.num_classes
        self.embed_dim = embed_dim
        self.max_seq_len = data_manager.max_seq_len
        self.num_heads = num_heads
        self.ff_dim = ff_dim
        self.num_layers = num_layers
        self.dropout_rate = dropout_rate

    def build(self):
        #Insert your code here
        self.embedding = nn.Embedding(self.vocab_size, self.embed_dim)
        
        # positional encoding
        self.pos_encoding = PositionalEncoding(self.embed_dim, self.max_seq_len)
        
        # encoder layers
        self.encoder_layers = nn.ModuleList([
            EncoderLayer(self.embed_dim, self.num_heads, self.ff_dim, self.dropout_rate)
            for _ in range(self.num_layers)
        ])
        
        # output layer
        self.output_layer = nn.Linear(self.embed_dim, self.num_classes)


    def forward(self, x):
        #Insert your code here
        embedded = self.embedding(x)
        
        # add positional encoding
        embedded = self.pos_encoding(embedded)
        
        # pass through encoder layers
        output = embedded
        for encoder_layer in self.encoder_layers:
            output = encoder_layer(output)
            
        # average pooling across the sequence length
        output = torch.mean(output, dim=1)
        
        return self.output_layer(output)




In [43]:
transformer = TransformerClassifier(embed_dim=512, num_heads=8, ff_dim=2048, num_layers=12, dropout_rate=0.1, data_manager= dm)
transformer.build()
transformer = transformer.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(transformer.parameters(), lr=1e-4, betas=(0.9, 0.98), eps=1e-9)
trainer = BaseTrainer(model= transformer, criterion=criterion, optimizer=optimizer, train_loader=dm.train_loader, val_loader=dm.valid_loader)
trainer.fit(num_epochs=30)

Epoch 1/30
26/26 - train_loss: 2.0195 - train_accuracy: 22.4235%                 - val_loss: 0.8066 - val_accuracy: 12.6263%
Epoch 2/30
26/26 - train_loss: 1.7231 - train_accuracy: 18.8007%                 - val_loss: 0.8273 - val_accuracy: 12.6263%
Epoch 3/30
26/26 - train_loss: 1.6846 - train_accuracy: 21.9238%                 - val_loss: 0.8405 - val_accuracy: 26.2626%
Epoch 4/30
26/26 - train_loss: 1.6877 - train_accuracy: 21.9863%                 - val_loss: 0.7704 - val_accuracy: 27.2727%
Epoch 5/30
26/26 - train_loss: 1.6989 - train_accuracy: 20.8620%                 - val_loss: 0.7289 - val_accuracy: 27.2727%
Epoch 6/30
26/26 - train_loss: 1.6803 - train_accuracy: 21.3616%                 - val_loss: 0.7095 - val_accuracy: 27.2727%
Epoch 7/30
26/26 - train_loss: 1.6395 - train_accuracy: 26.2961%                 - val_loss: 0.7330 - val_accuracy: 37.3737%
Epoch 8/30
26/26 - train_loss: 1.3738 - train_accuracy: 39.1006%                 - val_loss: 0.7387 - val_accuracy: 31.3131%


#### <font color="red">**Question 4.2**</font>
**Prefix prompt-tuning with Transformers: You need to implement the prefix prompt-tuning with Transformers. Basically, we base on a pre-trained Transformer, add prefix prompts, and do fine-tuning for a target dataset.**

<div style="text-align: right"><font color="red; font-weight:bold">[Total marks for this part: 10 marks]<span></div>

To implement prefix prompt-tuning with pretrained Transformers, we first need to create the Bert dataset.

In [44]:
from transformers import AutoModel, AutoTokenizer
from torch.optim import AdamW
from datasets import Dataset

model_name = "bert-base-uncased"  # BERT or any similar model

# Tokenize input and prepare model inputs
tokenizer = AutoTokenizer.from_pretrained(model_name)

dataset = Dataset.from_dict({"text": dm.str_questions, "label": dm.numeral_labels})

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length= 36)

dataset = dataset.map(tokenize_function, batched=True)
dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
print(dataset)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 2000
})


The following function splits the BERT dataset `dataset` into three BERT datasets for training, valid, and testing.

In [45]:
def train_valid_test_split(dataset, train_ratio=0.8, test_ratio = 0.1):
    num_sentences = len(dataset)
    train_size = int(num_sentences*train_ratio) +1
    test_size = int(num_sentences*test_ratio) +1
    valid_size = num_sentences - (train_size + test_size)
    train_set = dataset[:train_size]
    train_set = Dataset.from_dict(train_set)
    train_set.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
    test_set = dataset[-test_size:]
    test_set = Dataset.from_dict(test_set)
    test_set.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
    valid_set = dataset[train_size:-test_size]
    valid_set = Dataset.from_dict(valid_set)
    valid_set.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
    train_loader = DataLoader(train_set, batch_size=64, shuffle=True)
    test_loader = DataLoader(test_set, batch_size=64, shuffle=False)
    valid_loader = DataLoader(valid_set, batch_size=64, shuffle=False)
    return train_loader, test_loader, valid_loader


In [46]:
train_loader, test_loader, valid_loader = train_valid_test_split(dataset)

You need to implement the class `PrefixTuningForClassification` for the prefix prompt fine-tuning. We first load a pre-trained BERT model specified by `model_name`. The parameter `prefix_length` specifies the length of the prefix prompts we add to the pre-trained BERT model. Specifically, given the input batch `[batch_size, seq_len]`, we input to the embedding layer of the pre-trained BERT model to obtain `[batch_size, seq_len, embed_size]`. We create the prefix prompts $P$ of the size `[prefix_length, embed_size]` and concatenate to the embeddings from the pre-trained BERT to obtain `[batch_size, seq_len + prefix_length, embed_size]`. This concatenation tensor will then be fed to the encoder layers of the pre-trained BERT layer to obtain the last `[batch_size, seq_len + prefix_length, embed_size]`.

We then take mean across the seq_len to obtain `[batch_size, embed_size]` on which we can build up a linear layer for making predictions. Please note that **the parameters to tune include the prefix prompts $P$** and **the output linear layer**, and you should freeze the parameters of the BERT pre-trained model. Moreover, your code should cover the edge case when `prefix_length=None`. In this case, we do not insert any prefix prompts and we only do fine-tuning for the output linear layer on top.  

In [47]:
class PrefixTuningForClassification(nn.Module):
    def __init__(self, model_name, prefix_length=None, data_manager = None):
        super(PrefixTuningForClassification, self).__init__()

        # Load the pretrained transformer model (BERT-like model)
        self.model = AutoModel.from_pretrained(model_name).to(device)
        self.hidden_size =  self.model.config.hidden_size
        self.prefix_length = prefix_length
        self.num_classes = data_manager.num_classes
        # Insert your code here
        if self.prefix_length is not None:
            self.prefix_embeddings = nn.Parameter(torch.randn(self.prefix_length, self.hidden_size))
            
        # freeze the pre-trained parameters of the BERT model
        for param in self.model.parameters():
            param.requires_grad = False
            
        # add a new classification head
        self.classifier = nn.Linear(self.hidden_size, self.num_classes)

    def forward(self, input_ids, attention_mask):
        # Insert your code here
        embeddings = self.model.embeddings(input_ids)
        
        # add prefix embeddings if specified
        if self.prefix_length is not None:
            batch_size = embeddings.size(0)
            prefix_embeddings = self.prefix_embeddings.unsqueeze(0).expand(batch_size, -1, -1) # repeat?
            embeddings = torch.cat([prefix_embeddings, embeddings], dim=1)
            # update the attention mask
            prefix_attention_mask = torch.ones(batch_size, self.prefix_length, device=attention_mask.device)
            attention_mask = torch.cat([prefix_attention_mask, attention_mask], dim=1)
            
        # pass through the encode layers
        output = self.model.encoder(embeddings, attention_mask=attention_mask.unsqueeze(1).unsqueeze(2))
        
        # use the last hidden state for classification
        if self.prefix_length is not None:
            last_hidden_state = output.last_hidden_state[:, self.prefix_length:, :]
        else:
            last_hidden_state = output.last_hidden_state
            
        # average pooling
        pooled_output = torch.mean(last_hidden_state, dim=1)
        
        # do classification
        return self.classifier(pooled_output)


You can use the following `FineTunedBaseTrainer` to train the prompt fine-tuning models.

In [48]:
class FineTunedBaseTrainer:
    def __init__(self, model, criterion, optimizer, train_loader, val_loader):
        self.model = model
        self.criterion = criterion  #the loss function
        self.optimizer = optimizer  #the optimizer
        self.train_loader = train_loader  #the train loader
        self.val_loader = val_loader  #the valid loader

    #the function to train the model in many epochs
    def fit(self, num_epochs):
        self.num_batches = len(self.train_loader)

        for epoch in range(num_epochs):
            print(f'Epoch {epoch + 1}/{num_epochs}')
            train_loss, train_accuracy = self.train_one_epoch()
            val_loss, val_accuracy = self.validate_one_epoch()
            print(
                f'{self.num_batches}/{self.num_batches} - train_loss: {train_loss:.4f} - train_accuracy: {train_accuracy*100:.4f}% \
                - val_loss: {val_loss:.4f} - val_accuracy: {val_accuracy*100:.4f}%')

    #train in one epoch, return the train_acc, train_loss
    def train_one_epoch(self):
        self.model.train()
        running_loss, correct, total = 0.0, 0, 0
        for batch in self.train_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].to(device)
            self.optimizer.zero_grad()
            outputs = self.model(input_ids= input_ids, attention_mask= attention_mask)
            loss = self.criterion(outputs, labels)
            loss.backward()
            self.optimizer.step()

            running_loss += loss.item()
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
        train_accuracy = correct / total
        train_loss = running_loss / self.num_batches
        return train_loss, train_accuracy

    #evaluate on a loader and return the loss and accuracy
    def evaluate(self, loader):
        self.model.eval()
        loss, correct, total = 0.0, 0, 0
        with torch.no_grad():
            for batch in loader:
                input_ids = batch["input_ids"].to(device)
                labels = batch["label"].to(device)
                attention_mask = batch["attention_mask"].to(device)
                outputs = self.model(input_ids= input_ids, attention_mask= attention_mask)
                loss = self.criterion(outputs, labels)
                loss += loss.item()
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

        accuracy = correct / total
        loss = loss / len(self.val_loader)
        return loss, accuracy

    #return the val_acc, val_loss, be called at the end of each epoch
    def validate_one_epoch(self):
      val_loss, val_accuracy = self.evaluate(self.val_loader)
      return val_loss, val_accuracy

We declare and train the prefix-prompt tuning model. In addition, you need to be patient with this model because it might converge slowly with many epochs.

In [49]:
prefix_tuning_model = PrefixTuningForClassification(model_name = "bert-base-uncased", prefix_length = 5, data_manager = dm).to(device)

In [50]:
if prefix_tuning_model.prefix_length is not None:
  optimizer = torch.optim.Adam(list(prefix_tuning_model.classifier.parameters()) + [prefix_tuning_model.prefix_embeddings], lr=5e-5)
else:
  optimizer = torch.optim.Adam(prefix_tuning_model.classifier.parameters(), lr=1e-4)
criterion = nn.CrossEntropyLoss()
trainer = FineTunedBaseTrainer(model= prefix_tuning_model, criterion=criterion, optimizer=optimizer, train_loader=train_loader, val_loader=valid_loader)
trainer.fit(num_epochs=100)

Epoch 1/100
26/26 - train_loss: 1.7964 - train_accuracy: 22.2361%                 - val_loss: 0.9475 - val_accuracy: 22.7273%
Epoch 2/100
26/26 - train_loss: 1.7408 - train_accuracy: 23.9225%                 - val_loss: 0.9086 - val_accuracy: 25.2525%
Epoch 3/100
26/26 - train_loss: 1.6994 - train_accuracy: 25.0468%                 - val_loss: 0.8821 - val_accuracy: 26.2626%
Epoch 4/100
26/26 - train_loss: 1.6880 - train_accuracy: 26.1087%                 - val_loss: 0.8636 - val_accuracy: 25.7576%
Epoch 5/100
26/26 - train_loss: 1.6604 - train_accuracy: 26.0462%                 - val_loss: 0.8442 - val_accuracy: 28.2828%
Epoch 6/100
26/26 - train_loss: 1.6598 - train_accuracy: 25.6090%                 - val_loss: 0.8355 - val_accuracy: 25.7576%
Epoch 7/100
26/26 - train_loss: 1.6545 - train_accuracy: 26.3585%                 - val_loss: 0.8310 - val_accuracy: 28.2828%
Epoch 8/100
26/26 - train_loss: 1.6338 - train_accuracy: 29.6690%                 - val_loss: 0.8239 - val_accuracy: 2

#### <font color="red">**Question 4.3**</font>
**For any models defined in the previous questions (of all parts), you are free to fine-tune hyperparameters, e.g., `optimizer`, `learning_rate`, `state_sizes`, such that you get a best model, i.e., the one with the highest accuracy on the test set. You will need to report (i) what is your best model,  (ii) its accuracy on the test set, and (iii) the values of its hyperparameters. Note that you must report your best model's accuracy with rounding to 4 decimal places, i.e., 0.xxxx. You will also need to upload your best model (or provide us with the link to download your best model). The assessment will be based on your best model's accuracy, with up to 10 marks available, specifically:**
* The best accuracy $\ge$ 0.97: 10 marks
* 0.97 $>$ The best accuracy $\ge$ 0.92: 7 marks
* 0.92 $>$ The best accuracy $\ge$ 0.85: 4 marks
* The best accuracy $<$ 0.85: 0 mark

**For this question, you can put below the code to train the best model. In this case, you need to show your code and the evidence of running regarding the best model. Moreover, if you save the best model, you need to provide the link to download the best model, the code to load the best model, and then evaluate on the test set.**
<div style="text-align: right"><font color="red">[10 marks]</font></div>

##### Import Packages

In [51]:
from transformers import AutoModel, AutoTokenizer, get_linear_schedule_with_warmup
from tqdm import tqdm
from six.moves.urllib.request import urlretrieve

##### Define Optimized BERT Classifier

In [52]:
class OptimizedBERTClassifier(nn.Module):
    """
    Optimized BERT-based classifier with:
    - Partial fine-tuning (last 2 BERT layers)
    - Multi-layer classification head
    - Dropout for regularization
    - Batch normalization for stability
    """
    def __init__(self, model_name, num_classes, dropout_rate=0.3, hidden_dim=256):
        super(OptimizedBERTClassifier, self).__init__()
        
        # Load pretrained BERT model
        self.bert = AutoModel.from_pretrained(model_name)
        self.hidden_size = self.bert.config.hidden_size
        
        # Freeze all BERT parameters first
        for param in self.bert.parameters():
            param.requires_grad = False
        
        # Unfreeze the last 2 encoder layers for fine-tuning
        # This allows the model to adapt better to the specific task
        for layer in self.bert.encoder.layer[-2:]:
            for param in layer.parameters():
                param.requires_grad = True
        
        # Multi-layer classification head
        self.classifier = nn.Sequential(
            nn.Linear(self.hidden_size, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(hidden_dim, num_classes)
        )
        
    def forward(self, input_ids, attention_mask):
        # Get BERT outputs
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        
        # Use [CLS] token representation (first token)
        cls_output = outputs.last_hidden_state[:, 0, :]
        
        # Pass through classification head
        logits = self.classifier(cls_output)
        
        return logits

##### Define My Trainer with Early Stopping

In [53]:
class MyTrainer:
    """
    Advanced trainer with:
    - Learning rate scheduling with warmup
    - Early stopping
    - Model checkpointing
    - Gradient clipping
    """
    def __init__(self, model, criterion, optimizer, scheduler, train_loader, 
                 val_loader, test_loader, patience=10, checkpoint_dir='./checkpoints'):
        self.model = model
        self.criterion = criterion
        self.optimizer = optimizer
        self.scheduler = scheduler
        self.train_loader = train_loader
        self.val_loader = val_loader
        self.test_loader = test_loader
        self.patience = patience
        self.checkpoint_dir = checkpoint_dir
        
        # Create checkpoint directory
        os.makedirs(checkpoint_dir, exist_ok=True)
        
        # Early stopping variables
        self.best_val_acc = 0.0
        self.best_test_acc = 0.0
        self.patience_counter = 0
        self.best_epoch = 0
        
    def train_one_epoch(self):
        self.model.train()
        running_loss = 0.0
        correct = 0
        total = 0
        
        pbar = tqdm(self.train_loader, desc='Training')
        for batch in pbar:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].to(device)
            
            # Forward pass
            self.optimizer.zero_grad()
            outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
            loss = self.criterion(outputs, labels)
            
            # Backward pass
            loss.backward()
            
            # Gradient clipping to prevent exploding gradients
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
            
            self.optimizer.step()
            self.scheduler.step()
            
            # Statistics
            running_loss += loss.item()
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            
            # Update progress bar
            pbar.set_postfix({'loss': f'{loss.item():.4f}', 'acc': f'{100. * correct / total:.2f}%'})
        
        train_accuracy = correct / total
        train_loss = running_loss / len(self.train_loader)
        return train_loss, train_accuracy
    
    def evaluate(self, loader, desc='Evaluating'):
        self.model.eval()
        running_loss = 0.0
        correct = 0
        total = 0
        
        with torch.no_grad():
            for batch in tqdm(loader, desc=desc):
                input_ids = batch["input_ids"].to(device)
                attention_mask = batch["attention_mask"].to(device)
                labels = batch["label"].to(device)
                
                outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
                loss = self.criterion(outputs, labels)
                
                running_loss += loss.item()
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
        
        accuracy = correct / total
        avg_loss = running_loss / len(loader)
        return avg_loss, accuracy
    
    def fit(self, num_epochs):
        print(f"\nTraining for {num_epochs} epochs with early stopping (patience={self.patience})")
        print("=" * 80)
        
        for epoch in range(num_epochs):
            print(f'\nEpoch {epoch + 1}/{num_epochs}')
            print("-" * 80)
            
            # Train
            train_loss, train_accuracy = self.train_one_epoch()
            
            # Validate
            val_loss, val_accuracy = self.evaluate(self.val_loader, 'Validation')
            
            # Test
            test_loss, test_accuracy = self.evaluate(self.test_loader, 'Testing')
            
            # Print metrics
            print(f'\nTrain Loss: {train_loss:.4f} | Train Acc: {train_accuracy*100:.2f}%')
            print(f'Val Loss: {val_loss:.4f} | Val Acc: {val_accuracy*100:.2f}%')
            print(f'Test Loss: {test_loss:.4f} | Test Acc: {test_accuracy*100:.4f}%')
            
            # Check for improvement
            if val_accuracy > self.best_val_acc:
                self.best_val_acc = val_accuracy
                self.best_test_acc = test_accuracy
                self.best_epoch = epoch + 1
                self.patience_counter = 0
                
                # Save best model
                checkpoint_path = os.path.join(self.checkpoint_dir, 'best_model.pt')
                torch.save({
                    'epoch': epoch,
                    'model_state_dict': self.model.state_dict(),
                    'optimizer_state_dict': self.optimizer.state_dict(),
                    'val_accuracy': val_accuracy,
                    'test_accuracy': test_accuracy,
                }, checkpoint_path)
                print(f'✓ New best model saved! Val Acc: {val_accuracy*100:.2f}%, Test Acc: {test_accuracy*100:.4f}%')
            else:
                self.patience_counter += 1
                print(f'No improvement for {self.patience_counter} epoch(s)')
                
                if self.patience_counter >= self.patience:
                    print(f'\nEarly stopping triggered after {epoch + 1} epochs')
                    break
        
        print("\n" + "=" * 80)
        print(f'Best Model Performance:')
        print(f'  - Epoch: {self.best_epoch}')
        print(f'  - Val Accuracy: {self.best_val_acc*100:.2f}%')
        print(f'  - Test Accuracy: {self.best_test_acc:.4f} ({self.best_test_acc*100:.2f}%)')
        print("=" * 80)
        
        return self.best_test_acc

##### Data Preparation Function

In [54]:
def prepare_data(dm, model_name="bert-base-uncased", max_length=48, batch_size=32):
    """
    Prepare datasets for BERT model
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # Create dataset
    dataset = Dataset.from_dict({
        "text": dm.str_questions, 
        "label": dm.numeral_labels
    })
    
    # Tokenize
    def tokenize_function(examples):
        return tokenizer(
            examples["text"], 
            padding="max_length", 
            truncation=True, 
            max_length=max_length
        )
    
    dataset = dataset.map(tokenize_function, batched=True)
    dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
    
    # Split into train/val/test (80/10/10)
    num_samples = len(dataset)
    train_size = int(num_samples * 0.8)
    test_size = int(num_samples * 0.1)
    val_size = num_samples - train_size - test_size
    
    train_set = Dataset.from_dict(dataset[:train_size])
    train_set.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
    
    val_set = Dataset.from_dict(dataset[train_size:train_size+val_size])
    val_set.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
    
    test_set = Dataset.from_dict(dataset[-test_size:])
    test_set.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
    
    # Create dataloaders
    train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_set, batch_size=batch_size, shuffle=False)
    test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False)
    
    print(f"Data loaded:")
    print(f"  - Train set: {len(train_set)} samples")
    print(f"  - Val set: {len(val_set)} samples")
    print(f"  - Test set: {len(test_set)} samples")
    
    return train_loader, val_loader, test_loader

##### Train the Best Model

In [55]:
# Hyperparameters
MODEL_NAME = "bert-base-uncased"
LEARNING_RATE = 2e-5
BATCH_SIZE = 32
NUM_EPOCHS = 50
DROPOUT = 0.3
HIDDEN_DIM = 256
WEIGHT_DECAY = 0.01
WARMUP_RATIO = 0.1
MAX_LENGTH = 48
PATIENCE = 10

print("=" * 80)
print("TRAINING BEST MODEL FOR QUESTION 4.3")
print("=" * 80)

print("\nHyperparameters:")
print(f"  - Model: {MODEL_NAME}")
print(f"  - Learning Rate: {LEARNING_RATE}")
print(f"  - Batch Size: {BATCH_SIZE}")
print(f"  - Max Epochs: {NUM_EPOCHS}")
print(f"  - Dropout: {DROPOUT}")
print(f"  - Hidden Dimension: {HIDDEN_DIM}")
print(f"  - Weight Decay: {WEIGHT_DECAY}")
print(f"  - Warmup Ratio: {WARMUP_RATIO}")
print(f"  - Max Sequence Length: {MAX_LENGTH}")
print(f"  - Early Stopping Patience: {PATIENCE}")

# Prepare data
print("\nPreparing data...")
train_loader, val_loader, test_loader = prepare_data(dm, MODEL_NAME, MAX_LENGTH, BATCH_SIZE)

# Initialize model
print("\nInitializing model...")
model = OptimizedBERTClassifier(
    model_name=MODEL_NAME,
    num_classes=dm.num_classes,
    dropout_rate=DROPOUT,
    hidden_dim=HIDDEN_DIM
).to(device)

# Print model info
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nModel Statistics:")
print(f"  - Total parameters: {total_params:,}")
print(f"  - Trainable parameters: {trainable_params:,}")
print(f"  - Frozen parameters: {total_params - trainable_params:,}")

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY
)

# Learning rate scheduler with warmup
num_training_steps = len(train_loader) * NUM_EPOCHS
num_warmup_steps = int(num_training_steps * WARMUP_RATIO)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=num_warmup_steps,
    num_training_steps=num_training_steps
)

print(f"\nTraining Schedule:")
print(f"  - Total training steps: {num_training_steps}")
print(f"  - Warmup steps: {num_warmup_steps}")

# Train
trainer = MyTrainer(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    train_loader=train_loader,
    val_loader=val_loader,
    test_loader=test_loader,
    patience=PATIENCE,
    checkpoint_dir='.'
)

best_test_acc = trainer.fit(NUM_EPOCHS) 

print(f"\n✓ Training completed!")
print(f"✓ Best model saved to: ./best_model.pt")
print(f"✓ Best Test Accuracy: {best_test_acc:.4f}")

TRAINING BEST MODEL FOR QUESTION 4.3

Hyperparameters:
  - Model: bert-base-uncased
  - Learning Rate: 2e-05
  - Batch Size: 32
  - Max Epochs: 50
  - Dropout: 0.3
  - Hidden Dimension: 256
  - Weight Decay: 0.01
  - Warmup Ratio: 0.1
  - Max Sequence Length: 48
  - Early Stopping Patience: 10

Preparing data...


Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Data loaded:
  - Train set: 1600 samples
  - Val set: 200 samples
  - Test set: 200 samples

Initializing model...

Model Statistics:
  - Total parameters: 109,681,158
  - Trainable parameters: 14,374,662
  - Frozen parameters: 95,306,496

Training Schedule:
  - Total training steps: 2500
  - Warmup steps: 250

Training for 50 epochs with early stopping (patience=10)

Epoch 1/50
--------------------------------------------------------------------------------


Training: 100%|██████████| 50/50 [00:05<00:00,  8.56it/s, loss=1.7257, acc=21.00%]
Validation: 100%|██████████| 7/7 [00:00<00:00, 12.80it/s]
Testing: 100%|██████████| 7/7 [00:00<00:00, 12.77it/s]



Train Loss: 1.7972 | Train Acc: 21.00%
Val Loss: 1.6773 | Val Acc: 32.00%
Test Loss: 1.6585 | Test Acc: 38.0000%
✓ New best model saved! Val Acc: 32.00%, Test Acc: 38.0000%

Epoch 2/50
--------------------------------------------------------------------------------


Training: 100%|██████████| 50/50 [00:05<00:00,  8.58it/s, loss=1.4687, acc=32.56%]
Validation: 100%|██████████| 7/7 [00:00<00:00, 12.87it/s]
Testing: 100%|██████████| 7/7 [00:00<00:00, 12.75it/s]



Train Loss: 1.6496 | Train Acc: 32.56%
Val Loss: 1.3526 | Val Acc: 70.00%
Test Loss: 1.3531 | Test Acc: 70.5000%
✓ New best model saved! Val Acc: 70.00%, Test Acc: 70.5000%

Epoch 3/50
--------------------------------------------------------------------------------


Training: 100%|██████████| 50/50 [00:05<00:00,  8.58it/s, loss=1.0584, acc=59.25%]
Validation: 100%|██████████| 7/7 [00:00<00:00, 12.80it/s]
Testing: 100%|██████████| 7/7 [00:00<00:00, 12.75it/s]



Train Loss: 1.3105 | Train Acc: 59.25%
Val Loss: 0.8752 | Val Acc: 84.00%
Test Loss: 0.8873 | Test Acc: 87.0000%
✓ New best model saved! Val Acc: 84.00%, Test Acc: 87.0000%

Epoch 4/50
--------------------------------------------------------------------------------


Training: 100%|██████████| 50/50 [00:05<00:00,  8.59it/s, loss=0.7899, acc=79.88%]
Validation: 100%|██████████| 7/7 [00:00<00:00, 12.73it/s]
Testing: 100%|██████████| 7/7 [00:00<00:00, 12.77it/s]



Train Loss: 0.9026 | Train Acc: 79.88%
Val Loss: 0.4803 | Val Acc: 90.50%
Test Loss: 0.4765 | Test Acc: 93.0000%
✓ New best model saved! Val Acc: 90.50%, Test Acc: 93.0000%

Epoch 5/50
--------------------------------------------------------------------------------


Training: 100%|██████████| 50/50 [00:05<00:00,  8.53it/s, loss=0.4738, acc=90.31%]
Validation: 100%|██████████| 7/7 [00:00<00:00, 12.79it/s]
Testing: 100%|██████████| 7/7 [00:00<00:00, 12.74it/s]



Train Loss: 0.5359 | Train Acc: 90.31%
Val Loss: 0.2269 | Val Acc: 94.50%
Test Loss: 0.2106 | Test Acc: 96.5000%
✓ New best model saved! Val Acc: 94.50%, Test Acc: 96.5000%

Epoch 6/50
--------------------------------------------------------------------------------


Training: 100%|██████████| 50/50 [00:05<00:00,  8.51it/s, loss=0.3027, acc=94.12%]
Validation: 100%|██████████| 7/7 [00:00<00:00, 12.72it/s]
Testing: 100%|██████████| 7/7 [00:00<00:00, 12.71it/s]



Train Loss: 0.3107 | Train Acc: 94.12%
Val Loss: 0.1141 | Val Acc: 97.50%
Test Loss: 0.1056 | Test Acc: 97.5000%
✓ New best model saved! Val Acc: 97.50%, Test Acc: 97.5000%

Epoch 7/50
--------------------------------------------------------------------------------


Training: 100%|██████████| 50/50 [00:05<00:00,  8.48it/s, loss=0.1582, acc=96.56%]
Validation: 100%|██████████| 7/7 [00:00<00:00, 12.71it/s]
Testing: 100%|██████████| 7/7 [00:00<00:00, 12.78it/s]



Train Loss: 0.1972 | Train Acc: 96.56%
Val Loss: 0.0657 | Val Acc: 99.00%
Test Loss: 0.0680 | Test Acc: 99.5000%
✓ New best model saved! Val Acc: 99.00%, Test Acc: 99.5000%

Epoch 8/50
--------------------------------------------------------------------------------


Training: 100%|██████████| 50/50 [00:05<00:00,  8.47it/s, loss=0.1610, acc=97.38%]
Validation: 100%|██████████| 7/7 [00:00<00:00, 12.68it/s]
Testing: 100%|██████████| 7/7 [00:00<00:00, 12.63it/s]



Train Loss: 0.1519 | Train Acc: 97.38%
Val Loss: 0.0407 | Val Acc: 99.50%
Test Loss: 0.0469 | Test Acc: 99.0000%
✓ New best model saved! Val Acc: 99.50%, Test Acc: 99.0000%

Epoch 9/50
--------------------------------------------------------------------------------


Training: 100%|██████████| 50/50 [00:05<00:00,  8.48it/s, loss=0.1051, acc=98.81%]
Validation: 100%|██████████| 7/7 [00:00<00:00, 12.81it/s]
Testing: 100%|██████████| 7/7 [00:00<00:00, 12.77it/s]



Train Loss: 0.1039 | Train Acc: 98.81%
Val Loss: 0.0327 | Val Acc: 99.50%
Test Loss: 0.0411 | Test Acc: 99.0000%
No improvement for 1 epoch(s)

Epoch 10/50
--------------------------------------------------------------------------------


Training: 100%|██████████| 50/50 [00:05<00:00,  8.50it/s, loss=0.0825, acc=98.94%]
Validation: 100%|██████████| 7/7 [00:00<00:00, 12.90it/s]
Testing: 100%|██████████| 7/7 [00:00<00:00, 12.76it/s]



Train Loss: 0.0904 | Train Acc: 98.94%
Val Loss: 0.0254 | Val Acc: 99.50%
Test Loss: 0.0317 | Test Acc: 99.0000%
No improvement for 2 epoch(s)

Epoch 11/50
--------------------------------------------------------------------------------


Training: 100%|██████████| 50/50 [00:05<00:00,  8.46it/s, loss=0.0472, acc=99.31%]
Validation: 100%|██████████| 7/7 [00:00<00:00, 12.73it/s]
Testing: 100%|██████████| 7/7 [00:00<00:00, 12.73it/s]



Train Loss: 0.0754 | Train Acc: 99.31%
Val Loss: 0.0209 | Val Acc: 100.00%
Test Loss: 0.0315 | Test Acc: 99.5000%
✓ New best model saved! Val Acc: 100.00%, Test Acc: 99.5000%

Epoch 12/50
--------------------------------------------------------------------------------


Training: 100%|██████████| 50/50 [00:05<00:00,  8.57it/s, loss=0.0700, acc=99.25%]
Validation: 100%|██████████| 7/7 [00:00<00:00, 12.91it/s]
Testing: 100%|██████████| 7/7 [00:00<00:00, 12.69it/s]



Train Loss: 0.0612 | Train Acc: 99.25%
Val Loss: 0.0173 | Val Acc: 100.00%
Test Loss: 0.0306 | Test Acc: 99.5000%
No improvement for 1 epoch(s)

Epoch 13/50
--------------------------------------------------------------------------------


Training: 100%|██████████| 50/50 [00:05<00:00,  8.53it/s, loss=0.0457, acc=99.56%]
Validation: 100%|██████████| 7/7 [00:00<00:00, 12.72it/s]
Testing: 100%|██████████| 7/7 [00:00<00:00, 12.77it/s]



Train Loss: 0.0529 | Train Acc: 99.56%
Val Loss: 0.0150 | Val Acc: 100.00%
Test Loss: 0.0255 | Test Acc: 99.5000%
No improvement for 2 epoch(s)

Epoch 14/50
--------------------------------------------------------------------------------


Training: 100%|██████████| 50/50 [00:05<00:00,  8.52it/s, loss=0.0428, acc=99.38%]
Validation: 100%|██████████| 7/7 [00:00<00:00, 12.91it/s]
Testing: 100%|██████████| 7/7 [00:00<00:00, 12.72it/s]



Train Loss: 0.0485 | Train Acc: 99.38%
Val Loss: 0.0165 | Val Acc: 99.50%
Test Loss: 0.0275 | Test Acc: 99.5000%
No improvement for 3 epoch(s)

Epoch 15/50
--------------------------------------------------------------------------------


Training: 100%|██████████| 50/50 [00:05<00:00,  8.50it/s, loss=0.0392, acc=99.50%]
Validation: 100%|██████████| 7/7 [00:00<00:00, 12.88it/s]
Testing: 100%|██████████| 7/7 [00:00<00:00, 12.70it/s]



Train Loss: 0.0450 | Train Acc: 99.50%
Val Loss: 0.0123 | Val Acc: 100.00%
Test Loss: 0.0282 | Test Acc: 99.5000%
No improvement for 4 epoch(s)

Epoch 16/50
--------------------------------------------------------------------------------


Training: 100%|██████████| 50/50 [00:05<00:00,  8.55it/s, loss=0.0332, acc=99.69%]
Validation: 100%|██████████| 7/7 [00:00<00:00, 12.79it/s]
Testing: 100%|██████████| 7/7 [00:00<00:00, 12.91it/s]



Train Loss: 0.0372 | Train Acc: 99.69%
Val Loss: 0.0117 | Val Acc: 100.00%
Test Loss: 0.0237 | Test Acc: 99.5000%
No improvement for 5 epoch(s)

Epoch 17/50
--------------------------------------------------------------------------------


Training: 100%|██████████| 50/50 [00:05<00:00,  8.53it/s, loss=0.0374, acc=99.75%]
Validation: 100%|██████████| 7/7 [00:00<00:00, 12.94it/s]
Testing: 100%|██████████| 7/7 [00:00<00:00, 12.84it/s]



Train Loss: 0.0364 | Train Acc: 99.75%
Val Loss: 0.0103 | Val Acc: 100.00%
Test Loss: 0.0263 | Test Acc: 99.5000%
No improvement for 6 epoch(s)

Epoch 18/50
--------------------------------------------------------------------------------


Training: 100%|██████████| 50/50 [00:05<00:00,  8.53it/s, loss=0.0221, acc=99.88%] 
Validation: 100%|██████████| 7/7 [00:00<00:00, 12.97it/s]
Testing: 100%|██████████| 7/7 [00:00<00:00, 12.77it/s]



Train Loss: 0.0294 | Train Acc: 99.88%
Val Loss: 0.0098 | Val Acc: 100.00%
Test Loss: 0.0177 | Test Acc: 99.5000%
No improvement for 7 epoch(s)

Epoch 19/50
--------------------------------------------------------------------------------


Training: 100%|██████████| 50/50 [00:05<00:00,  8.56it/s, loss=0.0245, acc=99.81%] 
Validation: 100%|██████████| 7/7 [00:00<00:00, 12.87it/s]
Testing: 100%|██████████| 7/7 [00:00<00:00, 12.89it/s]



Train Loss: 0.0317 | Train Acc: 99.81%
Val Loss: 0.0096 | Val Acc: 100.00%
Test Loss: 0.0219 | Test Acc: 99.5000%
No improvement for 8 epoch(s)

Epoch 20/50
--------------------------------------------------------------------------------


Training: 100%|██████████| 50/50 [00:05<00:00,  8.55it/s, loss=0.0192, acc=99.94%]
Validation: 100%|██████████| 7/7 [00:00<00:00, 12.91it/s]
Testing: 100%|██████████| 7/7 [00:00<00:00, 12.75it/s]



Train Loss: 0.0245 | Train Acc: 99.94%
Val Loss: 0.0093 | Val Acc: 100.00%
Test Loss: 0.0263 | Test Acc: 99.5000%
No improvement for 9 epoch(s)

Epoch 21/50
--------------------------------------------------------------------------------


Training: 100%|██████████| 50/50 [00:05<00:00,  8.53it/s, loss=0.0196, acc=99.94%] 
Validation: 100%|██████████| 7/7 [00:00<00:00, 12.96it/s]
Testing: 100%|██████████| 7/7 [00:00<00:00, 12.85it/s]


Train Loss: 0.0244 | Train Acc: 99.94%
Val Loss: 0.0086 | Val Acc: 100.00%
Test Loss: 0.0235 | Test Acc: 99.5000%
No improvement for 10 epoch(s)

Early stopping triggered after 21 epochs

Best Model Performance:
  - Epoch: 11
  - Val Accuracy: 100.00%
  - Test Accuracy: 0.9950 (99.50%)

✓ Training completed!
✓ Best model saved to: ./best_model.pt
✓ Best Test Accuracy: 0.9950





##### Test the best model

In [56]:
# 1. initialize the model
best_model = (
    OptimizedBERTClassifier(
        model_name="bert-base-uncased",
        num_classes=dm.num_classes,
        dropout_rate=DROPOUT,
        hidden_dim=HIDDEN_DIM,
    )
    .to(device)
)

# 2. load the model parameters from the checkpoint
checkpoint = torch.load("best_model.pt", map_location=device)
best_model.load_state_dict(checkpoint["model_state_dict"])

# 3. set the model to evaluation mode
best_model.eval()

# 4. evaluate the model on the test set
criterion = nn.CrossEntropyLoss()


def evaluate_on_test(model, test_loader, criterion):
    model.eval()
    total = 0
    correct = 0
    total_loss = 0

    with torch.no_grad():
        for batch in test_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = criterion(outputs, labels)
            total_loss += loss.item()
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    avg_loss = total_loss / len(test_loader)
    accuracy = correct / total
    return avg_loss, accuracy


test_loss, test_accuracy = evaluate_on_test(best_model, test_loader, criterion)
print(f"Best model Test Loss: {test_loss:.4f}")
print(f"Best model Test Accuracy: {test_accuracy:.4f}")

print(f"Keys in checkpoint: {list(checkpoint.keys())}")
print(f"Training epochs: {checkpoint['epoch']}")
print(f"Validation accuracy: {checkpoint['val_accuracy']}")
print(f"Test accuracy: {checkpoint['test_accuracy']}")

Best model Test Loss: 0.0315
Best model Test Accuracy: 0.9950
Keys in checkpoint: ['epoch', 'model_state_dict', 'optimizer_state_dict', 'val_accuracy', 'test_accuracy']
Training epochs: 10
Validation accuracy: 1.0
Test accuracy: 0.995


##### (i) What is your best model?

Fine-tuned BERT (bert-base-uncased) with a multi-layer classification head.

Partial fine-tuning strategy was used (only the last 2 layers of the BERT encoder were unfrozen), combined with a multi-layer classification head containing BatchNorm and Dropout.

##### (ii) The accuracy of your best model on the test set

Best model Test Loss: 0.0185

Best model Test Accuracy: 0.9950


##### (iii) The values of the hyperparameters of your best model

Hyperparameters:
  - Model: bert-base-uncased
  - Learning Rate: 2e-05
  - Batch Size: 32
  - Max Epochs: 50
  - Dropout: 0.3
  - Hidden Dimension: 256
  - Weight Decay: 0.01
  - Warmup Ratio: 0.1
  - Max Sequence Length: 48
  - Early Stopping Patience: 10

##### (iv) The link to download your best model

https://drive.google.com/file/d/1g3P_2L113lzfGInNecvrB_cPEFvUuA8C/view?usp=sharing

---
<div style="text-align: center"> <font color="green">GOOD LUCK WITH YOUR ASSIGNMENT 2!</font> </div>
<div style="text-align: center"> <font color="black">END OF ASSIGNMENT</font> </div>