### Homework 2 (part 1): Finetuning BERT

Your task today will be to play with BERT embedding generation, finetune existing models on new data and behold transformer superiority over previous architectures (even though at the expense of heavier computational costs).

In [None]:
%pip install --upgrade scikit-learn transformers datasets accelerate deepspeed -q


In [1]:

import os
os.environ["WANDB_DISABLED"] = "true"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import torch
import transformers
import datasets
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification, Trainer, TrainingArguments

  from .autonotebook import tqdm as notebook_tqdm


### Load data and model

Our dataset for today is a **Quora Question Pairs (QQP)**.

The dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary value indicating whether the two questions are paraphrase of each other i.e. semantically close. Read [here](https://paperswithcode.com/dataset/quora-question-pairs) if you want to know more.

In [2]:
qqp = datasets.load_dataset('SetFit/qqp')
print('\n')
print("Sample[0]:", qqp['train'][0])
print("Sample[3]:", qqp['train'][3])

Repo card metadata block was not found. Setting CardData to empty.




Sample[0]: {'text1': 'How is the life of a math student? Could you describe your own experiences?', 'text2': 'Which level of prepration is enough for the exam jlpt5?', 'label': 0, 'idx': 0, 'label_text': 'not duplicate'}
Sample[3]: {'text1': 'What can one do after MBBS?', 'text2': 'What do i do after my MBBS ?', 'label': 1, 'idx': 3, 'label_text': 'duplicate'}


In [3]:
model_name = "gchhablani/bert-base-cased-finetuned-qqp"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name)

### Tokenize the data

The [dataset](https://huggingface.co/docs/datasets/en/index) library allows you to use mapping as in the functional-style programming.

What Happens to the Texts in `qqp_preprocessed`?

- The original `text1` and `text2` are tokenized into numerical ids using a relevant tokenizer.
- Both texts are concatenated via the `SEP` token and are prepended using the `CLS` token in order to meet the required formet. The resulting sequence is either truncated (if combined length > 128 tokens) or padded (if combined length < 128 tokens).
- The `qqp_preprocessed` dataset contains:
    - _Input IDs_: sequence of token ids.
    - _Attention Masks_: binary masks indicating which tokens are padding.
    - _Token Type IDs_: distinguish between tokens from text1 and text2.

__!Note!__ Attention masks here allow skipping computation on `PAD` tokens.

In [4]:
MAX_LENGTH = 128
def preprocess_function(examples):
    result = tokenizer(
        examples['text1'], examples['text2'],
        padding='max_length', max_length=MAX_LENGTH, truncation=True
    )
    result['label'] = examples['label']
    return result

qqp_preprocessed = qqp.map(preprocess_function, batched=True)

Map: 100%|█████████████████████| 390965/390965 [00:53<00:00, 7301.41 examples/s]


In [5]:
print(repr(qqp_preprocessed['train'][0]['input_ids'])[:100], "...")

[101, 1731, 1110, 1103, 1297, 1104, 170, 12523, 2377, 136, 7426, 1128, 5594, 1240, 1319, 5758, 136,  ...


### Evaluation (2 points)

We randomly chose a model trained on QQP - but is it any good?

One way to measure this is with validation accuracy - which is what you will implement next.

Here's the interface to help you do that:

Just glimpsing at our data

In [6]:
val_set = qqp_preprocessed['validation']
val_loader = torch.utils.data.DataLoader(
    val_set, batch_size=1, shuffle=False, collate_fn=transformers.default_data_collator
)

In [7]:
for batch in val_loader:
     break  # here be your training code
print("Sample batch:", batch)

with torch.no_grad():
  predicted = model(
      input_ids=batch['input_ids'],
      attention_mask=batch['attention_mask'],
      token_type_ids=batch['token_type_ids']
  )

print('\nPrediction (probs):', torch.softmax(predicted.logits, dim=1).data.numpy())

Sample batch: {'labels': tensor([0]), 'idx': tensor([0]), 'input_ids': tensor([[  101,  2009,  1132,  2170,   118,  4038,  1177,  2712,   136,   102,
          2009,  1132,  1117, 10224,  4724,  1177,  2712,   136,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,   

Note that the model uses 2 heads for binary classification (one for each class), not one. This is, in fact, a matter of preference.

__Your task__ is to measure the validation accuracy of your model.
Doing so naively may take several hours. Please make sure you use the following optimizations:

- run the model on GPU with no_grad
- using batch size larger than 1
- use optimize data loader with num_workers > 1
- (optional) use [mixed precision](https://pytorch.org/docs/stable/notes/amp_examples.html)


Note that even though the model computation runs on the GPU, the process of loading data from disk (or memory) into the format required by the model (e.g., tensors) is handled by the CPU.

Insufficient CPU computation resources may result in bottlenecking the whole process.

In [8]:
from tqdm import tqdm
import multiprocessing

cores = multiprocessing.cpu_count() # Count the number of cores in a computer
cores

16

In [9]:
# Move the model to GPU if available
device = torch.device("cpu")
if torch.backends.mps.is_available():
    device = torch.device("mps")
elif torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

print("Device:", device)
model.to(device)

# Create a DataLoader for the validation set
val_set = qqp_preprocessed['validation']
val_loader = torch.utils.data.DataLoader(
    val_set, batch_size=16,  # Larger batch size for faster processing
    shuffle=False, collate_fn=transformers.default_data_collator,
    num_workers=cores  # Use multiple workers to load data faster
)

Device: cuda


In [10]:
    # Measure validation accuracy
model.eval()  # Set model to evaluation mode

# <YOUR CODE HERE>
total = 0
correct = 0

# (optional) Enable mixed precision for faster computation if supported
scaler = torch.cuda.amp.GradScaler() if device == torch.device("cuda") else None

with torch.no_grad():  # Disable gradient calculation
    for batch in tqdm(val_loader, desc="Evaluating"):
        # Move batch to GPU
        # <YOUR CODE HERE>
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        token_type_ids = batch['token_type_ids'].to(device)
        labels = batch['labels'].to(device)

        # Use mixed precision if available
        if scaler:
            with torch.cuda.amp.autocast():
                # <YOUR CODE HERE>
                outputs =  model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    token_type_ids=token_type_ids
                )
        else:
            # <YOUR CODE HERE>
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids
            )

        # Get predictions and update accuracy
        # <YOUR CODE HERE>
        logits = outputs.logits
        _, predicted = torch.max(logits, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

# Compute accuracy
# <YOUR CODE HERE>
accuracy = correct / total # Validation accuracy, between 0 and 1
print(f"Validation Accuracy: {accuracy:.4f}")

  scaler = torch.cuda.amp.GradScaler() if device == torch.device("cuda") else None
  with torch.cuda.amp.autocast():
Evaluating: 100%|███████████████████████████| 2527/2527 [00:31<00:00, 79.63it/s]

Validation Accuracy: 0.9084





In [11]:
assert 0.9 < accuracy < 0.91

### Train the model (3 points)

For this task, you have to fine-tune your own model. You are free to choose any model __except for the original BERT.__ We recommend [DeBERTa-v3](https://huggingface.co/microsoft/deberta-v3-base). Better yet, choose the best model based on public benchmarks (e.g. [GLUE](https://gluebenchmark.com/)).

You can write the training code manually or use transformers.Trainer (see [this example](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification)). Please make sure that your model's accuracy is at least __comparable__ with the above example for BERT.

In [12]:
!pip install protobuf sentencepiece tiktoken -q


In [12]:
# Load your model e.g. DeBERTa-v3 tokenizer and model
model_name = 'microsoft/deberta-v3-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
model.to(device)

# Note that if the tokenizer of your model
# is different from the one we used aboVe,
# you need ot preprocess your data again.

# Preprocess the data
def preprocess_function(examples):
    result = tokenizer(
        examples['text1'], examples['text2'],
        padding='max_length', max_length=128, truncation=True
    )
    result['label'] = examples['label']
    return result


# <If so, your code goes here>
qqp_preprocessed =  qqp.map(preprocess_function, batched=True, remove_columns=qqp["train"].column_names)

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|███████████████████████| 40430/40430 [00:05<00:00, 6753.65 examples/s]


In [15]:
# Prepare the training and validation sets
from sklearn.metrics import accuracy_score
from transformers import  (DataCollatorWithPadding, EarlyStoppingCallback)

train_set = qqp_preprocessed['train']
val_set = qqp_preprocessed['validation']  
output_dir = './checkpoints'

seed = 42
per_device_train_batch_size = 32
per_device_eval_batch_size = 64
gradient_accumulation_steps = 2
num_train_epochs = 2
learning_rate = 2e-5
weight_decay = 0.01
# Define a metric for evaluation. You can write your own if you prefer


# If you are using transformers.Trainer, you may want to use a utility function below
def compute_metrics(eval_pred):
    """
    Compute evaluation metrics for the model during training or evaluation.
    Args:
        eval_pred (tuple): A tuple containing:
            - logits (ndarray or torch.Tensor): The raw logits output by the model for each sample
              in the evaluation batch. Shape: (batch_size, num_classes).
            - labels (ndarray or torch.Tensor): The ground truth labels for each sample in the batch.
              Shape: (batch_size,).
    Returns:
        dict: A dictionary containing the computed metric(s):
            - "accuracy" (float): The proportion of correct predictions over the total number of samples.
    """
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = accuracy_score(labels, predictions)
    return {"accuracy": accuracy}


# Feel free not to use transformers.Trainer and write the code manually if you want
# A good starting learning rate is 2e-5.
# A step of an order of magnitude is a good way to adjust it if necessary e.g. 2e-4, 2e-3 etc.
# 3 train epochs is likely enough for gently finetuning the model without the model 'forgetting previous data'
# Be sure to use weight_decay i.e. regularisation. A good starting point is 1e-2. Feel free to experiment.
# Consider setting accuracy as the metric for the best model.

# Define your training arguments without the 'device' argument since it is handled automatically.
training_args = TrainingArguments(
    # <YOUR CODE HERE>
    output_dir=output_dir,
    eval_strategy='epoch',
    save_strategy='epoch',
    learning_rate=learning_rate,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    num_train_epochs=num_train_epochs,
    weight_decay=weight_decay,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,                       
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    save_total_limit=2,
    remove_unused_columns=False,
    report_to="none",
    fp16=True,
    torch_compile=False,
    optim="adamw_torch_fused",
    group_by_length=True,
    dataloader_pin_memory=True,
    dataloader_num_workers=0,
    max_grad_norm=1.0,
    disable_tqdm=False   
)

torch.manual_seed(seed)
np.random.seed(seed)
data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8)

# Initialize the Trainer
trainer = Trainer(
    # <YOUR CODE HERE>
    model=model,
    args=training_args,
    train_dataset=train_set,
    eval_dataset=val_set,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=1)]
)

# Fine-tune the model
# <YOUR CODE HERE>
trainer.train()

# Evaluate the model
# <YOUR CODE HERE>
accuracy = trainer.evaluate()['eval_accuracy']
print(f"Validation Accuracy: {accuracy:.4f}")

trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1374,0.235482,0.906926
2,0.1386,0.226927,0.917907


Validation Accuracy: 0.9179


('./checkpoints/tokenizer_config.json',
 './checkpoints/special_tokens_map.json',
 './checkpoints/spm.model',
 './checkpoints/added_tokens.json',
 './checkpoints/tokenizer.json')

In [14]:
!ls ./checkpoints


In [16]:
assert 0.9 < accuracy

To be completely honest, we made a small crime here. Validation part of the dataset is intended for tuning the hyperparameters, but for the sake of simplicity we ommited that logic here. You are free to pick the best hyperparameters and test the results on the `test` subsample if you feel so.

### BONUS: Get a taste of how BERT embeddings work

It is time to shed light on how a BERT-based embedder can be leveraged in searching relevant information.

The problem with vanilla BERT and the likes is that it isn't directly trained using contrastive or triplet loss in order to genuinely force similar embeddings closer to each other. Hence, to obtain the best possible results in building a search engine it is preferrable to pick a dedicated [sentence similarity](https://huggingface.co/models?pipeline_tag=sentence-similarity) model. Feel free to pick the one that will likely meet your requirements the most.

Similar to what we showcased in the first homework, your task is to construct a search engine:
1) _Prepare an embeddings database_: Since Quora Question Pairs dataset contains, well, pairs of questions, we will only pick data in the `text1` field of the `validation` subsample. You should obtain embeddings using a model of your choice and store them for later use in a `numpy.ndarray`. Optionally, you can leverage a dedicated [Faiss](https://github.com/facebookresearch/faiss) index.
2) _Implement a way to search for similar questions to a given query_: It is expected that you will write a function or a class to streamline interactions with your database. __A completion of this part of the homework will be judged upon the ability to print coherently the TOP 5 most similart quora questions given a new arbitrary query.__

Hopefully, you can appreciate how the search has become more semantically profound as compared to our previous attempt.

In [18]:
from sklearn.metrics.pairwise import cosine_similarity
# Initialize the model and its tokenizer
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.to(device)

# <YOUR CODE HERE>
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

def get_embeddings(texts, model, tokenizer, batch_size=32):
    model.eval()
    all_embeddings = []

    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i + batch_size]
        inputs = tokenizer(
            batch_texts,
            padding=True,
            truncation=True,
            max_length=MAX_LENGTH,
            return_tensors="pt"
        ).to(device)

        with torch.no_grad():
            model_output = model(**inputs)
            sentence_embeddings = mean_pooling(model_output, inputs['attention_mask'])
            sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
            all_embeddings.append(sentence_embeddings.cpu().numpy())

    return np.vstack(all_embeddings)

validation_texts = qqp['validation']['text1']
database_embeddings = get_embeddings(validation_texts, model, tokenizer)

def find_similar_questions(query, database, model, top_k = 5):
    """
    Finds and prints the top_k most similar questions for a query.

    This function encodes a query, compares it against a pre-computed
    embedding database using cosine similarity, and prints the most
    semantically similar questions.

    Args:
        query (str): The user's search query.
        database (np.ndarray): A 2D NumPy array containing the pre-computed
                               embeddings for the database of questions.
        model (SentenceTransformer): The initialized Sentence-Transformer model
                                     used to encode the query.
        top_k (int): The number of top results to display.

    Returns:
        None. The function prints the results.
    """
    # <YOUR CODE HERE>
    query_embedding = get_embeddings([query], model, tokenizer)
    cosine_sim = cosine_similarity(query_embedding, database)[0]

    top_k_indices = np.argsort(cosine_sim)[::-1][:top_k]
    top_k_similarities = cosine_sim[top_k_indices]
    top_k_questions = [validation_texts[i] for i in top_k_indices]

    print(f"Query: {query}")
    print("\nTop Matches:")

    for i, (question, similarity) in enumerate(zip(top_k_questions, top_k_similarities), 1):
        print(f"{i}. {question} (Similarity: {similarity:.4f})")

query = "What is the future of artificial intelligence?"
find_similar_questions(query, database_embeddings, model, top_k=5)


Query: What is the future of artificial intelligence?

Top Matches:
1. What will be the future impact of artificial intelligence? (Similarity: 0.9108)
2. What is the future of AI in IoT? (Similarity: 0.7524)
3. What artificial intelligence can do? (Similarity: 0.6954)
4. When will most jobs be replaced by robots? (Similarity: 0.6399)
5. What will be our future? (Similarity: 0.6352)
