https://github.com/facebookresearch/fairseq/tree/main/examples/roberta/commonsense_qa#3-evaluate

# NLP FS25 Course Project 2: Commensense Question Answering with Transformers

By David Hodel

Weighs & Biases Project: https://wandb.ai/dhodel-hslu-nlp/hslu-fs25-nlp-qa-transformers

## Introduction

In this notebook, I present my solution to the second course project of the FS25 NLP module at HSLU.

The task is to compare three Transformer models on the task of commonsense question answering:
1) A randomly initialized Transformer
2) A pre-trained Transformer (which was not trained / finetuned on CommonsenseQA)
3) An LLM (1B+ parameters) of my choice

I'll finetune the first two models and do prompt-engineering for the LLM. The goal is to compare the performance of these three models on the task of commonsense question answering.

### Dataset

We use the CommonsenseQA ([Talmor et al., 2019](https://aclanthology.org/N19-1421/)) dataset in this project. The dataset consists of 12,247 questions with 5 choices each, where only one is correct. The questions are designed to require commonsense reasoning to answer correctly.

The dataset was created by taking concepts from ConceptNet, a semantic network of commonsense knowledge.

## Setup

We first import the necessary libraries to run the code.

In [1]:
from collections import Counter
from datetime import datetime
import random

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import trange
from sklearn.metrics import confusion_matrix

import torch
import torch.nn as nn
import torcheval.metrics as metrics
from torch.utils.data import Dataset, DataLoader

import pytorch_lightning as pl
from pytorch_lightning import seed_everything
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping, LearningRateMonitor
from pytorch_lightning.loggers import WandbLogger

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModel, AutoConfig
from transformers import AutoModelForCausalLM, pipeline

import wandb

We set up a fixed random seed to (at least try to) ensure reproducibility.

In [None]:
SEED = 42
seed_everything(SEED, workers=True)

Since we use Weights & Biases for experiment tracking, we first have to log in to our account.

In [None]:
wandb.login()

### Data Splits

The data is available on Hugging Face: https://huggingface.co/datasets/tau/commonsense_qa.
Since only the train and validation splits have an answer key, we will use our own dataset splits.
We perform the splitting as presented in the lecture slides. We separate the last 1"000 samples from the training set as the validation set and use the original validation set as the test set.

In [None]:
train = load_dataset("tau/commonsense_qa", split="train[:-1000]")
valid = load_dataset("tau/commonsense_qa", split="train[-1000:]")
test = load_dataset("tau/commonsense_qa", split="validation")

print(len(train), len(valid), len(test))

## Data Exploration

First, we want to take a look at the data to understand its structure and content.

In [5]:
datasets = {
  "train": train,
  "validation": valid,
  "test": test
}

We ensure that all three splits have the same structure and that the answers are in the same format.

In [None]:
print(train.column_names)
assert train.column_names == valid.column_names == test.column_names

print(train[0])

unique_answers = set([ex["answerKey"] for ex in train] + [ex["answerKey"] for ex in valid] + [ex["answerKey"] for ex in test])
print(f"Unique answer keys: {unique_answers}")

assert len(unique_answers) == 5

We then display a sample question and its answer for each split to get a feeling of the type of questions and answers.

In [None]:
for split, data in datasets.items():
    print(f"\n=== {split} Split ===")
    print(f"Question: {data[0]['question']}")
    for j, choice in enumerate(data[0]['choices']['text']):
        print(f"{chr(65+j)}) {choice}")  # A, B, C, etc.
    print(f"Correct Answer: {data[0]['answerKey']}")
    print("=" * 50)

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(14, 5))

labels = sorted(list(unique_answers))

colors = sns.color_palette("pastel")[0:5]

for i, (split, data) in enumerate(datasets.items()):
    answer_counts = Counter([ex["answerKey"] for ex in data])
    
    # Sort by labels to ensure consistent order
    counts = [answer_counts[label] for label in labels]
    
    ax[i].bar(labels, counts, color=colors)
    ax[i].set_xlabel("Answer Keys")
    ax[i].set_ylabel("Absolute Frequency")
    ax[i].set_title(f"{split.capitalize()} Set ({len(data)} samples)")
    
    # Add percentage annotations
    total = sum(counts)
    for j, count in enumerate(counts):
        percentage = count / total * 100
        ax[i].annotate(f"{percentage:.1f}%", 
                      xy=(labels[j], count),
                      xytext=(0, 3),
                      textcoords="offset points",
                      ha='center')

plt.suptitle("Distribution of Answer Keys Across Dataset Splits", fontsize=14)
plt.tight_layout()

We see that the distribution is relatively balanced, with a slight preference for answer `B` in the validation and test set.

We also plot the distribution of the number of characters in the questions.

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(12, 10))

colors = sns.color_palette("pastel")[0:3]
all_question_lengths = []

for i, (split, data) in enumerate(datasets.items()):
  question_lengths = [len(ex["question"]) for ex in data]
  all_question_lengths.append(question_lengths)
  
  # Histogram plots (top row)
  axes[0, i].hist(question_lengths, bins=30, color=colors[i])
  axes[0, i].set_xlabel("Question Length (characters)")
  axes[0, i].set_ylabel("Absolute Frequency")
  axes[0, i].set_title(f"{split.capitalize()} Question Length Distribution")
  
  # Add statistics as text
  axes[0, i].text(0.6, 0.95, 
      f"Min: {min(question_lengths)}\nMax: {max(question_lengths)}\nMean: {np.mean(question_lengths):.1f}\nMedian: {np.median(question_lengths)}",
      transform=axes[0, i].transAxes,
      bbox=dict(facecolor='white'),
      verticalalignment='top')
  
  # Boxplot (bottom row)
  axes[1, i].boxplot(question_lengths, patch_artist=True)
  axes[1, i].set_title(f"{split.capitalize()} Length Boxplot")
  axes[1, i].set_ylabel("Characters")
  
  # Set the boxplot fill color
  for patch in axes[1, i].get_children():
    if isinstance(patch, plt.matplotlib.patches.PathPatch):
      patch.set_facecolor(colors[i])

plt.suptitle("Distribution of Question Lengths Across Dataset Splits", fontsize=14)
plt.tight_layout()

We see that the questions are relatively short, with most of them having less than 100 characters. The three splits have a similar distribution and similar mean and median values.

The longest question has 376 characters which is good managable for a transformer model.

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(12, 10))

colors = sns.color_palette("pastel")[0:3]
all_choice_lengths = []

for i, (split, data) in enumerate(datasets.items()):
  choice_lengths = np.array([[len(choice) for choice in ex["choices"]["text"]] for ex in data]).flatten()
  all_choice_lengths.append(choice_lengths)
  
  # Histogram plots (top row)
  axes[0, i].hist(choice_lengths, bins=30, color=colors[i])
  axes[0, i].set_xlabel("Choice Length (characters)")
  axes[0, i].set_ylabel("Absolute Frequency")
  axes[0, i].set_title(f"{split.capitalize()} Choice Length Distribution")
  
  # Add statistics as text
  axes[0, i].text(0.6, 0.95, 
      f"Min: {min(choice_lengths)}\nMax: {max(choice_lengths)}\nMean: {np.mean(choice_lengths):.1f}\nMedian: {np.median(choice_lengths)}",
      transform=axes[0, i].transAxes,
      bbox=dict(facecolor='white'),
      verticalalignment='top')
  
  # Boxplot (bottom row)
  axes[1, i].boxplot(choice_lengths, patch_artist=True)
  axes[1, i].set_title(f"{split.capitalize()} Length Boxplot")
  axes[1, i].set_ylabel("Characters")
  
  # Set the boxplot fill color
  for patch in axes[1, i].get_children():
    if isinstance(patch, plt.matplotlib.patches.PathPatch):
      patch.set_facecolor(colors[i])

plt.suptitle("Distribution of Choice Lengths Across Dataset Splits", fontsize=14)
plt.tight_layout()

## Preprocessing

In [7]:
roberta_tokenizer = AutoTokenizer.from_pretrained("distilbert/distilroberta-base", use_fast=False) # disable fast tokenizer for multi-threaded tokenization (https://stackoverflow.com/a/72926996)

In [None]:
encoded = roberta_tokenizer.encode("Is this working?", return_tensors="pt")
decoded = roberta_tokenizer.decode(encoded[0])

print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")

### Prepare Data for Tansformers

In [6]:
def answer_key_to_index(answer_key):
  return ord(answer_key) - ord("A")

def index_to_answer_key(index):
  return chr(index + ord("A"))

assert answer_key_to_index("A") == 0
assert index_to_answer_key(0) == "A"

In [10]:
class CommonsenseQADataset(Dataset):
    def __init__(self, dataset, tokenizer, max_length, debug=False):
        self.dataset = dataset
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.debug = debug
        
    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self, idx):
        example = self.dataset[idx]
        question = example["question"]
        choices = example["choices"]["text"]
        
        answer_index = answer_key_to_index(example["answerKey"])
            
        # Tokenize all question-answer pairs but don't pad yet
        encodings = []
        for choice in choices:
            encoding = self.tokenizer(
                question,
                choice,
                truncation=False,
                return_tensors=None  # Return lists, not tensors
            )

            if self.debug:
                # Assert that max_length is respected
                assert len(encoding["input_ids"]) <= self.max_length, "Input exceeds max length"

            encodings.append(encoding)
            
        return encodings, answer_index

In [11]:
class MultipleChoiceCollator:
    def __init__(self, tokenizer, debug=False):
        self.tokenizer = tokenizer
        self.debug = debug
        
    def __call__(self, batch):
        # Unpack the batch - each item is now a tuple of (encodings, label)
        encodings_list = [item[0] for item in batch]  # List of lists of encodings
        labels = [item[1] for item in batch]  # List of labels
        
        # Flatten all encodings
        flat_encodings = [encoding for encodings in encodings_list for encoding in encodings]
        
        # Pad to the longest in this batch
        padded_encodings = self.tokenizer.pad(
            flat_encodings,
            padding=True,
            return_tensors="pt"
        )

        num_choices = 5
        batch_size = len(batch)

        # Reshape back to [batch_size, num_choices, seq_length]
        input_ids = padded_encodings["input_ids"].view(batch_size, num_choices, -1)
        attention_mask = padded_encodings["attention_mask"].view(batch_size, num_choices, -1)
        
        # Convert labels to tensor
        labels = torch.tensor(labels, dtype=torch.long)
        
        # Return a tuple of (input_ids, attention_mask, labels)
        return input_ids, attention_mask, labels

In [12]:
class CommonsenseQADataModule(pl.LightningDataModule):
    def __init__(self, train_dataset, valid_dataset, test_dataset, tokenizer, batch_size=16, max_length=512, num_workers=8, debug=False):
        super().__init__()
        self.train_dataset = train_dataset
        self.valid_dataset = valid_dataset
        self.test_dataset = test_dataset
        self.tokenizer = tokenizer
        self.batch_size = batch_size
        self.max_length = max_length
        self.num_workers = num_workers
        self.debug = debug
        
    def setup(self, stage=None):
        # Create datasets
        if stage == 'fit' or stage is None:
            self.train_ds = CommonsenseQADataset(self.train_dataset, self.tokenizer, self.max_length, debug=self.debug)
            self.val_ds = CommonsenseQADataset(self.valid_dataset, self.tokenizer, self.max_length, debug=self.debug)
            
            if self.debug:
                # Ensure datasets have expected properties
                assert len(self.train_ds) == len(self.train_dataset), "Train dataset length mismatch"
                assert len(self.val_ds) == len(self.valid_dataset), "Validation dataset length mismatch"
        
        if stage == 'test' or stage is None:
            self.test_ds = CommonsenseQADataset(self.test_dataset, self.tokenizer, self.max_length, debug=self.debug)

            if self.debug:
                assert len(self.test_ds) == len(self.test_dataset), "Test dataset length mismatch"
    
    def train_dataloader(self):
        loader = torch.utils.data.DataLoader(
            self.train_ds,
            batch_size=self.batch_size,
            collate_fn=MultipleChoiceCollator(self.tokenizer, debug=self.debug),
            shuffle=True,
            num_workers=self.num_workers
        )
        return loader
    
    def val_dataloader(self):
        loader = torch.utils.data.DataLoader(
            self.val_ds,
            batch_size=self.batch_size,
            collate_fn=MultipleChoiceCollator(self.tokenizer, debug=self.debug),
            shuffle=False,
            num_workers=self.num_workers
        )
        return loader
    
    def test_dataloader(self):
        loader = torch.utils.data.DataLoader(
            self.test_ds,
            batch_size=self.batch_size,
            collate_fn=MultipleChoiceCollator(self.tokenizer, debug=self.debug),
            shuffle=False,
            num_workers=self.num_workers
        )
        return loader

In [13]:
# Initialize DataModule
max_input_length = 512 # 514 (as specified in config.json of distillroberta-base model) - 2 (for [CLS] and [SEP]) 
data_module = CommonsenseQADataModule(train, valid, test, roberta_tokenizer, batch_size=24, max_length=max_input_length, debug=False)

### Prepare Data for LLM

In [5]:
class PhiPromptDataset(Dataset):
    def __init__(self, dataset, random_subset_size=1.0):
        self.dataset = dataset
        self.prompts = []
        self.correct_answers = []

        self.prompt_template = """<|system|>You are a helpful assistant in a multiple-choice question-answering task. Answer the following multiple-choice commonsense reasoning question with just the letter of the correct answer (A, B, C, D, or E). Do not provide any explanations or additional information. Your response must not contain the full answer, only the letter.<|end|>
<|user|>Question: What do students do in school?
Choices:
A They play outside.
B They eat lunch.
C They go home.
D They learn and study.
E They sleep.<|end|>
<|assistant|>D<|end|>
<|user|>Question: If you leave ice out in the sun, what will most likely happen to it?
Choices:
A It will catch fire
B It will melt
C It will grow bigger
D It will turn into dust
E It will start glowing<|end|>
<|assistant|>B<|end|>
<|user|>Question: If you are hungry, what is the most logical thing to do?
Choices:
A Take a nap
B Go for a swim
C Eat some food
D Buy new shoes
E Read a book<|end|>
<|assistant|>C<|end|>
<|user|>Question: {question}
Choices:
A {choice_a}
B {choice_b}
C {choice_c}
D {choice_d}
E {choice_e}<|end|>
<|assistant|>"""

        
        self.prepare_data(random_subset_size)
    
    def prepare_data(self, random_subset_size=1.0):
        # If random_subset_size is 1.0, use the entire dataset
        if random_subset_size >= 1.0:
            subset = self.dataset
        else:
            # Calculate the number of examples to include
            subset_size = max(1, int(len(self.dataset) * random_subset_size))
            
            # Get random indices without replacement
            indices = random.sample(range(len(self.dataset)), subset_size)
            
            # Create the subset
            subset = [self.dataset[i] for i in indices]
        
        # Process the subset
        for example in subset:
            question = example["question"]
            choices = example["choices"]["text"]
            correct_answer = answer_key_to_index(example["answerKey"])
            prompt = self.create_prompt(question, choices)
            
            self.prompts.append(prompt)
            self.correct_answers.append(correct_answer)
    
    def create_prompt(self, question, choices):
        prompt = self.prompt_template.format(
            question=question,
            choice_a=choices[0],
            choice_b=choices[1],
            choice_c=choices[2],
            choice_d=choices[3],
            choice_e=choices[4]
        )
        
        return prompt
    
    def __len__(self):
        return len(self.prompts)
    
    def __getitem__(self, idx):
        return {
            "prompt": self.prompts[idx],
            "correct_answer": self.correct_answers[idx]
        }

In [None]:
phi_prompt_valid = PhiPromptDataset(valid, random_subset_size=0.1) # around 100 samples is usually enough to see the model's performance
phi_prompt_test = PhiPromptDataset(test)

phi_prompt_valid[0], phi_prompt_test[0]

## Models

### 1) Pre-trained Transformer

I decided to use a distilled version of the [RoBERTa base model](https://huggingface.co/FacebookAI/roberta-base) for this task. The model is available on Hugging Face ([distilbert/distilroberta-base](https://huggingface.co/distilbert/distilroberta-base)) and was trained using the same procedure as DistilBERT.

The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 125M parameters for RoBERTa-base). According to Hugging Face, the model runs on average twice as fast as Roberta-base.

In [14]:
pretrained_distilroberta = AutoModel.from_pretrained("distilbert/distilroberta-base")

In [15]:
class RobertaMultipleChoiceModel(pl.LightningModule):
    def __init__(self, roberta_model, dropout_prob=0.1, learning_rate=1e-5, weight_decay=1e-3, use_layer_norm=True, hidden_size_multiplier=1.0, debug=False):
        super().__init__()
        
        self.save_hyperparameters(ignore=['roberta_model'])

        self.roberta = roberta_model
        self.roberta.train()

        hidden_size = int(hidden_size_multiplier * self.roberta.config.hidden_size)
        print(f"Hidden size: {hidden_size}")
        
        # Custom classification head
        self.dropout = nn.Dropout(dropout_prob)
        self.classifier = nn.Sequential(
            nn.Linear(self.roberta.config.hidden_size, hidden_size),
            nn.LayerNorm(hidden_size) if use_layer_norm else nn.Identity(),
            nn.ReLU(), # non-linearity
            nn.Dropout(dropout_prob),
            nn.Linear(hidden_size, 1) # single score per candidate
        )
        
        # Loss function
        self.criterion = nn.CrossEntropyLoss()
        
        # Metrics
        self.train_accuracy = metrics.MulticlassAccuracy(num_classes=5)
        self.val_accuracy = metrics.MulticlassAccuracy(num_classes=5)
        self.test_accuracy = metrics.MulticlassAccuracy(num_classes=5)

        self.test_y = []
        self.test_y_pred = []

        self.debug = debug
    
    def forward(self, input_ids, attention_mask):
        # input_ids and attention_mask have shape: [batch_size, num_choices, seq_length]
        this_batch_size, num_choices, seq_length = input_ids.shape

        if self.debug:
            assert num_choices == 5, "Number of choices should be 5 for CommonsenseQA"
            assert seq_length <= self.roberta.config.max_position_embeddings, "Sequence length exceeds model's max position embeddings"
            assert input_ids.shape == attention_mask.shape, "Input IDs and attention mask should have the same shape"
        
        # Reshape to feed through the model
        input_ids = input_ids.view(-1, seq_length)  # [batch_size * num_choices, seq_length]
        attention_mask = attention_mask.view(-1, seq_length)  # [batch_size * num_choices, seq_length]

        if self.debug:
            assert input_ids.shape == attention_mask.shape, "Input IDs and attention mask should have the same shape"
            assert input_ids.shape[0] == attention_mask.shape[0] == this_batch_size * num_choices, "First dimension should be batch size * num choices"
            assert input_ids.shape[1] == attention_mask.shape[1] == seq_length, "Second dimension should be sequence length"
        
        # Forward pass through base model
        outputs = self.roberta(input_ids=input_ids, attention_mask=attention_mask)
        
        # Get the first token (<s>) representation
        pooled_output = outputs.last_hidden_state[:, 0]  # [batch_size * num_choices, hidden_size]
        pooled_output = self.dropout(pooled_output)

        if self.debug:
            assert pooled_output.shape == (this_batch_size * num_choices, self.roberta.config.hidden_size), "Pooled output should have shape [batch_size * num_choices, hidden_size]"
        
        # Get logits for each choice
        logits = self.classifier(pooled_output)  # [batch_size * num_choices, 1]

        if self.debug:
            assert logits.shape == (this_batch_size * num_choices, 1), "Logits should have shape [batch_size * num_choices, 1]"
        
        # Reshape logits back to [batch_size, num_choices]
        reshaped_logits = logits.view(this_batch_size, num_choices)

        if self.debug:
            assert reshaped_logits.shape == (this_batch_size, num_choices), "Reshaped logits should have shape [batch_size, num_choices]"
        
        return reshaped_logits
    
    def training_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch
        logits = self(input_ids, attention_mask)
        loss = self.criterion(logits, labels)
        
        # Update metrics
        self.train_accuracy.update(logits, labels)
        
        # Log metrics
        self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True)
        self.log("train_acc", self.train_accuracy.compute().item(), on_step=False, on_epoch=True, prog_bar=True)
        
        return loss
    
    def validation_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch
        logits = self(input_ids, attention_mask)
        loss = self.criterion(logits, labels)
        
        # Update metrics
        self.val_accuracy.update(logits, labels)
        
        # Log metrics
        self.log("val_loss", loss, on_epoch=True, prog_bar=True)
        self.log("val_acc", self.val_accuracy.compute().item(), on_epoch=True, prog_bar=True)
        
        return loss
    
    def test_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch
        logits = self(input_ids, attention_mask)
        loss = self.criterion(logits, labels)

        # Update metrics
        self.test_accuracy.update(logits, labels)

        # Log metrics
        self.log("test_loss", loss, on_epoch=True)
        self.log("test_acc", self.test_accuracy.compute().item(), on_epoch=True)

        self.test_y_pred.extend(logits.argmax(dim=1).cpu().numpy())
        self.test_y.extend(labels.cpu().numpy())

        return loss
    
    def reset_test_arrays(self):
        self.test_y = []
        self.test_y_pred = []

    def configure_optimizers(self):
        # Group parameters to apply a lower learning rate to the transformer layers
        transformer_lr_multiplier = 0.1
        transformer_lr = transformer_lr_multiplier * self.hparams.learning_rate
        classifier_lr = self.hparams.learning_rate

        optimizer_grouped_parameters = [
            {
                "params": [p for _, p in self.roberta.named_parameters()],
                "lr": transformer_lr,
                "weight_decay": self.hparams.weight_decay,
            },
            {
                "params": [p for _, p in self.classifier.named_parameters()],
                "lr": classifier_lr,
                "weight_decay": self.hparams.weight_decay,
            },
        ]

        optimizer = torch.optim.AdamW(optimizer_grouped_parameters)
        
        # Set up learning rate scheduler
        total_steps = self.trainer.estimated_stepping_batches
        warmup_steps = total_steps // 10
        
        scheduler = torch.optim.lr_scheduler.OneCycleLR(
            optimizer,
            max_lr=[transformer_lr, classifier_lr],  # Specify max_lr for each group
            total_steps=total_steps,
            pct_start=warmup_steps / total_steps,
            div_factor=100,
            final_div_factor=1000,
            anneal_strategy="linear"
        )
        
        scheduler_config = {
            "scheduler": scheduler,
            "interval": "step",
            "frequency": 1,
        }
        
        return [optimizer], [scheduler_config]

In [None]:
roberta_use_layer_norm = True
roberta_hidden_size_multiplier = 1.0

pretrained_roberta_model = RobertaMultipleChoiceModel(pretrained_distilroberta, dropout_prob=0.1, learning_rate=1e-5, weight_decay=1e-3, use_layer_norm=roberta_use_layer_norm, hidden_size_multiplier=roberta_hidden_size_multiplier, debug=False)

### 2) Randomly Initialized Transformer

In [None]:
# Create a randomly initialized model using the same configuration
random_config = AutoConfig.from_pretrained("distilbert/distilroberta-base")
random_initialized_roberta = AutoModel.from_config(random_config)

# Initialize the PyTorch Lightning model with random weights
random_initialized_roberta_model = RobertaMultipleChoiceModel(random_initialized_roberta, dropout_prob=0.1, learning_rate=1e-5, weight_decay=1e-3, use_layer_norm=roberta_use_layer_norm, hidden_size_multiplier=roberta_hidden_size_multiplier, debug=False)

### 3) LLM - Phi-4-mini

For the LLM approach, I'll use Phi-4-mini, which is a 3.8 billion parameter model released by Microsoft.
Unlike the previous models, this model won't be fine-tuned but will use prompt engineering techniques.

In [None]:
phi_model_name = "microsoft/Phi-4-mini-instruct"

phi_tokenizer = AutoTokenizer.from_pretrained(phi_model_name, fast=False) # disable fast tokenizer for multi-threaded tokenization (https://stackoverflow.com/a/72926996)

phi_model = AutoModelForCausalLM.from_pretrained(
    phi_model_name, 
    torch_dtype="auto",  
    trust_remote_code=True,
)

In [20]:
class PhiPromptEngineering:
    def __init__(self, model, tokenizer, debug=False):
        self.model = model
        self.debug = debug

        self.pipe = pipeline(
            "text-generation",
            model=model,
            tokenizer=tokenizer,
            max_new_tokens=1, # only need a short response (just the letter)
            do_sample=False, # we want deterministic output
            return_full_text=False # only return the newly generated text
        )
    
    def predict(self, prompt):
        response = self.pipe(prompt)[0]["generated_text"].strip()
        
        if self.debug:
            print(f"Response: '{response}'")

        assert response, "Response is empty"
        assert len(response) == 1, "Response should be a single character"
        assert response[0] in "ABCDE", "Response should be one of the choices (A-E)"

        return answer_key_to_index(response[0].upper())

    
    def evaluate(self, dataset, log_wandb=True):
        correct = 0
        total = 0
        accuracy = 0.0

        y = []
        y_pred = []
        
        samples_count = len(dataset)
        for i in (pbar := trange(samples_count)):
            pbar.set_description(f"Sample {i}/{samples_count}")

            sample = dataset[i]
            prompt, correct_answer = sample["prompt"], sample["correct_answer"]
            predicted_index = self.predict(prompt)

            y.append(correct_answer)
            y_pred.append(predicted_index)
            
            total += 1
            if predicted_index == correct_answer:
                correct += 1

            accuracy = correct / total
            pbar.set_postfix({"accuracy": accuracy})

            if log_wandb:
                wandb.log({
                    "accuracy": accuracy
                })

        return accuracy, y, y_pred

## Training

### 1) Pre-trained Transformer

In [None]:
def train(model, data_module, max_epochs, checkpoints_path, wandb_run_prefix, early_stopping_patience=None, debug=False):
  if early_stopping_patience is None:
    early_stopping_patience = max_epochs + 1 # disable early stopping

  if debug:
    max_epochs = 1

  best_checkpoint_callback = ModelCheckpoint(
      dirpath=checkpoints_path,
      filename="best-{epoch:02d}-{val_acc:.4f}",
      save_top_k=1,
      monitor="val_acc",
      mode="max"
  )

  regular_checkpoint_callback = ModelCheckpoint(
      dirpath=checkpoints_path,
      filename="latest-{epoch:02d}",
      save_top_k=1, # only keep the most recent checkpoint
      every_n_epochs=1, # save every epoch
  )

  early_stop_callback = EarlyStopping(
    monitor="val_acc",
    patience=early_stopping_patience,
    mode="max"
  )
  
  lr_callback = LearningRateMonitor()

  wandb_logger = WandbLogger(
    entity="dhodel-hslu-nlp",
    project="hslu-fs25-nlp-qa-transformers",
    name=f"{wandb_run_prefix}-{datetime.now().strftime('%Y-%m-%dT%H:%M:%S')}",
    reinit=True,
    log_model=(not debug)
  )

  torch.set_float32_matmul_precision('high')
  trainer = pl.Trainer(
    max_epochs=max_epochs,
    accelerator="auto", # Uses GPU if available, otherwise CPU
    callbacks=[best_checkpoint_callback, regular_checkpoint_callback, early_stop_callback, lr_callback],
    logger=wandb_logger,
    log_every_n_steps=10,
  )

  trainer.fit(model, data_module)

  wandb.finish()

  return trainer, best_checkpoint_callback.best_model_path

In [None]:
pretrained_checkpoints_path = "./checkpoints/pretrained"

pretrained_roberta_trainer, pretrained_roberta_best_checkpoint = train(
  model=pretrained_roberta_model,
  data_module=data_module,
  max_epochs=50,
  checkpoints_path=pretrained_checkpoints_path,
  early_stopping_patience=5,
  wandb_run_prefix="pretrained-roberta",
  debug=True
)

### 2) Randomly Initialized Transformer

In [None]:
random_initialized_checkpoints_path = "./checkpoints/random-initialized"

randomly_initialized_roberta_trainer, randomly_initialized_roberta_best_checkpoint = train(
  model=random_initialized_roberta_model,
  data_module=data_module,
  max_epochs=50,
  checkpoints_path=random_initialized_checkpoints_path,
  early_stopping_patience=5,
  wandb_run_prefix="random-initialized-roberta",
)

### 3) Prompt Engineering with Phi LLM

For the Phi model, we don't need training since we're using prompt engineering.
We'll evaluate the model directly on a sample of the test set to save time and resources.

In [None]:
phi_prompt_engineering = PhiPromptEngineering(
    model=phi_model,
    tokenizer=phi_tokenizer,
    debug=False
)

In [None]:
wandb.init(
    entity="dhodel-hslu-nlp",
    project="hslu-fs25-nlp-qa-transformers",
    name=f"phi-prompt-engineering-{datetime.now().strftime('%Y-%m-%dT%H:%M:%S')}",
    reinit=True,
    config={
        "model_name": phi_model_name,
        "prompt_template": phi_prompt_valid.prompt_template,
    }
)

In [None]:
phi_accuracy, y, y_pred = phi_prompt_engineering.evaluate(phi_prompt_valid, log_wandb=True)

In [None]:
wandb.finish()

## Evaluation

In [None]:
pretrained_roberta_trainer.logger = False
randomly_initialized_roberta_trainer.logger = False

In [None]:
pretrained_roberta_model.reset_test_arrays()
pretrained_test_results = pretrained_roberta_trainer.test(pretrained_roberta_model, datamodule=data_module, ckpt_path=pretrained_roberta_best_checkpoint)

random_initialized_roberta_model.reset_test_arrays()
random_test_results = randomly_initialized_roberta_trainer.test(random_initialized_roberta_model, datamodule=data_module, ckpt_path=randomly_initialized_roberta_best_checkpoint)

pretrained_test_labels = pretrained_roberta_model.test_y
pretrained_test_preds = pretrained_roberta_model.test_y_pred

random_test_labels = random_initialized_roberta_model.test_y
random_test_preds = random_initialized_roberta_model.test_y_pred

In [None]:
phi_accuracy, phi_test_labels, phi_test_preds = phi_prompt_engineering.evaluate(phi_prompt_test, log_wandb=False)

In [None]:
label_mapping = {i: chr(65 + i) for i in range(5)}  # 0->A, 1->B, etc.
label_names = list(label_mapping.values())

pretrained_cm = confusion_matrix(pretrained_test_labels, pretrained_test_preds)
random_cm = confusion_matrix(random_test_labels, random_test_preds)
phi_cm = confusion_matrix(phi_test_labels, phi_test_preds)

# Determine the global min and max values for consistent scaling
global_vmin = min(pretrained_cm.min(), random_cm.min(), phi_cm.min())
global_vmax = max(pretrained_cm.max(), random_cm.max(), phi_cm.max())

# Create a figure with three subplots
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(21, 6))

# Plot confusion matrices
sns.heatmap(pretrained_cm, annot=True, fmt="d", cmap="Blues", 
            xticklabels=label_names, yticklabels=label_names, 
            ax=ax1, vmin=global_vmin, vmax=global_vmax)
ax1.set_title("Pretrained RoBERTa Model Confusion Matrix", fontsize=14)
ax1.set_xlabel("Predicted Choice", fontsize=12)
ax1.set_ylabel("True Choice", fontsize=12)

sns.heatmap(random_cm, annot=True, fmt="d", cmap="Blues", 
            xticklabels=label_names, yticklabels=label_names, 
            ax=ax2, vmin=global_vmin, vmax=global_vmax)
ax2.set_title("Random Initialized RoBERTa Model Confusion Matrix", fontsize=14)
ax2.set_xlabel("Predicted Choice", fontsize=12)
ax2.set_ylabel("True Choice", fontsize=12)

sns.heatmap(phi_cm, annot=True, fmt="d", cmap="Blues", 
            xticklabels=label_names, yticklabels=label_names, 
            ax=ax3, vmin=global_vmin, vmax=global_vmax)
ax3.set_title("Phi-4-mini Confusion Matrix", fontsize=14)
ax3.set_xlabel("Predicted Choice", fontsize=12)
ax3.set_ylabel("True Choice", fontsize=12)

plt.tight_layout()
plt.savefig("graphics/confusion_matrices_all_models.svg", bbox_inches="tight")
plt.show()

# Create a bar chart to compare model performance
model_names = ['Pretrained RoBERTa', 'Random Initialized RoBERTa', 'Phi-4-mini']
accuracies = [
    pretrained_test_results[0]['test_acc'],
    random_test_results[0]['test_acc'],
    phi_accuracy
]

plt.figure(figsize=(10, 6))
bars = plt.bar(model_names, accuracies, color=['#1f77b4', '#ff7f0e', '#2ca02c'])
plt.ylim(0, 1.0)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Model Performance Comparison', fontsize=14)
plt.xticks(rotation=15, ha='right')

# Add the accuracy values on top of the bars
for bar, acc in zip(bars, accuracies):
    plt.text(bar.get_x() + bar.get_width()/2., 
             bar.get_height() + 0.01, 
             f'{acc:.4f}', 
             ha='center')

plt.tight_layout()
plt.savefig('graphics/model_comparison.svg', bbox_inches='tight')
plt.show()

## Interpretation

### Interpretation of Results

The comparison between our three models - pre-trained RoBERTa, randomly initialized RoBERTa, and Phi-4-mini LLM (with prompt engineering) - reveals several interesting patterns in their performance on the CommonsenseQA task.

1. **Pre-trained RoBERTa**: The model leverages transfer learning from its pre-training phase, giving it a strong foundation for understanding language patterns and semantics before fine-tuning on our specific task.

2. **Randomly Initialized RoBERTa**: Starting from scratch, this model had to learn language patterns solely from our training data, which is much more challenging given the limited size of the dataset compared to typical pre-training datasets.

3. **Phi-4-mini LLM (Prompt Engineering)**: This approach uses a much larger model (7B parameters) without any task-specific fine-tuning, relying instead on prompt engineering to elicit the desired behavior.

The confusion matrices reveal each model's specific strengths and weaknesses in predicting different answer choices. The bar chart provides a clear comparison of overall accuracy between the three approaches.

### Key Observations

- **Effect of Pre-training**: The significant performance gap between the pre-trained and randomly initialized models demonstrates the value of transfer learning, especially for tasks with limited labeled data.

- **LLM with Prompt Engineering**: Phi's performance shows how effectively large language models can be adapted to specific tasks without fine-tuning, using only careful prompt design.

- **Error Patterns**: The confusion matrices show different patterns of errors across models, suggesting they may be making different types of mistakes despite being evaluated on the same task.

### Conclusions

This comparison illustrates the trade-offs between different approaches to transformer-based question answering:

- Pre-trained + fine-tuned models offer strong performance with reasonable computational requirements
- Randomly initialized models struggle without transfer learning benefits
- Large LLMs with prompt engineering can achieve competitive results without task-specific training, but at higher computational cost

The results highlight how different transformer-based approaches can be selected based on available resources, performance requirements, and deployment constraints.

## Tools Used

- Visual Studio Code as IDE
- Jupyter Notebook for interactive development
- Python 3.9.21
- GitHub for version control
- Weights & Biases for experiment tracking and hyperparameter optimization
- Claude 3.7 Sonnet for troubleshooting, finding bugs and discussing ideas
- Github Copilot Chat for troubleshooting and finding bugs