# Question-Answering Fine-tuning

This notebook demonstrates fine-tuning a BERT-based model for Question Answering using the Stanford Question Answering Dataset (SQuAD) dataset.

More information of the dataset available on: https://huggingface.co/datasets/rajpurkar/squad

## 📦 Imports and Setup

In [None]:
# Import necessary libraries
from time import time
from datasets import *
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline, AutoModelForSequenceClassification, DataCollatorWithPadding, TrainingArguments, Trainer
import pandas as pd
import numpy as np
import re
import logging
logging.getLogger("transformers").setLevel(logging.ERROR)
import torch
print("Is CUDA available:", torch.cuda.is_available())
print("CUDA version:", torch.version.cuda)
print("Number of GPUs available:", torch.cuda.device_count())

# Set pandas display options for better readability
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', None)
pd.set_option('display.colheader_justify', 'center')

Is CUDA available: True
CUDA version: 11.8
Number of GPUs available: 1


## 📊 Data Loading

The SQuAD (Stanford Question Answering Dataset) dataset is a dataset used primarily to train and evaluate reading comprehension models. It consists of triples of questions, answers, and context.

In [None]:
# Load the SQuAD dataset
dataset = load_dataset("squad")
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

To avoid delaying training times, the dataset will be filtered to only keep records with a _context_ field length of less than 300.

In [None]:
# Filtering function definition
def filtra_por_longitud(ejemplo):
    return len(ejemplo["context"]) < 300

# Filter the dataset
ds_tarea = dataset.filter(filtra_por_longitud)
ds_tarea

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 3466
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 345
    })
})

## 🧪 Tokenization and Model Setup

Defining the model checkpoint, tokenizer, and model for question answering.

In [None]:
# Model checkpoint definition
model_checkpoint = "bert-large-uncased-whole-word-masking-finetuned-squad"

# Tokenizer definition
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Model definition
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# Preprocessing function definition
def preproces_function(x):
    """ Preprocesses the dataset for question answering tasks.    
    Args:
        x (dict): A dictionary containing 'question', 'context', and 'answers' keys.
    Returns:
        dict: A dictionary with tokenized inputs and positions of start and end of answers.
    """  

    # Tokenization of questions and contexts
    questions = [q.strip() for q in x["question"]] # Strip whitespace from questions
    inputs = tokenizer(
        questions, 
        x["context"],
        max_length=384, 
        truncation="only_second",
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length"
    ) # Tokenize questions and contexts with specified parameters

    # Prepare the initial and final positions of answers
    offset_mapping = inputs.pop("offset_mapping") # Get offset mappings for tokenized inputs
    sample_mapping = inputs.pop("overflow_to_sample_mapping") # Get sample mappings for overflowed tokens
    answers = x["answers"] # Get answers from the input dictionary
    start_pos = [] # Positions of start of answers
    end_pos = [] # Positions of end of answers

    # Iterate through each tokenized input
    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_mapping[i] # Get the index of the sample corresponding to the tokenized input
        answer = answers[sample_idx] # Get the answer for the corresponding sample
        start_char = answer["answer_start"][0] # Position of start of the answer
        end_char = start_char + len(answer["text"][0]) # Position of end of the answer
        sequence_ids = inputs.sequence_ids(i) # Identify the sequence IDs for the tokenized input

        idx = 0 # Initialize index to find the context in the tokenized input

        #  Beginning of context
        while idx < len(sequence_ids) and sequence_ids[idx] != 1: 
            idx += 1 
        context_start = idx

        # End of context
        while idx < len(sequence_ids) and sequence_ids[idx] == 1: 
            idx += 1 
        context_end = idx - 1

        # Verify if the answer is within the context
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char: # If the answer is not within the context
            start_pos.append(0) 
            end_pos.append(0)
        else: # If the answer is within the context
            # Find the start and end positions of the answer within the context
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_pos.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_pos.append(idx + 1)

    inputs["start_positions"] = start_pos # Positions of start of answers
    inputs["end_positions"] = end_pos # Positions of end of answers

    return inputs 

# Apply the preprocessing function to the dataset
tokenized_ds = ds_tarea.map(
    preproces_function,
    batched=True,
    remove_columns=ds_tarea["train"].column_names,
)

## 🏋️ Training the Model
Setting up training arguments, initializing the Trainer, and training the model.

In [None]:
seed = 99 # Set a seed for reproducibility

# Trainer definition
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="steps",
    eval_steps=250,
    save_steps=250,
    logging_dir="./logs",
    logging_strategy="steps",
    logging_steps=100,

    learning_rate=1e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=2,

    num_train_epochs=5,
    weight_decay=0.01,
    warmup_ratio=0.1,

    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    save_total_limit=3,

    remove_unused_columns=False,
    seed=seed
)

# Data collator definition
data_collator = DataCollatorWithPadding(
    tokenizer=tokenizer,
    return_tensors="pt"
    )

# Trainer instantiation
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator
)

  trainer = Trainer(


In [None]:
# Training the model
start = time()

trainer.train()

end = time()
print(f">>>>>>>>>>>>> elapsed time: {(end-start)/60:.0f}m")

Step,Training Loss,Validation Loss
250,0.3501,1.102824
500,0.1254,1.3151
750,0.1009,1.517075
1000,0.0375,1.714367


>>>>>>>>>>>>> elapsed time: 17m


## 📈 Evaluation and Predictions

Evaluating the model on validation samples and computing similarity scores.

In [None]:
# Define the device for training
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Define the question-answering pipeline
question_answerer = pipeline(
    "question-answering",
    model=model, 
    tokenizer=tokenizer, 
    device=device
    )

Device set to use cuda


In [None]:
# Function to calculate sentence similarity
def calculate_sentence_similarity(sentence1, sentence2):
    sentence1 = re.sub(r'[^a-zA-Z0-9\s]', '', sentence1).lower()
    sentence2 = re.sub(r'[^a-zA-Z0-9\s]', '', sentence2).lower()
    words1 = set(sentence1.lower().split())
    words2 = set(sentence2.lower().split())
    matches = len(words1.intersection(words2))
    total_words = len(words1.union(words2))
    if total_words == 0:
        return 0.0
    return (matches / total_words) * 100

# Evaluate the model on validation samples
samples = [324,342,249,176,70,168,120,58,90,192,278,289,197,146,323,248,260,273,112,211]
evaluation_list = []

for ii in samples:
    context = ds_tarea['validation'][ii]['context']
    question = ds_tarea['validation'][ii]['question']
    answer = ds_tarea['validation'][ii]['answers']
    answers = [f"{tt}" for ii, tt in enumerate(answer['text'])]
    prediction = question_answerer(context=context, question=question)['answer']
    match = max([calculate_sentence_similarity(w, prediction) for w in answers])
    evaluation_list.append((ii,context,question,answers,prediction,match))

print(f"*** evaluation_df ***")
evaluation_df = pd.DataFrame(evaluation_list, columns=['sample', 'context', 'question', 'real_answers', 'predicted_answer', 'match'])
evaluation_df[['sample','real_answers','predicted_answer', 'match']]

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


*** evaluation_df ***


Unnamed: 0,sample,real_answers,predicted_answer,match
0,324,"[Hospitality Business/Financial Centre, Downtown Riverside, Hospitality Business/Financial Centre]",Hospitality Business/Financial Centre,100.0
1,342,"[Rugby, Rugby, Rugby]",Rugby,100.0
2,249,"[extremely high, high, extremely high]",high,100.0
3,176,"[""A Machine to End War"", ""A Machine to End War"", A Machine to End War]",A Machine to End War,100.0
4,70,"[Death Wish Coffee, Death Wish Coffee, Death Wish Coffee]",Death Wish Coffee,100.0
5,168,"[antagonistic, antagonistic, antagonistic]",antagonistic,100.0
6,120,"[1892 to 1894, from 1892 to 1894, from 1892 to 1894]",1892 to 1894,100.0
7,58,"[Vince Lombardi Trophy, the Vince Lombardi Trophy, Vince Lombardi Trophy]",Vince Lombardi,66.666667
8,90,"[5 Live Sports Extra, 5 Live Sports Extra, 5 Live Sports Extra]",5 Live Sports Extra,100.0
9,192,"[time, time complexity, time complexity]",time complexity,100.0


## ✅ Conclusions

- The model achieved 100% match on the majority of the evaluated samples (e.g., samples 0-6, 8-11, 13-15, and 18-19), indicating that the model is very effective at extracting the correct answer span when the context is clear and the answer is well-defined. 

- Some predictions had partial matches (e.g., sample 7 with 66,67% or sample 12 with 50%). These cases often involve variations in phrasing or partial extraction of the correct answer. 

- A few samples had very low or 0% match (e.g., sample 16 with 16,67% or sample 17 with 0%), representing challenging questions, model confusion, or ambiguous contexts. 

- In general, the model has very strong generalization abilities, showing consistent predictions across multiple samples.