# Fine-Tuning Transformers for Question Answering

## Introduction

Classification vs Question Answering (QA)

- Classification: assigns a discrete label (or labels) to an entire input example.
  Example: Given a movie review, predict 'positive' or 'negative'. The model's output
  is typically a single class token or a softmax over classes.

- Question Answering (extractive QA, SQuAD-style): given a context paragraph and a
  question, the model must extract a span from the context that answers the question.
  Instead of classifying the whole input, the model predicts start and end token positions
  within the context, and the final answer is the substring defined by those positions.

Key difference: QA is span-prediction (token-level, pointer-style) whereas classification
is label-prediction (example-level).

## 1.1 Install required libraries

In [None]:
!pip install transformers datasets evaluate accelerate

Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.5-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.5


## 1.2 Import Libraries

In [None]:
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForQuestionAnswering,
    TrainingArguments,
    Trainer,
    DefaultDataCollator,
    pipeline
)
import evaluate
import numpy as np

## 2. Load and Explore Dataset

In [None]:
dataset = load_dataset("squad")

# Print one example to understand structure
print(dataset["train"][0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

plain_text/validation-00000-of-00001.par(…):   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

{'id': '5733be284776f41900661182', 'title': 'University_of_Notre_Dame', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}


### Using a Subset for Faster Training

In [None]:
small_train = dataset["train"].select(range(4000))       # 4,000 training samples
small_valid = dataset["validation"].select(range(2000))   # 2000 validation sample

## 3. Tokenization

In [None]:
model_checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Helper function: map answers to token positions
def prepare_train_features(examples):
    tokenized = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=384,
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Map start/end positions of answers
    sample_mapping = tokenized.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized.pop("offset_mapping")

    start_positions = []
    end_positions = []

    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        sequence_ids = tokenized.sequence_ids(i)
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]

        if len(answers["answer_start"]) == 0:
            start_positions.append(cls_index)
            end_positions.append(cls_index)
        else:
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                start_positions.append(cls_index)
                end_positions.append(cls_index)
            else:
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                start_positions.append(token_start_index - 1)

                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                end_positions.append(token_end_index + 1)

    tokenized["start_positions"] = start_positions
    tokenized["end_positions"] = end_positions
    return tokenized

# Tokenize subsets
tokenized_train = small_train.map(
    prepare_train_features,
    batched=True,
    remove_columns=small_train.column_names
)
tokenized_valid = small_valid.map(
    prepare_train_features,
    batched=True,
    remove_columns=small_valid.column_names
)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

## 4. Model Setup

In [None]:
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

print(model)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, 

## 5. Fine-Tuning

In [None]:
batch_size = 16
args = TrainingArguments(
    #"qa-finetuned-bert",
    eval_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none"
)

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_valid,
    tokenizer=tokenizer,
    data_collator=DefaultDataCollator(),
)

  trainer = Trainer(


In [None]:
# Train model
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,1.585368
2,1.488500,1.55762
3,1.488500,1.670121


TrainOutput(global_step=768, training_loss=1.1959187189737956, metrics={'train_runtime': 1013.357, 'train_samples_per_second': 12.085, 'train_steps_per_second': 0.758, 'total_flos': 2537844749798400.0, 'train_loss': 1.1959187189737956, 'epoch': 3.0})

## 6. Evaluation

In [None]:
import evaluate
metric = evaluate.load("squad")

predictions = []
references  = []

for i in range(len(small_valid)):
    # Use the trained model + tokenizer to predict answer for each validation example
    qa_input = {
        "question": small_valid[i]["question"],
        "context": small_valid[i]["context"]
    }
    pred = pipeline("question-answering", model=model, tokenizer=tokenizer)(**qa_input)

    predictions.append({"id": str(i), "prediction_text": pred["answer"]})
    references.append({"id": str(i), "answers": small_valid[i]["answers"]})

# Compute EM and F1
results = metric.compute(predictions=predictions, references=references)
print("Validation EM and F1:", results)

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0


Validation EM and F1: {'exact_match': 66.05, 'f1': 74.66009338026294}


In [None]:
print("Validation EM and F1:", results)

Validation EM and F1: {'exact_match': 66.05, 'f1': 74.66009338026294}


## 7. Test on Custom Questions

In [None]:
qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)

question = "Who developed the theory of relativity?"
context = "Albert Einstein developed the theory of relativity in the early 20th century."
print(qa_pipeline(question=question, context=context))

question2 = "Where is the Eiffel Tower located?"
context2 = "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France."
print(qa_pipeline(question=question2, context=context2))

Device set to use cuda:0


{'score': 0.985287070274353, 'start': 0, 'end': 15, 'answer': 'Albert Einstein'}
{'score': 0.4273541569709778, 'start': 73, 'end': 86, 'answer': 'Paris, France'}


Reflection:

Through this project, I learned how Question Answering differs from classification.
Instead of predicting one label, the model must predict the start and end positions of an answer in a passage.
I explored how tokenization must handle both context and question together, which is different from single-sentence classification.
I also practiced fine-tuning a pre-trained transformer model on a subset of SQuAD for efficiency.
Finally, I evaluated my model using Exact Match and F1 metrics and tested it with real-world questions.
This project deepened my understanding of transformers for span-based tasks.