**Introduction:**
Classification and Question Answering (QA) are two key tasks in Natural Language Processing (NLP). In classification, a model assigns predefined labels to input text (e.g., identifying a sentence as positive or negative, or categorizing it as assertive, interrogative, etc.).

In contrast, QA involves finding or generating an answer to a given question based on a context passage. While classification focuses on categorizing text, QA focuses on understanding and extracting specific information from text.

QA Example:

Context: "Albert Einstein developed the theory of relativity in the early 20th century."

Question: "Who developed the theory of relativity?"

Answer: "Albert Einstein"


##Setup

In [None]:
# Install required libraries
!pip install -q transformers datasets evaluate

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# Import libraries
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer
import evaluate
import torch

import warnings
warnings.filterwarnings('ignore')

##Load and Explore Dataset

In [None]:
from datasets import load_dataset

# Load SQuAD v1.1
dataset = load_dataset("squad")

In [None]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})


In [None]:
# Explore a sample
print(dataset['train'][0])

import pprint
pprint.pprint(dataset["train"][0])

{'id': '5733be284776f41900661182', 'title': 'University_of_Notre_Dame', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}
{'answers': {'answer_start': [515], 'text': ['Saint Be

In [None]:
dataset['train'][0]['context']

'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'

##Tokenization

In [None]:
# Tokenization

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    start_positions = []
    end_positions = []

    # Map each span back to the original example
    for i, offsets in enumerate(inputs["offset_mapping"]):
        input_ids = inputs["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        sequence_ids = inputs.sequence_ids(i)
        sample_index = inputs["overflow_to_sample_mapping"][i]
        answers = examples["answers"][sample_index]

        if len(answers["answer_start"]) == 0:
            start_positions.append(cls_index)
            end_positions.append(cls_index)
        else:
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Find token indices
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # If answer out of span
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                start_positions.append(cls_index)
                end_positions.append(cls_index)
            else:
                # Start token
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                start_positions.append(token_start_index - 1)
                # End token
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                end_positions.append(token_end_index + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True, remove_columns=dataset["train"].column_names)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

##Model Setup

In [None]:
model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased")

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


##Fine-tuning

In [None]:
import numpy as np

# Load metric
metric = evaluate.load("squad")

def compute_metrics(eval_pred):
    start_logits, end_logits = eval_pred.predictions
    start_labels, end_labels = eval_pred.label_ids

    start_preds = np.argmax(start_logits, axis=1)
    end_preds = np.argmax(end_logits, axis=1)

    predictions = []
    references = []

    # Loop over each example
    for i in range(len(start_preds)):
        input_ids = tokenized_datasets["validation"][i]["input_ids"]
        # Decode predicted answer
        pred_text = tokenizer.decode(input_ids[start_preds[i]:end_preds[i]+1])
        # Decode true answer
        true_text = tokenizer.decode(input_ids[start_labels[i]:end_labels[i]+1])

        predictions.append({"id": str(i), "prediction_text": pred_text})
        references.append({"id": str(i), "answers": {"text": [true_text], "answer_start": [0]}})

    results = metric.compute(predictions=predictions, references=references)
    return results



training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.1,
    report_to="none",
    logging_dir='./logs',
    logging_steps=100,
    disable_tqdm=False,
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].select(range(10000)),
    eval_dataset=tokenized_datasets["validation"].select(range(2000)),
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

Epoch,Training Loss,Validation Loss,Exact Match,F1
1,1.5028,1.496712,52.8,64.389361
2,1.1189,1.44245,55.8,67.272621


TrainOutput(global_step=1250, training_loss=1.6723446014404297, metrics={'train_runtime': 1530.8809, 'train_samples_per_second': 13.064, 'train_steps_per_second': 0.817, 'total_flos': 3919451351040000.0, 'train_loss': 1.6723446014404297, 'epoch': 2.0})

##Evaluation & Custom Testing

In [None]:
# Evaluate on Validation
trainer.evaluate()

{'eval_loss': 1.4424500465393066,
 'eval_exact_match': 55.8,
 'eval_f1': 67.27262066158745,
 'eval_runtime': 48.9564,
 'eval_samples_per_second': 40.853,
 'eval_steps_per_second': 2.553,
 'epoch': 2.0}

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def answer_question(question, context):

    inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=384).to(device)

    # Get model outputs
    with torch.no_grad():
        outputs = model(**inputs)

    start_idx = torch.argmax(outputs.start_logits)
    end_idx = torch.argmax(outputs.end_logits)

    # Decode the answer from the input_ids
    answer = tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx+1])

    return answer

# Custom Testing
ques1 = "Who developed the theory of relativity?"
context1 = "Albert Einstein developed the theory of relativity in the early 20th century."

ques2 = "In which century did Einstein develop his theory?"
context2 = "Albert Einstein developed the theory of relativity in the early 20th century."

print("Question:", ques1)
print("Answer:", answer_question(ques1, context1))
print("Question:", ques2)
print("Answer:", answer_question(ques2, context2))


print("\n")

# More Testing

# Context
context = "The Eiffel Tower, located in Paris, France, was completed in 1889 and is one of the most famous landmarks in the world."

# Custom questions
questions = [
    "Where is the Eiffel Tower located?",
    "When was the Eiffel Tower completed?",
    "What is the Eiffel Tower famous for?",
    "Which city has the Eiffel Tower?",
]

# Testing
for q in questions:
    print("Question:", q)
    print("Answer:", answer_question(q, context))
    print()

Question: Who developed the theory of relativity?
Answer: albert einstein
Question: In which century did Einstein develop his theory?
Answer: 20th


Question: Where is the Eiffel Tower located?
Answer: paris, france

Question: When was the Eiffel Tower completed?
Answer: 1889

Question: What is the Eiffel Tower famous for?
Answer: one of the most famous landmarks in the world

Question: Which city has the Eiffel Tower?
Answer: paris, france



This project demonstrates the fine-tuning of a pre-trained BERT model for extractive Question Answering. By training on the SQuAD dataset, which contains questions, contexts, and answers, the model learns to identify the precise start and end tokens of answers within a passage. The input text is tokenized into a format suitable for BERT, and the model is trained to predict the token indices for the answer span. After two epochs of training, the model achieved an evaluation loss of 1.44, an Exact Match (EM) score of 55.8%, and an F1 score of 67.27%, showing its ability to extract answers with reasonable accuracy. I also observed the model’s efficiency, with evaluation running for about 49 seconds and processing roughly 41 samples per second. This project helped me understand the importance of task-specific fine-tuning, tokenization, and performance evaluation, giving me practical insights into building effective QA systems.