# Load data

In [22]:
!pip install -q evaluate streamlit colab-xterm

In [23]:
import kagglehub

path = kagglehub.dataset_download("akashdesarda/squad-v11")
print("Path to dataset files:", path)

Using Colab cache for faster access to the 'squad-v11' dataset.
Path to dataset files: /kaggle/input/squad-v11


In [24]:
import pandas as pd

data = pd.read_csv(path+"/SQuAD-v1.1.csv")
for col in ['title', 'context', 'question', 'answer']:
    data[col] = data[col].str.lower()
data.head()

Unnamed: 0,title,context,question,answer,answer_start,answer_end
0,university_of_notre_dame,"architecturally, the school has a catholic cha...",to whom did the virgin mary allegedly appear i...,saint bernadette soubirous,515,541
1,university_of_notre_dame,"architecturally, the school has a catholic cha...",what is in front of the notre dame main building?,a copper statue of christ,188,213
2,university_of_notre_dame,"architecturally, the school has a catholic cha...",the basilica of the sacred heart at notre dame...,the main building,279,296
3,university_of_notre_dame,"architecturally, the school has a catholic cha...",what is the grotto at notre dame?,a marian place of prayer and reflection,381,420
4,university_of_notre_dame,"architecturally, the school has a catholic cha...",what sits on top of the main building at notre...,a golden statue of the virgin mary,92,126


# Model selection

## BERT

In [25]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-large-uncased-whole-word-masking-finetuned-squad")
BERT_model = AutoModelForQuestionAnswering.from_pretrained("google-bert/bert-large-uncased-whole-word-masking-finetuned-squad")

Some weights of the model checkpoint at google-bert/bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [26]:
from transformers import BertForQuestionAnswering
import torch
from tqdm import tqdm

predictions = []

for i in tqdm(range(1000)):
    question = data["question"][i]
    context = data["context"][i]
    inputs = tokenizer.encode_plus(question, context, return_tensors="pt", truncation=True, max_length=512)

    with torch.no_grad():
        outputs = BERT_model(**inputs)
        start_logits = outputs.start_logits
        end_logits = outputs.end_logits

    start_idx = torch.argmax(start_logits)
    end_idx = torch.argmax(end_logits) + 1

    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][start_idx:end_idx])
    )
    predictions.append({answer})

100%|██████████| 1000/1000 [45:07<00:00,  2.71s/it]


In [27]:
import evaluate

metric = evaluate.load("squad")
references = [
    {"id": str(i), "answers": {"text": [data["answer"][i]], "answer_start": [data["answer_start"][i]]}}
    for i in range(len(predictions))
]
formatted_predictions = [{"id": str(i), "prediction_text": list(predictions[i])[0]} for i in range(len(predictions))]
results = metric.compute(predictions=formatted_predictions, references=references)
print(results)

{'exact_match': 79.1, 'f1': 87.30905782126362}


## RoBERTa

In [28]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
RoBERTa_model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")

In [29]:
from transformers import BertForQuestionAnswering
import torch
from tqdm import tqdm

predictions = []

for i in tqdm(range(1000)):
    question = data["question"][i]
    context = data["context"][i]
    inputs = tokenizer.encode_plus(question, context, return_tensors="pt", truncation=True, max_length=512)

    with torch.no_grad():
        outputs = RoBERTa_model(**inputs)
        start_logits = outputs.start_logits
        end_logits = outputs.end_logits

    start_idx = torch.argmax(start_logits)
    end_idx = torch.argmax(end_logits) + 1

    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][start_idx:end_idx])
    )
    predictions.append({answer})

100%|██████████| 1000/1000 [12:58<00:00,  1.28it/s]


In [30]:
import evaluate

metric = evaluate.load("squad")
references = [
    {"id": str(i), "answers": {"text": [data["answer"][i]], "answer_start": [data["answer_start"][i]]}}
    for i in range(len(predictions))
]
formatted_predictions = [{"id": str(i), "prediction_text": list(predictions[i])[0]} for i in range(len(predictions))]
results = metric.compute(predictions=formatted_predictions, references=references)
print(results)

{'exact_match': 77.8, 'f1': 86.65935352316549}


## DistilBERT

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-cased-distilled-squad")
DistilBERT_model = AutoModelForQuestionAnswering.from_pretrained("distilbert/distilbert-base-cased-distilled-squad")

In [32]:
from transformers import BertForQuestionAnswering
import torch
from tqdm import tqdm

predictions = []

for i in tqdm(range(1000)):
    question = data["question"][i]
    context = data["context"][i]
    inputs = tokenizer.encode_plus(question, context, return_tensors="pt", truncation=True, max_length=512)

    with torch.no_grad():
        outputs = DistilBERT_model(**inputs)
        start_logits = outputs.start_logits
        end_logits = outputs.end_logits

    start_idx = torch.argmax(start_logits)
    end_idx = torch.argmax(end_logits) + 1

    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][start_idx:end_idx])
    )
    predictions.append({answer})

100%|██████████| 1000/1000 [06:36<00:00,  2.52it/s]


In [33]:
import evaluate

metric = evaluate.load("squad")
references = [
    {"id": str(i), "answers": {"text": [data["answer"][i]], "answer_start": [data["answer_start"][i]]}}
    for i in range(len(predictions))
]
formatted_predictions = [{"id": str(i), "prediction_text": list(predictions[i])[0]} for i in range(len(predictions))]
results = metric.compute(predictions=formatted_predictions, references=references)
print(results)

{'exact_match': 69.9, 'f1': 81.23808877582852}


# Command-line interface

In [36]:
def predict(question, context):
  question = question
  context = context
  inputs = tokenizer.encode_plus(question, context, return_tensors="pt", truncation=True, max_length=512)
  with torch.no_grad():
        outputs = BERT_model(**inputs)
        start_logits = outputs.start_logits
        end_logits = outputs.end_logits
  start_idx = torch.argmax(start_logits)
  end_idx = torch.argmax(end_logits) + 1
  answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][start_idx:end_idx]))
  result = [answer, start_idx, end_idx]
  return result

In [37]:
from transformers import pipeline

print("\n🧠 Simple BERT Question Answering CLI\n" + "-" * 50)

print("Loading model... (this happens only once)")
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")

while True:
    context = input("\n📜 Enter passage (or type 'exit' to quit):\n> ")
    if context.lower().strip() == "exit":
        print("\nGoodbye! 👋")
        break

    question = input("\n❓ Enter your question:\n> ")

    result = predict(question, context)
    print("\n✅ Answer:",result[0])
    print("✅ Start index:",result[1])
    print("✅ End index:",result[2])

    again = input("\nDo you want to ask another question? (y/n): ").lower().strip()
    if again != "y":
        print("\nGoodbye! 👋")
        break



🧠 Simple BERT Question Answering CLI
--------------------------------------------------
Loading model... (this happens only once)


Device set to use cuda:0



📜 Enter passage (or type 'exit' to quit):
> my name is mohamed

❓ Enter your question:
> what is my name?

✅ Answer: mohamed
✅ Start index: tensor(10)
✅ End index: tensor(13)

Do you want to ask another question? (y/n): n

Goodbye! 👋



# High-Level Notebook Summary: Question Answering with Transformers

This notebook demonstrates how to build, train, and evaluate a **Question Answering (QA)** system using the **Transformers** library from Hugging Face. It walks through the process of fine-tuning a pre-trained transformer model (like BERT or DistilBERT) on a QA dataset and performing inference on new passages and questions.

## Structure Overview

1. **Setup and Imports**
   - The notebook imports key libraries such as `transformers`, `datasets`, and `torch`.
   - It sets up the environment for model training and evaluation.

2. **Dataset Loading and Preprocessing**
   - A QA dataset (likely SQuAD or a similar one) is loaded using `datasets.load_dataset()`.
   - Contexts, questions, and answers are tokenized using a Hugging Face tokenizer.
   - Token alignment and truncation are handled carefully to match input length constraints.

3. **Model Selection and Fine-Tuning**
   - A pre-trained model (such as `bert-base-uncased` or `roberta-base-uncased`) is loaded via `AutoModelForQuestionAnswering`.
   - The notebook fine-tunes this model using the prepared dataset.

4. **Evaluation and Metrics**
   - Model predictions are evaluated on a validation or test split.
   - Common QA metrics such as **Exact Match (EM)** and **F1 score** are calculated to assess model quality.

5. **Inference (Question Answering Pipeline)**
   - The fine-tuned model is wrapped into a `function` for easy testing.
   - Users can input custom passages and questions to see model predictions interactively.

## Key Insights

- Fine-tuning a transformer model for QA requires careful token alignment between context and answers.
- Pre-trained models like **BERT** or **RoBERTa** achieve strong results on QA benchmarks with minimal tuning.
- Evaluation metrics like **F1** give complementary views of performance.
- Using smaller models (e.g., DistilBERT) balances speed and accuracy for production applications.

---
**Overall**, this notebook provides a complete workflow for question answering — from dataset preprocessing and model fine-tuning to evaluation and real-world inference.
