# Question Answering with DistilBERT (SQuAD v1.1)

This notebook trains a DistilBERT question-answering model on the SQuAD v1.1 dataset. It includes:

- Preprocessing that maps character-level answer spans to token indices using tokenizer offsets.
- A custom token-level IoU metric for evaluation.
- Training via Hugging Face's Trainer API and an inference pipeline using the trained model.

Data and output paths in this notebook are set for a Kaggle environment (/kaggle/input and /kaggle/working). Adjust paths if you run locally.

In [1]:
import torch
import transformers
import pandas as pd
import numpy as np
import json

from sklearn import model_selection, metrics
from datasets import Dataset
from tqdm.notebook import tqdm

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

## Configuration and hyperparameters

Set model, tokenizer, maximum token length, batch sizes, learning rate, number of epochs, and a `debug` flag for quick iterations. Edit the `config` dictionary below to tune training behavior.

In [2]:
config = {
    "max_length": 512,
    "model_path": "distilbert-base-uncased",
    "output_dir": "/kaggle/working/my-qa-model",
    "train_batch_size": 8,
    "valid_batch_size": 8,
    "learning_rate": 3e-5,
    "epochs": 1,
    "debug": True,
}

## Preprocessing

We tokenize question+context with `return_offsets_mapping=True` and `truncation='only_second'` so we can:

- Identify which token indices correspond to the context (second sequence).
- Build a character-to-token mapping for the context using the offsets.
- Convert the dataset's character-level `answer_start` and `answer_end` into token-level `start_positions` and `end_positions` for training.

If an answer falls outside the tokenized context (due to truncation), the preprocess function falls back to a safe default; consider sliding-window tokenization for full coverage.

In [3]:
def preprocess_function(sample):
    # Tokenize with offsets so we can map character positions to token indices
    inputs = tokenizer(
        sample["question"],
        sample["context"],
        max_length=config["max_length"],
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sequence_ids = inputs.sequence_ids()

    # Find the token index span that corresponds to the context (the second sequence)
    context_start = None
    context_end = None
    for idx, s in enumerate(sequence_ids):
        if s == 1 and context_start is None:
            context_start = idx
        if s == 1:
            context_end = idx + 1

    # If there's no context tokens (shouldn't happen), default labels to 0
    if context_start is None:
        inputs["start_positions"] = 0
        inputs["end_positions"] = 0
        return inputs

    # Offsets for context tokens only
    context_offsets = offset_mapping[context_start:context_end]

    # Build a char -> token index mapping for the context string
    char_to_token = {}
    for i_token, (char_start, char_end) in enumerate(context_offsets):
        # some tokens (special tokens) may have (0, 0) offsets; skip those
        if char_end <= char_start:
            continue
        abs_token_idx = context_start + i_token
        for c in range(char_start, char_end):
            char_to_token[c] = abs_token_idx

    # Character-level answer span (from the original dataset)
    answer_start_char = sample["answer_start"]
    answer_end_char = sample["answer_end"] - 1  # make inclusive

    # Map character-level span to token-level span. If the answer was truncated out,
    # fall back to the context start token (safe default).
    if (answer_start_char in char_to_token) and (answer_end_char in char_to_token):
        token_start = char_to_token[answer_start_char]
        token_end = char_to_token[answer_end_char]
    else:
        # answer not fully inside the tokenized context (likely truncated)
        token_start = context_start
        token_end = context_start

    inputs["start_positions"] = token_start
    inputs["end_positions"] = token_end

    return inputs

## Load SQuAD data

This section reads the SQuAD JSON files from the Kaggle dataset location and flattens the nested structure into a tidy DataFrame with columns: `context`, `question`, `answer`, and `answer_start`. The dev set is used as the validation set. When `debug=True` the notebook samples small subsets for fast iteration.

In [4]:
# Load SQuAD JSONs from Kaggle input paths
train_path = "/kaggle/input/stanford-question-answering-dataset/train-v1.1.json"
dev_path = "/kaggle/input/stanford-question-answering-dataset/dev-v1.1.json"

train_data = json.load(open(train_path))
dev_data = json.load(open(dev_path))

flattened_train = []
for sample in train_data["data"]:
    for para in sample["paragraphs"]:
        for qas in para["qas"]:
            flattened_train.append({
                "context": para["context"],
                "question": qas["question"],
                "answer": qas["answers"][0]["text"],
                "answer_start": qas["answers"][0]["answer_start"],  
            })

flattened_dev = []
for sample in dev_data["data"]:
    for para in sample["paragraphs"]:
        for qas in para["qas"]:
            flattened_dev.append({
                "context": para["context"],
                "question": qas["question"],
                "answer": qas["answers"][0]["text"],
                "answer_start": qas["answers"][0]["answer_start"],
            })


df_train = pd.DataFrame(flattened_train)
df_train["answer_end"] = df_train["answer_start"] + df_train["answer"].apply(len)

df_valid = pd.DataFrame(flattened_dev)
df_valid["answer_end"] = df_valid["answer_start"] + df_valid["answer"].apply(len)

print("train:", df_train.shape, "valid:", df_valid.shape)

tokenizer = transformers.AutoTokenizer.from_pretrained(config["model_path"])

train: (87599, 5) valid: (10570, 5)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

## Training setup and custom metric

We use Hugging Face's `Trainer` with a custom `compute_metrics` that computes token-level IoU between predicted and gold token spans (mean over a batch). TrainingArguments are configured to save models each epoch and to load the best model at the end. `fp16` is enabled only when CUDA is available.

In [5]:
sample = df_valid.iloc[1]

print("Question:\n{}".format(sample["question"]))
print()
print("Context:\n{}".format(sample["context"]))
print()
print("Answer:", sample["answer"])
print()
print(sample["context"][sample["answer_start"] : sample["answer_end"]])

Question:
Which NFL team represented the NFC at Super Bowl 50?

Context:
Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.

Answer: Carolina Panthers

Carolina Panthers


In [6]:
enc = tokenizer(
    sample["question"],
    sample["context"],
    
    return_offsets_mapping=True
)


enc

{'input_ids': [101, 2029, 5088, 2136, 3421, 1996, 22309, 2012, 3565, 4605, 2753, 1029, 102, 3565, 4605, 2753, 2001, 2019, 2137, 2374, 2208, 2000, 5646, 1996, 3410, 1997, 1996, 2120, 2374, 2223, 1006, 5088, 1007, 2005, 1996, 2325, 2161, 1012, 1996, 2137, 2374, 3034, 1006, 10511, 1007, 3410, 7573, 14169, 3249, 1996, 2120, 2374, 3034, 1006, 22309, 1007, 3410, 3792, 12915, 2484, 1516, 2184, 2000, 7796, 2037, 2353, 3565, 4605, 2516, 1012, 1996, 2208, 2001, 2209, 2006, 2337, 1021, 1010, 2355, 1010, 2012, 11902, 1005, 1055, 3346, 1999, 1996, 2624, 3799, 3016, 2181, 2012, 4203, 10254, 1010, 2662, 1012, 2004, 2023, 2001, 1996, 12951, 3565, 4605, 1010, 1996, 2223, 13155, 1996, 1000, 3585, 5315, 1000, 2007, 2536, 2751, 1011, 11773, 11107, 1010, 2004, 2092, 2004, 8184, 28324, 2075, 1996, 4535, 1997, 10324, 2169, 3565, 4605, 2208, 2007, 3142, 16371, 28990, 2015, 1006, 2104, 2029, 1996, 2208, 2052, 2031, 2042, 2124, 2004, 1000, 3565, 4605, 1048, 1000, 1007, 1010, 2061, 2008, 1996, 8154, 2071, 14500,

In [7]:
print(enc.sequence_ids())

[None, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, None]


In [8]:
if config["debug"]:
    print("DEBUG MODE!!!")
    # sample smaller subsets for fast iteration in debug
    df_train = df_train.sample(1000, random_state=42)
    df_valid = df_valid.sample(500, random_state=42)

# Use preloaded dev set as validation
train = df_train.reset_index(drop=True)
valid = df_valid.reset_index(drop=True)

DEBUG MODE!!!


In [9]:
train_ds = Dataset.from_pandas(train)
valid_ds = Dataset.from_pandas(valid)

In [10]:
%%time

train_ds = train_ds.map(preprocess_function)
valid_ds = valid_ds.map(preprocess_function)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

CPU times: user 1.66 s, sys: 14.4 ms, total: 1.68 s
Wall time: 1.7 s


In [11]:
# train_ds[0]

In [12]:
model = transformers.AutoModelForQuestionAnswering.from_pretrained(
    config["model_path"]
)

2025-09-07 07:17:39.092038: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1757229459.403184      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1757229459.490751      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
model

DistilBertForQuestionAnswering(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
     

In [14]:
with torch.no_grad():
    out = model(**tokenizer("hello", "world", return_tensors="pt"))
    
out

QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[-0.0367, -0.0569, -0.0442, -0.0953, -0.0441]]), end_logits=tensor([[ 0.1562,  0.3723, -0.0067,  0.0697, -0.0067]]), hidden_states=None, attentions=None)

In [15]:
import numpy as np
import inspect

def compute_metrics(eval_pred):
    # eval_pred is an EvalPrediction object with .predictions and .label_ids
    start_logits, end_logits = eval_pred.predictions

    # Convert logits to token indices
    pred_starts = np.argmax(start_logits, axis=-1)
    pred_ends = np.argmax(end_logits, axis=-1)

    labels = eval_pred.label_ids
    # labels may be shape (batch, 2) containing start and end positions
    if labels is None:
        return {"token_iou": 0.0}

    if labels.ndim == 2:
        label_starts = labels[:, 0]
        label_ends = labels[:, 1]
    else:
        # fallback if labels were provided differently
        label_starts = labels
        label_ends = labels

    scores = []
    for ps, pe, ls, le in zip(pred_starts, pred_ends, label_starts, label_ends):
        # normalize poorly-formed predictions
        if pe < ps:
            pe = ps
        if le < ls:
            le = ls

        # intersection and union on token indices (inclusive ranges)
        inter_start = max(ps, ls)
        inter_end = min(pe, le)
        inter_len = max(0, inter_end - inter_start + 1)
        pred_len = pe - ps + 1
        label_len = le - ls + 1
        union_len = pred_len + label_len - inter_len if (pred_len + label_len - inter_len) > 0 else 1
        iou = inter_len / union_len
        scores.append(iou)

    return {"token_iou": float(np.mean(scores))}


# Build TrainingArguments robustly across transformers versions
kwargs = dict(
    output_dir=config["output_dir"],
    per_device_train_batch_size=config["train_batch_size"],
    per_device_eval_batch_size=config["valid_batch_size"],
    learning_rate=config["learning_rate"],
    num_train_epochs=config["epochs"],
    dataloader_num_workers=4,
    save_total_limit=2,
    report_to="none",
    fp16=torch.cuda.is_available(),
)

# Try newer signature first
try:
    kwargs_new = dict(kwargs)
    kwargs_new.update({
        "evaluation_strategy": "epoch",
        "save_strategy": "epoch",
        "load_best_model_at_end": True,
        "metric_for_best_model": "token_iou",
        "greater_is_better": True,
    })
    training_args = transformers.TrainingArguments(**kwargs_new)
except TypeError:
    # Fallback for older transformers versions that don't accept some kwargs
    training_args = transformers.TrainingArguments(**kwargs)
    # try to set common attributes if available
    for attr, val in [
        ("evaluation_strategy", "epoch"),
        ("save_strategy", "epoch"),
        ("load_best_model_at_end", True),
        ("metric_for_best_model", "token_iou"),
        ("greater_is_better", True),
    ]:
        try:
            setattr(training_args, attr, val)
        except Exception:
            pass

## Inference pipeline

After training, we create a `pipeline(task='question-answering')` pointing to the `output_dir`. The pipeline auto-selects GPU when available; otherwise it runs on CPU. Use the pipeline on your validation set to inspect sample predictions vs gold answers.

In [16]:
trainer = transformers.Trainer(
    model,
    training_args,
    
    train_dataset=train_ds,
    eval_dataset=valid_ds,
    
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)


  trainer = transformers.Trainer(


In [17]:
trainer.train()
trainer.save_state()
trainer.save_model()



Step,Training Loss


In [18]:
# Inference

from transformers import pipeline

device = 0 if torch.cuda.is_available() else -1

qa_pipeline = pipeline(
    task="question-answering",
    model=config["output_dir"],
    device=device,
)

Device set to use cuda:0


In [19]:
preds = []

for idx, row in valid.reset_index(drop=True).iterrows():
    context = row["context"]
    question = row["question"]
    
    pred = qa_pipeline(
        question=question,
        context=context
    )
    
    
    preds.append(
        pred
    )
    
    if idx == 10:
        break

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [20]:
pred_df = pd.DataFrame(preds)
pred_df["gold_answer"] = valid["answer"].tolist()[: 11]

pred_df

Unnamed: 0,score,start,end,answer,gold_answer
0,0.001156,345,394,carbon monoxide from the heme group of hemoglobin,carbon monoxide
1,0.002601,309,375,High Definition content which had previously b...,no
2,0.000569,181,250,Hulagu Khan destroyed much of Iran's northern ...,1237
3,0.000477,205,209,K-12,the University of Chicago campus
4,0.001032,168,218,atmospheric concentrations of the greenhouse g...,substantially increasing the atmospheric conce...
5,0.001099,77,93,Orellana in 1542,"11,000 years"
6,0.002837,192,212,New England Patriots,Pittsburgh Steelers
7,0.000719,138,185,launch vehicle to be used in the Apollo program,Joseph Shea
8,0.002227,246,260,statutory rape,the sex offenders register
9,0.002235,66,82,15 February 1546,that they convert
