
# Advanced NLP QA Project — From Scratch to SOTA (T5 & DistilBERT)
**Goal:** Upgrade a scratch Transformer QA notebook into an **advanced-level NLP project** featuring modern tokenization, pretrained models, rigorous evaluation, decoding strategies, interpretability, and a live demo.

**Highlights**
- Subword tokenization (BPE/WordPiece) with Hugging Face tokenizers
- Fine-tuning **T5-small** (generative QA) on custom QA CSV or SQuAD
- Fine-tuning **DistilBERT** (extractive QA) on SQuAD with EM/F1
- Metrics: **BLEU (sacrebleu), ROUGE**, **EM/F1** (SQuAD)
- Decoding: greedy, **beam search**, **top‑k**, **nucleus (top‑p)**
- **Attention visualization** for interpretability
- **Gradio demo** for interactive inference
- (Optional) **FastAPI** service stub for production deployment

> You can run **either** on your local QA CSV (2 columns: `question`, `answer`) **or** on SQuAD via `datasets`.


## 0. Environment Setup

In [None]:

# If running locally, uncomment to install dependencies.
# Note: In some environments, you may need to restart the kernel after installation.

# !pip install -U transformers datasets accelerate evaluate sacrebleu rouge-score gradio matplotlib torch torchvision torchaudio --quiet
# For SQuAD EM/F1 utility
# !pip install -U seqeval --quiet


## 1. Imports & Config

In [None]:

import os
import random
import math
from dataclasses import dataclass
from typing import List, Dict, Any, Optional

import numpy as np
import torch
from torch.utils.data import Dataset

import matplotlib.pyplot as plt

# Hugging Face
from transformers import (
    T5ForConditionalGeneration, T5TokenizerFast,
    AutoTokenizer, AutoModelForQuestionAnswering,
    DataCollatorForSeq2Seq,
    TrainingArguments, Trainer
)

from datasets import load_dataset, Dataset as HFDataset, DatasetDict
import evaluate

SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device



## 2. Data — Use Custom CSV or SQuAD
Choose one of the two paths below:

- **Custom CSV** (two columns: `question`, `answer`) — set `CSV_PATH`.
- **SQuAD** — set `USE_SQUAD=True`.


In [None]:

# ---- Config ----
USE_SQUAD = False  # set True to use SQuAD automatically
CSV_PATH = 'data/reminiscences_of_a_stock_operator_qa.csv'  # your local QA CSV
CSV_SEP = '\t'  # change if comma-separated
VAL_SPLIT = 0.1  # validation split for custom CSV
MAX_TRAIN_SAMPLES = None  # set an int to subsample for faster experiments

# Model names
T5_MODEL = "t5-small"  # generative
EXTRACTIVE_MODEL = "distilbert-base-uncased"  # extractive


In [None]:

def load_custom_csv(path: str, sep: str='\t', val_split: float=0.1) -> DatasetDict:
    import pandas as pd
    df = pd.read_csv(path, sep=sep)
    assert {'question','answer'}.issubset(df.columns), "CSV must have 'question' and 'answer' columns."
    if MAX_TRAIN_SAMPLES:
        df = df.sample(min(MAX_TRAIN_SAMPLES, len(df)), random_state=SEED)
    # Split
    val_size = max(1, int(len(df)*val_split))
    df = df.sample(frac=1, random_state=SEED).reset_index(drop=True)
    df_train = df.iloc[:-val_size].reset_index(drop=True)
    df_val = df.iloc[-val_size:].reset_index(drop=True)
    # Convert to HF datasets
    ds_train = HFDataset.from_pandas(df_train)
    ds_val = HFDataset.from_pandas(df_val)
    return DatasetDict(train=ds_train, validation=ds_val)

if USE_SQUAD:
    # Load SQuAD for both generative and extractive tracks
    ds_squad = load_dataset("squad")
    if MAX_TRAIN_SAMPLES:
        ds_squad = ds_squad.shuffle(seed=SEED)
        ds_squad = DatasetDict(
            train = ds_squad["train"].select(range(min(MAX_TRAIN_SAMPLES, len(ds_squad["train"])))),
            validation = ds_squad["validation"].select(range(min(MAX_TRAIN_SAMPLES//10 if MAX_TRAIN_SAMPLES else 1000, len(ds_squad["validation"]))))
        )
else:
    ds_custom = load_custom_csv(CSV_PATH, sep=CSV_SEP, val_split=VAL_SPLIT)

print("Datasets ready.")


## 3. Generative QA — T5-small (subword/BPE tokenization)

In [None]:

t5_tokenizer = T5TokenizerFast.from_pretrained(T5_MODEL)
t5_model = T5ForConditionalGeneration.from_pretrained(T5_MODEL).to(device)

MAX_INPUT_LENGTH = 256
MAX_TARGET_LENGTH = 64

def format_examples_for_t5(batch):
    # For SQuAD, include context; for custom CSV, question->answer only
    if USE_SQUAD:
        inputs = [f"question: {q}  context: {c}" for q, c in zip(batch['question'], batch['context'])]
        targets = batch['answers']
        targets = [ans['text'][0] if len(ans['text'])>0 else "" for ans in targets]
    else:
        inputs = [f"question: {q}" for q in batch['question']]
        targets = batch['answer']
    model_inputs = t5_tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)
    labels = t5_tokenizer(text_target=targets, max_length=MAX_TARGET_LENGTH, truncation=True)
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

if USE_SQUAD:
    ds_t5 = ds_squad.map(format_examples_for_t5, batched=True, remove_columns=ds_squad['train'].column_names)
else:
    ds_t5 = ds_custom.map(format_examples_for_t5, batched=True, remove_columns=ds_custom['train'].column_names)

data_collator_t5 = DataCollatorForSeq2Seq(tokenizer=t5_tokenizer, model=t5_model)


In [None]:

output_dir_t5 = "t5-gen-qa"

args_t5 = TrainingArguments(
    output_dir=output_dir_t5,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    predict_with_generate=True,
    logging_steps=50,
    report_to="none"
)

def compute_bleu_rouge(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = t5_tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in labels as we can't decode them
    labels = np.where(labels != -100, labels, t5_tokenizer.pad_token_id)
    decoded_labels = t5_tokenizer.batch_decode(labels, skip_special_tokens=True)
    bleu = evaluate.load("sacrebleu").compute(predictions=decoded_preds, references=[[l] for l in decoded_labels])
    rouge = evaluate.load("rouge").compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": bleu["score"], **{f"rouge_{k}": v for k, v in rouge.items()}}

trainer_t5 = Trainer(
    model=t5_model,
    args=args_t5,
    train_dataset=ds_t5['train'],
    eval_dataset=ds_t5['validation'],
    tokenizer=t5_tokenizer,
    data_collator=data_collator_t5,
    compute_metrics=compute_bleu_rouge
)

# To train: uncomment the next line
# trainer_t5.train()


### Decoding: greedy, beam search, top‑k, top‑p

In [None]:

def t5_generate(text_prompt: str, 
                 max_new_tokens: int=64,
                 decoding: str="greedy",
                 num_beams: int=4,
                 top_k: int=50,
                 top_p: float=0.95,
                 temperature: float=1.0):
    inputs = t5_tokenizer(text_prompt, return_tensors="pt").to(device)
    gen_kwargs = dict(max_new_tokens=max_new_tokens)
    if decoding == "greedy":
        pass
    elif decoding == "beam":
        gen_kwargs.update(dict(num_beams=num_beams, early_stopping=True))
    elif decoding == "topk":
        gen_kwargs.update(dict(do_sample=True, top_k=top_k, temperature=temperature))
    elif decoding == "topp":
        gen_kwargs.update(dict(do_sample=True, top_p=top_p, temperature=temperature))
    else:
        raise ValueError("decoding must be one of: greedy | beam | topk | topp")
    outputs = t5_model.generate(**inputs, **gen_kwargs)
    return t5_tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example (after training or with base T5):
# print(t5_generate("question: What is a stop-loss order?", decoding="beam"))


### Attention Visualization (Encoder Self-Attention)

In [None]:

# Enable attentions
t5_model.config.output_attentions = True

def visualize_attention(prompt: str, layer: int=0, head: int=0):
    inputs = t5_tokenizer(prompt, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = t5_model.encoder(**inputs, output_attentions=True, return_dict=True)
    attentions = outputs.attentions  # tuple: (layers) x (batch, heads, seq_len, seq_len)
    attn = attentions[layer][0, head].detach().cpu().numpy()
    tokens = t5_tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
    plt.figure(figsize=(6,6))
    plt.imshow(attn)
    plt.title(f"Encoder Layer {layer} Head {head}")
    plt.xlabel("Keys")
    plt.ylabel("Queries")
    plt.xticks(range(len(tokens)), tokens, rotation=90)
    plt.yticks(range(len(tokens)), tokens)
    plt.tight_layout()
    plt.show()

# Example:
# visualize_attention("question: Explain diversification in portfolio management.")


## 4. Extractive QA — DistilBERT on SQuAD (EM/F1)

In [None]:

if USE_SQUAD:
    tokenizer_ex = AutoTokenizer.from_pretrained(EXTRACTIVE_MODEL, use_fast=True)
    model_ex = AutoModelForQuestionAnswering.from_pretrained(EXTRACTIVE_MODEL).to(device)

    max_length = 384
    doc_stride = 128

    def prepare_train_features(examples):
        tokenized_examples = tokenizer_ex(
            examples["question"],
            examples["context"],
            truncation="only_second",
            max_length=max_length,
            stride=doc_stride,
            return_overflowing_tokens=True,
            return_offsets_mapping=True,
            padding="max_length",
        )
        sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
        offset_mapping = tokenized_examples.pop("offset_mapping")
        start_positions = []
        end_positions = []
        for i, offsets in enumerate(offset_mapping):
            input_ids = tokenized_examples["input_ids"][i]
            cls_index = input_ids.index(tokenizer_ex.cls_token_id) if tokenizer_ex.cls_token_id in input_ids else 0
            sequence_ids = tokenized_examples.sequence_ids(i)
            sample_index = sample_mapping[i]
            answers = examples["answers"][sample_index]
            if len(answers["answer_start"]) == 0:
                start_positions.append(cls_index)
                end_positions.append(cls_index)
                continue
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                start_positions.append(cls_index)
                end_positions.append(cls_index)
            else:
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                start_positions.append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                end_positions.append(token_end_index + 1)
        tokenized_examples["start_positions"] = start_positions
        tokenized_examples["end_positions"] = end_positions
        return tokenized_examples

    def prepare_validation_features(examples):
        tokenized_examples = tokenizer_ex(
            examples["question"],
            examples["context"],
            truncation="only_second",
            max_length=max_length,
            stride=doc_stride,
            return_overflowing_tokens=True,
            return_offsets_mapping=True,
            padding="max_length",
        )
        sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
        tokenized_examples["example_id"] = []
        for i in range(len(tokenized_examples["input_ids"])):
            sequence_ids = tokenized_examples.sequence_ids(i)
            context_index = 1
            sample_index = sample_mapping[i]
            tokenized_examples["example_id"].append(examples["id"][sample_index])
            tokenized_examples["offset_mapping"][i] = [
                (o if sequence_ids[k] == context_index else None)
                for k, o in enumerate(tokenized_examples["offset_mapping"][i])
            ]
        return tokenized_examples

    squad_tokenized_train = ds_squad["train"].map(prepare_train_features, batched=True, remove_columns=ds_squad["train"].column_names)
    squad_tokenized_val = ds_squad["validation"].map(prepare_validation_features, batched=True, remove_columns=ds_squad["validation"].column_names)

    args_ex = TrainingArguments(
        output_dir="distilbert-extractive-qa",
        evaluation_strategy="epoch",
        save_strategy="epoch",
        learning_rate=3e-5,
        per_device_train_batch_size=12,
        per_device_eval_batch_size=12,
        num_train_epochs=2,
        weight_decay=0.01,
        report_to="none",
        logging_steps=50
    )

    data_collator_ex = None  # default is fine for QA

    trainer_ex = Trainer(
        model=model_ex,
        args=args_ex,
        train_dataset=squad_tokenized_train,
        eval_dataset=squad_tokenized_val,
        tokenizer=tokenizer_ex,
        data_collator=data_collator_ex,
    )

    # To train: uncomment
    # trainer_ex.train()


### EM / F1 Evaluation (SQuAD-style)

In [None]:

def normalize_text(s):
    import re, string
    def remove_articles(text):
        return re.sub(r'\b(a|an|the)\b', ' ', text)
    def white_space_fix(text):
        return ' '.join(text.split())
    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)
    def lower(text):
        return text.lower()
    return white_space_fix(remove_articles(remove_punc(lower(s))))

def get_tokens(s):
    if not s: return []
    return normalize_text(s).split()

def compute_em_f1(prediction: str, ground_truth: str):
    pred_tokens = get_tokens(prediction)
    gt_tokens = get_tokens(ground_truth)
    common = set(pred_tokens) & set(gt_tokens)
    num_same = sum(min(pred_tokens.count(w), gt_tokens.count(w)) for w in common)
    if len(pred_tokens) == 0 or len(gt_tokens) == 0:
        return int(pred_tokens == gt_tokens), 0
    if num_same == 0:
        return 0, 0
    precision = 1.0 * num_same / len(pred_tokens)
    recall = 1.0 * num_same / len(gt_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    em = int(normalize_text(prediction) == normalize_text(ground_truth))
    return em, f1


### Validation Loop for T5 (BLEU/ROUGE + optional EM/F1 if ground truth provided)

In [None]:

def evaluate_t5(ds_eval, sample_size: Optional[int]=64):
    sacrebleu = evaluate.load("sacrebleu")
    rouge = evaluate.load("rouge")
    preds, refs = [], []
    ems, f1s = [], []
    n = len(ds_eval) if sample_size is None else min(sample_size, len(ds_eval))
    for i in range(n):
        ex = ds_eval[i]
        if USE_SQUAD:
            q = ex['question']; c = ex['context']; refs_i = ex['answers']['text']
            inp = f"question: {q}  context: {c}"
            gold = refs_i[0] if len(refs_i)>0 else ""
        else:
            q = ex['question']; gold = ex['answer']
            inp = f"question: {q}"
        pred = t5_generate(inp, decoding="beam", num_beams=4)
        preds.append(pred); refs.append([gold])
        if gold is not None:
            em, f1 = compute_em_f1(pred, gold)
            ems.append(em); f1s.append(f1)
    bleu = sacrebleu.compute(predictions=preds, references=refs)
    rouge_scores = rouge.compute(predictions=preds, references=[r[0] for r in refs])
    out = {"bleu": bleu["score"], **{f"rouge_{k}": v for k, v in rouge_scores.items()}}
    if ems:
        out.update({"em": float(np.mean(ems)), "f1": float(np.mean(f1s))})
    return out

# Example after training:
# results = evaluate_t5(ds_t5['validation'], sample_size=64)
# results


## 5. Gradio Demo

In [None]:

import gradio as gr

def t5_answer_fn(question: str, context: str=""):
    prompt = f"question: {question}" + (f"  context: {context}" if context else "")
    return t5_generate(prompt, decoding="beam", num_beams=4)

demo = gr.Interface(
    fn=t5_answer_fn,
    inputs=[gr.Textbox(label="Question"), gr.Textbox(label="Context (optional)")]
    ,
    outputs="text",
    title="Generative QA (T5-small)",
    examples=[
        ["What is diversification?", ""],
        ["Who wrote the book 'Reminiscences of a Stock Operator'?", ""],
    ]
)

# To launch locally: uncomment
# demo.launch()


## 6. Save / Load

In [None]:

def save_t5(path="t5-gen-qa-final"):
    os.makedirs(path, exist_ok=True)
    t5_model.save_pretrained(path)
    t5_tokenizer.save_pretrained(path)
    print("Saved to", path)

def load_t5(path="t5-gen-qa-final"):
    global t5_model, t5_tokenizer
    t5_tokenizer = T5TokenizerFast.from_pretrained(path)
    t5_model = T5ForConditionalGeneration.from_pretrained(path).to(device)
    print("Loaded from", path)

# Example:
# save_t5()
# load_t5()


## 7. (Optional) FastAPI Service Stub

In [None]:

# Save this to fastapi_app.py if you want a production microservice.
fastapi_code = r'''from fastapi import FastAPI
from pydantic import BaseModel
import torch
from transformers import T5ForConditionalGeneration, T5TokenizerFast

app = FastAPI()
MODEL_PATH = "t5-gen-qa-final"
tokenizer = T5TokenizerFast.from_pretrained(MODEL_PATH)
model = T5ForConditionalGeneration.from_pretrained(MODEL_PATH)

class QARequest(BaseModel):
    question: str
    context: str = ""

@app.post("/qa")
def qa(req: QARequest):
    prompt = f"question: {req.question}" + (f"  context: {req.context}" if req.context else "")
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=64, num_beams=4, early_stopping=True)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return {"answer": answer}
'''
with open("fastapi_app.py", "w", encoding="utf-8") as f:
    f.write(fastapi_code)
print("Wrote fastapi_app.py")



---

### How to Use
1. Set `USE_SQUAD=True` **or** point `CSV_PATH` to your QA CSV.
2. Run **Sections 0→7** in order.
3. Train T5 (uncomment `trainer_t5.train()`).
4. (Optional) Train DistilBERT on SQuAD (set `USE_SQUAD=True` and uncomment `trainer_ex.train()`).
5. Evaluate with `evaluate_t5(...)` and record **BLEU/ROUGE/EM/F1**.
6. Save model with `save_t5()`, then run **Gradio demo** or deploy via **FastAPI**.

**Deliverables to show on a resume**
- Comparison table: **scratch Transformer vs. T5 vs. DistilBERT** on your dataset/SQuAD.
- Plots of metrics vs. epochs.
- Screenshots/GIF of the **Gradio** app.
- Short write-up on decoding strategies & attention visualization findings.
