# T5 Question Generation — Colab Training

Standalone notebook for training and evaluating T5 topic-controlled question generation on Google Colab.

**Use this notebook when:**
- You want to train on Colab's GPU using datasets already in the repository
- You want to evaluate with the full metric suite against paper baselines

**Steps:**
1. Setup environment and clone repo (data files included)
2. Verify training data from the cloned repository
3. Train one or more model variants with `pipe.train()`
4. Evaluate with `pipe.evaluate()` against paper baselines
5. Download the trained model

**Expected CSV files** (committed to `data/training/` in the repo):
```
data/training/squad/baseline/   train.csv  val.csv  test.csv
data/training/squad/mixsquad/   train.csv  val.csv  test.csv
data/training/khanq/mixkhanq/   data.csv
```

## 1. Setup

In [1]:
# Check GPU
!nvidia-smi

import torch
print(f"\nPyTorch : {torch.__version__}")
print(f"CUDA    : {torch.cuda.is_available()}")
if torch.cuda.is_available():
    name = torch.cuda.get_device_name(0)
    mem  = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU     : {name} ({mem:.0f} GB)")
    # Suggest batch size based on available VRAM
    suggested_batch = 128 if mem >= 35 else 64 if mem >= 15 else 32
    print(f"Suggested batch size: {suggested_batch}")
else:
    print("WARNING: No GPU detected. Training will be very slow.")

Mon Feb 23 17:02:53 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   38C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+----------------------------------------------

In [None]:
import sys, os
from pathlib import Path

# ── Clone repository ──────────────────────────────────────────────────────────
# TODO: replace with your actual repository URL
REPO_URL = "https://github.com/Byambaa0325/question-generation-distillation.git"
!git clone {REPO_URL} /content/ai4ed-qg -q
%cd /content/ai4ed-qg

# ── Install dependencies ──────────────────────────────────────────────────────
# transformers>=4.46 required: eval_strategy + processing_class API (replaces
# evaluation_strategy + tokenizer= which were removed in 4.46)
!pip install -q torch "transformers>=4.46.0" datasets accelerate sentencepiece \
                evaluate rouge_score nltk sentence-transformers \
                pyyaml tqdm pandas python-dotenv

import nltk
for res in ('punkt', 'punkt_tab', 'wordnet', 'omw-1.4'):
    nltk.download(res, quiet=True)

sys.path.insert(0, '/content/ai4ed-qg')
os.chdir('/content/ai4ed-qg')
print(f"Working dir: {os.getcwd()}")

import transformers
print(f"transformers: {transformers.__version__}")

In [None]:
# Data is included in the cloned repository — no Google Drive needed.
from pathlib import Path

REPO_DIR = Path('/content/ai4ed-qg')
print(f"Repo    : {REPO_DIR}")
print(f"Data dir: {REPO_DIR / 'data'} — exists: {(REPO_DIR / 'data').exists()}")

## 2. Verify Training Data

Training data is committed to the repository and was cloned in Step 1. No upload or Drive mounting needed.

Run the cell below to confirm all expected files are present.

In [None]:
# Verify training data from cloned repo
from pathlib import Path

REPO_DIR = Path('/content/ai4ed-qg')

check_paths = [
    # SQuAD — baseline (WAT, plain context, 70/15/15 split of 37,388 entries)
    'data/training/squad/baseline/train.csv',
    'data/training/squad/baseline/val.csv',
    'data/training/squad/baseline/test.csv',
    # SQuAD — MixSQuAD (WAT, 10k random mixed-context pairs)
    'data/training/squad/mixsquad/train.csv',
    'data/training/squad/mixsquad/val.csv',
    'data/training/squad/mixsquad/test.csv',
    # SQuAD — MixSQuAD2X (MixSQuAD doubled, 20k entries)
    'data/training/squad/mixsquad2x/train.csv',
    'data/training/squad/mixsquad2x/val.csv',
    'data/training/squad/mixsquad2x/test.csv',
    # KhanQ — MixKhanQ evaluation set (Wikifier, 653 entries, no split)
    'data/training/khanq/mixkhanq/data.csv',
]

all_ok = True
prev_group = None
for rel in check_paths:
    group = '/'.join(rel.split('/')[:4])
    if group != prev_group:
        print()
        prev_group = group
    p = REPO_DIR / rel
    if p.exists():
        print(f"  [OK]      {rel}  ({p.stat().st_size:,} bytes)")
    else:
        print(f"  [MISSING] {rel}")
        all_ok = False

print()
if all_ok:
    print("All training files present — ready to train.")
else:
    print("Some files missing. Available CSVs in data/training/:")
    for f in sorted((REPO_DIR / 'data/training').rglob('*.csv')):
        print(f"  {f.relative_to(REPO_DIR)}")

In [None]:
# No file placement needed — data is already in the correct paths from the cloned repo.
print("Data paths are set up by the repository structure. Proceed to Step 3.")

In [None]:
# Preview first few rows of a training file to confirm format
import pandas as pd
from pathlib import Path

REPO_DIR = Path('/content/ai4ed-qg')
sample_csv = REPO_DIR / 'data/training/squad/mixsquad/train.csv'

if sample_csv.exists():
    df = pd.read_csv(sample_csv)
    print(f"mixsquad/train.csv — {len(df):,} rows, columns: {list(df.columns)}")
    display(df.head(3))
else:
    print(f"File not found: {sample_csv}")

## 3. Initialise Pipeline

In [5]:
from src.pipeline import Pipeline

pipe = Pipeline('config/pipeline.yaml')
pipe.status()


Pipeline status:
  [-] convert.squad.text
  [-] convert.squad.question
  [-] convert.khanq.text
  [-] convert.khanq.question
  [-] wikify.squad.text
  [-] wikify.squad.question
  [-] wikify.khanq.text
  [-] wikify.khanq.question
  [-] topics.squad.enriched
  [-] topics.squad.filtered
  [-] topics.khanq.enriched
  [-] topics.khanq.filtered
  [-] dataset.squad.baseline
  [-] dataset.squad.mixsquad
  [-] dataset.squad.mixsquad2x
  [-] dataset.khanq.baseline
  [-] dataset.khanq.mixsquad
  [-] dataset.khanq.mixsquad2x
  [-] train.baseline
  [-] train.topic
  [-] train.topic2x


{'convert.squad.text': False,
 'convert.squad.question': False,
 'convert.khanq.text': False,
 'convert.khanq.question': False,
 'wikify.squad.text': False,
 'wikify.squad.question': False,
 'wikify.khanq.text': False,
 'wikify.khanq.question': False,
 'topics.squad.enriched': False,
 'topics.squad.filtered': False,
 'topics.khanq.enriched': False,
 'topics.khanq.filtered': False,
 'dataset.squad.baseline': False,
 'dataset.squad.mixsquad': False,
 'dataset.squad.mixsquad2x': False,
 'dataset.khanq.baseline': False,
 'dataset.khanq.mixsquad': False,
 'dataset.khanq.mixsquad2x': False,
 'train.baseline': False,
 'train.topic': False,
 'train.topic2x': False}

In [None]:
import torch
from src.pipeline import Pipeline

pipe = Pipeline('config/pipeline.yaml')
tc = pipe.config.training

# ── Auto-tune to the available GPU ───────────────────────────────────────────
if torch.cuda.is_available():
    name = torch.cuda.get_device_name(0)
    vram = torch.cuda.get_device_properties(0).total_memory / 1e9

    # bf16 is natively fast on Ampere (A100) and Hopper (H100).
    # fp16 is the right choice for Turing (T4) and Volta (V100).
    # fp32 fallback for everything else.
    supports_bf16 = torch.cuda.is_bf16_supported()

    if vram >= 70:           # H100 80 GB
        tc.batch, tc.fp16, tc.bf16, tc.grad_accum = 256, False, True,  1
    elif vram >= 38:         # A100 40/80 GB
        tc.batch, tc.fp16, tc.bf16, tc.grad_accum = 128, False, True,  1
    elif vram >= 14:         # T4 16 GB  /  V100 16 GB
        tc.batch, tc.fp16, tc.bf16, tc.grad_accum = 64,  not supports_bf16, supports_bf16, 1
    else:                    # smaller GPU / CPU — use grad accumulation to compensate
        tc.batch, tc.fp16, tc.bf16, tc.grad_accum = 16,  False, False, 4

    tc.dataloader_workers = 2   # Colab is Linux — safe to use background workers
else:
    name, vram = "CPU", 0

# ── Anti-overfitting defaults ─────────────────────────────────────────────────
# Val loss bottomed at ~1.77 around epoch 5, then rose to ~3.0 by epoch 50.
# Early stopping + weight decay fix this without tuning epochs by hand.
tc.epochs                    = 25     # safety cap; early stopping fires first
tc.lr                        = 1e-3
tc.warmup_steps              = 200    # ~2 epochs on T4 (7k samples, batch 64)
tc.weight_decay              = 0.01   # L2 regularisation
tc.early_stopping_patience   = 5      # stop after 5 epochs with no val improvement

# ── Manual overrides (uncomment to adjust) ───────────────────────────────────
# tc.batch                   = 32     # reduce if CUDA OOM
# tc.grad_accum              = 2      # effective batch = batch * grad_accum
# tc.lr                      = 5e-4   # lower LR if val curve is still noisy
# tc.early_stopping_patience = 0      # 0 = disabled (run all epochs)

print(f"GPU          : {name} ({vram:.0f} GB)")
print(f"Batch        : {tc.batch}  (effective: {tc.batch * tc.grad_accum}  grad_accum={tc.grad_accum})")
print(f"Precision    : {'bf16' if tc.bf16 else 'fp16' if tc.fp16 else 'fp32'}")
print(f"Epochs (max) : {tc.epochs}   LR: {tc.lr}   warmup_steps: {tc.warmup_steps}")
print(f"Weight decay : {tc.weight_decay}   early_stopping_patience: {tc.early_stopping_patience}")
print(f"Workers      : {tc.dataloader_workers}")
print(f"Model        : {tc.model_name}")

## 4. Hyperparameter Sweep (optional)

Grid search over `lr` × `weight_decay` to find the best combination before committing to a full 25-epoch run. Each trial runs up to 15 epochs with early stopping patience=3, so most trials exit in 6–9 epochs.

**Grid (9 runs):**

| | `wd=0.0` | `wd=0.01` | `wd=0.1` |
|---|---|---|---|
| `lr=1e-3` | run 1 | run 2 | run 3 |
| `lr=5e-4` | run 4 | run 5 | run 6 |
| `lr=3e-4` | run 7 | run 8 | run 9 |

**Estimated time:** ~3–4 min/run on T4 → ~30 min total. ~8 min total on A100.

Skip to **Section 5** if you want to train directly with the defaults from Section 3.

In [None]:
import shutil, itertools
import torch
import pandas as pd
from pathlib import Path
from torch.utils.data import Dataset
from transformers import (
    DataCollatorForSeq2Seq, EarlyStoppingCallback,
    Seq2SeqTrainer, Seq2SeqTrainingArguments,
    T5ForConditionalGeneration, T5Tokenizer,
)

# ── Inline dataset class (mirrors train.py) ───────────────────────────────────
class _QGDataset(Dataset):
    def __init__(self, data_file, tokenizer, max_input_len, max_output_len):
        df = pd.read_csv(data_file)
        self.tokenizer = tokenizer
        self.max_input_len  = max_input_len
        self.max_output_len = max_output_len
        self.examples = [
            {
                "input_text":  f"<topic> {row['topic']} <context> {row['text']} ",
                "target_text": str(row["question"]),
            }
            for _, row in df.iterrows()
        ]

    def __len__(self):  return len(self.examples)

    def __getitem__(self, idx):
        ex = self.examples[idx]
        enc = self.tokenizer(
            ex["input_text"], max_length=self.max_input_len,
            padding="max_length", truncation=True,
        )
        labels = self.tokenizer(
            ex["target_text"], max_length=self.max_output_len,
            padding="max_length", truncation=True,
        ).input_ids
        labels = [(l if l != self.tokenizer.pad_token_id else -100) for l in labels]
        return {"input_ids": enc["input_ids"], "attention_mask": enc["attention_mask"], "labels": labels}


def sweep_run(train_csv, val_csv, tc, hparams, run_idx):
    """
    Train one sweep trial. Returns metrics dict.
    Output is written to /tmp/sweep_{run_idx}/ and deleted afterwards.
    """
    run_dir = Path(f"/tmp/sweep_{run_idx}")
    run_dir.mkdir(parents=True, exist_ok=True)

    tokenizer = T5Tokenizer.from_pretrained(tc.model_name, legacy=False)
    model     = T5ForConditionalGeneration.from_pretrained(tc.model_name)
    tokenizer.add_tokens(tc.special_tokens)
    model.resize_token_embeddings(len(tokenizer))

    train_ds = _QGDataset(train_csv, tokenizer, tc.max_input_len, tc.max_output_len)
    val_ds   = _QGDataset(val_csv,   tokenizer, tc.max_input_len, tc.max_output_len)

    args = Seq2SeqTrainingArguments(
        output_dir           = str(run_dir),
        num_train_epochs     = hparams.get("max_sweep_epochs", 15),
        per_device_train_batch_size = tc.batch,
        per_device_eval_batch_size  = tc.batch,
        learning_rate        = hparams["lr"],
        warmup_steps         = hparams.get("warmup_steps", 200),
        weight_decay         = hparams["weight_decay"],
        fp16                 = tc.fp16,
        bf16                 = tc.bf16,
        gradient_accumulation_steps = tc.grad_accum,
        save_strategy        = "epoch",
        eval_strategy        = "epoch",
        load_best_model_at_end    = True,
        metric_for_best_model     = "eval_loss",
        greater_is_better    = False,
        save_total_limit     = 1,
        predict_with_generate= False,
        logging_steps        = 99999,   # suppress per-step output
        dataloader_num_workers = tc.dataloader_workers,
        report_to            = "none",
    )

    collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding=True, label_pad_token_id=-100)
    trainer  = Seq2SeqTrainer(
        model            = model,
        args             = args,
        train_dataset    = train_ds,
        eval_dataset     = val_ds,
        processing_class = tokenizer,
        data_collator    = collator,
        callbacks        = [EarlyStoppingCallback(early_stopping_patience=hparams.get("patience", 3))],
    )

    trainer.train()

    eval_logs = [l for l in trainer.state.log_history if "eval_loss" in l]
    best      = min(eval_logs, key=lambda x: x["eval_loss"]) if eval_logs else {}

    shutil.rmtree(run_dir, ignore_errors=True)   # clean up temp checkpoint

    return {
        **hparams,
        "best_val_loss": round(best.get("eval_loss", float("nan")), 5),
        "best_epoch":    round(best.get("epoch",     -1),           1),
        "epochs_run":    round(trainer.state.epoch,                 1),
        "n_train":       len(train_ds),
    }

print("sweep_run() defined — ready to run grid.")

In [None]:
# ── Run hyperparameter sweep ──────────────────────────────────────────────────
# Grid: lr × weight_decay (9 trials, ~30 min on T4)
# Prerequisite: run Section 3 (code-config) first so `tc` and `pipe` are defined.

from pathlib import Path
import itertools

REPO_DIR  = Path('/content/ai4ed-qg')
train_csv = REPO_DIR / 'data/training/squad/mixsquad/train.csv'
val_csv   = REPO_DIR / 'data/training/squad/mixsquad/val.csv'

lr_values  = [1e-3, 5e-4, 3e-4]
wd_values  = [0.0,  0.01, 0.1]
sweep_grid = list(itertools.product(lr_values, wd_values))

sweep_results = []
for i, (lr, wd) in enumerate(sweep_grid):
    hparams = {
        "lr":               lr,
        "weight_decay":     wd,
        "warmup_steps":     200,
        "patience":         3,
        "max_sweep_epochs": 15,
    }
    print(f"\n[{i+1}/{len(sweep_grid)}] lr={lr}  weight_decay={wd}")
    result = sweep_run(train_csv, val_csv, tc, hparams, run_idx=i)
    sweep_results.append(result)
    print(f"  best_val_loss={result['best_val_loss']}  "
          f"best_epoch={result['best_epoch']}  "
          f"epochs_run={result['epochs_run']}")

In [None]:
# ── Sweep results + apply best config ────────────────────────────────────────
import pandas as pd

df_sweep = (
    pd.DataFrame(sweep_results)
    [["lr", "weight_decay", "warmup_steps", "best_val_loss", "best_epoch", "epochs_run"]]
    .sort_values("best_val_loss")
    .reset_index(drop=True)
)
df_sweep.index += 1   # 1-based rank
print("Sweep results (sorted by val loss):")
display(df_sweep)

# Apply best hyperparameters to tc for the full training run in Section 5
best = df_sweep.iloc[0]
tc.lr           = float(best["lr"])
tc.weight_decay = float(best["weight_decay"])
tc.warmup_steps = int(best["warmup_steps"])
tc.epochs       = 25    # restore full epoch cap

print(f"\nBest config applied to tc:")
print(f"  lr={tc.lr}  weight_decay={tc.weight_decay}  warmup_steps={tc.warmup_steps}")
print(f"  -> Now run Section 5 (code-train-topic) to train with these settings.")

## 4. Train

Train the model variant you need. The pipeline uses the correct paper format for all modes:
```
Input:  <topic> {topic} <context> {combined text}
Target: {question}
```

Saved to `models/{mode}/best_model/` (best checkpoint by validation loss).

### Training best practices

**Overfitting observed:** val loss bottomed at ~1.77 around epoch 5, then climbed to ~3.0 by epoch 50.
With 7k training samples and T5-small, the model memorises the data well before 50 epochs.

| Practice | Why | Setting |
|----------|-----|---------|
| **Early stopping** | Halt once val loss stops improving — no need to guess epochs | `early_stopping_patience=5` (default) |
| **Weight decay** | L2 regularisation penalises large weights, slows memorisation | `weight_decay=0.01` (default) |
| **Warmup steps** | Avoids large gradient updates in the first few steps | `warmup_steps=200` (~2 epochs on T4) |
| **Fewer max epochs** | Safety cap — early stopping should fire first, but 20–25 is safer than 50 | `epochs=25` |
| **Lower LR** | If still overfitting, try `5e-4`; slower but smoother val curve | `lr=5e-4` |

> `load_best_model_at_end=True` is always on — the saved checkpoint is the epoch with the **lowest val loss**, not the last epoch.

In [None]:
# ── TopicQG — trained on MixSQuAD (10k mixed pairs) ─────────────────────────
model_path = pipe.train(mode='topic', dataset='squad')
print(f"\nModel saved to: {model_path}")

In [None]:
# ── Baseline — context only, no topic signal ─────────────────────────────────
# model_path = pipe.train(mode='baseline', dataset='squad')
# print(f"Model saved to: {model_path}")

In [None]:
# ── TopicQG2X — trained on MixSQuAD2X (20k, reversed context order) ─────────
# model_path = pipe.train(mode='topic2x', dataset='squad')
# print(f"Model saved to: {model_path}")

## 5. Quick Generation Test

In [None]:
topic   = "Electronegativity"
context = (
    "Electronegativity is a measure of the tendency of an atom to attract "
    "a bonding pair of electrons. The Pauling scale is the most commonly "
    "used. Fluorine has the highest electronegativity (4.0). "
    "Electronegativity increases across a period and decreases down a group."
)

question = pipe.generate(topic=topic, context=context, mode='topic')
print(f"Topic   : {topic}")
print(f"Question: {question}")

## 6. Evaluate

Runs the full metric suite (word-level BLEU, char-level BLEU, F1, METEOR, ROUGE-L, Perplexity) and prints a comparison table against paper baselines.

**KhanQ evaluation** uses the `mixkhanq/data.csv` set (653 pairs, `topic2`/`question2` columns — paper's method).

In [None]:
# Evaluate T5 models only (no Ollama/Gemini needed)
results = pipe.evaluate(
    models='t5:topic',          # or 't5:baseline,t5:topic,t5:topic2x' or 'all'
    dataset='khanq',
)

In [None]:
import pandas as pd

rows = []
for key, m in results.items():
    rows.append({
        'model':       key,
        'n':           m.get('num_samples', '-'),
        'B1 (word)':   round(m.get('bleu1',      0), 3),
        'B4 (word)':   round(m.get('bleu4',      0), 3),
        'B1c (paper)': round(m.get('bleu1_char', 0), 3),
        'B4c (paper)': round(m.get('bleu4_char', 0), 3),
        'F1':          round(m.get('f1',          0), 3),
        'METEOR':      round(m.get('meteor',      0), 3),
        'ROUGE-L':     round(m.get('rouge_l',     0), 3),
        'PPL':         round(m.get('perplexity',  float('nan')), 3),
    })

df = pd.DataFrame(rows).set_index('model')
pd.set_option('display.max_columns', None)
df

### Paper Baselines (char-level BLEU, KhanQ)

| Model | B1c | B2c | B3c | B4c | F1 | METEOR | ROUGE-L | PPL |
|-------|-----|-----|-----|-----|----|--------|---------|-----|
| Baseline | 0.519 | 0.316 | 0.216 | 0.175 | 0.319 | 0.216 | 0.207 | 1.303 |
| TopicQGedu | 0.551 | 0.335 | 0.221 | 0.177 | 0.302 | 0.216 | 0.204 | 1.360 |
| **TopicQG** | **0.551** | **0.343** | **0.236** | **0.191** | **0.330** | **0.233** | **0.230** | **1.323** |
| TopicQG 8-bit | 0.546 | 0.339 | 0.231 | 0.186 | 0.319 | 0.226 | 0.225 | 1.327 |
| TopicQG 4-bit | 0.543 | 0.337 | 0.231 | 0.186 | 0.318 | 0.223 | 0.223 | 1.334 |
| TopicQG2X | 0.536 | 0.328 | 0.221 | 0.177 | 0.321 | 0.220 | 0.216 | 1.345 |

> Use `B1c`/`B4c` columns from the results table above for direct comparison.

## 7. Download Trained Model

Download the trained model as a zip file to your local machine. The cell below zips `models/topic/best_model/` and triggers a browser download.

In [None]:
# Optional: save results summary to a local file before downloading
import json
from pathlib import Path

results_dir = Path('/content/ai4ed-qg/results')
results_dir.mkdir(parents=True, exist_ok=True)

if 'results' in dir():
    out = results_dir / 'eval_results.json'
    with open(out, 'w') as f:
        json.dump(results, f, indent=2, default=str)
    print(f"Results saved to: {out}")
else:
    print("No evaluation results yet — run Section 6 first.")

In [None]:
# Download best model as zip
import shutil
from google.colab import files as colab_files

model_dir = Path('/content/ai4ed-qg/models/topic/best_model')
if model_dir.exists():
    shutil.make_archive('/content/t5_topic_best_model', 'zip', model_dir)
    colab_files.download('/content/t5_topic_best_model.zip')
else:
    print("Model not found — train first")