# T5 Question Generation — Colab Training

Standalone notebook for training and evaluating T5 topic-controlled question generation on Google Colab.

**Use this notebook when:**
- You want to train on Colab's GPU using datasets already in the repository
- You want to evaluate with the full metric suite against paper baselines

**Steps:**
1. Setup environment and clone repo (data files included)
2. Verify training data from the cloned repository
3. Train one or more model variants with `pipe.train()`
4. Evaluate with `pipe.evaluate()` against paper baselines
5. Download the trained model

**Expected CSV files** (committed to `data/training/` in the repo):
```
data/training/squad/baseline/   train.csv  val.csv  test.csv
data/training/squad/mixsquad/   train.csv  val.csv  test.csv
data/training/khanq/mixkhanq/   data.csv
```

## 1. Setup

In [1]:
# Check GPU
!nvidia-smi

import torch
print(f"\nPyTorch : {torch.__version__}")
print(f"CUDA    : {torch.cuda.is_available()}")
if torch.cuda.is_available():
    name = torch.cuda.get_device_name(0)
    mem  = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU     : {name} ({mem:.0f} GB)")
    # Suggest batch size based on available VRAM
    suggested_batch = 128 if mem >= 35 else 64 if mem >= 15 else 32
    print(f"Suggested batch size: {suggested_batch}")
else:
    print("WARNING: No GPU detected. Training will be very slow.")

Mon Feb 23 17:02:53 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   38C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+----------------------------------------------

In [2]:
import sys, os
from pathlib import Path

# ── Clone repository ──────────────────────────────────────────────────────────
# TODO: replace with your actual repository URL
REPO_URL = "https://github.com/Byambaa0325/question-generation-distillation.git"
!git clone {REPO_URL} /content/ai4ed-qg -q
%cd /content/ai4ed-qg

# ── Install dependencies ──────────────────────────────────────────────────────
!pip install -q torch transformers datasets accelerate sentencepiece \
                evaluate rouge_score nltk sentence-transformers \
                pyyaml tqdm pandas python-dotenv

import nltk
for res in ('punkt', 'punkt_tab', 'wordnet', 'omw-1.4'):
    nltk.download(res, quiet=True)

sys.path.insert(0, '/content/ai4ed-qg')
os.chdir('/content/ai4ed-qg')
print(f"Working dir: {os.getcwd()}")

/content/ai4ed-qg
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
Working dir: /content/ai4ed-qg


In [None]:
# Data is included in the cloned repository — no Google Drive needed.
from pathlib import Path

REPO_DIR = Path('/content/ai4ed-qg')
print(f"Repo    : {REPO_DIR}")
print(f"Data dir: {REPO_DIR / 'data'} — exists: {(REPO_DIR / 'data').exists()}")

## 2. Verify Training Data

Training data is committed to the repository and was cloned in Step 1. No upload or Drive mounting needed.

Run the cell below to confirm all expected files are present.

In [None]:
# Verify training data from cloned repo
from pathlib import Path

REPO_DIR = Path('/content/ai4ed-qg')

check_paths = [
    'data/training/squad/baseline/train.csv',
    'data/training/squad/baseline/val.csv',
    'data/training/squad/baseline/test.csv',
    'data/training/squad/mixsquad/train.csv',
    'data/training/squad/mixsquad/val.csv',
    'data/training/squad/mixsquad/test.csv',
    'data/training/khanq/mixkhanq/data.csv',
]

all_ok = True
for rel in check_paths:
    p = REPO_DIR / rel
    if p.exists():
        print(f"  [OK]      {rel}  ({p.stat().st_size:,} bytes)")
    else:
        print(f"  [MISSING] {rel}")
        all_ok = False

if all_ok:
    print("\nAll training files present — ready to train.")
else:
    print("\nSome files missing. Available CSVs in data/training/:")
    for f in sorted((REPO_DIR / 'data/training').rglob('*.csv')):
        print(f"  {f.relative_to(REPO_DIR)}")

In [None]:
# No file placement needed — data is already in the correct paths from the cloned repo.
print("Data paths are set up by the repository structure. Proceed to Step 3.")

In [None]:
# Preview first few rows of a training file to confirm format
import pandas as pd
from pathlib import Path

REPO_DIR = Path('/content/ai4ed-qg')
sample_csv = REPO_DIR / 'data/training/squad/mixsquad/train.csv'

if sample_csv.exists():
    df = pd.read_csv(sample_csv)
    print(f"mixsquad/train.csv — {len(df):,} rows, columns: {list(df.columns)}")
    display(df.head(3))
else:
    print(f"File not found: {sample_csv}")

## 3. Initialise Pipeline

In [5]:
from src.pipeline import Pipeline

pipe = Pipeline('config/pipeline.yaml')
pipe.status()


Pipeline status:
  [-] convert.squad.text
  [-] convert.squad.question
  [-] convert.khanq.text
  [-] convert.khanq.question
  [-] wikify.squad.text
  [-] wikify.squad.question
  [-] wikify.khanq.text
  [-] wikify.khanq.question
  [-] topics.squad.enriched
  [-] topics.squad.filtered
  [-] topics.khanq.enriched
  [-] topics.khanq.filtered
  [-] dataset.squad.baseline
  [-] dataset.squad.mixsquad
  [-] dataset.squad.mixsquad2x
  [-] dataset.khanq.baseline
  [-] dataset.khanq.mixsquad
  [-] dataset.khanq.mixsquad2x
  [-] train.baseline
  [-] train.topic
  [-] train.topic2x


{'convert.squad.text': False,
 'convert.squad.question': False,
 'convert.khanq.text': False,
 'convert.khanq.question': False,
 'wikify.squad.text': False,
 'wikify.squad.question': False,
 'wikify.khanq.text': False,
 'wikify.khanq.question': False,
 'topics.squad.enriched': False,
 'topics.squad.filtered': False,
 'topics.khanq.enriched': False,
 'topics.khanq.filtered': False,
 'dataset.squad.baseline': False,
 'dataset.squad.mixsquad': False,
 'dataset.squad.mixsquad2x': False,
 'dataset.khanq.baseline': False,
 'dataset.khanq.mixsquad': False,
 'dataset.khanq.mixsquad2x': False,
 'train.baseline': False,
 'train.topic': False,
 'train.topic2x': False}

In [6]:
# ── Tweak training config for your GPU ───────────────────────────────────────
# Edit pipeline.yaml to make changes permanent, or override here:

pipe.config.training.batch  = 64     # 128 for A100, 64 for T4/V100, 32 if OOM
pipe.config.training.epochs = 50     # paper uses 50
pipe.config.training.lr     = 1e-3

t = pipe.config.training
print(f"Model      : {t.model_name}")
print(f"Batch size : {t.batch}")
print(f"Epochs     : {t.epochs}")
print(f"LR         : {t.lr}")
print(f"Max input  : {t.max_input_len} tokens")
print(f"Max output : {t.max_output_len} tokens")

Model      : google-t5/t5-small
Batch size : 64
Epochs     : 50
LR         : 0.001
Max input  : 200 tokens
Max output : 45 tokens


## 4. Train

Train the model variant you need. The pipeline uses the correct paper format for all modes:
```
Input:  <topic> {topic} <context> {combined text}
Target: {question}
```

Saved to `models/{mode}/best_model/` (best checkpoint by validation loss).

In [None]:
# ── TopicQG — trained on MixSQuAD (10k mixed pairs) ─────────────────────────
model_path = pipe.train(mode='topic', dataset='squad')
print(f"\nModel saved to: {model_path}")

In [None]:
# ── Baseline — context only, no topic signal ─────────────────────────────────
# model_path = pipe.train(mode='baseline', dataset='squad')
# print(f"Model saved to: {model_path}")

In [None]:
# ── TopicQG2X — trained on MixSQuAD2X (20k, reversed context order) ─────────
# model_path = pipe.train(mode='topic2x', dataset='squad')
# print(f"Model saved to: {model_path}")

## 5. Quick Generation Test

In [None]:
topic   = "Electronegativity"
context = (
    "Electronegativity is a measure of the tendency of an atom to attract "
    "a bonding pair of electrons. The Pauling scale is the most commonly "
    "used. Fluorine has the highest electronegativity (4.0). "
    "Electronegativity increases across a period and decreases down a group."
)

question = pipe.generate(topic=topic, context=context, mode='topic')
print(f"Topic   : {topic}")
print(f"Question: {question}")

## 6. Evaluate

Runs the full metric suite (word-level BLEU, char-level BLEU, F1, METEOR, ROUGE-L, Perplexity) and prints a comparison table against paper baselines.

**KhanQ evaluation** uses the `mixkhanq/data.csv` set (653 pairs, `topic2`/`question2` columns — paper's method).

In [None]:
# Evaluate T5 models only (no Ollama/Gemini needed)
results = pipe.evaluate(
    models='t5:topic',          # or 't5:baseline,t5:topic,t5:topic2x' or 'all'
    dataset='khanq',
)

In [None]:
import pandas as pd

rows = []
for key, m in results.items():
    rows.append({
        'model':       key,
        'n':           m.get('num_samples', '-'),
        'B1 (word)':   round(m.get('bleu1',      0), 3),
        'B4 (word)':   round(m.get('bleu4',      0), 3),
        'B1c (paper)': round(m.get('bleu1_char', 0), 3),
        'B4c (paper)': round(m.get('bleu4_char', 0), 3),
        'F1':          round(m.get('f1',          0), 3),
        'METEOR':      round(m.get('meteor',      0), 3),
        'ROUGE-L':     round(m.get('rouge_l',     0), 3),
        'PPL':         round(m.get('perplexity',  float('nan')), 3),
    })

df = pd.DataFrame(rows).set_index('model')
pd.set_option('display.max_columns', None)
df

### Paper Baselines (char-level BLEU, KhanQ)

| Model | B1c | B2c | B3c | B4c | F1 | METEOR | ROUGE-L | PPL |
|-------|-----|-----|-----|-----|----|--------|---------|-----|
| Baseline | 0.519 | 0.316 | 0.216 | 0.175 | 0.319 | 0.216 | 0.207 | 1.303 |
| TopicQGedu | 0.551 | 0.335 | 0.221 | 0.177 | 0.302 | 0.216 | 0.204 | 1.360 |
| **TopicQG** | **0.551** | **0.343** | **0.236** | **0.191** | **0.330** | **0.233** | **0.230** | **1.323** |
| TopicQG 8-bit | 0.546 | 0.339 | 0.231 | 0.186 | 0.319 | 0.226 | 0.225 | 1.327 |
| TopicQG 4-bit | 0.543 | 0.337 | 0.231 | 0.186 | 0.318 | 0.223 | 0.223 | 1.334 |
| TopicQG2X | 0.536 | 0.328 | 0.221 | 0.177 | 0.321 | 0.220 | 0.216 | 1.345 |

> Use `B1c`/`B4c` columns from the results table above for direct comparison.

## 7. Download Trained Model

Download the trained model as a zip file to your local machine. The cell below zips `models/topic/best_model/` and triggers a browser download.

In [None]:
# Optional: save results summary to a local file before downloading
import json
from pathlib import Path

results_dir = Path('/content/ai4ed-qg/results')
results_dir.mkdir(parents=True, exist_ok=True)

if 'results' in dir():
    out = results_dir / 'eval_results.json'
    with open(out, 'w') as f:
        json.dump(results, f, indent=2, default=str)
    print(f"Results saved to: {out}")
else:
    print("No evaluation results yet — run Section 6 first.")

In [None]:
# Download best model as zip
import shutil
from google.colab import files as colab_files

model_dir = Path('/content/ai4ed-qg/models/topic/best_model')
if model_dir.exists():
    shutil.make_archive('/content/t5_topic_best_model', 'zip', model_dir)
    colab_files.download('/content/t5_topic_best_model.zip')
else:
    print("Model not found — train first")