# T5 Question Generation — Colab Training

Standalone notebook for training and evaluating T5 topic-controlled question generation on Google Colab.

**Use this notebook when:**
- You have already run the data pipeline locally (stages 1–4) and have the CSV files
- You want to train on Colab's GPU and evaluate with the full metric suite

**Steps:**
1. Setup environment and clone repo
2. Upload training CSVs from your local pipeline run
3. Train one or more model variants with `pipe.train()`
4. Evaluate with `pipe.evaluate()` against paper baselines
5. Download the trained model

**Expected CSV files** (produced by `01_data_generation.ipynb` or `pipeline.py dataset`):
```
data/training/squad/baseline/  train.csv  val.csv  test.csv
data/training/squad/mixsquad/  train.csv  val.csv  test.csv
data/training/khanq/mixkhanq/  data.csv
```

## 1. Setup

In [None]:
# Check GPU
!nvidia-smi

import torch
print(f"\nPyTorch : {torch.__version__}")
print(f"CUDA    : {torch.cuda.is_available()}")
if torch.cuda.is_available():
    name = torch.cuda.get_device_name(0)
    mem  = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU     : {name} ({mem:.0f} GB)")
    # Suggest batch size based on available VRAM
    suggested_batch = 128 if mem >= 35 else 64 if mem >= 15 else 32
    print(f"Suggested batch size: {suggested_batch}")
else:
    print("WARNING: No GPU detected. Training will be very slow.")

In [None]:
import sys, os
from pathlib import Path

# ── Clone repository ──────────────────────────────────────────────────────────
# TODO: replace with your actual repository URL
REPO_URL = "https://github.com/YOUR_ORG/YOUR_REPO.git"
!git clone {REPO_URL} /content/ai4ed-qg -q
%cd /content/ai4ed-qg

# ── Install dependencies ──────────────────────────────────────────────────────
!pip install -q torch transformers datasets accelerate sentencepiece \
                evaluate rouge_score nltk sentence-transformers \
                pyyaml tqdm pandas python-dotenv

import nltk
for res in ('punkt', 'punkt_tab', 'wordnet', 'omw-1.4'):
    nltk.download(res, quiet=True)

sys.path.insert(0, '/content/ai4ed-qg')
os.chdir('/content/ai4ed-qg')
print(f"Working dir: {os.getcwd()}")

In [None]:
from google.colab import drive
drive.mount('/content/drive')

DRIVE_DIR = Path('/content/drive/MyDrive/ai4ed_qg')
DRIVE_DIR.mkdir(parents=True, exist_ok=True)
print(f"Drive directory: {DRIVE_DIR}")

# Restore any previously saved models from Drive
import shutil
for subdir in ('models', 'results'):
    src = DRIVE_DIR / subdir
    dst = Path('/content/ai4ed-qg') / subdir
    if src.exists():
        shutil.copytree(src, dst, dirs_exist_ok=True)
        print(f"Restored {subdir}/ from Drive")

## 2. Upload Training Data

Upload the CSV files produced by the data pipeline. You need at minimum:
- `train.csv` + `val.csv` for the mode you want to train
- `data.csv` (MixKhanQ) for evaluation

**Option A**: Upload from your local machine using the cell below.
**Option B**: Copy from Drive if you already uploaded them.

In [None]:
# ── Option A: upload from local machine ──────────────────────────────────────
# Run this cell and select your CSV files.
# Files will be placed in the correct data/training/ subdirectory.

from google.colab import files
import shutil

print("Select CSV files to upload (train.csv, val.csv, test.csv, data.csv)...")
uploaded = files.upload()

for filename in uploaded:
    print(f"Uploaded: {filename}")

# After uploading, place files manually:
# data/training/squad/mixsquad/train.csv  → for 'topic' mode
# data/training/squad/baseline/train.csv  → for 'baseline' mode
# data/training/khanq/mixkhanq/data.csv   → for evaluation

In [None]:
# ── Place uploaded files into the correct directories ─────────────────────────
# Edit this mapping to match what you uploaded.
# Keys are uploaded filenames, values are destination paths.

import shutil
from pathlib import Path

file_placement = {
    # 'train.csv': 'data/training/squad/mixsquad/train.csv',
    # 'val.csv':   'data/training/squad/mixsquad/val.csv',
    # 'test.csv':  'data/training/squad/mixsquad/test.csv',
    # 'data.csv':  'data/training/khanq/mixkhanq/data.csv',
}

for src_name, dst_rel in file_placement.items():
    src = Path(src_name)
    dst = Path(dst_rel)
    if src.exists():
        dst.parent.mkdir(parents=True, exist_ok=True)
        shutil.copy(src, dst)
        print(f"Placed: {src_name} → {dst}")
    else:
        print(f"Not found: {src_name}")

In [None]:
# ── Option B: copy from Drive ─────────────────────────────────────────────────
# If you already have data on Drive, copy it here:
import shutil
for subdir in ('processed', 'training'):
    src = DRIVE_DIR / subdir
    dst = Path('/content/ai4ed-qg/data') / subdir
    if src.exists():
        shutil.copytree(src, dst, dirs_exist_ok=True)
        print(f"Restored data/{subdir}/ from Drive")
    else:
        print(f"Not found in Drive: {subdir}/")

## 3. Initialise Pipeline

In [None]:
from src.pipeline import Pipeline

pipe = Pipeline('config/pipeline.yaml')
pipe.status()

In [None]:
# ── Tweak training config for your GPU ───────────────────────────────────────
# Edit pipeline.yaml to make changes permanent, or override here:

pipe.config.training.batch  = 64     # 128 for A100, 64 for T4/V100, 32 if OOM
pipe.config.training.epochs = 50     # paper uses 50
pipe.config.training.lr     = 1e-3

t = pipe.config.training
print(f"Model      : {t.model_name}")
print(f"Batch size : {t.batch}")
print(f"Epochs     : {t.epochs}")
print(f"LR         : {t.lr}")
print(f"Max input  : {t.max_input_len} tokens")
print(f"Max output : {t.max_output_len} tokens")

## 4. Train

Train the model variant you need. The pipeline uses the correct paper format for all modes:
```
Input:  <topic> {topic} <context> {combined text}
Target: {question}
```

Saved to `models/{mode}/best_model/` (best checkpoint by validation loss).

In [None]:
# ── TopicQG — trained on MixSQuAD (10k mixed pairs) ─────────────────────────
model_path = pipe.train(mode='topic', dataset='squad')
print(f"\nModel saved to: {model_path}")

In [None]:
# ── Baseline — context only, no topic signal ─────────────────────────────────
# model_path = pipe.train(mode='baseline', dataset='squad')
# print(f"Model saved to: {model_path}")

In [None]:
# ── TopicQG2X — trained on MixSQuAD2X (20k, reversed context order) ─────────
# model_path = pipe.train(mode='topic2x', dataset='squad')
# print(f"Model saved to: {model_path}")

## 5. Quick Generation Test

In [None]:
topic   = "Electronegativity"
context = (
    "Electronegativity is a measure of the tendency of an atom to attract "
    "a bonding pair of electrons. The Pauling scale is the most commonly "
    "used. Fluorine has the highest electronegativity (4.0). "
    "Electronegativity increases across a period and decreases down a group."
)

question = pipe.generate(topic=topic, context=context, mode='topic')
print(f"Topic   : {topic}")
print(f"Question: {question}")

## 6. Evaluate

Runs the full metric suite (word-level BLEU, char-level BLEU, F1, METEOR, ROUGE-L, Perplexity) and prints a comparison table against paper baselines.

**KhanQ evaluation** uses the `mixkhanq/data.csv` set (653 pairs, `topic2`/`question2` columns — paper's method).

In [None]:
# Evaluate T5 models only (no Ollama/Gemini needed)
results = pipe.evaluate(
    models='t5:topic',          # or 't5:baseline,t5:topic,t5:topic2x' or 'all'
    dataset='khanq',
)

In [None]:
import pandas as pd

rows = []
for key, m in results.items():
    rows.append({
        'model':       key,
        'n':           m.get('num_samples', '-'),
        'B1 (word)':   round(m.get('bleu1',      0), 3),
        'B4 (word)':   round(m.get('bleu4',      0), 3),
        'B1c (paper)': round(m.get('bleu1_char', 0), 3),
        'B4c (paper)': round(m.get('bleu4_char', 0), 3),
        'F1':          round(m.get('f1',          0), 3),
        'METEOR':      round(m.get('meteor',      0), 3),
        'ROUGE-L':     round(m.get('rouge_l',     0), 3),
        'PPL':         round(m.get('perplexity',  float('nan')), 3),
    })

df = pd.DataFrame(rows).set_index('model')
pd.set_option('display.max_columns', None)
df

### Paper Baselines (char-level BLEU, KhanQ)

| Model | B1c | B2c | B3c | B4c | F1 | METEOR | ROUGE-L | PPL |
|-------|-----|-----|-----|-----|----|--------|---------|-----|
| Baseline | 0.519 | 0.316 | 0.216 | 0.175 | 0.319 | 0.216 | 0.207 | 1.303 |
| TopicQGedu | 0.551 | 0.335 | 0.221 | 0.177 | 0.302 | 0.216 | 0.204 | 1.360 |
| **TopicQG** | **0.551** | **0.343** | **0.236** | **0.191** | **0.330** | **0.233** | **0.230** | **1.323** |
| TopicQG 8-bit | 0.546 | 0.339 | 0.231 | 0.186 | 0.319 | 0.226 | 0.225 | 1.327 |
| TopicQG 4-bit | 0.543 | 0.337 | 0.231 | 0.186 | 0.318 | 0.223 | 0.223 | 1.334 |
| TopicQG2X | 0.536 | 0.328 | 0.221 | 0.177 | 0.321 | 0.220 | 0.216 | 1.345 |

> Use `B1c`/`B4c` columns from the results table above for direct comparison.

## 7. Save to Drive

In [None]:
import shutil

# Sync models and results to Drive
for subdir in ('models', 'results'):
    src = Path('/content/ai4ed-qg') / subdir
    dst = DRIVE_DIR / subdir
    if src.exists():
        shutil.copytree(src, dst, dirs_exist_ok=True)
        print(f"Synced {subdir}/ to Drive")

print(f"\nAll files saved to: {DRIVE_DIR}")

In [None]:
# Download best model as zip
import shutil
from google.colab import files as colab_files

model_dir = Path('/content/ai4ed-qg/models/topic/best_model')
if model_dir.exists():
    shutil.make_archive('/content/t5_topic_best_model', 'zip', model_dir)
    colab_files.download('/content/t5_topic_best_model.zip')
else:
    print("Model not found — train first")