# Zero-Shot T5 Baseline — KhanQ Evaluation

Evaluates the **base pretrained `t5-small`** (no fine-tuning) against the KhanQ evaluation set.

**Purpose:** Establish how much fine-tuning on MixSQuAD contributes over T5's pre-training alone.

**Prompt:** `generate a question about {topic}: {context}`

**Dataset:** `data/training/khanq/mixkhanq/data.csv` — 653 pairs, `topic2`/`question2` columns (paper's method)

**Metrics:** word-level BLEU, char-level BLEU (paper), F1, METEOR, ROUGE-L, Perplexity

In [None]:
import sys, os
from pathlib import Path

# Add project root to path
ROOT = Path('..').resolve()
if str(ROOT) not in sys.path:
    sys.path.insert(0, str(ROOT))
os.chdir(ROOT)

import torch
print(f'PyTorch     : {torch.__version__}')
print(f'CUDA        : {torch.cuda.is_available()}')
if torch.cuda.is_available():
    print(f'GPU         : {torch.cuda.get_device_name(0)}')

import transformers
print(f'transformers: {transformers.__version__}')
print(f'Working dir : {os.getcwd()}')

In [None]:
from src.pipeline import Pipeline

pipe = Pipeline('config/pipeline.yaml')

results = pipe.evaluate(
    models='t5:zero',
    dataset='khanq',
)

In [None]:
import pandas as pd

rows = []
for key, m in results.items():
    rows.append({
        'model':       key,
        'n':           m.get('num_samples', '-'),
        'B1 (word)':   round(m.get('bleu1',      0), 3),
        'B4 (word)':   round(m.get('bleu4',      0), 3),
        'B1c (paper)': round(m.get('bleu1_char', 0), 3),
        'B4c (paper)': round(m.get('bleu4_char', 0), 3),
        'F1':          round(m.get('f1',          0), 3),
        'METEOR':      round(m.get('meteor',      0), 3),
        'ROUGE-L':     round(m.get('rouge_l',     0), 3),
        'PPL':         round(m.get('perplexity',  float('nan')), 3),
    })

df = pd.DataFrame(rows).set_index('model')
pd.set_option('display.max_columns', None)
display(df)

### Paper Baselines (char-level BLEU, KhanQ)

| Model | B1c | B2c | B3c | B4c | F1 | METEOR | ROUGE-L | PPL |
|-------|-----|-----|-----|-----|----|--------|---------|-----|
| Baseline | 0.519 | 0.316 | 0.216 | 0.175 | 0.319 | 0.216 | 0.207 | 1.303 |
| **TopicQG** | **0.551** | **0.343** | **0.236** | **0.191** | **0.330** | **0.233** | **0.230** | **1.323** |

Compare `B1c`/`B4c` above against this table.

In [None]:
# Show a few sample predictions
key = list(results.keys())[0]
m = results[key]
preds = m.get('predictions', [])
refs  = m.get('references',  [])

print(f'Sample predictions from {key}:\n')
for i, (p, r) in enumerate(zip(preds[:10], refs[:10])):
    print(f'[{i+1}] Ref : {r}')
    print(f'     Pred: {p}')
    print()