# 04 — Concept Extraction & Targeted Question Generation

Uses the pipeline's Wikifier/WAT integration to extract Wikipedia concepts from any educational text, then generates targeted questions for each concept using the trained T5 model.

**Workflow:**
1. Wikify a passage — extract ranked Wikipedia entities
2. Select top concepts by relevance score
3. Generate a question per concept using `pipe.generate()`
4. Export question bank as JSON / CSV

This is a lightweight alternative to a full knowledge graph — it produces a `(concept, context, question)` triple set directly usable for assessment or adaptive learning.

## Setup

In [None]:
import sys
from pathlib import Path

IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    REPO_URL = "https://github.com/YOUR_ORG/YOUR_REPO.git"  # TODO: set your URL
    !git clone {REPO_URL} /content/ai4ed-qg -q
    %cd /content/ai4ed-qg
    !pip install -q torch transformers sentence-transformers sentencepiece \
                    requests pyyaml tqdm pandas python-dotenv

    from google.colab import drive
    drive.mount('/content/drive')
    DRIVE_DIR = Path('/content/drive/MyDrive/ai4ed_qg')

    # Restore trained model from Drive
    import shutil
    src = DRIVE_DIR / 'models'
    if src.exists():
        shutil.copytree(src, Path('/content/ai4ed-qg/models'), dirs_exist_ok=True)
        print("Restored models/ from Drive")
else:
    DRIVE_DIR = None

import os
project_root = Path('/content/ai4ed-qg') if IN_COLAB else Path.cwd()
if project_root.name == 'notebooks':
    project_root = project_root.parent
os.chdir(project_root)
sys.path.insert(0, str(project_root))
print(f"Working dir: {os.getcwd()}")

In [None]:
# Wikifier API key — https://wikifier.org/register.html
if IN_COLAB:
    from google.colab import userdata
    try:
        os.environ['WIKIFIER_API_KEY'] = userdata.get('WIKIFIER_API_KEY')
        print("Wikifier key loaded from Colab Secrets")
    except Exception:
        os.environ['WIKIFIER_API_KEY'] = ""  # ← paste your key here
else:
    try:
        from dotenv import load_dotenv
        load_dotenv()
    except ImportError:
        pass

key = os.environ.get('WIKIFIER_API_KEY', '')
print(f"WIKIFIER_API_KEY: {'set' if key else 'NOT SET — wikification will fail'}")

## Initialise Pipeline

In [None]:
from src.pipeline import Pipeline

pipe = Pipeline('config/pipeline.yaml')
pipe.status()

## Define Educational Text

Paste any educational passage below. The Wikifier will annotate it with Wikipedia concepts.

In [None]:
# ── Edit this block with your own educational content ─────────────────────────

passages = [
    {
        "id": "chem-01",
        "text": (
            "Electronegativity is a measure of the tendency of an atom to attract "
            "a bonding pair of electrons. The Pauling scale is the most commonly used. "
            "Fluorine is the most electronegative element (4.0 on Pauling scale). "
            "Electronegativity increases across a period and decreases down a group "
            "in the periodic table. The difference in electronegativity between atoms "
            "determines the polarity of a chemical bond."
        ),
    },
    {
        "id": "bio-01",
        "text": (
            "Photosynthesis is a process by which plants, algae, and some bacteria "
            "convert light energy into chemical energy stored as glucose. "
            "It occurs in the chloroplasts, specifically using chlorophyll to absorb "
            "sunlight. The overall reaction combines carbon dioxide and water to produce "
            "glucose and oxygen. The light-dependent reactions occur in the thylakoid "
            "membranes, while the Calvin cycle takes place in the stroma."
        ),
    },
]

print(f"Loaded {len(passages)} passage(s)")
for p in passages:
    print(f"  [{p['id']}] {p['text'][:80]}...")

## Wikify Passages

Annotate each passage with Wikipedia entities using the Wikifier API.

In [None]:
from src.wikification import get_wikifier

# Construct the wikifier directly (no need to run the full pipeline)
wikifier = get_wikifier('wikifier', pipe.config)

annotated = []
for passage in passages:
    annotations = wikifier.annotate(passage['text'])
    annotated.append({
        **passage,
        'entities': annotations,  # list of {title, wiki_id, score, spot}
    })
    print(f"[{passage['id']}] Found {len(annotations)} entities")
    for e in annotations[:5]:  # show top 5
        print(f"    {e['score']:6.4f}  {e['title']}")

## Select Top Concepts

Filter entities to the top-N by relevance score (pageRank for Wikifier, rho for WAT).

In [None]:
TOP_N = 5  # concepts per passage

concept_list = []  # (passage_id, text, topic, score)

for item in annotated:
    # Entities are already sorted by score (descending)
    top_entities = item['entities'][:TOP_N]
    for e in top_entities:
        concept_list.append({
            'passage_id': item['id'],
            'text':       item['text'],
            'topic':      e['title'],
            'score':      e['score'],
        })

print(f"Total concept-passage pairs: {len(concept_list)}")
for c in concept_list:
    print(f"  [{c['passage_id']}]  {c['score']:.4f}  {c['topic']}")

## Generate Questions

For each concept, generate a question using the trained T5 model.

> Requires `models/topic/best_model/` from `02_distillation_training.ipynb`. If not available, set `USE_GEMINI = True` to use Gemini instead.

In [None]:
from tqdm.notebook import tqdm

USE_GEMINI = False   # Set True to use Gemini instead of T5
GEMINI_MODEL = 'gemini-2.5-flash'

if USE_GEMINI:
    from src.evaluation.models import GeminiBaseline
    generator = GeminiBaseline(GEMINI_MODEL)
    gen_fn = lambda topic, text: generator.generate_question(topic, text)
else:
    gen_fn = lambda topic, text: pipe.generate(
        topic=topic, context=text, mode='topic'
    )

question_bank = []

for concept in tqdm(concept_list, desc="Generating"):
    try:
        question = gen_fn(concept['topic'], concept['text'])
    except (FileNotFoundError, Exception) as exc:
        question = f"[ERROR: {exc}]"

    entry = {
        **concept,
        'question': question,
    }
    question_bank.append(entry)
    print(f"  [{concept['passage_id']}]  {concept['topic']:30s}  →  {question}")

print(f"\nGenerated {len(question_bank)} questions")

## Review Question Bank

In [None]:
import pandas as pd

df = pd.DataFrame(question_bank)[['passage_id', 'topic', 'score', 'question']]
df['score'] = df['score'].round(4)
pd.set_option('display.max_colwidth', 80)
df

## Export Question Bank

In [None]:
import json
from datetime import datetime

out_dir = project_root / 'results' / f'question_bank_{datetime.now().strftime("%Y%m%d_%H%M%S")}'
out_dir.mkdir(parents=True, exist_ok=True)

# CSV
csv_path = out_dir / 'question_bank.csv'
df.to_csv(csv_path, index=False)
print(f"CSV : {csv_path}")

# JSON (full, includes original text)
json_path = out_dir / 'question_bank.json'
with open(json_path, 'w', encoding='utf-8') as f:
    json.dump(question_bank, f, indent=2, ensure_ascii=False)
print(f"JSON: {json_path}")

In [None]:
# Colab: download the CSV
if IN_COLAB:
    from google.colab import files
    files.download(str(csv_path))

## Next Steps

- **Scale up**: Pass a larger document set in the `passages` list, or load from a file
- **Use WAT**: Change `get_wikifier('wikifier', ...)` to `get_wikifier('wat', ...)` for higher-quality entity linking (requires WAT token)
- **Zero-shot comparison**: Set `USE_GEMINI = True` to compare T5 vs Gemini question quality
- **Filter by score**: Increase the score threshold to keep only high-confidence concepts