# End-to-End Textbook RAG Demo

This notebook demonstrates the complete RAG pipeline for a textbook, including:
1. **Ingestion**: Parsing a textbook (PDF/JSON) into structure nodes and content atoms.
2. **Ground Truth Extraction**: Extracting vocabulary using the specialized extractor for verification.
3. **RAG Query**: Generating a quiz for a specific unit using the RAG engine with safeguards.
4. **Verification**: verifying that the generated quiz respects the vocabulary constraints of the target unit.

**Prerequisites:**
- Postgres database running (with `pgvector` extension).
- `OPENAI_API_KEY` environment variable set.
- Dependencies installed (`pip install -r requirements.txt`).

In [None]:
import os
import sys
import uuid
import asyncio
import json
from pathlib import Path
import re

# Ensure project root is in path
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '../..')))

from ingest.service import IngestionService
from ingest.infra.postgres import PostgresStructureNodeRepository
from ingest.hybrid_ingestor import HybridIngestor
from ingest.docling_parser import load_docling_blocks
from ingest.segmentation import SegmentationRules, segment_lessons
from ingest.vocab_extractor import extract_vocab_entries, link_vocab_to_lessons
from app.rag_engine import retrieve_and_generate
from app.schemas import GenerateItemsRequest, ConceptPack

## Configuration

In [None]:
# Constants
FILE_PATH = "../../data/toy_green_line_1_docling.json" # Relative path from docs/notebooks/
BOOK_ID = uuid.uuid4()
UNIT_TO_TEST = 1
TOPIC = "Create a quiz for Unit 1"
CATEGORY = "language"

# Check for API Key
if not os.getenv("OPENAI_API_KEY"):
    print("WARNING: OPENAI_API_KEY not set. Please set it to use real embeddings/LLM.")
    should_mock = True
else:
    should_mock = False
    print("OPENAI_API_KEY detected.")

## Step 1: Ingestion
We use the `IngestionService` to process the book. This parses the content, creates structure nodes in Postgres, and indexes content atoms in the Vector Store.

In [None]:
print(f"--- Starting Ingestion for Book ID: {BOOK_ID} ---")

# Initialize Components
repo = PostgresStructureNodeRepository()
ingestor = HybridIngestor()
service = IngestionService(
    structure_repo=repo,
    ingestor=ingestor,
    should_mock_embedding=should_mock
)

# Run Ingestion
# Note: In a real notebook, ensure Postgres is running locally or via Docker
try:
    service.ingest_book(FILE_PATH, book_id=BOOK_ID, category=CATEGORY)
except Exception as e:
    print(f"Ingestion failed: {e}")
    print("Ensure your Postgres database is running and accessible via env vars (POSTGRES_HOST, etc).")

## Step 2: Extract Ground Truth Vocabulary
To verify our safeguards, we extract the "official" vocabulary list for Unit 1 using the `vocab_extractor` module. This gives us the list of allowed words.

In [None]:
print("Extracting Ground Truth Vocabulary...")
path = Path(FILE_PATH)
blocks = load_docling_blocks(path)
rules = SegmentationRules()

# Segment into lessons/units
lessons = segment_lessons(blocks, rules, textbook_id=str(BOOK_ID))

# Extract vocab
vocab_entries = extract_vocab_entries(blocks, rules, textbook_id=str(BOOK_ID))
linked_vocab = link_vocab_to_lessons(vocab_entries, lessons)

# Filter for Unit 1
unit_1_vocab = {v.term.lower() for v in linked_vocab if v.unit == UNIT_TO_TEST}
print(f"Found {len(unit_1_vocab)} vocabulary words for Unit {UNIT_TO_TEST}.")
if unit_1_vocab:
    print(f"Sample: {list(unit_1_vocab)[:5]}")

## Step 3: Run RAG Pipeline
We request a quiz generation. The RAG engine (`retrieve_and_generate`) searches the vector store for relevant content atoms. 

**Safeguard:** The `SearchService` (called internally) applies a `MetadataFilter` to restrict content to atoms with `sequence_index` <= Unit 1 (or matching the unit context).

In [None]:
if should_mock:
    print("Skipping LLM generation due to missing API key.")
else:
    print("Generating Quiz...")
    try:
        response = retrieve_and_generate(
            book_id=str(BOOK_ID),
            unit=UNIT_TO_TEST,
            topic=TOPIC,
            category=CATEGORY
        )
        
        print("\n--- Generated Quiz ---")
        quiz_text = ""
        for item in response.items:
            q_str = f"Q: {item.stem}"
            if item.options:
                q_str += f" Options: {item.options}"
            print(f"{q_str} (Ans: {item.answer})")
            quiz_text += f"{item.stem} {item.answer} "
            
    except Exception as e:
        print(f"RAG Execution Failed: {e}")

## Step 4: Verify Safeguards
We analyze the generated quiz text to see if it utilizes the vocabulary from Unit 1. While the LLM might use common English words (stop words), we specifically look for the presence of Unit 1 vocabulary terms to confirm relevant content was used.

In [None]:
if not should_mock and 'quiz_text' in locals():
    print("\nVerifying Vocabulary Usage...")
    words_in_quiz = set(re.findall(r'\b\w+\b', quiz_text.lower()))
    
    used_target_vocab = words_in_quiz.intersection(unit_1_vocab)
    
    print(f"Total unique words in quiz: {len(words_in_quiz)}")
    print(f"Unit 1 Vocab words used: {len(used_target_vocab)}")
    
    if used_target_vocab:
        print(f"Examples used: {list(used_target_vocab)[:10]}")
        print("SUCCESS: The generated quiz incorporates specific vocabulary from Unit 1.")
    else:
        if len(unit_1_vocab) > 0:
            print("WARNING: No specific Unit 1 vocabulary detected. The quiz might be too generic.")
        else:
            print("Note: No Unit 1 vocabulary was extracted to check against.")