# Notebook 04: Evaluation & Gradio Demo

## Goal

Build a lightweight evaluation set, test answer composition with verbatim quotes, and wire up a Gradio demo.


## Building a Tiny Gold QA Set

Create 10–20 question-answer pairs manually:

- **Questions**: Focused, answerable from the text (e.g., "What does Lord Henry say about influence?")
- **Acceptable answer keywords**: Terms that should appear in retrieved chunks (e.g., "influence", "young", "soul")
- **Notes**: Optional context about expected answer structure

Save this as `data/interim/qa_dev.csv` with columns: `question`, `acceptable_answer_keywords`, `notes`


## Metrics: Recall@k & Groundedness

### Recall@k (Retrieval)

For each question, check if at least one retrieved chunk (top-k) contains any of the acceptable answer keywords.

- **Recall@5**: Proportion of questions where a gold-supporting chunk appears in top-5
- **Target**: ≥ 0.8 (80% of questions have relevant chunks retrieved)

### Groundedness (Composition)

Measure how well answers are grounded in quotes:

- **Quote presence**: % of answers with ≥1 quote
- **Attribution score**: Mean fraction of answer sentences that share ≥2 content words with some quote
- **Target**: Groundedness ≥ 0.95, Attribution ≥ 0.7

### Latency (UX)

Mean retrieval + compose time on CPU for one query. Target: < 2 seconds.


## Answer Style Guide

When composing answers:

- **Length**: 2–4 sentences, ~100–140 words
- **Tone**: Assertive but qualified ("the text suggests…", "the narrator frames…")
- **No inventions**: Every factual clause must be traceable to a quote
- **References**: Use [1], [2], [3] in the answer; match to citations list
- **Quote selection**: Prefer one quote that defines, one that illustrates, and one that contrasts (when available)


## Step 1: Create QA Development Set

Manually create a small QA dataset.


In [None]:
# === TODO (you code this) ===
# Create a small QA dataframe: {question, acceptable_answer_keywords, notes}.
# Acceptance: CSV written to data/interim/qa_dev.csv

import pandas as pd

# Create 10-20 QA pairs manually
# Save to data/interim/qa_dev.csv


## Step 2: Evaluate Retrieval (Recall@k)

For each question, retrieve top-k chunks and check if any contain the acceptable keywords.


In [None]:
# === TODO (you code this) ===
# Evaluate retrieval: for each Q, if any retrieved chunk contains a keyword → hit.
# Acceptance: print Recall@k summary.

from src.retrieve import retrieve

# Load QA set
# For each question:
#   - Retrieve top-k chunks
#   - Check if any chunk contains acceptable keywords
# Compute Recall@k


## Step 3: Compose Answers with Quotes

Test the answer composition pipeline with retrieved chunks.


In [None]:
# === TODO (you code this) ===
# Compose an answer from retrieved chunks (no LLM), with quotes and citations.
# Acceptance: dict with 'answer', 'quotes', 'used_chunks'.

from src.compose import compose_answer

# Test on a few example queries
# Verify that answers include quotes and citations


## Step 4: Evaluate Groundedness

Measure how well answers are grounded in quotes.


In [None]:
# === TODO (you code this) ===
# Evaluate "groundedness": % of answers containing ≥1 quote AND each claim sentence overlaps tokens with at least one quote.
# Hints:
# 1) For a small QA set, run retrieve -> compose; check non-empty quotes.
# 2) Token-overlap heuristic: for each answer sentence, require >= t shared content words with some quote (t ~ 2–3).
# Acceptance:
# - Print groundedness rate and sample diagnostics for 3 questions.

# For each question:
#   - Retrieve and compose answer
#   - Check quote presence
#   - Compute token overlap between answer sentences and quotes
# Report groundedness and attribution scores


## Step 5: Wire Gradio Demo

Create a simple Gradio interface for interactive Q&A.


In [None]:
# === TODO (you code this) ===
# Wire a simple Gradio demo using src/app.launch_app().

from src.app import launch_app

# Launch the demo
# Test with a few questions interactively


## Step 6: Prepare for Hugging Face Spaces Deployment

Optional: Prepare the app for deployment to Hugging Face Spaces.


In [None]:
# === TODO (you code this) ===
# Prepare for Hugging Face Spaces deployment.
# Hints:
# 1) Ensure app entrypoint is src/app.py with a function `launch_app()` or `demo = gr.Interface(...)`.
# 2) Create a `README.md` for the Space (use SPACE_CARD.md text).
# 3) Runtime: set "Hardware: CPU basic", "SDK: Gradio", "Space Timeout: 120s".
# Acceptance:
# - Space builds successfully; interacts within ~2–5 seconds per query on CPU.

# See SPACE_CARD.md for the README content to use in the Space.


## Summary

At this point, you should have:

- ✅ QA development set created (`data/interim/qa_dev.csv`)
- ✅ Recall@k evaluated (target: ≥ 0.8)
- ✅ Groundedness evaluated (target: ≥ 0.95)
- ✅ Answer composition tested with quotes and citations
- ✅ Gradio demo working locally
- ✅ (Optional) Space deployment ready

## Next Steps

- Optional LLM rewrite step (keeps quotes, improves fluency)
- Named-entity & character graph for richer answers
- Multi-book corpus with per-source filtering
