# Notebook 04: Evaluation & Gradio Demo

## Goal

Build a lightweight evaluation set, test answer composition with verbatim quotes, and wire up a Gradio demo.


## Building a Tiny Gold QA Set

Create 10â€“20 question-answer pairs manually:

- **Questions**: Focused, answerable from the text (e.g., "What does Lord Henry say about influence?")
- **Acceptable answer keywords**: Terms that should appear in retrieved chunks (e.g., "influence", "young", "soul")
- **Notes**: Optional context about expected answer structure

Save this as `data/interim/qa_dev.csv` with columns: `question`, `acceptable_answer_keywords`, `notes`


## Metrics: Recall@k & Groundedness

### Recall@k (Retrieval)

For each question, check if at least one retrieved chunk (top-k) contains any of the acceptable answer keywords.

- **Recall@5**: Proportion of questions where a gold-supporting chunk appears in top-5
- **Target**: â‰¥ 0.8 (80% of questions have relevant chunks retrieved)

### Groundedness (Composition)

Measure how well answers are grounded in quotes:

- **Quote presence**: % of answers with â‰¥1 quote
- **Attribution score**: Mean fraction of answer sentences that share â‰¥2 content words with some quote
- **Target**: Groundedness â‰¥ 0.95, Attribution â‰¥ 0.7

### Latency (UX)

Mean retrieval + compose time on CPU for one query. Target: < 2 seconds.


## Answer Style Guide

When composing answers:

- **Length**: 2â€“4 sentences, ~100â€“140 words
- **Tone**: Assertive but qualified ("the text suggestsâ€¦", "the narrator framesâ€¦")
- **No inventions**: Every factual clause must be traceable to a quote
- **References**: Use [1], [2], [3] in the answer; match to citations list
- **Quote selection**: Prefer one quote that defines, one that illustrates, and one that contrasts (when available)


## Query Length Flexibility

**Important**: Queries do NOT need to be the same length!

The embedding model (`sentence-transformers/all-MiniLM-L6-v2`) can handle queries of **any length**:
- Short queries (1-5 words): "Who is Basil?"
- Medium queries (6-15 words): "What does Lord Henry say about beauty?"
- Long queries (16+ words): "What does Lord Henry claim about influence on young people and how does he explain his philosophy?"

The model automatically:
1. Tokenizes the query (handles variable-length text)
2. Generates a fixed-size embedding (384 dimensions)
3. Normalizes the embedding for semantic search

**Best Practice**: Write queries naturally - use the length that best expresses your question. Longer queries can be more specific, but shorter queries often work well too!


## Step 1: Create QA Development Set

Manually create a small QA dataset.


In [10]:
# === TODO (you code this) ===
# Create a small QA dataframe: {question, acceptable_answer_keywords, notes}.
# Acceptance: CSV written to data/interim/qa_dev.csv

import pandas as pd
from pathlib import Path

# Create diverse QA pairs with varying lengths and question types
# Note: Queries can be ANY length - the embedding model handles variable-length text!

# Short queries (1-5 words)
query_1 = "What does the portrait look like?"
query_2 = "Who is Basil Hallward?"
query_3 = "Describe Dorian Gray."

# Medium queries (6-10 words)
query_4 = "What does Lord Henry say about beauty and intellect?"
query_5 = "Why doesn't Basil want to exhibit the portrait?"
query_6 = "How does Basil describe meeting Dorian for the first time?"
query_7 = "What happens to the portrait as the story progresses?"

# Long queries (11+ words)
query_8 = "What does Lord Henry claim about influence on young people and how does he explain his philosophy?"
query_9 = "How does the story describe the relationship between Dorian Gray and his portrait?"
query_10 = "What are Lord Henry's views on art, beauty, and the purpose of life according to the text?"

# Create list of queries (note: they vary significantly in length!)
queries = [
    query_1, query_2, query_3, query_4, query_5,
    query_6, query_7, query_8, query_9, query_10
]

# Define acceptable answer keywords for each query
keywords = [
    "portrait, painting, young man, beauty",  # query_1
    "Basil Hallward, artist, painter",  # query_2
    "Dorian Gray, young, beautiful, handsome",  # query_3
    "beauty, intellect, intellectual expression, harmony",  # query_4
    "exhibit, portrait, too much, himself",  # query_5
    "Basil, meeting, Dorian, first time, Lady Brandon",  # query_6
    "portrait, changes, ages, corruption",  # query_7
    "influence, immoral, soul, self-development, nature",  # query_8
    "portrait, relationship, mirror, reflection, corruption",  # query_9
    "art, beauty, life, purpose, philosophy, Lord Henry"  # query_10
]

# Create notes for context
notes = [
    "Should retrieve description of the portrait from early chapters",
    "Character introduction - should be in first few chapters",
    "Physical description of Dorian - early in book",
    "Lord Henry's philosophy about beauty vs intellect",
    "Basil's reason for not exhibiting - early conversation",
    "Basil's story about meeting Dorian at Lady Brandon's party",
    "Portrait's transformation - later in book",
    "Lord Henry's famous speech about influence - Chapter 2",
    "Central theme - portrait as mirror of soul",
    "Lord Henry's aesthetic philosophy throughout the book"
]

# Create dataframe
qa_dev = pd.DataFrame({
    'question': queries,
    'acceptable_answer_keywords': keywords,
    'notes': notes
})

# Save to CSV
output_dir = Path("../data/interim")
output_dir.mkdir(parents=True, exist_ok=True)
output_path = output_dir / "qa_dev.csv"
qa_dev.to_csv(output_path, index=False)

print(f"âœ… Created QA development set with {len(queries)} questions")
print(f"   Saved to: {output_path}")
print(f"\nðŸ“Š Query length distribution:")
for i, q in enumerate(queries, 1):
    word_count = len(q.split())
    print(f"   Q{i}: {word_count} words - \"{q[:50]}{'...' if len(q) > 50 else ''}\"")
print(f"\nðŸ’¡ Note: Queries vary from {min(len(q.split()) for q in queries)} to {max(len(q.split()) for q in queries)} words - all work fine!")


âœ… Created QA development set with 10 questions
   Saved to: ../data/interim/qa_dev.csv

ðŸ“Š Query length distribution:
   Q1: 6 words - "What does the portrait look like?"
   Q2: 4 words - "Who is Basil Hallward?"
   Q3: 3 words - "Describe Dorian Gray."
   Q4: 9 words - "What does Lord Henry say about beauty and intellec..."
   Q5: 8 words - "Why doesn't Basil want to exhibit the portrait?"
   Q6: 10 words - "How does Basil describe meeting Dorian for the fir..."
   Q7: 9 words - "What happens to the portrait as the story progress..."
   Q8: 17 words - "What does Lord Henry claim about influence on youn..."
   Q9: 13 words - "How does the story describe the relationship betwe..."
   Q10: 17 words - "What are Lord Henry's views on art, beauty, and th..."

ðŸ’¡ Note: Queries vary from 3 to 17 words - all work fine!


## Step 2: Evaluate Retrieval (Recall@k)

For each question, retrieve top-k chunks and check if any contain the acceptable keywords.


In [None]:
# === TODO (you code this) ===
# Evaluate retrieval: for each Q, if any retrieved chunk contains a keyword â†’ hit.
# Acceptance: print Recall@k summary.

from src.retrieve import retrieve

# Load QA set
# For each question:
#   - Retrieve top-k chunks
#   - Check if any chunk contains acceptable keywords
# Compute Recall@k


## Step 3: Compose Answers with Quotes

Test the answer composition pipeline with retrieved chunks.


In [None]:
# === TODO (you code this) ===
# Compose an answer from retrieved chunks (no LLM), with quotes and citations.
# Acceptance: dict with 'answer', 'quotes', 'used_chunks'.

from src.compose import compose_answer

# Test on a few example queries
# Verify that answers include quotes and citations


## Step 4: Evaluate Groundedness

Measure how well answers are grounded in quotes.


In [None]:
# === TODO (you code this) ===
# Evaluate "groundedness": % of answers containing â‰¥1 quote AND each claim sentence overlaps tokens with at least one quote.
# Hints:
# 1) For a small QA set, run retrieve -> compose; check non-empty quotes.
# 2) Token-overlap heuristic: for each answer sentence, require >= t shared content words with some quote (t ~ 2â€“3).
# Acceptance:
# - Print groundedness rate and sample diagnostics for 3 questions.

# For each question:
#   - Retrieve and compose answer
#   - Check quote presence
#   - Compute token overlap between answer sentences and quotes
# Report groundedness and attribution scores


## Step 5: Wire Gradio Demo

Create a simple Gradio interface for interactive Q&A.


In [None]:
# === TODO (you code this) ===
# Wire a simple Gradio demo using src/app.launch_app().



# Force reload the app


# Launch the demo
# Test with a few questions interactively



NameError: name 'src' is not defined

## Step 6: Prepare for Hugging Face Spaces Deployment

Optional: Prepare the app for deployment to Hugging Face Spaces.


In [None]:
# === TODO (you code this) ===
# Prepare for Hugging Face Spaces deployment.
# Hints:
# 1) Ensure app entrypoint is src/app.py with a function `launch_app()` or `demo = gr.Interface(...)`.
# 2) Create a `README.md` for the Space (use SPACE_CARD.md text).
# 3) Runtime: set "Hardware: CPU basic", "SDK: Gradio", "Space Timeout: 120s".
# Acceptance:
# - Space builds successfully; interacts within ~2â€“5 seconds per query on CPU.

# See SPACE_CARD.md for the README content to use in the Space.


## Summary

At this point, you should have:

- âœ… QA development set created (`data/interim/qa_dev.csv`)
- âœ… Recall@k evaluated (target: â‰¥ 0.8)
- âœ… Groundedness evaluated (target: â‰¥ 0.95)
- âœ… Answer composition tested with quotes and citations
- âœ… Gradio demo working locally
- âœ… (Optional) Space deployment ready

## Next Steps

- Optional LLM rewrite step (keeps quotes, improves fluency)
- Named-entity & character graph for richer answers
- Multi-book corpus with per-source filtering
