## Prerequisites

1. ✅ foundation/00-setup-postgres-schema.ipynb (creates evaluation_groundtruth table)
2. ✅ foundation/02-rag-postgresql-persistent.ipynb (generates and stores embeddings)

## What This Notebook Does

This notebook creates a **ground-truth test set** for evaluating RAG systems:

1. **Phase 1: Synthetic Generation** - Generate candidate questions from chunks using:
   - LLM-based generation (using Ollama, local and free)
   - Template-based generation (fallback, pattern matching)
   - Blended approach (use both)

2. **Phase 2: Deduplication** - Remove near-duplicate questions using semantic similarity

3. **Phase 3: Interactive Curation** - Human review and rating of each question with options:
   - [G]ood - Accept as-is
   - [B]ad - Reject
   - [E]dit - Modify the question
   - [N]otes - Add clarification notes
   - [S]kip - Skip this question

4. **Phase 4: Storage** - Store curated questions to `evaluation_groundtruth` table

## Quick Start

1. Set `INTERACTIVE_MODE = True` (line below Configuration) to enable human curation
2. Set `SAMPLE_SIZE = 20` to start with a small test (adjust up after testing)
3. Run all cells from top to bottom
4. Answer curation prompts as they appear
5. View summary statistics at the end

## Output

Questions stored in PostgreSQL `evaluation_groundtruth` table with:
- question (text)
- source_type ('llm_generated', 'template_based', '_edited')
- relevant_chunk_ids (which chunks answer this)
- quality_rating ('good', 'bad', 'ambiguous')
- human_notes (optional curator comments)

## Configuration

In [None]:
import ollama
import psycopg2
import json
import pandas as pd
import numpy as np
import re
import random
import time
from datetime import datetime
from typing import List, Dict, Tuple, Optional

In [None]:
# PostgreSQL configuration
POSTGRES_CONFIG = {
    'host': 'localhost',
    'port': 5432,
    'database': 'rag_db',
    'user': 'postgres',
    'password': 'postgres',
}

# LLM Configuration
LLM_MODEL = 'llama3.2:1b'  # Ollama model for generation (local, free)
GENERATION_APPROACH = 'blended'  # 'llm', 'template', or 'blended'
QUESTIONS_PER_CHUNK = 3
BATCH_SIZE = 10  # Questions per curation batch

# Deduplication
SIMILARITY_THRESHOLD = 0.85

# Sampling
SAMPLE_SIZE = 20  # Number of chunks to generate questions from (start small)

# Interactive Mode
INTERACTIVE_MODE = True  # Set False to use automatic labeling

# Which embedding model to use for deduplication
EMBEDDING_MODEL = 'hf.co/CompendiumLabs/bge-base-en-v1.5-gguf'
EMBEDDING_MODEL_ALIAS = 'bge_base_en_v1.5'

## Phase 1: Synthetic Generation

### Option A: LLM-Based Generation

In [None]:
def generate_questions_with_llm(chunk_text: str, chunk_id: int, num_questions: int = 3, 
                               llm_model: str = 'llama3.2:1b') -> List[Tuple[str, int, str]]:
    """
    Generate questions from a chunk using Ollama LLM.
    
    Args:
        chunk_text: Text chunk to generate questions from
        chunk_id: ID of the chunk (for tracking)
        num_questions: How many questions to generate
        llm_model: Ollama model name
        
    Returns:
        List of (question, chunk_id, source_type) tuples
    """
    prompt = f"""Based on the following text, generate exactly {num_questions} clear, specific questions that can be answered using ONLY the information in this text.

Format each question on a new line starting with "Q:". Do not include the "Q:" prefix in the actual question - only use it as a marker.

Text:
{chunk_text[:1000]}

Generate exactly {num_questions} questions:"""
    
    try:
        response = ollama.chat(
            model=llm_model,
            messages=[{
                'role': 'user',
                'content': prompt
            }]
        )
        
        # Parse response to extract questions
        content = response['message']['content']
        questions = []
        
        for line in content.split('\n'):
            line = line.strip()
            
            # Pattern 1: "Q: What is..."
            if line.startswith('Q:'):
                question = line[2:].strip()
                if question and len(question) > 10:  # Skip too-short questions
                    questions.append((question, chunk_id, 'llm_generated'))
            
            # Pattern 2: "1. What is..." or "1) What is..."
            elif line and (line[0].isdigit()):
                # Split on . or )
                parts = re.split(r'[\.\)]\s+', line, 1)
                if len(parts) > 1:
                    question = parts[1].strip()
                    if question and len(question) > 10 and ('?' in question or any(q in question for q in ['What', 'How', 'When', 'Where', 'Who', 'Why'])):
                        questions.append((question, chunk_id, 'llm_generated'))
        
        return questions[:num_questions]
    
    except Exception as e:
        print(f"  ✗ LLM generation failed for chunk {chunk_id}: {e}")
        return []

# Test LLM generation
print("Testing LLM-based question generation...")
test_chunk = """
Albert Einstein was a German-born theoretical physicist who developed the theory of relativity. 
He won the Nobel Prize in Physics in 1921. Einstein's mass-energy equivalence formula, E=mc², 
is one of the most famous equations in science. He spent his later years at Princeton University.
"""

test_qs = generate_questions_with_llm(test_chunk, chunk_id=1, num_questions=3)
print(f"Generated {len(test_qs)} questions:")
for q, cid, src in test_qs:
    print(f"  - {q} (chunk_id={cid}, source={src})")
print()

### Option B: Template-Based Generation

In [None]:
def generate_questions_with_templates(chunk_text: str, chunk_id: int) -> List[Tuple[str, int, str]]:
    """
    Generate questions using template patterns as fallback.
    
    Args:
        chunk_text: Text to generate questions from
        chunk_id: ID of the chunk
        
    Returns:
        List of (question, chunk_id, source_type) tuples
    """
    questions = []
    
    # Pattern 1: Extract capitalized entities (potential names/places)
    entities = re.findall(r'\b[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\b', chunk_text)
    entities = list(dict.fromkeys(entities))[:5]  # Unique, max 5, preserve order
    
    # Pattern 2: Extract numbers and years
    years = re.findall(r'\b(19|20)\d{2}\b', chunk_text)
    years = list(set(years))[:3]
    
    numbers_with_units = re.findall(r'\b(\d+)\s*(?:percent|%|million|billion|thousand)\b', chunk_text, re.IGNORECASE)
    
    # Generate questions from entities
    for entity in entities:
        if entity in chunk_text and entity not in ['Article', 'Text', 'The', 'A']:
            questions.append((f"What is {entity}?", chunk_id, 'template_based'))
            questions.append((f"Tell me about {entity}.", chunk_id, 'template_based'))
    
    # Generate questions from years
    for year in years:
        questions.append((f"What happened in {year}?", chunk_id, 'template_based'))
    
    # Generate questions from numbers
    for num in numbers_with_units[:2]:
        questions.append((f"What does {num} refer to in this text?", chunk_id, 'template_based'))
    
    # Generic questions
    if len(chunk_text) > 100:
        questions.append(("What is the main topic discussed?", chunk_id, 'template_based'))
        questions.append(("What key information is provided?", chunk_id, 'template_based'))
    
    # Remove duplicates and return
    seen = set()
    unique_questions = []
    for q, cid, src in questions:
        q_lower = q.lower()
        if q_lower not in seen:
            seen.add(q_lower)
            unique_questions.append((q, cid, src))
    
    return unique_questions[:10]  # Max 10 template questions

# Test template generation
print("Testing template-based question generation...")
test_qs = generate_questions_with_templates(test_chunk, chunk_id=1)
print(f"Generated {len(test_qs)} questions:")
for q, cid, src in test_qs:
    print(f"  - {q} (chunk_id={cid}, source={src})")
print()

### Deduplicate Candidates

In [None]:
def deduplicate_questions(candidate_questions: List[Tuple[str, int, str]], 
                         threshold: float = 0.85) -> List[Tuple[str, int, str]]:
    """
    Remove near-duplicate questions using semantic similarity with embeddings.
    
    Args:
        candidate_questions: List of (question, chunk_id, source_type) tuples
        threshold: Similarity threshold (0.85 = 85% similar means duplicate)
        
    Returns:
        Deduplicated list of questions
    """
    if len(candidate_questions) <= 1:
        return candidate_questions
    
    print(f"Deduplicating {len(candidate_questions)} questions (threshold: {threshold})...")
    
    # Extract just the question text
    question_texts = [q[0] for q in candidate_questions]
    
    # Generate embeddings for all questions
    embeddings = []
    for i, q_text in enumerate(question_texts):
        try:
            emb_response = ollama.embed(model=EMBEDDING_MODEL, input=q_text)
            emb = emb_response['embeddings'][0]
            embeddings.append(emb)
        except Exception as e:
            print(f"  Warning: Failed to embed question {i}: {e}")
            # Use zero embedding as fallback
            embeddings.append([0.0] * 768)
    
    # Compute cosine similarity between pairs
    def cosine_similarity(a, b):
        """Compute cosine similarity between two vectors."""
        a_arr = np.array(a)
        b_arr = np.array(b)
        dot_product = np.dot(a_arr, b_arr)
        norm_a = np.linalg.norm(a_arr)
        norm_b = np.linalg.norm(b_arr)
        if norm_a == 0 or norm_b == 0:
            return 0.0
        return dot_product / (norm_a * norm_b)
    
    # Keep indices of non-duplicate questions
    keep_indices = []
    for i in range(len(embeddings)):
        is_duplicate = False
        
        # Check similarity with all previously kept questions
        for j in keep_indices:
            sim = cosine_similarity(embeddings[i], embeddings[j])
            if sim > threshold:
                is_duplicate = True
                break
        
        if not is_duplicate:
            keep_indices.append(i)
    
    # Return non-duplicate questions
    deduplicated = [candidate_questions[i] for i in keep_indices]
    print(f"  Deduplicated: {len(candidate_questions)} → {len(deduplicated)} questions")
    
    return deduplicated

# Test deduplication
print("Testing deduplication...")
test_questions = [
    ("What is Albert Einstein?", 1, "llm_generated"),
    ("Tell me about Einstein.", 1, "template_based"),
    ("Who was Einstein?", 1, "template_based"),
    ("What is physics?", 1, "template_based"),
]
dedup = deduplicate_questions(test_questions, threshold=0.85)
print(f"Deduplication result: {len(test_questions)} → {len(dedup)} questions")
for q, cid, src in dedup:
    print(f"  - {q}")
print()

## Phase 2: Interactive Human Curation

Curator reviews each generated question and provides feedback.

In [None]:
def curate_questions_interactively(candidate_questions: List[Tuple[str, int, str]], 
                                   db_connection,
                                   batch_size: int = 10) -> List[Dict]:
    """
    Interactive curation with human-in-the-loop feedback.
    
    Args:
        candidate_questions: List of (question, chunk_id, source_type) tuples
        db_connection: PostgreSQL connection for immediate storage
        batch_size: Questions per batch
        
    Returns:
        List of curated question dicts
    """
    curated = []
    
    print(f"\n{'='*70}")
    print(f"INTERACTIVE CURATION - {len(candidate_questions)} questions to review")
    print(f"{'='*70}")
    print("\nOptions: [G]ood | [B]ad | [E]dit | [N]otes | [S]kip")
    print("="*70 + "\n")
    
    for batch_num, i in enumerate(range(0, len(candidate_questions), batch_size)):
        batch = candidate_questions[i:i+batch_size]
        
        print(f"\n--- Batch {batch_num + 1} ({len(batch)} questions) ---\n")
        
        for idx, (question, chunk_id, source_type) in enumerate(batch):
            print(f"\nQuestion {idx+1}/{len(batch)}:")
            print(f"  Text: {question}")
            print(f"  Source: {source_type}")
            print(f"  Chunk ID: {chunk_id}")
            
            # Get user feedback
            while True:
                choice = input("  Rate: [G]ood / [B]ad / [E]dit / [N]otes / [S]kip: ").strip().lower()
                
                if choice == 'g':
                    curated.append({
                        'question': question,
                        'chunk_ids': [chunk_id],
                        'source_type': source_type,
                        'quality_rating': 'good',
                        'human_notes': None
                    })
                    print("  ✓ Marked as GOOD\n")
                    break
                
                elif choice == 'b':
                    curated.append({
                        'question': question,
                        'chunk_ids': [chunk_id],
                        'source_type': source_type,
                        'quality_rating': 'bad',
                        'human_notes': 'Rejected during curation'
                    })
                    print("  ✗ Marked as BAD\n")
                    break
                
                elif choice == 'e':
                    edited_q = input("  Enter edited question: ").strip()
                    if edited_q:
                        curated.append({
                            'question': edited_q,
                            'chunk_ids': [chunk_id],
                            'source_type': f"{source_type}_edited",
                            'quality_rating': 'good',
                            'human_notes': f"Edited from: {question}"
                        })
                        print("  ✓ Saved EDITED version\n")
                    break
                
                elif choice == 'n':
                    notes = input("  Enter notes: ").strip()
                    curated.append({
                        'question': question,
                        'chunk_ids': [chunk_id],
                        'source_type': source_type,
                        'quality_rating': 'ambiguous',
                        'human_notes': notes if notes else 'Flagged for ambiguity'
                    })
                    print("  ⚠ Marked as AMBIGUOUS with notes\n")
                    break
                
                elif choice == 's':
                    print("  ⊘ Skipped\n")
                    break
                
                else:
                    print("  Invalid choice. Try again.")
    
    return curated

print("Interactive curation function defined.")
print("Set INTERACTIVE_MODE = True to enable curation when running the full pipeline.")

## Store Results

In [None]:
def store_curated_questions(curated_questions: List[Dict], db_connection):
    """
    Store curated questions to evaluation_groundtruth table.
    
    Args:
        curated_questions: List of question dicts with all metadata
        db_connection: PostgreSQL connection
        
    Returns:
        Number of questions stored
    """
    if not curated_questions:
        print("No curated questions to store.")
        return 0
    
    try:
        stored_count = 0
        with db_connection.cursor() as cur:
            for q in curated_questions:
                try:
                    cur.execute('''
                        INSERT INTO evaluation_groundtruth 
                        (question, source_type, relevant_chunk_ids, quality_rating, human_notes)
                        VALUES (%s, %s, %s, %s, %s)
                    ''', (
                        q['question'],
                        q['source_type'],
                        q['chunk_ids'],  # PostgreSQL will store as integer array
                        q['quality_rating'],
                        q['human_notes']
                    ))
                    stored_count += 1
                except Exception as e:
                    print(f"  Warning: Failed to store question '{q['question'][:50]}...': {e}")
        
        db_connection.commit()
        print(f"\n✓ Stored {stored_count}/{len(curated_questions)} questions to evaluation_groundtruth table")
        return stored_count
    
    except Exception as e:
        db_connection.rollback()
        print(f"\n✗ Error storing questions: {e}")
        return 0

print("Storage function defined.")

## Summary

In [None]:
def show_summary_statistics(curated_questions: List[Dict]):
    """
    Display comprehensive summary of curation results.
    
    Args:
        curated_questions: List of curated question dicts
    """
    if not curated_questions:
        print("No curated questions to summarize.")
        return
    
    # Create DataFrame for analysis
    df = pd.DataFrame(curated_questions)
    
    print("\n" + "="*70)
    print("CURATION SUMMARY STATISTICS")
    print("="*70)
    
    print(f"\nTotal Questions Curated: {len(curated_questions)}")
    
    # Quality rating distribution
    print("\nBy Quality Rating:")
    rating_counts = df['quality_rating'].value_counts()
    for rating, count in rating_counts.items():
        pct = (count / len(curated_questions)) * 100
        print(f"  {rating:12s}: {count:3d} ({pct:5.1f}%)")
    
    # Source type distribution
    print("\nBy Source Type:")
    source_counts = df['source_type'].value_counts()
    for source, count in source_counts.items():
        pct = (count / len(curated_questions)) * 100
        print(f"  {source:20s}: {count:3d} ({pct:5.1f}%)")
    
    # Chunk ID statistics
    chunk_counts = df['chunk_ids'].apply(len)
    print(f"\nChunk ID Statistics:")
    print(f"  Average chunk IDs per question: {chunk_counts.mean():.2f}")
    print(f"  Min chunk IDs: {chunk_counts.min()}")
    print(f"  Max chunk IDs: {chunk_counts.max()}")
    
    # Notes statistics
    has_notes = df[df['human_notes'].notna()].shape[0]
    print(f"\nHuman Notes:")
    print(f"  Questions with notes: {has_notes} ({(has_notes/len(curated_questions))*100:.1f}%)")
    
    # Good questions summary
    good_count = (df['quality_rating'] == 'good').sum()
    print(f"\nQuality Summary:")
    print(f"  Good questions (ready for evaluation): {good_count}")
    print(f"  Ambiguous questions (need clarification): {(df['quality_rating'] == 'ambiguous').sum()}")
    print(f"  Bad questions (rejected): {(df['quality_rating'] == 'bad').sum()}")
    
    print("\n" + "="*70)

# Test summary function
print("Summary function defined.")

In [None]:
# Test summary function
print("Summary function defined.")

# =============================================================================
# MAIN PIPELINE: Generate, Deduplicate, Curate, and Store Ground Truth
# =============================================================================

print("\n" + "="*70)
print("GROUND TRUTH CREATION PIPELINE")
print("="*70)

# Step 1: Connect to database
print("\n[1/6] Connecting to PostgreSQL...")
try:
    conn = psycopg2.connect(
        host=POSTGRES_CONFIG['host'],
        port=POSTGRES_CONFIG['port'],
        database=POSTGRES_CONFIG['database'],
        user=POSTGRES_CONFIG['user'],
        password=POSTGRES_CONFIG['password']
    )
    print("✓ Connected to PostgreSQL")
except Exception as e:
    print(f"✗ Failed to connect: {e}")
    print("Make sure PostgreSQL is running. Exiting.")
    raise

# Step 2: Load sample chunks from database
print("\n[2/6] Loading sample chunks from embeddings database...")
try:
    table_name = f'embeddings_{EMBEDDING_MODEL_ALIAS.replace(".", "_")}'
    with conn.cursor() as cur:
        # Get total count
        cur.execute(f'SELECT COUNT(*) FROM {table_name}')
        total_count = cur.fetchone()[0]
        
        if total_count == 0:
            print(f"✗ No embeddings found in {table_name}")
            print("Please run foundation/02-rag-postgresql-persistent.ipynb first")
            raise ValueError("No embeddings in database")
        
        # Get random sample
        actual_sample_size = min(SAMPLE_SIZE, total_count)
        cur.execute(f'''
            SELECT id, chunk_text FROM {table_name}
            ORDER BY RANDOM()
            LIMIT %s
        ''', (actual_sample_size,))
        
        chunks = cur.fetchall()
        print(f"✓ Loaded {len(chunks)} sample chunks from {total_count:,} total embeddings")
except Exception as e:
    print(f"✗ Failed to load chunks: {e}")
    raise

# Step 3: Generate candidate questions
print("\n[3/6] Generating candidate questions...")
start_time = time.time()
candidate_questions = []

for chunk_id, chunk_text in chunks:
    if GENERATION_APPROACH in ['llm', 'blended']:
        # LLM-based generation
        llm_questions = generate_questions_with_llm(
            chunk_text, 
            chunk_id=chunk_id, 
            num_questions=QUESTIONS_PER_CHUNK,
            llm_model=LLM_MODEL
        )
        candidate_questions.extend(llm_questions)
    
    if GENERATION_APPROACH in ['template', 'blended']:
        # Template-based generation (fallback or blended)
        template_questions = generate_questions_with_templates(chunk_text, chunk_id=chunk_id)
        candidate_questions.extend(template_questions)

generation_time = time.time() - start_time
print(f"✓ Generated {len(candidate_questions)} candidate questions in {generation_time:.1f}s")
print(f"  - Approach: {GENERATION_APPROACH}")
print(f"  - Sample size: {len(chunks)} chunks")

# Step 4: Deduplicate questions
print("\n[4/6] Deduplicating questions...")
start_time = time.time()
deduplicated_questions = deduplicate_questions(candidate_questions, threshold=SIMILARITY_THRESHOLD)
dedup_time = time.time() - start_time
print(f"✓ Deduplication complete in {dedup_time:.1f}s")
print(f"  - Before: {len(candidate_questions)} questions")
print(f"  - After: {len(deduplicated_questions)} questions")
print(f"  - Removed: {len(candidate_questions) - len(deduplicated_questions)} duplicates")

# Randomize order for curation
random.shuffle(deduplicated_questions)

# Step 5: Interactive curation (if enabled)
print("\n[5/6] Starting interactive curation...")
if INTERACTIVE_MODE:
    curated_questions = curate_questions_interactively(
        deduplicated_questions, 
        conn, 
        batch_size=BATCH_SIZE
    )
else:
    # Automated labeling: mark all as good
    print("Running in automated mode (not interactive)")
    curated_questions = []
    for question, chunk_id, source_type in deduplicated_questions:
        curated_questions.append({
            'question': question,
            'chunk_ids': [chunk_id],
            'source_type': source_type,
            'quality_rating': 'good',
            'human_notes': 'Auto-accepted (non-interactive mode)'
        })
    print(f"✓ Auto-curated {len(curated_questions)} questions")

# Step 6: Store to database and show summary
print("\n[6/6] Storing results and generating summary...")
stored = store_curated_questions(curated_questions, conn)
show_summary_statistics(curated_questions)

# Final summary
print("\n" + "="*70)
print("PIPELINE COMPLETE")
print("="*70)
print(f"\nResults:")
print(f"  Generated: {len(candidate_questions)} questions")
print(f"  Deduplicated: {len(deduplicated_questions)} questions")
print(f"  Curated: {len(curated_questions)} questions")
print(f"  Stored: {stored} questions to evaluation_groundtruth table")
print(f"\nYou can now use these questions in evaluation-lab/02+ notebooks")
print(f"to measure RAG retrieval and generation quality.")
print("="*70)

# Close connection
conn.close()
print("\n✓ Database connection closed")