# üöÄ Advanced RAG Comparison - Experiment Notebook

This notebook demonstrates:
1. Building all three retriever types
2. Testing with sample queries
3. Running comprehensive evaluation
4. Generating benchmark results for Streamlit

**Methods Compared:**
- Basic RAG (baseline)
- Sentence Window Retrieval
- Auto-Merging Retrieval

## üì¶ Setup & Imports

In [None]:
import sys
sys.path.append('..')

from src.config import Config
from src.utils import (
    load_documents,
    setup_rag_system,
    print_query_results,
    create_sample_eval_questions,
)
from src.retrievers import (
    build_basic_retriever,
    build_sentence_window_retriever,
    build_auto_merging_retriever,
)
from src.evaluation import (
    RetrieverEvaluator,
    compare_retrievers,
)

import pandas as pd
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Imports successful!")

## ‚öôÔ∏è Configuration

In [None]:
# Print current configuration
Config.print_config()

# Validate configuration
Config.validate_config()

## üîß Initialize System Components

In [None]:
# Setup RAG system (LLM, embeddings, reranker)
system = setup_rag_system()

llm = system["llm"]
embed_model = system["embed_model"]
reranker = system["reranker"]

## üìÇ Load Documents

In [None]:
# Load documents from data directory
documents = load_documents()

print(f"\nüìä Document Statistics:")
print(f"Total documents: {len(documents)}")
print(f"Total characters: {sum(len(doc.text) for doc in documents):,}")
print(f"Average doc length: {sum(len(doc.text) for doc in documents) / len(documents):,.0f} chars")

## üî® Build Retrievers

Build all three retriever types

### 1. Basic RAG (Baseline)

In [None]:
basic_retriever = build_basic_retriever(
    documents=documents,
    llm=llm,
    embed_model=embed_model,
    reranker=reranker,
    index_name="basic_index",
    force_rebuild=False,  # Set to True to rebuild
)

print(f"‚úÖ {basic_retriever.get_retriever_name()} ready!")

### 2. Sentence Window Retrieval

In [None]:
sentence_window_retriever = build_sentence_window_retriever(
    documents=documents,
    llm=llm,
    embed_model=embed_model,
    reranker=reranker,
    window_size=3,  # Try 1, 3, 5, 7
    index_name="sentence_window_index",
    force_rebuild=False,
)

print(f"‚úÖ {sentence_window_retriever.get_retriever_name()} ready!")

### 3. Auto-Merging Retrieval

In [None]:
auto_merging_retriever = build_auto_merging_retriever(
    documents=documents,
    llm=llm,
    embed_model=embed_model,
    reranker=reranker,
    chunk_sizes=[2048, 512, 128],  # Try different hierarchies
    index_name="auto_merging_index",
    force_rebuild=False,
)

print(f"‚úÖ {auto_merging_retriever.get_retriever_name()} ready!")

## üß™ Test Single Query

Test a single query across all retrievers

In [None]:
test_query = "What is the main topic of the document?"

retrievers = {
    "Basic RAG": basic_retriever,
    "Sentence Window": sentence_window_retriever,
    "Auto-Merging": auto_merging_retriever,
}

for name, retriever in retrievers.items():
    print(f"\n{'='*80}")
    print(f"Testing: {name}")
    print(f"{'='*80}")
    
    response, nodes = retriever.query(test_query, return_nodes=True)
    
    print(f"\nüí¨ Response:\n{response}\n")
    print(f"üìö Retrieved {len(nodes)} nodes")
    
    # Show first retrieved context
    if nodes:
        print(f"\nüìÑ Top Context (snippet):")
        print(nodes[0].node.get_content()[:300] + "...")

## üìä Comprehensive Evaluation

Evaluate all retrievers on multiple questions

### Define Test Questions

In [None]:
# Define your test questions
test_questions = [
    "What is the main topic discussed in the document?",
    "What are the key findings or conclusions?",
    "What methodology was used?",
    "What are the main recommendations?",
    "What are the limitations mentioned?",
]

# Or load from file
# import json
# with open('../data/eval_questions.json', 'r') as f:
#     questions_data = json.load(f)
#     test_questions = [q['question'] for q in questions_data]

print(f"üìù Test questions: {len(test_questions)}")
for i, q in enumerate(test_questions, 1):
    print(f"   {i}. {q}")

### Run Evaluation

In [None]:
# Create evaluator
evaluator = RetrieverEvaluator()

# Evaluate all retrievers
results = evaluator.evaluate_multiple_retrievers(
    retrievers=[basic_retriever, sentence_window_retriever, auto_merging_retriever],
    questions=test_questions,
    ground_truths=None,  # Add if you have ground truth answers
    verbose=True,
)

### View Results

In [None]:
# Get comparison dataframe
comparison_df = evaluator.get_comparison_dataframe()

print("\nüìä Comparison Results:")
display(comparison_df)

# Highlight best scores
styled_df = comparison_df.style.highlight_max(
    subset=['faithfulness', 'answer_relevancy', 'context_relevancy'],
    color='lightgreen'
).highlight_min(
    subset=['avg_response_time'],
    color='lightgreen'
)

display(styled_df)

### Save Results for Streamlit

In [None]:
# Save results to be used in Streamlit dashboard
evaluator.save_results(Config.STORAGE_DIR / "eval_results.json")

print("‚úÖ Results saved! You can now view them in the Streamlit dashboard.")
print("\nRun: streamlit run streamlit_app.py")

## üî¨ Experiment: Different Window Sizes

Compare different sentence window sizes

In [None]:
window_sizes = [1, 3, 5, 7]
window_retrievers = []

for window_size in window_sizes:
    print(f"\nü™ü Building retriever with window_size={window_size}...")
    
    retriever = build_sentence_window_retriever(
        documents=documents,
        llm=llm,
        embed_model=embed_model,
        reranker=reranker,
        window_size=window_size,
        index_name=f"sentence_window_{window_size}",
        force_rebuild=False,
    )
    
    window_retrievers.append(retriever)

print("\n‚úÖ All window size retrievers built!")

In [None]:
# Evaluate window size variants
window_evaluator = RetrieverEvaluator()

window_results = window_evaluator.evaluate_multiple_retrievers(
    retrievers=window_retrievers,
    questions=test_questions[:3],  # Use subset for faster testing
    verbose=True,
)

# Compare results
window_comparison = window_evaluator.get_comparison_dataframe()
display(window_comparison)

## üî¨ Experiment: Different Chunk Sizes (Auto-Merging)

Compare different hierarchical chunk configurations

In [None]:
chunk_configs = [
    [2048, 512, 128],
    [1024, 256, 64],
    [4096, 1024, 256],
]

chunk_retrievers = []

for chunk_sizes in chunk_configs:
    print(f"\nüîÑ Building retriever with chunk_sizes={chunk_sizes}...")
    
    retriever = build_auto_merging_retriever(
        documents=documents,
        llm=llm,
        embed_model=embed_model,
        reranker=reranker,
        chunk_sizes=chunk_sizes,
        index_name=f"auto_merge_{'_'.join(map(str, chunk_sizes))}",
        force_rebuild=False,
    )
    
    chunk_retrievers.append(retriever)

print("\n‚úÖ All chunk size retrievers built!")

In [None]:
# Evaluate chunk size variants
chunk_evaluator = RetrieverEvaluator()

chunk_results = chunk_evaluator.evaluate_multiple_retrievers(
    retrievers=chunk_retrievers,
    questions=test_questions[:3],
    verbose=True,
)

# Compare results
chunk_comparison = chunk_evaluator.get_comparison_dataframe()
display(chunk_comparison)

## üìà Visualize Results

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style("whitegrid")
plt.figure(figsize=(12, 6))

# Plot metrics comparison
metrics = ['faithfulness', 'answer_relevancy', 'context_relevancy']
x = range(len(comparison_df))
width = 0.25

for i, metric in enumerate(metrics):
    if metric in comparison_df.columns:
        plt.bar(
            [xi + width * i for xi in x],
            comparison_df[metric],
            width=width,
            label=metric.replace('_', ' ').title()
        )

plt.xlabel('Retriever')
plt.ylabel('Score')
plt.title('Retriever Performance Comparison')
plt.xticks([xi + width for xi in x], comparison_df['retriever'])
plt.legend()
plt.tight_layout()
plt.show()

## üí° Next Steps

1. **Add More Documents**: Place PDF/TXT files in the `data/` directory
2. **Customize Questions**: Create your own evaluation questions
3. **Tune Parameters**: Experiment with different:
   - Window sizes (1, 3, 5, 7, 9)
   - Chunk sizes ([2048, 512, 128], [1024, 256, 64], etc.)
   - Top-K values
   - Reranker settings
4. **View Dashboard**: Run `streamlit run streamlit_app.py` to see interactive results
5. **Compare Methods**: Analyze which method works best for your use case

## üéØ Summary

This notebook demonstrated:
- ‚úÖ Building three advanced RAG retrieval methods
- ‚úÖ Evaluating with RAGAS metrics (faithfulness, answer relevancy, context relevancy)
- ‚úÖ Comparing performance across methods
- ‚úÖ Experimenting with different configurations
- ‚úÖ Generating results for Streamlit dashboard

**Key Findings:**
- Basic RAG provides a good baseline
- Sentence Window adds context richness
- Auto-Merging balances granularity and context

Choose the method that best fits your:
- Document type
- Query complexity
- Latency requirements
- Quality needs