# RAG Pipeline Evaluation ⚖️

This notebook evaluates the RAG system by comparing the **Prototype** (small scale, locally embedded) vector store against the **Production** (full scale, pre-embedded) vector store.

## Objectives
1. **Load Pipelines**: Initialize RAG pipelines for both collections.
2. **Define Test Queries**: A set of realistic consumer complaints questions.
3. **Run Comparisons**: Query both systems and compare retrieval quality and answers.
4. **Qualitative Analysis**: Discuss differences in response quality.

In [None]:
import sys
import os
import pandas as pd

# Add src to path
sys.path.append(os.path.abspath(os.path.join('../src')))
from rag_pipeline import RAGPipeline

## 1. Initialize Pipelines

We connect to both vector stores. Note that the 'Production' store must have been built using `src/index_production.py`.

In [None]:
VECTOR_DB_PATH = '../vector_store'

print("Initializing PROTOTYPE Pipeline...")
try:
    rag_proto = RAGPipeline(vector_db_path=VECTOR_DB_PATH, collection_name='complaints_prototype')
    print("Prototype Loaded.")
except Exception as e:
    print(f"Prototype Load Failed: {e}")
    rag_proto = None

print("\nInitializing PRODUCTION Pipeline...")
try:
    rag_prod = RAGPipeline(vector_db_path=VECTOR_DB_PATH, collection_name='complaints_production')
    print("Production Loaded.")
except Exception as e:
    print(f"Production Load Failed (Did you run index_production.py?): {e}")
    rag_prod = None

## 2. Test Queries
We select 5 distinct questions covering different products and issues.

In [None]:
test_queries = [
    "How do consumers complain about credit card late fees?",
    "What issues are reported regarding mortgage escrow accounts?",
    "Are there complaints about identity theft in checking accounts?",
    "What do people say about debt collection harassment?",
    "How are student loan servicing errors described?"
]

## 3. Comparative Evaluation
We run each query against both systems.

In [None]:
results = []

for q in test_queries:
    row = {"Question": q}
    
    # Prototype
    if rag_proto:
        ans_proto, docs_proto, _ = rag_proto.query(q, n_results=3)
        row["Prototype Answer"] = ans_proto
        row["Prototype Sources"] = len(docs_proto)
    else:
        row["Prototype Answer"] = "N/A"

    # Production
    if rag_prod:
        ans_prod, docs_prod, _ = rag_prod.query(q, n_results=3)
        row["Production Answer"] = ans_prod
        row["Production Sources"] = len(docs_prod)
    else:
        row["Production Answer"] = "N/A"
        
    results.append(row)

eval_df = pd.DataFrame(results)
pd.set_option('display.max_colwidth', None)
display(eval_df)

## 4. Analysis
**Hypothesis**:
- **Coverage**: The Production index covers 464k documents versus the Prototype's small subset (e.g., 5000). We expect the Production answers to be more comprehensive and citation-rich.
- **Latency**: Retrieval speed should be comparable as ChromaDB handles vector search efficiently, though the larger index might be slightly slower without optimization.
- **Quality**: Production answers should reference specific details that might not exist in the small prototype sample.