# RAGAS Evaluation - Interactive Tutorial

This notebook demonstrates how to evaluate your RAG system using RAGAS metrics.

## What is RAGAS?

RAGAS (RAG Assessment) is a framework to evaluate Retrieval Augmented Generation systems using metrics like:
- **Faithfulness**: Answers are grounded in context
- **Answer Relevancy**: Answers address the question
- **Context Relevancy**: Retrieved contexts are useful

## Setup

In [None]:
# Import required modules
import os
import sys
from pathlib import Path

# Add project root to path
project_root = os.path.abspath('..')
sys.path.insert(0, project_root)

from evaluation import RAGASEvaluator, DatasetBuilder, get_ragas_metrics
from rag_core.pipeline.query_pipeline import QueryPipeline
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

print("‚úÖ Setup complete!")

## Step 1: Initialize RAG Pipeline

In [None]:
# Initialize your RAG pipeline
pipeline = QueryPipeline()

print("‚úÖ RAG Pipeline initialized")

## Step 2: Run Sample Queries

In [None]:
# Define test queries
test_queries = [
    "Find clients with high income and education level",
    "Show me clients who have defaulted on loans",
    "Retrieve information about clients with overdue payments"
]

# Run queries and collect results
query_results = []

for query in test_queries:
    print(f"\nüîç Query: {query}")
    results = pipeline.execute(query, top_k=3, verbose=False)
    query_results.append({
        'query': query,
        'results': results
    })
    print(f"   Retrieved {len(results['results'])} results")

print("\n‚úÖ All queries executed")

## Step 3: Build Evaluation Dataset

In [None]:
# Initialize dataset builder
builder = DatasetBuilder()

# Add each query result to the dataset
for item in query_results:
    query = item['query']
    results = item['results']['results']
    
    # Extract contexts
    contexts = [r['text'] for r in results]
    
    # Generate a simple answer (in production, use LLM)
    answer = f"Based on the search, I found {len(contexts)} relevant clients. {contexts[0][:200]}..."
    
    # Add to dataset
    builder.add_sample(
        question=query,
        answer=answer,
        contexts=contexts
    )

# Build the dataset
dataset = builder.build_dataset()

print(f"\n‚úÖ Dataset built with {len(dataset)} samples")
print(f"\nDataset summary: {builder.get_summary()}")

## Step 4: Run RAGAS Evaluation

In [None]:
# Initialize RAGAS evaluator
evaluator = RAGASEvaluator()

# Get metrics (without ground truth)
metrics = get_ragas_metrics(include_all=False)

print("üîç Running RAGAS evaluation...\n")
print(f"Metrics: {[m.name for m in metrics]}\n")

# Run evaluation
results = evaluator.evaluate_dataset(
    dataset=dataset,
    metrics=metrics,
    verbose=True
)

print("\n‚úÖ Evaluation complete!")

## Step 5: Analyze Results

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Create results dataframe
results_df = pd.DataFrame([results])

# Display results
print("\nüìä RAGAS Scores:")
print("="*50)
for metric, score in results.items():
    print(f"{metric:.<30} {score:.4f}")

# Visualize results
plt.figure(figsize=(10, 6))
plt.bar(results.keys(), results.values(), color='skyblue')
plt.xlabel('Metric')
plt.ylabel('Score')
plt.title('RAGAS Evaluation Results')
plt.ylim(0, 1)
plt.xticks(rotation=45)
plt.axhline(y=0.7, color='g', linestyle='--', label='Good threshold')
plt.legend()
plt.tight_layout()
plt.show()

print("\n‚úÖ Analysis complete!")

## Step 6: Interpret Results

### Score Interpretation:
- **Faithfulness**: Measures if answers are factual based on context
  - >0.8: Excellent - minimal hallucination
  - 0.6-0.8: Good
  - <0.6: Needs improvement

- **Answer Relevancy**: Measures if answers address the question
  - >0.7: Excellent
  - 0.5-0.7: Good
  - <0.5: Needs improvement

- **Context Relevancy**: Measures if retrieved contexts are useful
  - >0.6: Excellent
  - 0.4-0.6: Good
  - <0.4: Needs improvement

## Optional: Save Results

In [None]:
# Save dataset
builder.save_to_csv('../evaluation/notebook_test_dataset.csv')

# Save results
results_df.to_csv('../evaluation/notebook_evaluation_results.csv', index=False)

print("\n‚úÖ Results saved!")
print("   - Dataset: evaluation/notebook_test_dataset.csv")
print("   - Results: evaluation/notebook_evaluation_results.csv")

## Next Steps

1. Try with more test queries
2. Add ground truth answers for full metrics
3. Compare different configurations (top_k, filters, etc.)
4. Integrate LLM for better answer generation
5. Set up automated evaluation pipeline

See `RAGAS_INTEGRATION_GUIDE.md` for more details!