# üìä RAG Evaluation Basics

This notebook covers the fundamentals of evaluating RAG (Retrieval-Augmented Generation) systems. We'll explore key metrics, methodologies, and practical examples to help you understand how to measure and improve your RAG system's performance.

In [None]:
# Install required packages if not already installed
# !pip install ragas langchain-openai datasets

In [None]:
import os
import pandas as pd
import numpy as np
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    harmfulness,
    conciseness
)

# Import other necessary libraries
from src.core.bootstrap import get_container
from src.application.use_cases.ask_question_hybrid import AskQuestionHybridUseCase

# Initialize the container
container = get_container()

## üéØ Understanding RAG Evaluation Metrics

RAG systems require specialized metrics that evaluate not just the final answer, but the entire pipeline including retrieval and generation components.

### Key Metrics:
- **Faithfulness**: Does the answer stick to the provided context?
- **Answer Relevancy**: Is the answer relevant to the question?
- **Context Precision**: Are the retrieved chunks relevant to the question?
- **Context Recall**: Does the retrieved context contain the answer?
- **Harmfulness**: Does the system generate harmful content?
- **Conciseness**: Is the answer appropriately concise?

## üìã Creating an Evaluation Dataset

To evaluate our RAG system, we need a dataset with questions, ground truths, and contexts. In practice, you'd create this from your actual documents and expected answers.

In [None]:
# Sample evaluation data - in practice, you'd generate this from your documents
sample_data = {
    "question": [
        "What is the main advantage of hybrid search in RAG systems?",
        "How does chunking affect RAG performance?",
        "What is the purpose of re-ranking in RAG?"
    ],
    "answer": [
        "Hybrid search combines vector and keyword search to improve retrieval accuracy by leveraging both semantic and lexical matching.",
        "Chunking affects RAG performance by determining how documents are split, impacting both retrieval relevance and context preservation.",
        "Re-ranking improves the relevance of initially retrieved documents by applying a secondary ranking algorithm."
    ],
    "contexts": [
        [
            "Hybrid search combines vector search with keyword search to leverage the strengths of both approaches. Vector search captures semantic meaning, while keyword search handles exact matches and named entities.",
            "The effectiveness of retrieval in RAG systems depends on how well the search balances semantic and lexical matching.",
            "Vector databases store embeddings that represent semantic meaning of text fragments."
        ],
        [
            "Chunking strategies determine how documents are divided for storage in the vector database. Different strategies affect retrieval quality.",
            "Large chunks may contain more context but could dilute relevance. Small chunks increase precision but may lose important context.",
            "Optimal chunk size varies depending on the document type and query patterns."
        ],
        [
            "Re-ranking is a post-processing step that re-orders initially retrieved documents based on a more sophisticated relevance model.",
            "Initial retrieval might use fast but less accurate methods, while re-ranking applies more precise but computationally expensive models.",
            "Cross-encoder models are commonly used for re-ranking due to their effectiveness."
        ]
    ],
    "ground_truth": [
        "Hybrid search combines vector and keyword search to improve retrieval accuracy by leveraging both semantic and lexical matching.",
        "Chunking affects RAG performance by determining how documents are split, impacting both retrieval relevance and context preservation.",
        "Re-ranking improves the relevance of initially retrieved documents by applying a secondary ranking algorithm."
    ]
}

# Convert to HuggingFace Dataset format
eval_dataset = Dataset.from_dict(sample_data)
print(f"Dataset created with {eval_dataset.num_rows} samples")

## üìä Running the Evaluation

Now we'll run the evaluation using Ragas metrics to assess our RAG system's performance.

In [None]:
# Define the metrics to evaluate
metrics = [
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
]

# Run the evaluation
try:
    results = evaluate(
        dataset=eval_dataset,
        metrics=metrics
    )
    
    print("Evaluation Results:")
    print(results)
    
    # Convert to a more readable format
    results_df = pd.DataFrame([results])
    print("\nResults as DataFrame:")
    print(results_df.T)
    
except Exception as e:
    print(f"Evaluation failed with error: {e}")
    print("This is expected if the required models/api keys are not configured")

## üß™ A/B Testing Different Configurations

Let's compare the performance of different RAG configurations to understand the impact of various parameters.

In [None]:
# Example: Comparing hybrid search vs vector-only search
def run_ab_test():
    # This is a conceptual example - actual implementation would require
    # different configurations of your RAG system
    
    print("Running A/B test between different configurations...")
    
    # Configuration 1: Vector-only search
    print("Configuration 1: Vector-only search")
    # results_config1 = evaluate_with_config(config="vector_only")
    
    # Configuration 2: Hybrid search (vector + keyword)
    print("Configuration 2: Hybrid search (vector + keyword)")
    # results_config2 = evaluate_with_config(config="hybrid")
    
    # Configuration 3: Hybrid with re-ranking
    print("Configuration 3: Hybrid search with re-ranking")
    # results_config3 = evaluate_with_config(config="hybrid_with_rerank")
    
    print("\nComparison would show:")
    print("- Hybrid search typically improves context recall")
    print("- Re-ranking typically improves context precision")
    print("- Vector-only might be faster but less accurate")

run_ab_test()

## üìà Interpreting Results

Understanding what the metrics tell us about our RAG system's performance:

In [None]:
def interpret_results(results):
    """
    Interpret evaluation results and provide insights
    """
    print("Interpretation of RAG Evaluation Results:")
    print("=========================================")
    
    if 'faithfulness' in results:
        faithfulness_score = results['faithfulness']
        print(f"Faithfulness: {faithfulness_score:.3f}")
        if faithfulness_score > 0.8:
            print("  ‚úì Good: Answers are well-grounded in provided context")
        else:
            print("  ‚ö†Ô∏è  Improvement needed: Answers may contain hallucinations")
    
    if 'answer_relevancy' in results:
        relevancy_score = results['answer_relevancy']
        print(f"Answer Relevancy: {relevancy_score:.3f}")
        if relevancy_score > 0.7:
            print("  ‚úì Good: Answers are relevant to the questions")
        else:
            print("  ‚ö†Ô∏è  Improvement needed: Answers may be off-topic")
            
    if 'context_precision' in results:
        precision_score = results['context_precision']
        print(f"Context Precision: {precision_score:.3f}")
        if precision_score > 0.7:
            print("  ‚úì Good: Retrieved context is relevant to the question")
        else:
            print("  ‚ö†Ô∏è  Improvement needed: Retrieved context may contain irrelevant information")
            
    if 'context_recall' in results:
        recall_score = results['context_recall']
        print(f"Context Recall: {recall_score:.3f}")
        if recall_score > 0.7:
            print("  ‚úì Good: Retrieved context contains information needed to answer")
        else:
            print("  ‚ö†Ô∏è  Improvement needed: Important information may be missing from retrieved context")

# Example interpretation (using placeholder values since we didn't run actual eval)
example_results = {
    'faithfulness': 0.85,
    'answer_relevancy': 0.78,
    'context_precision': 0.72,
    'context_recall': 0.81
}

interpret_results(example_results)

## üõ†Ô∏è Practical Tips for RAG Evaluation

Based on the evaluation results, here are practical steps to improve your RAG system:

In [None]:
def get_improvement_recommendations(results):
    """
    Provide recommendations based on evaluation results
    """
    print("Improvement Recommendations:")
    print("===========================")
    
    if results.get('faithfulness', 1.0) < 0.8:
        print("‚Ä¢ Improve faithfulness:")
        print("  - Strengthen grounding mechanisms in prompts")
        print("  - Implement self-correction/validation steps")
        print("  - Use more constrained generation parameters")
    
    if results.get('context_precision', 1.0) < 0.7:
        print("‚Ä¢ Improve context precision:")
        print("  - Enhance re-ranking mechanisms")
        print("  - Fine-tune embedding models for your domain")
        print("  - Implement better query expansion techniques")
    
    if results.get('context_recall', 1.0) < 0.7:
        print("‚Ä¢ Improve context recall:")
        print("  - Implement hybrid search (vector + keyword)")
        print("  - Increase retrieval depth (top-k)")
        print("  - Improve query expansion")
    
    if results.get('answer_relevancy', 1.0) < 0.7:
        print("‚Ä¢ Improve answer relevancy:")
        print("  - Refine prompt engineering")
        print("  - Implement query classification and routing")
        print("  - Use more appropriate LLM for your domain")

get_improvement_recommendations(example_results)

## üß† Key Takeaways

1. **Comprehensive Evaluation**: RAG systems need evaluation across multiple dimensions, not just answer accuracy
2. **Metric Selection**: Choose metrics that align with your specific use case and success criteria
3. **Iterative Improvement**: Use evaluation results to identify bottlenecks and guide improvements
4. **Baseline Establishment**: Establish baselines to measure improvement from changes
5. **Realistic Testing**: Use test data that represents real-world usage patterns