üîß **Setup Required**: Before running this notebook, please follow the [setup instructions](../README.md#setup-instructions) to configure your environment and API keys. **You will need to ensure you've executed the Indexing pipeline before completing this exercise**

# Systematic RAG Evaluation: Naive vs Hybrid Comparison

## üìã Overview

This notebook demonstrates a **systematic evaluation workflow** for comparing two RAG (Retrieval-Augmented Generation) approaches using **Haystack custom components**. We'll create a reproducible pipeline to:

1. **Load evaluation datasets** from CSV files
2. **Process queries** through both Naive and Hybrid RAG SuperComponents 
3. **Generate comprehensive metrics** using the RAGAS framework
4. **Compare performance** systematically

## üéØ Learning Objectives

By the end of this notebook, you will understand how to:
- Create reusable evaluation components for RAG systems
- Build scalable pipelines for systematic RAG comparison
- Interpret RAGAS metrics in comparative context
- Make data-driven decisions between RAG approaches

## üîß Evaluation Pipeline

Our pipeline consists of three main components:

```
CSV Data ‚Üí RAGDataAugmenter ‚Üí RagasEvaluation ‚Üí Metrics & Results
```

**Key Benefits:**
- **Systematic**: Same evaluation conditions for both RAG systems
- **Reproducible**: Consistent evaluation across experiments
- **Scalable**: Easy to add new RAG implementations
- **Comprehensive**: Multiple metrics provide complete assessment

---

## Component 1: CSV Data Loader üìä

The **CSVReaderComponent** serves as the entry point for our evaluation pipeline. It handles loading synthetic evaluation datasets and ensures data quality before processing.

**Key Features:**
- **Robust Error Handling**: Validates file existence and data integrity
- **Pandas Integration**: Returns data as DataFrame for easy manipulation
- **Pipeline Compatible**: Designed to work seamlessly with Haystack pipelines

**Input:** File path to CSV containing evaluation queries and ground truth
**Output:** Pandas DataFrame ready for RAG processing

In [1]:
import pandas as pd
from pathlib import Path
from haystack import component, Pipeline
from typing import List, Optional, Dict, Any, Union

@component
class CSVReaderComponent:
    """Reads a CSV file into a Pandas DataFrame."""

    @component.output_types(data_frame=pd.DataFrame)
    def run(self, source: Union[str, Path]):
        """
        Reads the CSV file from the first source in the list.
        
        Args:
            sources: List of file paths to CSV files. Only the first file will be processed.
            
        Returns:
            dict: Dictionary containing the loaded DataFrame under 'data_frame' key.
            
        Raises:
            FileNotFoundError: If the file doesn't exist or can't be read.
            ValueError: If the DataFrame is empty after loading.
        """
        if not source:
            raise ValueError("No sources provided")
            

        try:
            df = pd.read_csv(source)
        except FileNotFoundError:
            raise FileNotFoundError(f"File not found at {source}")
        except Exception as e:
            raise ValueError(f"Error reading CSV file {source}: {str(e)}")

        # Check if DataFrame is empty using proper pandas method
        if df.empty:
            raise ValueError(f"DataFrame is empty after loading from {source}")

        print(f"Loaded DataFrame with {len(df)} rows from {source}.")
        return {"data_frame": df}

## Component 2: RAG Data Augmentation üîÑ

The **RAGDataAugmenterComponent** is the core of our evaluation workflow. It takes each query from our evaluation dataset and processes it through a RAG SuperComponent, collecting both the generated responses and retrieved contexts.

**üîë Key Design Decisions:**

1. **SuperComponent Flexibility**: Accepts any pre-configured RAG SuperComponent (Naive, Hybrid, or custom)
2. **Batch Processing**: Efficiently processes entire evaluation datasets
3. **Data Augmentation**: Enriches the original dataset with RAG outputs for evaluation
4. **Context Extraction**: Captures retrieved documents for context-based metrics

**Pipeline Integration:**
- **Input**: DataFrame with queries from CSVReaderComponent  
- **Process**: Runs each query through the specified RAG SuperComponent
- **Output**: Augmented DataFrame with responses and retrieved contexts

**üí° Why This Approach?**
By separating RAG execution from evaluation, we can:
- **Swap RAG systems** without changing evaluation logic
- **Cache RAG results** for multiple evaluation runs  
- **Debug RAG performance** independently of metrics calculation
- **Scale evaluation** across different datasets and configurations

In [2]:
from haystack import SuperComponent

@component
class RAGDataAugmenterComponent:
    """
    Applies a RAG SuperComponent to each query in a DataFrame and 
    augments the data with the generated answer and retrieved contexts.
    """

    def __init__(self, rag_supercomponent: SuperComponent):
        # We store the pre-initialized SuperComponent
        self.rag_supercomponent = rag_supercomponent
        self.output_names = ["augmented_data_frame"]

    @component.output_types(augmented_data_frame=pd.DataFrame)
    def run(self, data_frame: pd.DataFrame):
        
        # New columns to store RAG results
        answers: List[str] = []
        contexts: List[List[str]] = []

        print(f"Running RAG SuperComponent on {len(data_frame)} queries...")

        # Iterate through the queries (user_input column)
        for _, row in data_frame.iterrows():
            query = row["user_input"]
            
            # 1. Run the RAG SuperComponent
            # It expects 'query' as input and returns a dictionary.
            rag_output = self.rag_supercomponent.run(query=query)
            
            # 2. Extract answer and contexts
            # Based on the naive_rag_sc/hybrid_rag_sc structure:
            answer = rag_output.get('replies', [''])[0]
            
            # Extract content from the Document objects
            retrieved_docs = rag_output.get('documents', [])
            retrieved_contexts = [doc.content for doc in retrieved_docs]
            
            answers.append(answer)
            contexts.append(retrieved_contexts)
        
        # 3. Augment the DataFrame
        data_frame['response'] = answers
        data_frame['retrieved_contexts'] = contexts
        
        print("RAG processing complete.")
        return {"augmented_data_frame": data_frame}

## Component 3: RAGAS Evaluation Engine üìà

The **RagasEvaluationComponent** integrates the RAGAS framework into our Haystack pipeline, providing systematic evaluation metrics for RAG systems.

**üéØ Core Evaluation Metrics:**

| Metric | Purpose | What It Measures |
|--------|---------|------------------|
| **Faithfulness** | Response Quality | Factual consistency with retrieved context |
| **ResponseRelevancy** | Relevance | How well responses answer the questions |
| **LLMContextRecall** | Retrieval Quality | How well retrieval captures relevant information |
| **FactualCorrectness** | Accuracy | Correctness of factual claims in responses |

**üîß Technical Implementation:**
- **Focused Metrics**: Core metrics for reliable comparison
- **LLM Integration**: Uses OpenAI GPT models for evaluation judgments  
- **Data Format Handling**: Automatically formats data for RAGAS requirements
- **Comprehensive Output**: Returns both aggregated metrics and detailed per-query results

In [3]:
from ragas import EvaluationDataset, evaluate
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy

from ragas.llms import llm_factory
from haystack.utils import Secret
import os
from ragas.llms import HaystackLLMWrapper
from haystack.components.generators import OpenAIGenerator

@component
class RagasEvaluationComponent:
    """
    Prepares data for Ragas, runs the evaluation, and returns the metrics.
    Simplified for core metrics comparison.
    """
    
    def __init__(self, 
                 metrics: Optional[List[Any]] = None,
                 llm_model: str = "gpt-4o-mini",
                 openai_api_key: Optional[str] = None):
        """
        Initialize the RagasEvaluationComponent.
        
        Args:
            metrics: List of RAGAS metrics to evaluate (defaults to core metrics)
            llm_model (str): OpenAI model for evaluation. Defaults to "gpt-4o-mini".
            openai_api_key (Optional[str]): OpenAI API key. If None, will use environment variable.
        """
        
        # Default to core metrics for systematic comparison
        if metrics is None:
            self.metrics = [
                Faithfulness(), 
                ResponseRelevancy(),
                LLMContextRecall(),
                FactualCorrectness()
            ]
        else:
            self.metrics = metrics
            
        self.llm_model = llm_model
        self.openai_api_key = openai_api_key or os.getenv('OPENAI_API_KEY')
        
        if not self.openai_api_key:
            raise ValueError("OpenAI API key not found. Please set OPENAI_API_KEY environment variable or pass openai_api_key parameter.")
        
        # Configure RAGAS LLM for evaluation
        self.ragas_llm = HaystackLLMWrapper(
            OpenAIGenerator(
                model=self.llm_model,
                api_key=Secret.from_token(self.openai_api_key)
            )
        )

    @component.output_types(metrics=Dict[str, float], evaluation_df=pd.DataFrame)
    def run(self, augmented_data_frame: pd.DataFrame):
        
        # 1. Map columns to Ragas requirements
        ragas_data = pd.DataFrame({
            'user_input': augmented_data_frame['user_input'],
            'response': augmented_data_frame['response'], 
            'retrieved_contexts': augmented_data_frame['retrieved_contexts'],
            'reference': augmented_data_frame['reference'],
            'reference_contexts': augmented_data_frame['reference_contexts'].apply(eval)
        })

        print("Creating Ragas EvaluationDataset...")
        # 2. Create EvaluationDataset
        dataset = EvaluationDataset.from_pandas(ragas_data)

        print("Starting Ragas evaluation...")
        
        # 3. Run Ragas Evaluation
        results = evaluate(
            dataset=dataset,
            metrics=self.metrics,
            llm=self.ragas_llm
        )
        
        results_df = results.to_pandas()
        
        print("Ragas evaluation complete.")
        print(f"Overall metrics: {results}")
        
        return {"metrics": results, "evaluation_df": results_df}

---

# üß™ Systematic RAG Comparison: Naive vs Hybrid

Now we'll systematically evaluate both RAG SuperComponents using the same evaluation pipeline. This ensures fair comparison with identical evaluation conditions.

## üéØ Comparison Strategy

Our approach enables **systematic comparison**:

1. **Same Dataset**: Both systems evaluated on identical test queries
2. **Same Metrics**: Consistent evaluation criteria across both approaches
3. **Same Pipeline**: Identical processing workflow eliminates bias
4. **Reproducible Results**: Pipeline ensures consistent evaluation conditions

## üìä Dataset Information

We'll use a synthetic evaluation dataset:
- **`synthetic_tests_advanced_branching_2.csv`**: Focused dataset for comparison

**Dataset Structure:**
- `user_input`: Questions to ask the RAG system
- `reference`: Ground truth answers for comparison
- `reference_contexts`: Expected retrieved contexts

### Setup: Initialize Both RAG SuperComponents

First, we'll initialize both RAG SuperComponents with consistent parameters for fair comparison.

**üîß Configuration:**
- **Same base parameters**: Both systems use identical core settings
- **Document store**: Shared Elasticsearch document store
- **Models**: Consistent LLM and embedding models for both systems

In [4]:
# --- Setup Environment & Dependencies ---
from scripts.rag.hybridrag import HybridRAGSuperComponent
from scripts.rag.naiverag import NaiveRAGSuperComponent
from haystack_integrations.document_stores.elasticsearch import ElasticsearchDocumentStore
import os

# Initialize document store
document_store = ElasticsearchDocumentStore(hosts="http://localhost:9200")

# Create both RAG SuperComponents with base parameters for fair comparison
naive_rag_sc = NaiveRAGSuperComponent(
    document_store=document_store
)

hybrid_rag_sc = HybridRAGSuperComponent(
    document_store=document_store
)

print("‚úÖ Both RAG SuperComponents initialized successfully!")
print(f"üìä Naive RAG: {naive_rag_sc.__class__.__name__}")
print(f"üìä Hybrid RAG: {hybrid_rag_sc.__class__.__name__}")

‚úÖ Both RAG SuperComponents initialized successfully!
üìä Naive RAG: NaiveRAGSuperComponent
üìä Hybrid RAG: HybridRAGSuperComponent


In [5]:
# --- Create Evaluation Function for Systematic Comparison ---

def evaluate_rag_system(rag_supercomponent, system_name):
    """
    Evaluate a RAG SuperComponent using the evaluation pipeline
    
    Args:
        rag_supercomponent: The RAG system to evaluate
        system_name: Name for logging and identification
    
    Returns:
        dict: Evaluation results containing metrics and detailed dataframe
    """
    
    print(f"\nüîÑ Evaluating {system_name}...")
    print("=" * 50)
    
    # Initialize pipeline components
    reader = CSVReaderComponent()
    augmenter = RAGDataAugmenterComponent(rag_supercomponent=rag_supercomponent)
    evaluator = RagasEvaluationComponent()  # Uses default core metrics
    
    # Build evaluation pipeline
    evaluation_pipeline = Pipeline()
    evaluation_pipeline.add_component("reader", reader)
    evaluation_pipeline.add_component("augmenter", augmenter)
    evaluation_pipeline.add_component("evaluator", evaluator)
    
    # Connect pipeline components
    evaluation_pipeline.connect("reader.data_frame", "augmenter.data_frame")
    evaluation_pipeline.connect("augmenter.augmented_data_frame", "evaluator.augmented_data_frame")
    
    # Run evaluation
    csv_file_path = "data_for_eval/synthetic_tests_advanced_branching_2.csv"
    results = evaluation_pipeline.run({"reader": {"source": csv_file_path}})
    
    print(f"‚úÖ {system_name} evaluation complete!")
    return results

print("üîß Evaluation function ready for systematic comparison")

üîß Evaluation function ready for systematic comparison


## Experiment 1: Naive RAG Evaluation üî¨

Let's evaluate the Naive RAG SuperComponent first. This will establish our baseline performance metrics.

**üéØ What to Observe:**
- Processing time and efficiency
- Core metric scores (Faithfulness, Relevancy, Context Recall, Factual Correctness)
- Any errors or warnings during evaluation

In [6]:
# Evaluate Naive RAG SuperComponent
naive_results = evaluate_rag_system(naive_rag_sc, "Naive RAG SuperComponent")


üîÑ Evaluating Naive RAG SuperComponent...
Loaded DataFrame with 3 rows from data_for_eval/synthetic_tests_advanced_branching_2.csv.
Running RAG SuperComponent on 3 queries...
RAG processing complete.
Creating Ragas EvaluationDataset...
Starting Ragas evaluation...
RAG processing complete.
Creating Ragas EvaluationDataset...
Starting Ragas evaluation...


Evaluating:   0%|          | 0/12 [00:00<?, ?it/s]

LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.


Ragas evaluation complete.
Overall metrics: {'faithfulness': 0.5238, 'answer_relevancy': 0.6277, 'context_recall': 0.8333, 'factual_correctness(mode=f1)': 0.3900}
‚úÖ Naive RAG SuperComponent evaluation complete!


### Naive RAG Results Analysis üìä

Let's examine the detailed evaluation results from our Naive RAG system.

In [7]:
# Display Naive RAG detailed results
print("üìä Naive RAG - Detailed Results:")
naive_results['evaluator']['evaluation_df']

üìä Naive RAG - Detailed Results:


Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,faithfulness,answer_relevancy,context_recall,factual_correctness(mode=f1)
0,Wut is Alexa and how does it use AI?,"[What is AI, how does it work and why are some...","[What is AI, how does it work and why are some...",Alexa is a voice-controlled virtual assistant ...,Alexa is a voice-controlled virtual assistant ...,1.0,0.93069,1.0,0.57
1,What trends in user engagement with ChatGPT we...,[‚Äù (truncated)\n[user]: ‚Äú10 more‚Äù\nTable 2:Ill...,[<1-hop>\n\n4 The Growth of ChatGPT\nChatGPT w...,"In 2023, the first cohort of ChatGPT users exh...","In 2023, the first cohort of ChatGPT users exp...",0.571429,0.95237,1.0,0.6
2,What are the implications of artificial intell...,[Corporate\nusers may also use ChatGPT Busines...,"[<1-hop>\n\n27-28.\n- Christian Davenport, ‚Äú F...",I don't have enough information to answer.,The implications of artificial intelligence on...,0.0,0.0,0.5,0.0


In [8]:
# Display Naive RAG summary metrics
print("üìà Naive RAG - Summary Metrics:")
naive_metrics = naive_results['evaluator']['metrics']
naive_metrics

üìà Naive RAG - Summary Metrics:


{'faithfulness': 0.5238, 'answer_relevancy': 0.6277, 'context_recall': 0.8333, 'factual_correctness(mode=f1)': 0.3900}

## Experiment 2: Hybrid RAG Evaluation üî¨‚ö°

Now let's evaluate the **Hybrid RAG SuperComponent** using the identical evaluation pipeline. This systematic approach ensures fair comparison.

**üîÑ Key Benefits of This Approach:**
- **Identical Conditions**: Same pipeline, metrics, and dataset
- **Systematic Comparison**: Eliminates evaluation bias
- **Reproducible Results**: Consistent methodology across both systems

**üéØ Expected Improvements:**
Hybrid RAG typically shows better performance due to:
- **Dense + Sparse Retrieval**: Combines semantic and keyword-based search
- **Enhanced Context Quality**: Better retrieval often leads to better responses
- **Improved Robustness**: Multiple retrieval methods reduce failure modes

# Evaluate Hybrid RAG SuperComponent


In [9]:
hybrid_results = evaluate_rag_system(hybrid_rag_sc, "Hybrid RAG SuperComponent")


üîÑ Evaluating Hybrid RAG SuperComponent...
Loaded DataFrame with 3 rows from data_for_eval/synthetic_tests_advanced_branching_2.csv.
Running RAG SuperComponent on 3 queries...
RAG processing complete.
Creating Ragas EvaluationDataset...
Starting Ragas evaluation...
RAG processing complete.
Creating Ragas EvaluationDataset...
Starting Ragas evaluation...


Evaluating:   0%|          | 0/12 [00:00<?, ?it/s]

LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.


Ragas evaluation complete.
Overall metrics: {'faithfulness': 1.0000, 'answer_relevancy': 0.6342, 'context_recall': 0.8333, 'factual_correctness(mode=f1)': 0.6667}
‚úÖ Hybrid RAG SuperComponent evaluation complete!


In [10]:
print("üìä Hybrid RAG - Detailed Results:")
hybrid_results['evaluator']['evaluation_df']

üìä Hybrid RAG - Detailed Results:


Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,faithfulness,answer_relevancy,context_recall,factual_correctness(mode=f1)
0,Wut is Alexa and how does it use AI?,"[What is AI, how does it work and why are some...","[What is AI, how does it work and why are some...",Alexa is a voice-controlled virtual assistant ...,Alexa is a voice-controlled virtual assistant ...,1.0,0.917761,1.0,1.0
1,What trends in user engagement with ChatGPT we...,[The yellow line represents the first cohort o...,[<1-hop>\n\n4 The Growth of ChatGPT\nChatGPT w...,"In 2023, user engagement with ChatGPT, particu...","In 2023, the first cohort of ChatGPT users exp...",1.0,0.984896,1.0,1.0
2,What are the implications of artificial intell...,"[What is AI, how does it work and why are some...","[<1-hop>\n\n27-28.\n- Christian Davenport, ‚Äú F...",The provided information does not specifically...,The implications of artificial intelligence on...,1.0,0.0,0.5,0.0


Naive
```python
{'faithfulness': 0.5238, 
'answer_relevancy': 0.6277, 
'context_recall': 0.8333, 
'factual_correctness(mode=f1)': 0.3900}
```


Hybrid

```python
{'faithfulness': 1.0000, 
'answer_relevancy': 0.6342, 
'context_recall': 0.8333, 
'factual_correctness(mode=f1)': 0.6667}
```