üîß **Setup Required**: Before running this notebook, please follow the [setup instructions](../README.md#setup-instructions) to configure your environment and API keys. **You will need to ensure you've executed the Indexing pipeline before completing this exercise**

# RAG Evaluation Pipeline: Custom Components Approach

## üìã Overview

This notebook demonstrates how to create a **reproducible evaluation workflow** for RAG (Retrieval-Augmented Generation) systems using **Haystack custom components**. Instead of manually evaluating RAG systems, we'll build a pipeline that can:

1. **Load evaluation datasets** from CSV files
2. **Process queries** through different RAG SuperComponents 
3. **Generate comprehensive metrics** using the RAGAS framework
4. **Compare multiple RAG configurations** systematically

## üéØ Learning Objectives

By the end of this notebook, you will understand how to:
- Design modular evaluation components for RAG systems
- Create reusable pipelines for systematic RAG assessment 
- Switch between different RAG SuperComponents for comparative evaluation
- Interpret RAGAS metrics in the context of pipeline performance

## üîß Architecture Overview

Our evaluation pipeline consists of three main components:

```
CSV Data ‚Üí RAGDataAugmenter ‚Üí RagasEvaluation ‚Üí Metrics & Results
    ‚Üë              ‚Üë                 ‚Üë
CSVReader    SuperComponent    RAGAS Framework
```

**Key Benefits:**
- **Modularity**: Each component has a single responsibility
- **Reusability**: Swap RAG systems without changing evaluation logic  
- **Scalability**: Process multiple datasets and configurations systematically
- **Reproducibility**: Consistent evaluation across different experiments

---

## Component 1: CSV Data Loader üìä

The **CSVReaderComponent** serves as the entry point for our evaluation pipeline. It handles loading synthetic evaluation datasets and ensures data quality before processing.

**Key Features:**
- **Robust Error Handling**: Validates file existence and data integrity
- **Pandas Integration**: Returns data as DataFrame for easy manipulation
- **Pipeline Compatible**: Designed to work seamlessly with Haystack pipelines

**Input:** File path to CSV containing evaluation queries and ground truth
**Output:** Pandas DataFrame ready for RAG processing

In [1]:
import pandas as pd
from pathlib import Path
from haystack import component, Pipeline
from typing import List, Optional, Dict, Any, Union

@component
class CSVReaderComponent:
    """Reads a CSV file into a Pandas DataFrame."""

    @component.output_types(data_frame=pd.DataFrame)
    def run(self, source: Union[str, Path]):
        """
        Reads the CSV file from the first source in the list.
        
        Args:
            sources: List of file paths to CSV files. Only the first file will be processed.
            
        Returns:
            dict: Dictionary containing the loaded DataFrame under 'data_frame' key.
            
        Raises:
            FileNotFoundError: If the file doesn't exist or can't be read.
            ValueError: If the DataFrame is empty after loading.
        """
        if not source:
            raise ValueError("No sources provided")
            

        try:
            df = pd.read_csv(source)
        except FileNotFoundError:
            raise FileNotFoundError(f"File not found at {source}")
        except Exception as e:
            raise ValueError(f"Error reading CSV file {source}: {str(e)}")

        # Check if DataFrame is empty using proper pandas method
        if df.empty:
            raise ValueError(f"DataFrame is empty after loading from {source}")

        print(f"Loaded DataFrame with {len(df)} rows from {source}.")
        return {"data_frame": df}

## Component 2: RAG Data Augmentation üîÑ

The **RAGDataAugmenterComponent** is the core of our evaluation workflow. It takes each query from our evaluation dataset and processes it through a RAG SuperComponent, collecting both the generated responses and retrieved contexts.

**üîë Key Design Decisions:**

1. **SuperComponent Flexibility**: Accepts any pre-configured RAG SuperComponent (Naive, Hybrid, or custom)
2. **Batch Processing**: Efficiently processes entire evaluation datasets
3. **Data Augmentation**: Enriches the original dataset with RAG outputs for evaluation
4. **Context Extraction**: Captures retrieved documents for context-based metrics

**Pipeline Integration:**
- **Input**: DataFrame with queries from CSVReaderComponent  
- **Process**: Runs each query through the specified RAG SuperComponent
- **Output**: Augmented DataFrame with responses and retrieved contexts

**üí° Why This Approach?**
By separating RAG execution from evaluation, we can:
- **Swap RAG systems** without changing evaluation logic
- **Cache RAG results** for multiple evaluation runs  
- **Debug RAG performance** independently of metrics calculation
- **Scale evaluation** across different datasets and configurations

In [2]:
from haystack import SuperComponent

@component
class RAGDataAugmenterComponent:
    """
    Applies a RAG SuperComponent to each query in a DataFrame and 
    augments the data with the generated answer and retrieved contexts.
    """

    def __init__(self, rag_supercomponent: SuperComponent):
        # We store the pre-initialized SuperComponent
        self.rag_supercomponent = rag_supercomponent
        self.output_names = ["augmented_data_frame"]

    @component.output_types(augmented_data_frame=pd.DataFrame)
    def run(self, data_frame: pd.DataFrame):
        
        # New columns to store RAG results
        answers: List[str] = []
        contexts: List[List[str]] = []

        print(f"Running RAG SuperComponent on {len(data_frame)} queries...")

        # Iterate through the queries (user_input column)
        for _, row in data_frame.iterrows():
            query = row["user_input"]
            
            # 1. Run the RAG SuperComponent
            # It expects 'query' as input and returns a dictionary.
            rag_output = self.rag_supercomponent.run(query=query)
            
            # 2. Extract answer and contexts
            # Based on the naive_rag_sc/hybrid_rag_sc structure:
            answer = rag_output.get('replies', [''])[0]
            
            # Extract content from the Document objects
            retrieved_docs = rag_output.get('documents', [])
            retrieved_contexts = [doc.content for doc in retrieved_docs]
            
            answers.append(answer)
            contexts.append(retrieved_contexts)
        
        # 3. Augment the DataFrame
        data_frame['response'] = answers
        data_frame['retrieved_contexts'] = contexts
        
        print("RAG processing complete.")
        return {"augmented_data_frame": data_frame}

## Component 3: RAGAS Evaluation Engine üìà

The **RagasEvaluationComponent** integrates the RAGAS framework into our Haystack pipeline, providing comprehensive evaluation metrics for RAG systems.

**üéØ Evaluation Metrics Included:**

| Metric | Purpose | What It Measures |
|--------|---------|------------------|
| **LLMContextRecall** | Retrieval Quality | How well retrieval captures relevant information |
| **Faithfulness** | Response Quality | Factual consistency with retrieved context |
| **FactualCorrectness** | Accuracy | Correctness of factual claims in responses |
| **ResponseRelevancy** | Relevance | How well responses answer the questions |
| **ContextEntityRecall** | Entity Coverage | Retrieval of important entities (people, places, dates) |
| **NoiseSensitivity** | Robustness | System performance with irrelevant context |

**üîß Technical Implementation:**
- **Configurable Metrics**: Choose which RAGAS metrics to compute
- **LLM Integration**: Uses OpenAI GPT models for evaluation judgments  
- **Data Format Handling**: Automatically formats data for RAGAS requirements
- **Comprehensive Output**: Returns both aggregated metrics and detailed per-query results

**üí° Design Philosophy:**
This component abstracts away the complexity of RAGAS integration, allowing you to focus on comparing RAG system performance rather than wrestling with evaluation setup.

In [3]:
from ragas import EvaluationDataset, evaluate
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity

from ragas.llms import llm_factory
from haystack.utils import Secret
import os
from ragas.llms import HaystackLLMWrapper
from haystack.components.generators import OpenAIGenerator

# Note: Ensure ragas and its dependencies (like litellm or openai) are installed
@component
class RagasEvaluationComponent:
    """
    Prepares data for Ragas, runs the evaluation, and returns the metrics.
    """
    
    def __init__(self, 
                 metrics: Optional[List[Any]] = None,
                 ragas_llm: Optional[Any] = None):
        
        # Default metrics for RAG evaluation
        self.metrics = metrics
        
        # Ragas requires an LLM for evaluation, often provided through OpenAI or Anthropic.
        # It's best practice to use a strong model like gpt-4o-mini or gpt-4.
        if ragas_llm is None:
            # Assumes OPENAI_API_KEY is set in the environment
            self.ragas_llm = HaystackLLMWrapper(OpenAIGenerator(model="gpt-4o-mini",
                                                               api_key=Secret.from_env_var("OPENAI_API_KEY")))
        else:
            self.ragas_llm = ragas_llm

    @component.output_types(metrics=Dict[str, float], evaluation_df=pd.DataFrame)
    def run(self, augmented_data_frame: pd.DataFrame):
        
        # 1. Map columns to Ragas requirements - correct column mapping for SingleTurnSample
        ragas_data = pd.DataFrame({
            'user_input': augmented_data_frame['user_input'],
            'response': augmented_data_frame['response'], 
            'retrieved_contexts': augmented_data_frame['retrieved_contexts'],
            'reference': augmented_data_frame['reference'],
            'reference_contexts': augmented_data_frame['reference_contexts'].apply(eval)
        })

        print("Creating Ragas EvaluationDataset...")
        # 2. Create EvaluationDataset using from_pandas which handles the format correctly
        dataset = EvaluationDataset.from_pandas(ragas_data)

        print("Starting Ragas evaluation...")
        
        # 3. Run Ragas Evaluation
        # Pass the configured LLM to Ragas
        results = evaluate(
            dataset=dataset,
            metrics=self.metrics,
            llm=self.ragas_llm
        )
        

        results_df = results.to_pandas()
        
        print("Ragas evaluation complete.")
        print(f"Overall metrics: {results}")
        
        return {"metrics": results, "evaluation_df": results_df}

  from .autonotebook import tqdm as notebook_tqdm


## Experiment 1: Naive RAG Evaluation üî¨

---

# üß™ Experimental Setup: Systematic RAG Evaluation

Now we'll put our custom components together to create a **reproducible evaluation workflow**. This section demonstrates how to systematically evaluate different RAG SuperComponents using the same evaluation pipeline.

## üéØ Evaluation Strategy

Our approach enables **systematic comparison** of RAG systems:

1. **Consistent Evaluation**: Same metrics and datasets across all RAG variants
2. **Modular Design**: Easy to swap between Naive RAG, Hybrid RAG, or custom implementations  
3. **Reproducible Results**: Pipeline ensures identical evaluation conditions
4. **Scalable Assessment**: Process multiple datasets with different complexity levels

## üìä Dataset Information

We'll use synthetic evaluation datasets with varying complexity:
- **`synthetic_tests_advanced_branching_3.csv`**: Focused dataset with 3 test cases
- **`synthetic_tests_advanced_branching_10.csv`**: Medium dataset with 10 test cases  
- **`synthetic_tests_advanced_branching_50.csv`**: Large dataset with 50 test cases

**Dataset Structure:**
- `user_input`: Questions to ask the RAG system
- `reference`: Ground truth answers for comparison
- `reference_contexts`: Expected retrieved contexts

### Pipeline Configuration for Naive RAG

Here we configure our evaluation pipeline for the **Naive RAG SuperComponent**. This demonstrates the **modular design principle**: we can easily swap different RAG implementations while keeping the evaluation logic identical.

**üîß Configuration Steps:**
1. **Select RAG SuperComponent**: Choose `naive_rag_sc` for this experiment
2. **Configure RAGAS Metrics**: Set up comprehensive evaluation criteria
3. **Instantiate Components**: Create our three pipeline components
4. **Connect Pipeline**: Define the data flow between components

**üí° Key Insight:** Notice how the `rag_sc_to_test` variable makes it trivial to switch between different RAG implementations. This is the power of modular pipeline design!

In [4]:
# --- Setup Environment & Dependencies ---
# You need to ensure:
# 1. Elasticsearch is running (as NaiveRAG/HybridRAG rely on it, see files).
# 2. OPENAI_API_KEY is set in your environment.
# 3. The document store has been indexed with your data.

# --- 1. Import RAG SuperComponents ---
# Assuming naiverag.py and hybridrag.py are in your environment
from scripts.rag.naiverag import naive_rag_sc
from scripts.rag.hybridrag import hybrid_rag_sc
from pathlib import Path

# --- 2. Define Configurations to Test ---

# The RAG SuperComponent to test (change this to swap RAG configurations)
rag_sc_to_test = naive_rag_sc # OR hybrid_rag_sc

# If you want to test different internal configurations (e.g., chunk size, embedder model), 
# you should create and index new SuperComponents with those changes 
# and then choose the appropriate object here.

# --- 3. Instantiate Custom Components ---

metrics = [LLMContextRecall(), \
                Faithfulness(), \
                FactualCorrectness(), \
                ResponseRelevancy(), \
                ContextEntityRecall(), \
                NoiseSensitivity()]


reader = CSVReaderComponent()
augmenter = RAGDataAugmenterComponent(rag_supercomponent=rag_sc_to_test)
evaluator = RagasEvaluationComponent(metrics=metrics)

# --- 4. Build the Evaluation Pipeline ---

evaluation_pipeline = Pipeline()

evaluation_pipeline.add_component("reader", reader)
evaluation_pipeline.add_component("augmenter", augmenter)
evaluation_pipeline.add_component("evaluator", evaluator)

# Connect the flow: CSV -> Augment -> Evaluate
evaluation_pipeline.connect("reader.data_frame", "augmenter.data_frame")
evaluation_pipeline.connect("augmenter.augmented_data_frame", "evaluator.augmented_data_frame")



<haystack.core.pipeline.pipeline.Pipeline object at 0x30ee1d520>
üöÖ Components
  - reader: CSVReaderComponent
  - augmenter: RAGDataAugmenterComponent
  - evaluator: RagasEvaluationComponent
üõ§Ô∏è Connections
  - reader.data_frame -> augmenter.data_frame (DataFrame)
  - augmenter.augmented_data_frame -> evaluator.augmented_data_frame (DataFrame)

### Running the Evaluation Pipeline üöÄ

Now we execute our configured pipeline on the evaluation dataset. The pipeline will:

1. **Load Data**: Read the CSV file containing evaluation queries
2. **Process Queries**: Run each query through the Naive RAG SuperComponent  
3. **Generate Metrics**: Calculate RAGAS evaluation scores
4. **Return Results**: Provide both detailed and summary metrics

**üîç What to Observe:**
- Processing time for the dataset
- Console output showing pipeline progress  
- Any errors or warnings during evaluation

In [5]:

# --- 5. Run the Evaluation Pipeline ---
csv_file_path = "data_for_eval/synthetic_tests_advanced_branching_10.csv"
print(f"Starting evaluation of {rag_sc_to_test.__class__.__name__}...")

results = evaluation_pipeline.run({"reader": {"source": csv_file_path}})


Starting evaluation of SuperComponent...
Loaded DataFrame with 10 rows from data_for_eval/synthetic_tests_advanced_branching_10.csv.
Running RAG SuperComponent on 10 queries...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  7.66it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  7.22it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 10.48it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 18.44it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  9.99it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 10.16it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 15.68it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 16.61it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  9.09it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 10.75it/s]



RAG processing complete.
Creating Ragas EvaluationDataset...
Starting Ragas evaluation...


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
Evaluating:  10%|‚ñà         | 6/60 [00:08<01:03,  1.17s/it]LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
Evaluating:  22%|‚ñà‚ñà‚ñè       | 13/60 [00:14<00:36,  1.28it/s]LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
Evalu

Ragas evaluation complete.
Overall metrics: {'context_recall': 0.9667, 'faithfulness': 0.6410, 'factual_correctness(mode=f1)': 0.5170, 'answer_relevancy': 0.5775, 'context_entity_recall': 0.1532, 'noise_sensitivity(mode=relevant)': 0.1972}


### Analyzing Naive RAG Results üìä

Let's examine the detailed evaluation results from our Naive RAG system. The pipeline returns two key outputs:

1. **Detailed DataFrame**: Per-query metrics showing individual performance  
2. **Summary Metrics**: Aggregated scores across all evaluation queries

**üìà Result Interpretation Guide:**
- **High scores (>0.8)**: Excellent performance on this metric
- **Medium scores (0.5-0.8)**: Good performance with room for improvement  
- **Low scores (<0.5)**: Area needing significant attention

**üîç What to Look For:**
- Which metrics show the strongest performance?
- Are there specific queries where the system struggles?  
- What patterns emerge in the retrieved contexts?

In [6]:
# --- 6. Access Metrics ---
final_metrics = results
final_metrics['evaluator']['evaluation_df']

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,faithfulness,factual_correctness(mode=f1),answer_relevancy,context_entity_recall,noise_sensitivity(mode=relevant)
0,What are the ethical implications and concerns...,"[What is AI, how does it work and why are some...","[What is AI, how does it work and why are some...",The ethical implications and concerns surround...,"The rise of Meta AI, like other generative AI ...",1.0,0.818182,0.77,0.999999,0.05,0.181818
1,What is the estimated energy consumption of th...,[NBER WORKING PAPER SERIES\nHOW PEOPLE USE CHA...,[How does AI effect the environment?\nIt is no...,"According to the information provided, some re...",Some researchers estimate that the AI industry...,1.0,1.0,1.0,0.0,0.0,0.333333
2,Wut is the significanse of Artificial Intellig...,"[What is AI, how does it work and why are some...",[This article was published in 2018. To read m...,Artificial intelligence (AI) plays a significa...,Artificial Intelligence (AI) is a technology t...,1.0,1.0,0.6,0.946639,0.333333,0.615385
3,What does Figure 22 illustrate about the varia...,[‚Ä¢Sampled from all ChatGPT users:a random samp...,[<1-hop>\n\n37% of messages are work-related\n...,I don't have enough information to answer.,Figure 22 illustrates the variation in ChatGPT...,0.666667,0.0,0.0,0.0,0.2,0.0
4,What does Figure 22 show about how ChatGPT is ...,[‚Ä¢Sampled from all ChatGPT users:a random samp...,[<1-hop>\n\nPanel A.Work Related\n Panel B1.As...,I don't have enough information to answer.,Figure 22 illustrates the classification of wo...,1.0,0.0,0.0,0.0,0.0,0.0
5,How does ChatGPT Business usage vary by occupa...,[‚Ä¢Sampled from all ChatGPT users:a random samp...,[<1-hop>\n\nCorporate users may also use ChatG...,I don't have enough information to answer.,ChatGPT Business usage varies significantly by...,1.0,0.0,0.0,0.0,0.1,0.0
6,How does the environmental impact of artificia...,"[What is AI, how does it work and why are some...","[<1-hop>\n\nWhat is AI, how does it work and w...",The environmental impact of artificial intelli...,The environmental impact of artificial intelli...,1.0,0.692308,0.75,0.963276,0.111111,0.0
7,How do privacy protections and de-identificati...,[‚Ä¢Sampled from all ChatGPT users:a random samp...,[<1-hop>\n\nWe describe the contents of each d...,The analysis of ChatGPT user messages incorpor...,Privacy protections in the analysis of ChatGPT...,1.0,1.0,0.52,0.952696,0.333333,0.541667
8,What trends can be observed in user cohort ana...,[‚Ä¢Sampled from all ChatGPT users:a random samp...,[<1-hop>\n\nThe yellow line represents the fir...,In the user cohort analysis regarding ChatGPT ...,User cohort analysis reveals that there has be...,1.0,1.0,0.73,0.958646,0.153846,0.2
9,What are the environmental concerns related to...,"[What is AI, how does it work and why are some...","[<1-hop>\n\nWhat is AI, how does it work and w...",The environmental concerns related to artifici...,The environmental concerns related to artifici...,1.0,0.9,0.8,0.953338,0.25,0.1


In [7]:
final_metrics['evaluator']['metrics']

{'context_recall': 0.9667, 'faithfulness': 0.6410, 'factual_correctness(mode=f1)': 0.5170, 'answer_relevancy': 0.5775, 'context_entity_recall': 0.1532, 'noise_sensitivity(mode=relevant)': 0.1972}

---

## Experiment 2: Hybrid RAG Evaluation üî¨‚ö°

Now let's evaluate the **Hybrid RAG SuperComponent** using the exact same pipeline. This demonstrates the **power of modular evaluation**: we can systematically compare different RAG approaches with identical evaluation conditions.

**üîÑ What Changes:**
- **RAG SuperComponent**: Switch from `naive_rag_sc` to `hybrid_rag_sc`  
- **Everything Else**: Identical pipeline, metrics, and dataset

**üéØ Expected Improvements:**
Hybrid RAG typically shows better performance due to:
- **Dense + Sparse Retrieval**: Combines semantic and keyword-based search
- **Enhanced Context Quality**: Better retrieval often leads to better responses
- **Improved Robustness**: Multiple retrieval methods reduce failure modes

**üìä Comparative Analysis:**
After running both experiments, you'll be able to directly compare:
- Which approach handles different query types better
- Performance differences across RAGAS metrics  
- Trade-offs between complexity and performance

### Configuring Pipeline for Hybrid RAG ‚öôÔ∏è

Notice how **minimal** the configuration changes are! This showcases the elegance of our modular design:

**üîÑ Single Line Change**: 
```python
rag_sc_to_test = hybrid_rag_sc  # Previously: naive_rag_sc
```

**üèóÔ∏è Architecture Benefits:**
- **Consistency**: Same evaluation methodology across all RAG variants
- **Efficiency**: No need to rewrite evaluation logic  
- **Reliability**: Eliminates configuration differences that could skew results
- **Scalability**: Easy to add new RAG SuperComponents to comparison

In [8]:
rag_sc_to_test = hybrid_rag_sc
metrics = [LLMContextRecall(), \
                Faithfulness(), \
                FactualCorrectness(), \
                ResponseRelevancy(), \
                ContextEntityRecall(), \
                NoiseSensitivity()]


reader = CSVReaderComponent()
augmenter = RAGDataAugmenterComponent(rag_supercomponent=rag_sc_to_test)
evaluator = RagasEvaluationComponent(metrics=metrics)

# --- 4. Build the Evaluation Pipeline ---

evaluation_pipeline = Pipeline()

evaluation_pipeline.add_component("reader", reader)
evaluation_pipeline.add_component("augmenter", augmenter)
evaluation_pipeline.add_component("evaluator", evaluator)

# Connect the flow: CSV -> Augment -> Evaluate
evaluation_pipeline.connect("reader.data_frame", "augmenter.data_frame")
evaluation_pipeline.connect("augmenter.augmented_data_frame", "evaluator.augmented_data_frame")

<haystack.core.pipeline.pipeline.Pipeline object at 0x3d229c200>
üöÖ Components
  - reader: CSVReaderComponent
  - augmenter: RAGDataAugmenterComponent
  - evaluator: RagasEvaluationComponent
üõ§Ô∏è Connections
  - reader.data_frame -> augmenter.data_frame (DataFrame)
  - augmenter.augmented_data_frame -> evaluator.augmented_data_frame (DataFrame)

### Executing Hybrid RAG Evaluation üöÄ

Running the same evaluation pipeline with the Hybrid RAG SuperComponent. Compare the processing characteristics with the previous Naive RAG run:

**üîç Observations to Make:**
- **Processing Time**: May be longer due to multiple retrieval methods
- **Console Output**: Look for differences in component execution
- **Error Patterns**: Note any changes in system robustness

In [9]:
# --- 5. Run the Evaluation Pipeline ---
csv_file_path = "data_for_eval/synthetic_tests_advanced_branching_10.csv"
print(f"Starting evaluation of {rag_sc_to_test.__class__.__name__}...")

results = evaluation_pipeline.run({"reader": {"source": csv_file_path}})


Starting evaluation of SuperComponent...
Loaded DataFrame with 10 rows from data_for_eval/synthetic_tests_advanced_branching_10.csv.
Running RAG SuperComponent on 10 queries...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 25.97it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 28.60it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 17.73it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  2.85it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 17.22it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 17.79it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 24.57it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 22.17it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 17.33it/s]

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  8.26it/s]



RAG processing complete.
Creating Ragas EvaluationDataset...
Starting Ragas evaluation...


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
Evaluating:   2%|‚ñè         | 1/60 [00:01<01:44,  1.77s/it]LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
Evaluating:  12%|‚ñà‚ñè        | 7/60 [00:09<01:00,  1.14s/it]LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
Evaluating:  12%|‚ñà‚ñè        | 7/60 [00:09<01:00,  1.14s/it]LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
Evaluating:  22%|‚ñà‚ñà‚ñè       | 13/60 [00:14<00:49,  1.06s/it]LLM returned 1 generations instead of request

Ragas evaluation complete.
Overall metrics: {'context_recall': 1.0000, 'faithfulness': 0.7900, 'factual_correctness(mode=f1)': 0.5730, 'answer_relevancy': 0.6663, 'context_entity_recall': 0.2503, 'noise_sensitivity(mode=relevant)': 0.2537}


In [10]:
finally_metrics_hybrid = results

In [11]:
finally_metrics_hybrid['evaluator']['evaluation_df']

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,faithfulness,factual_correctness(mode=f1),answer_relevancy,context_entity_recall,noise_sensitivity(mode=relevant)
0,What are the ethical implications and concerns...,"[What is AI, how does it work and why are some...","[What is AI, how does it work and why are some...",I don't have enough information to answer.,"The rise of Meta AI, like other generative AI ...",1.0,0.0,0.0,0.0,0.1,0.0
1,What is the estimated energy consumption of th...,"[What is AI, how does it work and why are some...",[How does AI effect the environment?\nIt is no...,It is estimated that the AI industry as a whol...,Some researchers estimate that the AI industry...,1.0,1.0,1.0,0.0,0.5,0.0
2,Wut is the significanse of Artificial Intellig...,[This article was published in 2018. To read m...,[This article was published in 2018. To read m...,Artificial Intelligence (AI) holds significant...,Artificial Intelligence (AI) is a technology t...,1.0,1.0,0.55,0.933296,0.666667,0.642857
3,What does Figure 22 illustrate about the varia...,[The prompts for each of these automated class...,[<1-hop>\n\n37% of messages are work-related\n...,Figure 22 illustrates that there is a variatio...,Figure 22 illustrates the variation in ChatGPT...,1.0,0.9,0.75,0.964758,0.1,0.4
4,What does Figure 22 show about how ChatGPT is ...,[X‚Äôs indicate that the ranking is\nunavailable...,[<1-hop>\n\nPanel A.Work Related\n Panel B1.As...,Figure 22 presents data on how ChatGPT is used...,Figure 22 illustrates the classification of wo...,1.0,1.0,0.0,0.959466,0.176471,1.0
5,How does ChatGPT Business usage vary by occupa...,[The prompts for each of these automated class...,[<1-hop>\n\nCorporate users may also use ChatG...,I don't have enough information to answer.,ChatGPT Business usage varies significantly by...,1.0,0.0,0.0,0.0,0.111111,0.0
6,How does the environmental impact of artificia...,"[What is AI, how does it work and why are some...","[<1-hop>\n\nWhat is AI, how does it work and w...",The environmental impact of artificial intelli...,The environmental impact of artificial intelli...,1.0,1.0,0.93,0.949417,0.105263,0.0
7,How do privacy protections and de-identificati...,[‚Ä¢Sampled from all ChatGPT users:a random samp...,[<1-hop>\n\nWe describe the contents of each d...,The analysis of ChatGPT user messages employs ...,Privacy protections in the analysis of ChatGPT...,1.0,1.0,0.77,0.952696,0.333333,0.222222
8,What trends can be observed in user cohort ana...,[X‚Äôs indicate that the ranking is\nunavailable...,[<1-hop>\n\nThe yellow line represents the fir...,The user cohort analysis regarding ChatGPT que...,User cohort analysis reveals that there has be...,1.0,1.0,0.83,0.950398,0.076923,0.166667
9,What are the environmental concerns related to...,"[What is AI, how does it work and why are some...","[<1-hop>\n\nWhat is AI, how does it work and w...",The environmental concerns related to artifici...,The environmental concerns related to artifici...,1.0,1.0,0.9,0.953324,0.333333,0.105263


In [12]:
finally_metrics_hybrid['evaluator']['metrics']

{'context_recall': 1.0000, 'faithfulness': 0.7900, 'factual_correctness(mode=f1)': 0.5730, 'answer_relevancy': 0.6663, 'context_entity_recall': 0.2503, 'noise_sensitivity(mode=relevant)': 0.2537}

### Comparative Analysis: Hybrid vs Naive RAG üìäüî¨

Naive: 

```python
{'context_recall': 0.9667,
'faithfulness': 0.6410, 
'factual_correctness(mode=f1)': 0.5170, 
'answer_relevancy': 0.5775, 
'context_entity_recall': 0.1532, 
'noise_sensitivity(mode=relevant)': 0.1972}
```

Hybrid with reranking

```python
{'context_recall': 1.0000,
'faithfulness': 0.7900,
'factual_correctness(mode=f1)': 0.5730, 
'answer_relevancy': 0.6663, 
'context_entity_recall': 0.2503, 
'noise_sensitivity(mode=relevant)': 0.2537}
```

Now you have evaluation results from both RAG systems! Let's compare their performance across all RAGAS metrics.

**üîç Comparison Framework:**

| Metric | Naive RAG Score | Hybrid RAG Score | Winner | Insights |
|--------|----------------|------------------|---------|----------|
| **LLMContextRecall** | 0.9667 | 1.0000 | Hybrid with reranking | Retrieval effectiveness |
| **Faithfulness** | 0.6410 | 0.7900 | Hybrid with reranking | Response accuracy |
| **FactualCorrectness** | 0.5775 | 0.5730 | Naive | Factual reliability |
| **ResponseRelevancy** | 0.5775 | 0.6663 | Hybrid with reranking | Answer relevance |
| **ContextEntityRecall** | 0.1532 | 0.2503 | Hybrid with reranking | Entity coverage |
| **NoiseSensitivity** |  0.1972 | 0.2537 | Naive | Robustness to noise |

**üí° Analysis Questions:**
1. **Which system performs better overall?**
2. **Are there specific metrics where one system significantly outperforms?**  
3. **What trade-offs do you observe between the approaches?**
4. **How do the retrieved contexts differ between systems?**

**üéØ Next Steps:**
Based on these results, you can:
- **Choose the better performing system** for your use case
- **Identify areas for improvement** in both approaches  
- **Design hybrid approaches** that combine the best of both
- **Scale evaluation** to larger datasets for more robust conclusions

---

# üéì Summary: Reproducible RAG Evaluation Workflows

## üéØ What You've Accomplished

Congratulations! You've successfully built and executed a **reproducible evaluation pipeline** for RAG systems using Haystack custom components. Here's what you've learned:

### ‚úÖ **Technical Skills Developed:**
- **Modular Component Design**: Created reusable evaluation components
- **Pipeline Architecture**: Built scalable evaluation workflows  
- **RAGAS Integration**: Integrated comprehensive RAG metrics
- **Systematic Comparison**: Evaluated multiple RAG approaches consistently

### ‚úÖ **Methodological Insights:**
- **Reproducibility**: Same evaluation conditions across all experiments
- **Modularity**: Easy to swap RAG systems and evaluation datasets
- **Scalability**: Pipeline handles datasets of varying sizes
- **Comprehensive Assessment**: Multiple metrics provide holistic view

## üöÄ Next Steps & Extensions

### **Immediate Applications:**
1. **Scale Up Evaluation**: Test with larger datasets (50+ queries)
2. **Add More RAG Variants**: Evaluate custom SuperComponents  
3. **Parameter Tuning**: Test different chunk sizes, embedding models
4. **Domain Testing**: Use domain-specific evaluation datasets

### **Advanced Extensions:**
1. **Automated Comparison**: Build comparison dashboards
2. **Statistical Significance**: Add significance testing between systems
3. **Cost Analysis**: Track API usage and processing time
4. **A/B Testing**: Deploy evaluation pipeline for production monitoring

## üí° **Key Design Principles Learned:**

### **üîß Modularity**
- Each component has a single, well-defined responsibility
- Easy to swap implementations without changing evaluation logic
- Components are reusable across different experiments

### **üìä Reproducibility**  
- Consistent evaluation conditions eliminate bias
- Pipeline ensures identical processing for all RAG variants
- Results are comparable and scientifically valid

### **‚ö° Scalability**
- Architecture handles small experiments and large-scale evaluation
- Easy to add new RAG systems or evaluation metrics  
- Pipeline can be deployed for automated monitoring

---

**üéâ Congratulations!** You now have a production-ready evaluation system for RAG applications. This pipeline will serve as the foundation for systematic RAG system development and optimization.