üîß **Setup Required**: Before running this notebook, please follow the [setup instructions](../README.md#setup-instructions) to configure your environment and API keys. **You will need to ensure you've executed the Indexing pipeline before completing this exercise**

# Systematic RAG Evaluation: Naive vs Hybrid Comparison

## üìã Overview

This notebook demonstrates a **systematic evaluation workflow** for comparing two RAG (Retrieval-Augmented Generation) approaches using **Haystack custom components**. We'll create a reproducible pipeline to:

1. **Load evaluation datasets** from CSV files
2. **Process queries** through both Naive and Hybrid RAG SuperComponents 
3. **Generate comprehensive metrics** using the RAGAS framework
4. **Compare performance** systematically

## üéØ Learning Objectives

By the end of this notebook, you will understand how to:
- Create reusable evaluation components for RAG systems
- Build scalable pipelines for systematic RAG comparison
- Interpret RAGAS metrics in comparative context
- Make data-driven decisions between RAG approaches

## üîß Evaluation Pipeline

Our pipeline consists of three main components:

```
CSV Data ‚Üí RAGDataAugmenter ‚Üí RagasEvaluation ‚Üí Metrics & Results
```

**Key Benefits:**
- **Systematic**: Same evaluation conditions for both RAG systems
- **Reproducible**: Consistent evaluation across experiments
- **Scalable**: Easy to add new RAG implementations
- **Comprehensive**: Multiple metrics provide complete assessment

---

## Component 1: CSV Data Loader üìä

The **CSVReaderComponent** serves as the entry point for our evaluation pipeline. It handles loading synthetic evaluation datasets and ensures data quality before processing.

**Key Features:**
- **Robust Error Handling**: Validates file existence and data integrity
- **Pandas Integration**: Returns data as DataFrame for easy manipulation
- **Pipeline Compatible**: Designed to work seamlessly with Haystack pipelines

**Input:** File path to CSV containing evaluation queries and ground truth
**Output:** Pandas DataFrame ready for RAG processing

In [1]:
import pandas as pd
from pathlib import Path
from haystack import component, Pipeline
from typing import List, Optional, Dict, Any, Union

@component
class CSVReaderComponent:
    """Reads a CSV file into a Pandas DataFrame."""

    @component.output_types(data_frame=pd.DataFrame)
    def run(self, source: Union[str, Path]):
        """
        Reads the CSV file from the first source in the list.
        
        Args:
            sources: List of file paths to CSV files. Only the first file will be processed.
            
        Returns:
            dict: Dictionary containing the loaded DataFrame under 'data_frame' key.
            
        Raises:
            FileNotFoundError: If the file doesn't exist or can't be read.
            ValueError: If the DataFrame is empty after loading.
        """
        if not source:
            raise ValueError("No sources provided")
            

        try:
            df = pd.read_csv(source)
        except FileNotFoundError:
            raise FileNotFoundError(f"File not found at {source}")
        except Exception as e:
            raise ValueError(f"Error reading CSV file {source}: {str(e)}")

        # Check if DataFrame is empty using proper pandas method
        if df.empty:
            raise ValueError(f"DataFrame is empty after loading from {source}")

        print(f"Loaded DataFrame with {len(df)} rows from {source}.")
        return {"data_frame": df}

## Component 2: RAG Data Augmentation üîÑ

The **RAGDataAugmenterComponent** is the core of our evaluation workflow. It takes each query from our evaluation dataset and processes it through a RAG SuperComponent, collecting both the generated responses and retrieved contexts.

**üîë Key Design Decisions:**

1. **SuperComponent Flexibility**: Accepts any pre-configured RAG SuperComponent (Naive, Hybrid, or custom)
2. **Batch Processing**: Efficiently processes entire evaluation datasets
3. **Data Augmentation**: Enriches the original dataset with RAG outputs for evaluation
4. **Context Extraction**: Captures retrieved documents for context-based metrics

**Pipeline Integration:**
- **Input**: DataFrame with queries from CSVReaderComponent  
- **Process**: Runs each query through the specified RAG SuperComponent
- **Output**: Augmented DataFrame with responses and retrieved contexts

**üí° Why This Approach?**
By separating RAG execution from evaluation, we can:
- **Swap RAG systems** without changing evaluation logic
- **Cache RAG results** for multiple evaluation runs  
- **Debug RAG performance** independently of metrics calculation
- **Scale evaluation** across different datasets and configurations

In [2]:
from haystack import SuperComponent, super_component

@component
class RAGDataAugmenterComponent:
    """
    Applies a RAG SuperComponent to each query in a DataFrame and 
    augments the data with the generated answer and retrieved contexts.
    """

    def __init__(self, rag_supercomponent: SuperComponent):
        # We store the pre-initialized SuperComponent
        self.rag_supercomponent = rag_supercomponent
        self.output_names = ["augmented_data_frame"]

    @component.output_types(augmented_data_frame=pd.DataFrame)
    def run(self, data_frame: pd.DataFrame):
        
        # New columns to store RAG results
        answers: List[str] = []
        contexts: List[List[str]] = []

        print(f"Running RAG SuperComponent on {len(data_frame)} queries...")

        # Iterate through the queries (user_input column)
        for _, row in data_frame.iterrows():
            query = row["user_input"]
            
            # 1. Run the RAG SuperComponent
            # It expects 'query' as input and returns a dictionary.
            rag_output = self.rag_supercomponent.run(query=query)
            
            # 2. Extract answer and contexts
            # Based on the naive_rag_sc/hybrid_rag_sc structure:
            answer = rag_output.get('replies', [''])[0]
            
            # Extract content from the Document objects
            retrieved_docs = rag_output.get('documents', [])
            retrieved_contexts = [doc.content for doc in retrieved_docs]
            
            answers.append(answer)
            contexts.append(retrieved_contexts)
        
        # 3. Augment the DataFrame
        data_frame['response'] = answers
        data_frame['retrieved_contexts'] = contexts
        
        print("RAG processing complete.")
        return {"augmented_data_frame": data_frame}

## Component 3: RAGAS Evaluation Engine üìà

The **RagasEvaluationComponent** integrates the RAGAS framework into our Haystack pipeline, providing systematic evaluation metrics for RAG systems.

**üéØ Core Evaluation Metrics:**

| Metric | Purpose | What It Measures |
|--------|---------|------------------|
| **Faithfulness** | Response Quality | Factual consistency with retrieved context |
| **ResponseRelevancy** | Relevance | How well responses answer the questions |
| **LLMContextRecall** | Retrieval Quality | How well retrieval captures relevant information |
| **FactualCorrectness** | Accuracy | Correctness of factual claims in responses |

**üîß Technical Implementation:**
- **Focused Metrics**: Core metrics for reliable comparison
- **LLM Integration**: Uses OpenAI GPT models for evaluation judgments  
- **Data Format Handling**: Automatically formats data for RAGAS requirements
- **Comprehensive Output**: Returns both aggregated metrics and detailed per-query results

In [3]:
from ragas import EvaluationDataset, evaluate
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy

from ragas.llms import llm_factory
from haystack.utils import Secret
import os
from ragas.llms import HaystackLLMWrapper
from haystack.components.generators import OpenAIGenerator


@component
class RagasEvaluationComponent:
    """
    Integrates the RAGAS framework into Haystack pipeline for systematic evaluation.
    
    This component provides systematic evaluation metrics for RAG systems using
    the RAGAS framework with focus on core metrics for reliable comparison.
    
    Core Evaluation Metrics:
    - Faithfulness: Factual consistency with retrieved context
    - ResponseRelevancy: How well responses answer the questions  
    - LLMContextRecall: How well retrieval captures relevant information
    - FactualCorrectness: Correctness of factual claims in responses
    
    Technical Features:
    - Focused Metrics: Core metrics for reliable comparison
    - LLM Integration: Uses provided generator for evaluation judgments
    - Data Format Handling: Automatically formats data for RAGAS requirements
    - Comprehensive Output: Returns both aggregated metrics and detailed per-query results
    """
    
    def __init__(self, 
                 generator: Any,
                 metrics: Optional[List[Any]] = None):
        """
        Initialize the RAGAS Evaluation Component.
        
        Args:
            generator: LLM generator instance (e.g., OpenAIGenerator or OllamaGenerator).
            metrics: List of RAGAS metrics to evaluate (defaults to core metrics)
        """
        
        # Default to core metrics for systematic comparison
        if metrics is None:
            self.metrics = [
                Faithfulness(), 
                ResponseRelevancy(),
                LLMContextRecall(),
                FactualCorrectness()
            ]
        else:
            self.metrics = metrics
        
        # Configure RAGAS LLM for evaluation
        self.ragas_llm = HaystackLLMWrapper(generator)

    @component.output_types(metrics=Dict[str, float], evaluation_df=pd.DataFrame)
    def run(self, augmented_data_frame: pd.DataFrame):
        """
        Run RAGAS evaluation on augmented dataset.
        
        Args:
            augmented_data_frame: DataFrame with RAG responses and retrieved contexts
            
        Returns:
            dict: Dictionary containing evaluation metrics and detailed results DataFrame
        """
        
        # 1. Map columns to Ragas requirements
        ragas_data = pd.DataFrame({
            'user_input': augmented_data_frame['user_input'],
            'response': augmented_data_frame['response'], 
            'retrieved_contexts': augmented_data_frame['retrieved_contexts'],
            'reference': augmented_data_frame['reference'],
            'reference_contexts': augmented_data_frame['reference_contexts'].apply(eval)
        })

        print("Creating Ragas EvaluationDataset...")
        # 2. Create EvaluationDataset
        dataset = EvaluationDataset.from_pandas(ragas_data)

        print("Starting Ragas evaluation...")
        
        # 3. Run Ragas Evaluation
        results = evaluate(
            dataset=dataset,
            metrics=self.metrics,
            llm=self.ragas_llm
        )
        
        results_df = results.to_pandas()
        
        print("Ragas evaluation complete.")
        print(f"Overall metrics: {results}")
        
        return {"metrics": results, "evaluation_df": results_df}

---

# üß™ Systematic RAG Comparison: Naive vs Hybrid

Now we'll systematically evaluate both RAG SuperComponents using the same evaluation pipeline. This ensures fair comparison with identical evaluation conditions.

## üéØ Comparison Strategy

Our approach enables **systematic comparison**:

1. **Same Dataset**: Both systems evaluated on identical test queries
2. **Same Metrics**: Consistent evaluation criteria across both approaches
3. **Same Pipeline**: Identical processing workflow eliminates bias
4. **Reproducible Results**: Pipeline ensures consistent evaluation conditions

## üìä Dataset Information

We'll use a synthetic evaluation dataset:
- **`synthetic_tests_advanced_branching_2.csv`**: Focused dataset for comparison

**Dataset Structure:**
- `user_input`: Questions to ask the RAG system
- `reference`: Ground truth answers for comparison
- `reference_contexts`: Expected retrieved contexts

### Setup: Initialize Both RAG SuperComponents

First, we'll initialize both RAG SuperComponents with consistent parameters for fair comparison.

**üîß Configuration:**
- **Same base parameters**: Both systems use identical core settings
- **Document store**: Shared Elasticsearch document store
- **Models**: Consistent LLM and embedding models for both systems

In [4]:
# --- Setup Environment & Dependencies ---
from scripts.rag.hybridrag import HybridRAGSuperComponent
from scripts.rag.naiverag import NaiveRAGSuperComponent
from haystack_integrations.document_stores.elasticsearch import ElasticsearchDocumentStore
import os

# Initialize document store
document_store = ElasticsearchDocumentStore(hosts="http://localhost:9200")

# Create both RAG SuperComponents with base parameters for fair comparison
naive_rag_sc = NaiveRAGSuperComponent(
    document_store=document_store
)

hybrid_rag_sc = HybridRAGSuperComponent(
    document_store=document_store
)

print("‚úÖ Both RAG SuperComponents initialized successfully!")
print(f"üìä Naive RAG: {naive_rag_sc.__class__.__name__}")
print(f"üìä Hybrid RAG: {hybrid_rag_sc.__class__.__name__}")

‚úÖ Both RAG SuperComponents initialized successfully!
üìä Naive RAG: NaiveRAGSuperComponent
üìä Hybrid RAG: HybridRAGSuperComponent


In [5]:
# --- Create RAG Evaluation SuperComponent for Systematic Comparison ---

@super_component
class RAGEvaluationSuperComponent:
    """
    Complete RAG evaluation pipeline for systematic comparison of RAG systems.
    
    This SuperComponent provides a systematic evaluation workflow for comparing
    RAG approaches with consistent evaluation conditions and comprehensive metrics.
    
    Pipeline Flow:
    CSV Data ‚Üí RAGDataAugmenter ‚Üí RagasEvaluation ‚Üí Metrics & Results
    
    Key Benefits:
    - Systematic: Same evaluation conditions for all RAG systems
    - Reproducible: Consistent evaluation across experiments
    - Scalable: Easy to add new RAG implementations
    - Comprehensive: Multiple metrics provide complete assessment
    """
    
    def __init__(self, 
                 rag_supercomponent, 
                 system_name: str,
                 generator: Any,
                 openai_api_key: Optional[str] = None):
        """
        Initialize the RAG Evaluation SuperComponent.
        
        Args:
            rag_supercomponent: The RAG system to evaluate
            system_name (str): Name for logging and identification
            generator: LLM generator instance (e.g., OpenAIGenerator or OllamaGenerator).
            openai_api_key (Optional[str]): OpenAI API key. If None, will use environment variable.
        """
        self.rag_supercomponent = rag_supercomponent
        self.system_name = system_name
        self.generator = generator
        self.openai_api_key = openai_api_key or os.getenv('OPENAI_API_KEY')
        
        if not self.openai_api_key:
            raise ValueError("OpenAI API key not found. Please set OPENAI_API_KEY environment variable or pass openai_api_key parameter.")
        
        self._build_pipeline()
    
    def _build_pipeline(self):
        """Build the RAG evaluation pipeline with initialized components."""
        
        print(f"\nüîÑ Building evaluation pipeline for {self.system_name}...")
        print("=" * 50)
        
        # --- 1. Initialize Evaluation Pipeline Components ---
        
        # CSV Reader: Loads evaluation dataset
        reader = CSVReaderComponent()
        
        # RAG Data Augmenter: Processes queries through the RAG system
        augmenter = RAGDataAugmenterComponent(rag_supercomponent=self.rag_supercomponent)
        
        # RAGAS Evaluator: Computes evaluation metrics
        evaluator = RagasEvaluationComponent(
            generator=self.generator
        )
        
        # --- 2. Build the Evaluation Pipeline ---
        self.pipeline = Pipeline()
        
        # Add all components to the pipeline
        self.pipeline.add_component("reader", reader)
        self.pipeline.add_component("augmenter", augmenter)
        self.pipeline.add_component("evaluator", evaluator)
        
        # --- 3. Connect the Components in a Graph ---
        
        # CSV Data -> RAG Augmentation -> RAGAS Evaluation
        self.pipeline.connect("reader.data_frame", "augmenter.data_frame")
        self.pipeline.connect("augmenter.augmented_data_frame", "evaluator.augmented_data_frame")
        
        # --- 4. Define Input and Output Mappings ---
        self.input_mapping = {
            "csv_source": ["reader.source"]
        }

        self.output_mapping = {
            "evaluator.metrics": "metrics",
            "evaluator.evaluation_df": "evaluation_df"
        }
        
        print(f"‚úÖ Evaluation pipeline for {self.system_name} built successfully!")


## Experiment 1: Naive RAG Evaluation üî¨

Let's evaluate the Naive RAG SuperComponent first. This will establish our baseline performance metrics.

**üéØ What to Observe:**
- Processing time and efficiency
- Core metric scores (Faithfulness, Relevancy, Context Recall, Factual Correctness)
- Any errors or warnings during evaluation

In [6]:
# Create generator for evaluation
from haystack.components.generators import OpenAIGenerator
from haystack.utils import Secret
import os

eval_generator = OpenAIGenerator(
    model="gpt-4o-mini",
    api_key=Secret.from_token(os.getenv("OPENAI_API_KEY"))
)

# Evaluate Naive RAG SuperComponent
csv_file_path = "data_for_eval/synthetic_tests_advanced_branching_10.csv"

# Create evaluation SuperComponent (csv_source now passed in run method)
evaluation_sc = RAGEvaluationSuperComponent(
    rag_supercomponent=naive_rag_sc,
    system_name="Naive RAG",
    generator=eval_generator
)

# Run evaluation with CSV source passed to run method
naive_results = evaluation_sc.run(csv_source=csv_file_path)

print(f"‚úÖ Naive RAG evaluation complete!")


üîÑ Building evaluation pipeline for Naive RAG...
‚úÖ Evaluation pipeline for Naive RAG built successfully!
Loaded DataFrame with 10 rows from data_for_eval/synthetic_tests_advanced_branching_10.csv.
Running RAG SuperComponent on 10 queries...
RAG processing complete.
Creating Ragas EvaluationDataset...
Starting Ragas evaluation...
RAG processing complete.
Creating Ragas EvaluationDataset...
Starting Ragas evaluation...


Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 g

Ragas evaluation complete.
Overall metrics: {'faithfulness': 0.9167, 'answer_relevancy': 0.7699, 'context_recall': 0.8967, 'factual_correctness(mode=f1)': 0.5260}
‚úÖ Naive RAG evaluation complete!


### Naive RAG Results Analysis üìä

Let's examine the detailed evaluation results from our Naive RAG system.

In [7]:
# Display Naive RAG detailed results
print("üìä Naive RAG - Detailed Results:")
naive_results['evaluation_df']

üìä Naive RAG - Detailed Results:


Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,faithfulness,answer_relevancy,context_recall,factual_correctness(mode=f1)
0,What Alexa do in AI?,"[What is AI, how does it work and why are some...","[What is AI, how does it work and why are some...","Alexa, as a voice-controlled virtual assistant...",Alexa is a voice-controlled virtual assistant ...,0.5,0.912168,1.0,0.33
1,What happened to the UnitedHealthcare CEO Bria...,"[- James Kurose, ‚ÄúTestimony before the House C...",[Why is AI controversial?\nWhile acknowledging...,I don't have enough information to answer.,The BBC reported that Apple's AI falsely told ...,1.0,0.0,1.0,0.0
2,What is the current stance of the UK governmen...,"[What is AI, how does it work and why are some...",[Are there laws governing AI?\nSome government...,"The UK government has stated that it will ""tes...","In the UK, Prime Minister Sir Keir Starmer has...",1.0,0.0,1.0,0.71
3,How does the GDPR impact the transparency and ...,[Smart cities\nMetropolitan governments are us...,[<1-hop>\n\nAI ethics and transparency\nAlgori...,The GDPR (General Data Protection Regulation) ...,The GDPR impacts the transparency and accounta...,0.833333,0.97165,1.0,0.37
4,What are the trends in the usage of ChatGPT fo...,[The yellow line represents the first cohort o...,[<1-hop>\n\n37% of messages are work-related f...,The usage of ChatGPT for Technical Help has sh...,The usage of ChatGPT for Technical Help has sh...,0.833333,0.972123,1.0,0.87
5,What trends in user interactions and gender re...,"[Overall, the majority of ChatGPT usage\nat wo...","[<1-hop>\n\nHowever, in the first half of 2025...","By June 2025, trends in user interactions with...","By June 2025, the share of active users with t...",1.0,0.974209,0.666667,0.56
6,What are the main conversation topics users en...,[The yellow line represents the first cohort o...,[<1-hop>\n\nFigure 9 disaggregates four of the...,The main conversation topics users engage with...,Users engage with ChatGPT primarily through co...,1.0,0.954396,1.0,0.5
7,How does the quality of interactions with Chat...,[Corporate\nusers may also use ChatGPT Busines...,"[<1-hop>\n\nOverall, the majority of ChatGPT u...",The quality of interactions with ChatGPT at wo...,The quality of interactions with ChatGPT at wo...,1.0,0.98584,0.5,0.56
8,What are the trends in ChatGPT message counts ...,"[Overall, the majority of ChatGPT usage\nat wo...","[<1-hop>\n\nTeams, Enterprise, Education), whi...",The trends in ChatGPT message counts highlight...,The trends in ChatGPT message counts indicate ...,1.0,0.966834,0.8,0.46
9,What are the implications of artificial intell...,"[What is AI, how does it work and why are some...","[<1-hop>\n\nWhat is AI, how does it work and w...",Artificial intelligence (AI) has significant i...,The implications of artificial intelligence (A...,1.0,0.961463,1.0,0.9


In [8]:
# Display Naive RAG summary metrics
print("üìà Naive RAG - Summary Metrics:")
naive_metrics = naive_results['metrics']
naive_metrics

üìà Naive RAG - Summary Metrics:


{'faithfulness': 0.9167, 'answer_relevancy': 0.7699, 'context_recall': 0.8967, 'factual_correctness(mode=f1)': 0.5260}

## Experiment 2: Hybrid RAG Evaluation üî¨‚ö°

Now let's evaluate the **Hybrid RAG SuperComponent** using the identical evaluation pipeline. This systematic approach ensures fair comparison.

**üîÑ Key Benefits of This Approach:**
- **Identical Conditions**: Same pipeline, metrics, and dataset
- **Systematic Comparison**: Eliminates evaluation bias
- **Reproducible Results**: Consistent methodology across both systems

**üéØ Expected Improvements:**
Hybrid RAG typically shows better performance due to:
- **Dense + Sparse Retrieval**: Combines semantic and keyword-based search
- **Enhanced Context Quality**: Better retrieval often leads to better responses
- **Improved Robustness**: Multiple retrieval methods reduce failure modes

# Evaluate Hybrid RAG SuperComponent


In [12]:
# Evaluate Hybrid RAG SuperComponent

# Create a new generator for this evaluation (components can't be shared between pipelines)
from haystack.components.generators import OpenAIGenerator
from haystack.utils import Secret

eval_generator_hybrid = OpenAIGenerator(
    model="gpt-4o-mini",
    api_key=Secret.from_token(os.getenv("OPENAI_API_KEY"))
)

# Create evaluation SuperComponent (csv_source now passed in run method)
hybrid_evaluation_sc = RAGEvaluationSuperComponent(
    rag_supercomponent=hybrid_rag_sc,
    system_name="Hybrid RAG",
    generator=eval_generator_hybrid
)

# Run evaluation with CSV source passed to run method
hybrid_results = hybrid_evaluation_sc.run(csv_source=csv_file_path)

print(f"‚úÖ Hybrid RAG evaluation complete!")


üîÑ Building evaluation pipeline for Hybrid RAG...
‚úÖ Evaluation pipeline for Hybrid RAG built successfully!
Loaded DataFrame with 10 rows from data_for_eval/synthetic_tests_advanced_branching_10.csv.
Running RAG SuperComponent on 10 queries...
RAG processing complete.
Creating Ragas EvaluationDataset...
Starting Ragas evaluation...
RAG processing complete.
Creating Ragas EvaluationDataset...
Starting Ragas evaluation...


Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 g

Ragas evaluation complete.
Overall metrics: {'faithfulness': 0.8080, 'answer_relevancy': 0.8716, 'context_recall': 0.9467, 'factual_correctness(mode=f1)': 0.5720}
‚úÖ Hybrid RAG evaluation complete!


In [13]:
# Display Hybrid RAG detailed results
print("üìä Hybrid RAG - Detailed Results:")
hybrid_results['evaluation_df']

üìä Hybrid RAG - Detailed Results:


Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,faithfulness,answer_relevancy,context_recall,factual_correctness(mode=f1)
0,What Alexa do in AI?,"[What is AI, how does it work and why are some...","[What is AI, how does it work and why are some...",Amazon's Alexa is a voice-controlled virtual a...,Alexa is a voice-controlled virtual assistant ...,0.5,0.908875,1.0,0.37
1,What happened to the UnitedHealthcare CEO Bria...,[AI will reconfigure how society and the econo...,[Why is AI controversial?\nWhile acknowledging...,I don't have enough information to answer.,The BBC reported that Apple's AI falsely told ...,0.0,0.0,1.0,0.0
2,What is the current stance of the UK governmen...,[AI will reconfigure how society and the econo...,[Are there laws governing AI?\nSome government...,The current stance of the UK government regard...,"In the UK, Prime Minister Sir Keir Starmer has...",1.0,0.989034,1.0,0.5
3,How does the GDPR impact the transparency and ...,"[- James Kurose, ‚ÄúTestimony before the House C...",[<1-hop>\n\nAI ethics and transparency\nAlgori...,The GDPR (General Data Protection Regulation) ...,The GDPR impacts the transparency and accounta...,0.714286,0.97165,1.0,0.48
4,What are the trends in the usage of ChatGPT fo...,[37% of messages are work-related\nfor users w...,[<1-hop>\n\n37% of messages are work-related f...,The usage of ChatGPT for Technical Help has sh...,The usage of ChatGPT for Technical Help has sh...,0.928571,0.973908,1.0,0.96
5,What trends in user interactions and gender re...,"[However, in the first half of 2025, we see th...","[<1-hop>\n\nHowever, in the first half of 2025...","By June 2025, trends in user interactions and ...","By June 2025, the share of active users with t...",1.0,0.996117,0.666667,0.77
6,What are the main conversation topics users en...,[Figure 9 disaggregates four of the seven Conv...,[<1-hop>\n\nFigure 9 disaggregates four of the...,The main conversation topics users engage with...,Users engage with ChatGPT primarily through co...,1.0,0.954396,1.0,0.8
7,How does the quality of interactions with Chat...,"[Overall, the majority of ChatGPT usage\nat wo...","[<1-hop>\n\nOverall, the majority of ChatGPT u...",The quality of interactions with ChatGPT at wo...,The quality of interactions with ChatGPT at wo...,1.0,0.993349,1.0,0.5
8,What are the trends in ChatGPT message counts ...,"[Teams, Enterprise, Education), which we do no...","[<1-hop>\n\nTeams, Enterprise, Education), whi...",ChatGPT message counts have shown a significan...,The trends in ChatGPT message counts indicate ...,0.9375,0.966806,0.8,0.57
9,What are the implications of artificial intell...,"[What is AI, how does it work and why are some...","[<1-hop>\n\nWhat is AI, how does it work and w...",The implications of artificial intelligence (A...,The implications of artificial intelligence (A...,1.0,0.9615,1.0,0.77


In [14]:
# Display Hybrid RAG summary metrics
print("üìà Hybrid RAG - Summary Metrics:")
hybrid_metrics = hybrid_results['metrics']
hybrid_metrics

üìà Hybrid RAG - Summary Metrics:


{'faithfulness': 0.8080, 'answer_relevancy': 0.8716, 'context_recall': 0.9467, 'factual_correctness(mode=f1)': 0.5720}

## üìä Side-by-Side Performance Comparison

Let's create a comprehensive comparison of both RAG systems to understand their relative performance across all metrics.

In [15]:
# --- Create Side-by-Side Performance Comparison ---
import pandas as pd
import numpy as np

# Extract average metrics from both evaluations using proper RAGAS data structure methods
# The EvaluationResult object contains individual scores for each query
naive_scores = naive_metrics.scores
hybrid_scores = hybrid_metrics.scores

print("üîç Processing individual query scores...")
print(f"Naive RAG: {len(naive_scores)} queries evaluated")
print(f"Hybrid RAG: {len(hybrid_scores)} queries evaluated")

# Compute averages for each metric using pandas for easier calculation
naive_df = pd.DataFrame(naive_scores)
hybrid_df = pd.DataFrame(hybrid_scores)

# Calculate mean values for each metric
naive_averages = naive_df.mean()
hybrid_averages = hybrid_df.mean()

print("\nüìä Average Scores Computed:")
print(f"Naive RAG averages: {dict(naive_averages)}")
print(f"Hybrid RAG averages: {dict(hybrid_averages)}")

# Create comparison DataFrame
comparison_data = {
    'Metric': list(naive_averages.index),
    'Naive RAG': list(naive_averages.values),
    'Hybrid RAG': list(hybrid_averages.values)
}

# Calculate improvement percentages
comparison_df = pd.DataFrame(comparison_data)
comparison_df['Improvement (%)'] = ((comparison_df['Hybrid RAG'] - comparison_df['Naive RAG']) / comparison_df['Naive RAG'] * 100).round(2)
comparison_df['Better System'] = comparison_df.apply(
    lambda row: 'üèÜ Hybrid RAG' if row['Hybrid RAG'] > row['Naive RAG'] 
    else 'üèÜ Naive RAG' if row['Naive RAG'] > row['Hybrid RAG'] 
    else 'ü§ù Tie', axis=1
)

# Round scores for better display
comparison_df['Naive RAG'] = comparison_df['Naive RAG'].round(4)
comparison_df['Hybrid RAG'] = comparison_df['Hybrid RAG'].round(4)

print("\n" + "=" * 80)
print("üîç COMPREHENSIVE RAG SYSTEM COMPARISON")
print("=" * 80)
print(comparison_df.to_string(index=False))
print("\n" + "=" * 80)

# Calculate overall winner
hybrid_wins = sum(comparison_df['Hybrid RAG'] > comparison_df['Naive RAG'])
naive_wins = sum(comparison_df['Naive RAG'] > comparison_df['Hybrid RAG'])
ties = sum(comparison_df['Naive RAG'] == comparison_df['Hybrid RAG'])

print(f"\nüèÅ FINAL SCORECARD:")
print(f"üèÜ Hybrid RAG wins: {hybrid_wins} metrics")
print(f"üèÜ Naive RAG wins: {naive_wins} metrics") 
print(f"ü§ù Ties: {ties} metrics")

if hybrid_wins > naive_wins:
    print(f"\nüéâ OVERALL WINNER: Hybrid RAG SuperComponent!")
    print(f"   Better performance in {hybrid_wins}/{len(comparison_df)} metrics")
elif naive_wins > hybrid_wins:
    print(f"\nüéâ OVERALL WINNER: Naive RAG SuperComponent!")  
    print(f"   Better performance in {naive_wins}/{len(comparison_df)} metrics")
else:
    print(f"\nü§ù RESULT: It's a tie between both systems!")
    
avg_improvement = comparison_df['Improvement (%)'].mean()
print(f"\nüìà Average improvement by Hybrid RAG: {avg_improvement:.2f}%")

# Show standard deviations to understand score consistency
naive_stds = naive_df.std()
hybrid_stds = hybrid_df.std()

print(f"\nüìä Score Consistency (Standard Deviation):")
for metric in naive_averages.index:
    print(f"   {metric}:")
    print(f"     Naive RAG: {naive_stds[metric]:.4f}")
    print(f"     Hybrid RAG: {hybrid_stds[metric]:.4f}")
    stability_winner = "Hybrid RAG" if hybrid_stds[metric] < naive_stds[metric] else "Naive RAG"
    print(f"     More consistent: {stability_winner}")

üîç Processing individual query scores...
Naive RAG: 10 queries evaluated
Hybrid RAG: 10 queries evaluated

üìä Average Scores Computed:
Naive RAG averages: {'faithfulness': np.float64(0.9166666666666667), 'answer_relevancy': np.float64(0.769868266906707), 'context_recall': np.float64(0.8966666666666665), 'factual_correctness(mode=f1)': np.float64(0.526)}
Hybrid RAG averages: {'faithfulness': np.float64(0.8080357142857142), 'answer_relevancy': np.float64(0.8715633776901648), 'context_recall': np.float64(0.9466666666666667), 'factual_correctness(mode=f1)': np.float64(0.5720000000000001)}

üîç COMPREHENSIVE RAG SYSTEM COMPARISON
                      Metric  Naive RAG  Hybrid RAG  Improvement (%) Better System
                faithfulness     0.9167      0.8080           -11.85   üèÜ Naive RAG
            answer_relevancy     0.7699      0.8716            13.21  üèÜ Hybrid RAG
              context_recall     0.8967      0.9467             5.58  üèÜ Hybrid RAG
factual_correctness(m