üîß  **Setup Required**: Before running this notebook, please follow the [setup instructions](../README.md#setup-instructions) to configure your environment and API keys. **You will need to ensure you've executed the Indexing pipeline before completing this exercise**

# Systematic RAG Evaluation: Naive vs Hybrid Comparison

## Overview

This notebook demonstrates a **systematic evaluation workflow** for comparing two RAG (Retrieval-Augmented Generation) approaches using **Haystack custom components**. We'll create a reproducible pipeline to:

1. **Load evaluation datasets** from CSV files
2. **Process queries** through both Naive and Hybrid RAG SuperComponents 
3. **Generate comprehensive metrics** using the RAGAS framework
4. **Compare performance** systematically

## Learning Objectives

By the end of this notebook, you will understand how to:
- Create reusable evaluation components for RAG systems
- Build scalable pipelines for systematic RAG comparison
- Interpret RAGAS metrics in comparative context
- Make data-driven decisions between RAG approaches

## Evaluation Pipeline

Our pipeline consists of three main components:

```
CSV Data ‚Üí RAGDataAugmenter ‚Üí RagasEvaluation ‚Üí Metrics & Results
```

**Key Benefits:**
- **Systematic**: Same evaluation conditions for both RAG systems
- **Reproducible**: Consistent evaluation across experiments
- **Scalable**: Easy to add new RAG implementations
- **Comprehensive**: Multiple metrics provide complete assessment

---

## Component 1: CSV Data Loader

The **CSVReaderComponent** serves as the entry point for our evaluation pipeline. It handles loading synthetic evaluation datasets and ensures data quality before processing.

**Key Features:**
- **Robust Error Handling**: Validates file existence and data integrity
- **Pandas Integration**: Returns data as DataFrame for easy manipulation
- **Pipeline Compatible**: Designed to work seamlessly with Haystack pipelines

**Input:** File path to CSV containing evaluation queries and ground truth
**Output:** Pandas DataFrame ready for RAG processing

In [1]:
import pandas as pd
from pathlib import Path
from haystack import component
from typing import Optional, Dict, Any, Union

@component
class CSVReaderComponent:
    """Reads a CSV file into a Pandas DataFrame."""

    @component.output_types(data_frame=pd.DataFrame)
    def run(self, source: Union[str, Path]):
        """
        Reads the CSV file from the first source in the list.
        
        Args:
            sources: List of file paths to CSV files. Only the first file will be processed.
            
        Returns:
            dict: Dictionary containing the loaded DataFrame under 'data_frame' key.
            
        Raises:
            FileNotFoundError: If the file doesn't exist or can't be read.
            ValueError: If the DataFrame is empty after loading.
        """
        if not source:
            raise ValueError("No sources provided")
            

        try:
            df = pd.read_csv(source)
        except FileNotFoundError:
            raise FileNotFoundError(f"File not found at {source}")
        except Exception as e:
            raise ValueError(f"Error reading CSV file {source}: {str(e)}")

        # Check if DataFrame is empty using proper pandas method
        if df.empty:
            raise ValueError(f"DataFrame is empty after loading from {source}")

        print(f"Loaded DataFrame with {len(df)} rows from {source}.")
        return {"data_frame": df}

## Component 2: Async RAG Data Augmentation

The **AsyncRAGDataAugmenterComponent** is the core of our evaluation workflow. It processes queries through a RAG SuperComponent with concurrent execution for optimal performance.

**Key Features:**

1. **Concurrent Processing**: Processes multiple queries in parallel batches
2. **SuperComponent Flexibility**: Accepts any pre-configured RAG SuperComponent (Naive, Hybrid, or custom)
3. **Configurable Batch Size**: Control concurrency based on API rate limits
4. **Progress Tracking**: Real-time visibility into processing status
5. **Error Handling**: Gracefully handles failures without stopping the entire evaluation

**Performance Benefits:**
- **Speed**: Up to N√ó faster than sequential processing (where N = batch_size)
- **Scalability**: Efficiently handles large evaluation datasets
- **Resource Optimization**: Maximizes API call efficiency

**Pipeline Integration:**
- **Input**: DataFrame with queries from CSVReaderComponent  
- **Process**: Runs queries through RAG SuperComponent in concurrent batches
- **Output**: Augmented DataFrame with responses and retrieved contexts

**Why Async?**
- **Faster Iteration**: Quickly evaluate large datasets
- **Better Resource Utilization**: Maximize throughput without overwhelming APIs
- **Production Ready**: Scalable approach for continuous evaluation

## Implementation: Concurrent Query Processing

Below is the implementation of the async component that enables concurrent query processing.
### ‚ö†Ô∏è Important Thread-Safety Considerations

**RAG SuperComponent Sharing:**
- The `AsyncRAGDataAugmenterComponent` accepts a single `rag_supercomponent` instance
- **Haystack SuperComponents are NOT guaranteed to be thread-safe**
- When using `batch_size > 1`, multiple queries may access the same component instance concurrently

**Current Safety Approach:**
- **Default `batch_size=1`**: Processes queries sequentially within each system to avoid conflicts
- **Concurrent System Evaluation**: Multiple RAG systems can still be evaluated in parallel
- **Trade-off**: Sacrifices within-system concurrency for stability

**For Higher Concurrency:**

If you need `batch_size > 1`, ensure:
1. Each RAG SuperComponent has **separate model instances** (not shared)
2. Test thoroughly for race conditions
3. Monitor for embedding model conflicts

**Alternative Approaches:**
```python
# Option 1: Create separate RAG instances per query (memory intensive)
# Option 2: Use a connection pool pattern with multiple RAG instances
# Option 3: Implement proper locking mechanisms (reduces concurrency benefits)
```

**Current Implementation:**
- Safe for concurrent multi-system evaluation
- Conservative within-system processing
- Prioritizes stability over maximum speed

In [2]:
import asyncio
from typing import List
from haystack import SuperComponent
import traceback


@component
class AsyncRAGDataAugmenterComponent:
    """
    Async version of RAGDataAugmenterComponent that processes queries concurrently.
    
    Key Improvements:
    - Processes multiple queries in parallel batches
    - Configurable concurrency limits to avoid rate limits
    - Significant performance improvement for large datasets
    - Progress tracking for long-running evaluations
    
    Thread-Safety Considerations:
    - Haystack SuperComponents are NOT guaranteed to be thread-safe
    - Default batch_size=1 to avoid concurrent access to shared component state
    - Each batch processes sequentially, but multiple systems can run in parallel
    - For batch_size > 1: Ensure each RAG system has separate model instances
    
    Performance Trade-offs:
    - batch_size=1: Safe, sequential processing within each system
    - batch_size>1: Faster, but requires thread-safe component design
    """

    def __init__(self, rag_supercomponent: SuperComponent, batch_size: int = 1):
        """
        Initialize the Async RAG Data Augmenter.
        
        Args:
            rag_supercomponent: Pre-initialized RAG SuperComponent
            batch_size (int): Number of queries to process concurrently. Defaults to 1.
                             
                             batch_size=1: Safe for all configurations (sequential)
                             batch_size>1: Only safe if rag_supercomponent is thread-safe
                                          and has no shared mutable state
        
        Warning:
            Increasing batch_size beyond 1 may cause race conditions if the
            rag_supercomponent shares embedding models or other stateful components
            across concurrent calls. Always test thoroughly with batch_size > 1.
        """
        self.rag_supercomponent = rag_supercomponent
        self.batch_size = batch_size
        self.output_names = ["augmented_data_frame"]
        
        # Warn if batch_size > 1
        if batch_size > 1:
            print(f"‚ö†Ô∏è  Warning: batch_size={batch_size} may cause threading issues")
            print("   Ensure your RAG SuperComponent is thread-safe")
            print("   (separate embedding model instances, no shared state)")

    async def _process_single_query(self, query: str, index: int) -> tuple:
        """
        Process a single query through the RAG SuperComponent.
        
        Thread Safety:
        - Uses asyncio.to_thread() to run synchronous RAG code in thread pool
        - Thread pool executor serializes access when batch_size=1
        - With batch_size>1, concurrent calls may access shared component state
        
        Args:
            query: The query string to process
            index: The query index for tracking
            
        Returns:
            tuple: (index, answer, contexts)
        """
        try:
            # Run the RAG SuperComponent in a thread pool
            # asyncio.to_thread automatically handles thread pool execution
            rag_output = await asyncio.to_thread(
                self.rag_supercomponent.run, 
                query=query
            )
            
            # Extract answer and contexts
            answer = rag_output.get('replies', [''])[0]
            retrieved_docs = rag_output.get('documents', [])
            retrieved_contexts = [doc.content for doc in retrieved_docs]
            
            return (index, answer, retrieved_contexts)
        except Exception as e:
            query_preview = query[:50] + ('...' if len(query) > 50 else '')
            print(f"Error processing query {index} ('{query_preview}'): {str(e)}")
            
            traceback.print_exc()
            return (index, "", [])

    async def _process_batch(self, queries_with_indices: List[tuple]) -> List[tuple]:
        """
        Process a batch of queries concurrently.
        
        Thread Safety Note:
        - When batch_size=1, only one query processes at a time (safe)
        - When batch_size>1, queries run concurrently (potential race conditions)
        
        Args:
            queries_with_indices: List of (index, query) tuples
            
        Returns:
            List of (index, answer, contexts) tuples
        """
        tasks = [
            self._process_single_query(query, idx) 
            for idx, query in queries_with_indices
        ]
        return await asyncio.gather(*tasks)

    @component.output_types(augmented_data_frame=pd.DataFrame)
    def run(self, data_frame: pd.DataFrame):
        """
        Process all queries in the DataFrame with concurrent execution.
        
        Data Safety:
        - Creates a copy of input DataFrame to avoid mutation
        - Results stored in new columns on the copied DataFrame
        - Original data_frame parameter remains unchanged
        
        Thread Safety:
        - Component instance shared across all batch calls
        - batch_size=1 ensures sequential processing (safe)
        - batch_size>1 requires thread-safe rag_supercomponent
        
        Args:
            data_frame: DataFrame with 'user_input' column containing queries
            
        Returns:
            dict: Dictionary with augmented DataFrame containing responses and contexts
        """
        async def _async_process():
            total_queries = len(data_frame)
            print(f"Running Async RAG on {total_queries} queries (batch size: {self.batch_size})...")
            
            # Prepare queries with their indices
            queries_with_indices = list(enumerate(data_frame["user_input"].tolist()))
            
            # Initialize results storage
            results = [None] * total_queries
            
            # Process in batches
            for batch_start in range(0, total_queries, self.batch_size):
                batch_end = min(batch_start + self.batch_size, total_queries)
                batch = queries_with_indices[batch_start:batch_end]
                
                print(f"Processing batch {batch_start//self.batch_size + 1} "
                      f"(queries {batch_start+1}-{batch_end} of {total_queries})...")
                
                # Process batch concurrently
                batch_results = await self._process_batch(batch)
                
                # Store results in correct order
                for idx, answer, contexts in batch_results:
                    results[idx] = (answer, contexts)
            
            # Extract answers and contexts from results
            answers = [r[0] for r in results]
            contexts = [r[1] for r in results]
            
            # Create a copy to avoid modifying the original DataFrame
            # This ensures the component doesn't have side effects on input data
            data_frame_copy = data_frame.copy()
            data_frame_copy['response'] = answers
            data_frame_copy['retrieved_contexts'] = contexts
            
            print(f"‚úì Async RAG processing complete for {total_queries} queries!")
            return {"augmented_data_frame": data_frame_copy}
        
        try:
            loop = asyncio.get_running_loop()
            # Already in async context
            return loop.create_task(_async_process())
        except RuntimeError:
            # No loop running
            return asyncio.run(_async_process())

## Component 3: RAGAS Evaluation Engine

The **RagasEvaluationComponent** integrates the RAGAS framework into our Haystack pipeline, providing systematic evaluation metrics for RAG systems.

**Core Evaluation Metrics:**

| Metric | Purpose | What It Measures |
|--------|---------|------------------|
| **Faithfulness** | Response Quality | Factual consistency with retrieved context |
| **ResponseRelevancy** | Relevance | How well responses answer the questions |
| **LLMContextRecall** | Retrieval Quality | How well retrieval captures relevant information |
| **FactualCorrectness** | Accuracy | Correctness of factual claims in responses |

**Technical Implementation:**
- **Focused Metrics**: Core metrics for reliable comparison
- **LLM Integration**: Uses OpenAI GPT models for evaluation judgments  
- **Data Format Handling**: Automatically formats data for RAGAS requirements
- **Comprehensive Output**: Returns both aggregated metrics and detailed per-query results

In [3]:
from ragas import EvaluationDataset, evaluate
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy
from ragas.llms import HaystackLLMWrapper


@component
class RagasEvaluationComponent:
    """
    Integrates the RAGAS framework into Haystack pipeline for systematic evaluation.
    
    This component provides systematic evaluation metrics for RAG systems using
    the RAGAS framework with focus on core metrics for reliable comparison.
    
    Core Evaluation Metrics:
    - Faithfulness: Factual consistency with retrieved context
    - ResponseRelevancy: How well responses answer the questions  
    - LLMContextRecall: How well retrieval captures relevant information
    - FactualCorrectness: Correctness of factual claims in responses
    
    Technical Features:
    - Focused Metrics: Core metrics for reliable comparison
    - LLM Integration: Uses provided generator for evaluation judgments
    - Data Format Handling: Automatically formats data for RAGAS requirements
    - Comprehensive Output: Returns both aggregated metrics and detailed per-query results
    """
    
    def __init__(self, 
                 generator: Any,
                 metrics: Optional[List[Any]] = None):
        """
        Initialize the RAGAS Evaluation Component.
        
        Args:
            generator: LLM generator instance (e.g., OpenAIGenerator or OllamaGenerator).
            metrics: List of RAGAS metrics to evaluate (defaults to core metrics)
        """
        
        # Default to core metrics for systematic comparison
        if metrics is None:
            self.metrics = [
                Faithfulness(), 
                ResponseRelevancy(),
                LLMContextRecall(),
                FactualCorrectness()
            ]
        else:
            self.metrics = metrics
        
        # Configure RAGAS LLM for evaluation
        self.ragas_llm = HaystackLLMWrapper(generator)

    @component.output_types(metrics=Dict[str, float], evaluation_df=pd.DataFrame)
    def run(self, augmented_data_frame: pd.DataFrame):
        """
        Run RAGAS evaluation on augmented dataset.
        
        Args:
            augmented_data_frame: DataFrame with RAG responses and retrieved contexts
            
        Returns:
            dict: Dictionary containing evaluation metrics and detailed results DataFrame
        """
        
        # 1. Map columns to Ragas requirements
        ragas_data = pd.DataFrame({
            'user_input': augmented_data_frame['user_input'],
            'response': augmented_data_frame['response'], 
            'retrieved_contexts': augmented_data_frame['retrieved_contexts'],
            'reference': augmented_data_frame['reference'],
            'reference_contexts': augmented_data_frame['reference_contexts'].apply(eval)
        })

        print("Creating Ragas EvaluationDataset...")
        # 2. Create EvaluationDataset
        dataset = EvaluationDataset.from_pandas(ragas_data)

        print("Starting Ragas evaluation...")
        
        # 3. Run Ragas Evaluation
        results = evaluate(
            dataset=dataset,
            metrics=self.metrics,
            llm=self.ragas_llm
        )
        
        results_df = results.to_pandas()
        
        print("Ragas evaluation complete.")
        print(f"Overall metrics: {results}")
        
        return {"metrics": results, "evaluation_df": results_df}

---

# Systematic RAG Comparison: Naive vs Hybrid

Now we'll systematically evaluate both RAG SuperComponents using the same evaluation pipeline. This ensures fair comparison with identical evaluation conditions.

## Comparison Strategy

Our approach enables **systematic comparison**:

1. **Same Dataset**: Both systems evaluated on identical test queries
2. **Same Metrics**: Consistent evaluation criteria across both approaches
3. **Same Pipeline**: Identical processing workflow eliminates bias
4. **Reproducible Results**: Pipeline ensures consistent evaluation conditions

## Dataset Information

We'll use a synthetic evaluation dataset:
- **`synthetic_tests_advanced_branching_10.csv`**: Focused dataset for comparison

**Dataset Structure:**
- `user_input`: Questions to ask the RAG system
- `reference`: Ground truth answers for comparison
- `reference_contexts`: Expected retrieved contexts

### Setup: Initialize Both RAG SuperComponents

First, we'll initialize both RAG SuperComponents with consistent parameters for fair comparison.

**Configuration:**
- **Same base parameters**: Both systems use identical core settings
- **Document store**: Shared Elasticsearch document store
- **Models**: Consistent LLM and embedding models for both systems

In [4]:
# --- Setup Environment & Dependencies ---
from scripts.rag.hybridrag import HybridRAGSuperComponent
from scripts.rag.naiverag import NaiveRAGSuperComponent
from haystack_integrations.document_stores.elasticsearch import ElasticsearchDocumentStore
import os

# Initialize document store
document_store = ElasticsearchDocumentStore(hosts="http://localhost:9200")

# Create both RAG SuperComponents with base parameters for fair comparison
naive_rag_sc = NaiveRAGSuperComponent(
    document_store=document_store
)

hybrid_rag_sc = HybridRAGSuperComponent(
    document_store=document_store
)

print("Both RAG SuperComponents initialized successfully!")
print(f"Naive RAG: {naive_rag_sc.__class__.__name__}")
print(f"Hybrid RAG: {hybrid_rag_sc.__class__.__name__}")

Both RAG SuperComponents initialized successfully!
Naive RAG: NaiveRAGSuperComponent
Hybrid RAG: HybridRAGSuperComponent


## Evaluation Pipeline: Using Haystack AsyncPipeline

Now we'll create an evaluation pipeline using Haystack's `AsyncPipeline` for proper concurrent execution. This provides a cleaner, more maintainable approach than manually coordinating async components.

In [None]:
from haystack import AsyncPipeline
import time

def create_evaluation_pipeline(
    rag_supercomponent,
    generator: Any,
    batch_size: int = 5
) -> AsyncPipeline:
    """
    Create an evaluation pipeline using Haystack's AsyncPipeline.
    
    AsyncPipeline handles concurrent execution of components automatically.
    Components themselves use synchronous run() methods, and AsyncPipeline
    orchestrates their async execution.
    
    Args:
        rag_supercomponent: The RAG system to evaluate
        generator: LLM generator for RAGAS evaluation
        batch_size: Number of queries to process concurrently
        
    Returns:
        AsyncPipeline: Configured evaluation pipeline
    """
    pipeline = AsyncPipeline()
    
    # Add components to pipeline
    pipeline.add_component("reader", CSVReaderComponent())
    pipeline.add_component(
        "augmenter",
        AsyncRAGDataAugmenterComponent(
            rag_supercomponent=rag_supercomponent,
            batch_size=batch_size
        )
    )
    pipeline.add_component("evaluator", RagasEvaluationComponent(generator=generator))
    
    # Connect components
    pipeline.connect("reader.data_frame", "augmenter.data_frame")
    pipeline.connect("augmenter.augmented_data_frame", "evaluator.augmented_data_frame")
    
    return pipeline


async def evaluate_rag_system_async(
    rag_supercomponent,
    system_name: str,
    csv_file_path: str,
    generator: Any,
    batch_size: int = 5
) -> dict:
    """
    Asynchronously evaluate a single RAG system using AsyncPipeline.
    
    Args:
        rag_supercomponent: The RAG system to evaluate
        system_name: Name for logging
        csv_file_path: Path to evaluation dataset
        generator: LLM generator for RAGAS evaluation
        batch_size: Number of queries to process concurrently
        
    Returns:
        dict: Evaluation results with metrics and detailed DataFrame
    """
    print(f"\n{'='*80}")
    print(f"Starting async evaluation of {system_name}")
    print(f"{'='*80}\n")
    
    # Create evaluation pipeline
    eval_pipeline = create_evaluation_pipeline(
        rag_supercomponent=rag_supercomponent,
        generator=generator,
        batch_size=batch_size
    )
    
    # Run pipeline asynchronously
    results = await eval_pipeline.run_async(data={"reader": {"source": csv_file_path}})
    
    print(f"\n‚úì {system_name} evaluation complete!")
    return {
        "system_name": system_name,
        "metrics": results["evaluator"]["metrics"],
        "evaluation_df": results["evaluator"]["evaluation_df"]
    }


async def evaluate_multiple_rag_systems_async(
    rag_systems: List[tuple],
    csv_file_path: str,
    batch_size: int = 5
) -> List[dict]:
    """
    Evaluate multiple RAG systems concurrently using AsyncPipeline.
    
    Args:
        rag_systems: List of (rag_supercomponent, system_name, generator) tuples
        csv_file_path: Path to evaluation dataset
        batch_size: Number of queries to process concurrently per system
        
    Returns:
        List of evaluation results for each system
    """
    print(f"\n{'='*80}")
    print(f"CONCURRENT EVALUATION OF {len(rag_systems)} RAG SYSTEMS")
    print(f"{'='*80}")
    print(f"Dataset: {csv_file_path}")
    print(f"Batch size: {batch_size}")
    print(f"Systems: {[name for _, name, _ in rag_systems]}")
    print(f"{'='*80}\n")
    
    # Create evaluation tasks for each system
    tasks = [
        evaluate_rag_system_async(
            rag_supercomponent=rag_sc,
            system_name=name,
            csv_file_path=csv_file_path,
            generator=generator,
            batch_size=batch_size
        )
        for rag_sc, name, generator in rag_systems
    ]
    
    # Run all evaluations concurrently
    start_time = time.time()
    results = await asyncio.gather(*tasks)
    elapsed = time.time() - start_time
    
    print(f"\n{'='*80}")
    print(f"‚úì ALL EVALUATIONS COMPLETE")
    print(f"‚è±Total time: {elapsed:.2f} seconds")
    print(f"‚ö° Average time per system: {elapsed/len(rag_systems):.2f} seconds")
    print(f"{'='*80}\n")
    
    return results

## Run Concurrent Evaluation

Now let's evaluate both Naive and Hybrid RAG systems **simultaneously** using concurrent evaluation.

**Performance Benefits:**
- **2√ó Faster**: Both systems evaluated at the same time
- **Identical Conditions**: Both evaluations start simultaneously
- **Efficient Resource Use**: Maximizes computational efficiency

**Configuration:**
- `batch_size`: Controls concurrent queries per system (adjust based on API limits)

In [6]:
# --- Setup for Concurrent Evaluation ---

# Create separate generators for each evaluation (components can't be shared)
from haystack.components.generators import OpenAIGenerator
from haystack.utils import Secret
import os

eval_generator_naive = OpenAIGenerator(
    model="gpt-4o-mini",
    api_key=Secret.from_token(os.getenv("OPENAI_API_KEY"))
)

eval_generator_hybrid = OpenAIGenerator(
    model="gpt-4o-mini",
    api_key=Secret.from_token(os.getenv("OPENAI_API_KEY"))
)

# Define evaluation dataset
csv_file_path = "data_for_eval/synthetic_tests_advanced_branching_10.csv"

# Configure concurrent evaluation
# batch_size: Number of queries to process in parallel per RAG system
# Set to 1 to avoid embedding model conflicts (still gets 2x speedup from parallel systems)
batch_size = 1  # Process 1 query at a time per system (safe for shared models)

# Prepare RAG systems for concurrent evaluation
rag_systems = [
    (naive_rag_sc, "Naive RAG", eval_generator_naive),
    (hybrid_rag_sc, "Hybrid RAG", eval_generator_hybrid)
]

# Run concurrent evaluation
all_results = await evaluate_multiple_rag_systems_async(
    rag_systems=rag_systems,
    csv_file_path=csv_file_path,
    batch_size=batch_size
)

# Extract results for each system
naive_results_async = all_results[0]
hybrid_results_async = all_results[1]

print("\n‚úì Concurrent evaluation complete!")
print(f"Naive RAG: {len(naive_results_async['evaluation_df'])} queries evaluated")
print(f"Hybrid RAG: {len(hybrid_results_async['evaluation_df'])} queries evaluated")


üöÄ CONCURRENT EVALUATION OF 2 RAG SYSTEMS
Dataset: data_for_eval/synthetic_tests_advanced_branching_10.csv
Batch size: 1
Systems: ['Naive RAG', 'Hybrid RAG']


Starting async evaluation of Naive RAG


Starting async evaluation of Hybrid RAG

Loaded DataFrame with 10 rows from data_for_eval/synthetic_tests_advanced_branching_10.csv.
Loaded DataFrame with 10 rows from data_for_eval/synthetic_tests_advanced_branching_10.csv.
Running Async RAG on 10 queries (batch size: 1)...
Processing batch 1 (queries 1-1 of 10)...
Running Async RAG on 10 queries (batch size: 1)...
Processing batch 1 (queries 1-1 of 10)...
Processing batch 2 (queries 2-2 of 10)...
Processing batch 2 (queries 2-2 of 10)...
Processing batch 3 (queries 3-3 of 10)...
Processing batch 3 (queries 3-3 of 10)...
Processing batch 2 (queries 2-2 of 10)...
Processing batch 2 (queries 2-2 of 10)...
Processing batch 3 (queries 3-3 of 10)...
Processing batch 3 (queries 3-3 of 10)...
Processing batch 4 (queries 4-4 of 10)...
Process

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.


‚úì Async RAG processing complete for 10 queries!
Creating Ragas EvaluationDataset...
Starting Ragas evaluation...


Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 g

Ragas evaluation complete.
Overall metrics: {'faithfulness': 0.8700, 'answer_relevancy': 0.8655, 'context_recall': 0.9600, 'factual_correctness(mode=f1)': 0.5730}

‚úì Hybrid RAG evaluation complete!
Ragas evaluation complete.
Overall metrics: {'faithfulness': 0.8523, 'answer_relevancy': 0.7670, 'context_recall': 0.9300, 'factual_correctness(mode=f1)': 0.5510}

‚úì Naive RAG evaluation complete!

‚úì ALL EVALUATIONS COMPLETE
‚è±Ô∏è  Total time: 132.97 seconds
‚ö° Average time per system: 66.49 seconds


‚úì Concurrent evaluation complete!
Naive RAG: 10 queries evaluated
Hybrid RAG: 10 queries evaluated
Ragas evaluation complete.
Overall metrics: {'faithfulness': 0.8523, 'answer_relevancy': 0.7670, 'context_recall': 0.9300, 'factual_correctness(mode=f1)': 0.5510}

‚úì Naive RAG evaluation complete!

‚úì ALL EVALUATIONS COMPLETE
‚è±Ô∏è  Total time: 132.97 seconds
‚ö° Average time per system: 66.49 seconds


‚úì Concurrent evaluation complete!
Naive RAG: 10 queries evaluated
Hybrid RAG: 1

### View Evaluation Results

Let's examine the results from our concurrent evaluation of both systems.

In [9]:
# Display results from concurrent evaluation
print("Naive RAG - Async Evaluation Results:")
print(f"Metrics: {naive_results_async['metrics']}\n")
display(naive_results_async['evaluation_df'].head())

print("\n" + "="*80 + "\n")

print("Hybrid RAG - Async Evaluation Results:")
print(f"Metrics: {hybrid_results_async['metrics']}\n")
display(hybrid_results_async['evaluation_df'].head())

Naive RAG - Async Evaluation Results:
Metrics: {'faithfulness': 0.8523, 'answer_relevancy': 0.7670, 'context_recall': 0.9300, 'factual_correctness(mode=f1)': 0.5510}



Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,faithfulness,answer_relevancy,context_recall,factual_correctness(mode=f1)
0,What Alexa do in AI?,"[What is AI, how does it work and why are some...","[What is AI, how does it work and why are some...","Alexa, developed by Amazon, is an AI-powered v...",Alexa is a voice-controlled virtual assistant ...,0.230769,0.917891,1.0,0.53
1,What happened to the UnitedHealthcare CEO Bria...,"[- James Kurose, ‚ÄúTestimony before the House C...",[Why is AI controversial?\nWhile acknowledging...,I don't have enough information to answer.,The BBC reported that Apple's AI falsely told ...,1.0,0.0,1.0,0.0
2,What is the current stance of the UK governmen...,"[What is AI, how does it work and why are some...",[Are there laws governing AI?\nSome government...,The UK government's current stance on the regu...,"In the UK, Prime Minister Sir Keir Starmer has...",0.75,0.0,1.0,0.57
3,How does the GDPR impact the transparency and ...,[Smart cities\nMetropolitan governments are us...,[<1-hop>\n\nAI ethics and transparency\nAlgori...,The General Data Protection Regulation (GDPR) ...,The GDPR impacts the transparency and accounta...,1.0,0.965147,1.0,0.56
4,What are the trends in the usage of ChatGPT fo...,[The yellow line represents the first cohort o...,[<1-hop>\n\n37% of messages are work-related f...,The usage of ChatGPT for Technical Help has se...,The usage of ChatGPT for Technical Help has sh...,0.923077,0.946387,1.0,0.88




Hybrid RAG - Async Evaluation Results:
Metrics: {'faithfulness': 0.8700, 'answer_relevancy': 0.8655, 'context_recall': 0.9600, 'factual_correctness(mode=f1)': 0.5730}



Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,faithfulness,answer_relevancy,context_recall,factual_correctness(mode=f1)
0,What Alexa do in AI?,"[What is AI, how does it work and why are some...","[What is AI, how does it work and why are some...","Alexa, as a voice-controlled virtual assistant...",Alexa is a voice-controlled virtual assistant ...,1.0,0.909179,1.0,0.46
1,What happened to the UnitedHealthcare CEO Bria...,[AI will reconfigure how society and the econo...,[Why is AI controversial?\nWhile acknowledging...,I don't have enough information to answer.,The BBC reported that Apple's AI falsely told ...,0.0,0.0,1.0,0.0
2,What is the current stance of the UK governmen...,[AI will reconfigure how society and the econo...,[Are there laws governing AI?\nSome government...,"The UK government has stated that it will ""tes...","In the UK, Prime Minister Sir Keir Starmer has...",1.0,0.943726,1.0,0.58
3,How does the GDPR impact the transparency and ...,"[- James Kurose, ‚ÄúTestimony before the House C...",[<1-hop>\n\nAI ethics and transparency\nAlgori...,The GDPR (General Data Protection Regulation) ...,The GDPR impacts the transparency and accounta...,1.0,0.97165,1.0,0.61
4,What are the trends in the usage of ChatGPT fo...,[37% of messages are work-related\nfor users w...,[<1-hop>\n\n37% of messages are work-related f...,The trends in the usage of ChatGPT for Technic...,The usage of ChatGPT for Technical Help has sh...,0.7,0.979072,0.8,0.82


## Side-by-Side Performance Comparison

Let's create a comprehensive comparison of both RAG systems to understand their relative performance across all metrics.

In [10]:
# Update the comparison to use async results
import pandas as pd

# Extract average metrics from async evaluations
naive_scores_async = naive_results_async['metrics'].scores
hybrid_scores_async = hybrid_results_async['metrics'].scores

# Compute averages for each metric
naive_df_async = pd.DataFrame(naive_scores_async)
hybrid_df_async = pd.DataFrame(hybrid_scores_async)

naive_averages_async = naive_df_async.mean()
hybrid_averages_async = hybrid_df_async.mean()

# Create comparison DataFrame
comparison_data_async = {
    'Metric': list(naive_averages_async.index),
    'Naive RAG': list(naive_averages_async.values),
    'Hybrid RAG': list(hybrid_averages_async.values)
}

comparison_df_async = pd.DataFrame(comparison_data_async)
comparison_df_async['Improvement (%)'] = (
    (comparison_df_async['Hybrid RAG'] - comparison_df_async['Naive RAG']) / 
    comparison_df_async['Naive RAG'] * 100
).round(2)
comparison_df_async['Better System'] = comparison_df_async.apply(
    lambda row: 'Hybrid RAG' if row['Hybrid RAG'] > row['Naive RAG'] 
    else 'Naive RAG' if row['Naive RAG'] > row['Hybrid RAG'] 
    else 'Tie', axis=1
)

# Round scores
comparison_df_async['Naive RAG'] = comparison_df_async['Naive RAG'].round(4)
comparison_df_async['Hybrid RAG'] = comparison_df_async['Hybrid RAG'].round(4)

print("\n" + "=" * 80)
print("COMPREHENSIVE RAG SYSTEM COMPARISON (ASYNC EVALUATION)")
print("=" * 80)
print(comparison_df_async.to_string(index=False))
print("\n" + "=" * 80)

# Calculate overall winner
hybrid_wins = sum(comparison_df_async['Hybrid RAG'] > comparison_df_async['Naive RAG'])
naive_wins = sum(comparison_df_async['Naive RAG'] > comparison_df_async['Hybrid RAG'])
ties = sum(comparison_df_async['Naive RAG'] == comparison_df_async['Hybrid RAG'])

print(f"\nFINAL SCORECARD:")
print(f"Hybrid RAG wins: {hybrid_wins} metrics")
print(f"Naive RAG wins: {naive_wins} metrics") 
print(f"Ties: {ties} metrics")

if hybrid_wins > naive_wins:
    print(f"\nOVERALL WINNER: Hybrid RAG SuperComponent!")
    print(f"   Better performance in {hybrid_wins}/{len(comparison_df_async)} metrics")
elif naive_wins > hybrid_wins:
    print(f"\nOVERALL WINNER: Naive RAG SuperComponent!")  
    print(f"   Better performance in {naive_wins}/{len(comparison_df_async)} metrics")
else:
    print(f"\nRESULT: It's a tie between both systems!")
    
avg_improvement = comparison_df_async['Improvement (%)'].mean()
print(f"\nAverage improvement by Hybrid RAG: {avg_improvement:.2f}%")


COMPREHENSIVE RAG SYSTEM COMPARISON (ASYNC EVALUATION)
                      Metric  Naive RAG  Hybrid RAG  Improvement (%) Better System
                faithfulness     0.8523      0.8700             2.08    Hybrid RAG
            answer_relevancy     0.7670      0.8655            12.85    Hybrid RAG
              context_recall     0.9300      0.9600             3.23    Hybrid RAG
factual_correctness(mode=f1)     0.5510      0.5730             3.99    Hybrid RAG


FINAL SCORECARD:
Hybrid RAG wins: 4 metrics
Naive RAG wins: 0 metrics
Ties: 0 metrics

OVERALL WINNER: Hybrid RAG SuperComponent!
   Better performance in 4/4 metrics

Average improvement by Hybrid RAG: 5.54%
