üîß  **Setup Required**: Before running this notebook, please follow the [setup instructions](../README.md#setup-instructions) to configure your environment and API keys. **You will need to ensure you've executed the Indexing pipeline before completing this exercise**

# Systematic RAG Evaluation: Naive vs Hybrid Comparison

## Overview

This notebook demonstrates a **systematic evaluation workflow** for comparing two RAG (Retrieval-Augmented Generation) approaches using **Haystack custom components**. We'll create a reproducible pipeline to:

1. **Load evaluation datasets** from CSV files
2. **Process queries** through both Naive and Hybrid RAG SuperComponents 
3. **Generate comprehensive metrics** using the RAGAS framework
4. **Compare performance** systematically

## Learning Objectives

By the end of this notebook, you will understand how to:
- Create reusable evaluation components for RAG systems
- Build scalable pipelines for systematic RAG comparison
- Interpret RAGAS metrics in comparative context
- Make data-driven decisions between RAG approaches

## Evaluation Pipeline

Our pipeline consists of three main components:

```
CSV Data ‚Üí RAGDataAugmenter ‚Üí RagasEvaluation ‚Üí Metrics & Results
```

**Key Benefits:**
- **Systematic**: Same evaluation conditions for both RAG systems
- **Reproducible**: Consistent evaluation across experiments
- **Scalable**: Easy to add new RAG implementations
- **Comprehensive**: Multiple metrics provide complete assessment

---

## Component 1: CSV Data Loader

The **CSVReaderComponent** serves as the entry point for our evaluation pipeline. It handles loading synthetic evaluation datasets and ensures data quality before processing.

**Key Features:**
- **Robust Error Handling**: Validates file existence and data integrity
- **Pandas Integration**: Returns data as DataFrame for easy manipulation
- **Pipeline Compatible**: Designed to work seamlessly with Haystack pipelines

**Input:** File path to CSV containing evaluation queries and ground truth
**Output:** Pandas DataFrame ready for RAG processing

In [None]:
import pandas as pd
from pathlib import Path
from haystack import component, Pipeline
from typing import List, Optional, Dict, Any, Union

@component
class CSVReaderComponent:
    """Reads a CSV file into a Pandas DataFrame."""

    @component.output_types(data_frame=pd.DataFrame)
    def run(self, source: Union[str, Path]):
        """
        Reads the CSV file from the first source in the list.
        
        Args:
            sources: List of file paths to CSV files. Only the first file will be processed.
            
        Returns:
            dict: Dictionary containing the loaded DataFrame under 'data_frame' key.
            
        Raises:
            FileNotFoundError: If the file doesn't exist or can't be read.
            ValueError: If the DataFrame is empty after loading.
        """
        if not source:
            raise ValueError("No sources provided")
            

        try:
            df = pd.read_csv(source)
        except FileNotFoundError:
            raise FileNotFoundError(f"File not found at {source}")
        except Exception as e:
            raise ValueError(f"Error reading CSV file {source}: {str(e)}")

        # Check if DataFrame is empty using proper pandas method
        if df.empty:
            raise ValueError(f"DataFrame is empty after loading from {source}")

        print(f"Loaded DataFrame with {len(df)} rows from {source}.")
        return {"data_frame": df}

## Component 2: Async RAG Data Augmentation

The **AsyncRAGDataAugmenterComponent** is the core of our evaluation workflow. It processes queries through a RAG SuperComponent with concurrent execution for optimal performance.

**Key Features:**

1. **Concurrent Processing**: Processes multiple queries in parallel batches
2. **SuperComponent Flexibility**: Accepts any pre-configured RAG SuperComponent (Naive, Hybrid, or custom)
3. **Configurable Batch Size**: Control concurrency based on API rate limits
4. **Progress Tracking**: Real-time visibility into processing status
5. **Error Handling**: Gracefully handles failures without stopping the entire evaluation

**Performance Benefits:**
- **Speed**: Up to N√ó faster than sequential processing (where N = batch_size)
- **Scalability**: Efficiently handles large evaluation datasets
- **Resource Optimization**: Maximizes API call efficiency

**Pipeline Integration:**
- **Input**: DataFrame with queries from CSVReaderComponent  
- **Process**: Runs queries through RAG SuperComponent in concurrent batches
- **Output**: Augmented DataFrame with responses and retrieved contexts

**Why Async?**
- **Faster Iteration**: Quickly evaluate large datasets
- **Better Resource Utilization**: Maximize throughput without overwhelming APIs
- **Production Ready**: Scalable approach for continuous evaluation

## Implementation: Concurrent Query Processing

Below is the implementation of the async component that enables concurrent query processing.

In [None]:
import asyncio
from typing import List
from haystack import SuperComponent

@component
class AsyncRAGDataAugmenterComponent:
    """
    Async version of RAGDataAugmenterComponent that processes queries concurrently.
    
    Key Improvements:
    - Processes multiple queries in parallel batches
    - Configurable concurrency limits to avoid rate limits
    - Significant performance improvement for large datasets
    - Progress tracking for long-running evaluations
    """

    def __init__(self, rag_supercomponent: SuperComponent, batch_size: int = 5):
        """
        Initialize the Async RAG Data Augmenter.
        
        Args:
            rag_supercomponent: Pre-initialized RAG SuperComponent
            batch_size (int): Number of queries to process concurrently. Defaults to 5.
                             Adjust based on API rate limits and system resources.
        """
        self.rag_supercomponent = rag_supercomponent
        self.batch_size = batch_size
        self.output_names = ["augmented_data_frame"]

    async def _process_single_query(self, query: str, index: int) -> tuple:
        """
        Process a single query through the RAG SuperComponent.
        
        Args:
            query: The query string to process
            index: The query index for tracking
            
        Returns:
            tuple: (index, answer, contexts)
        """
        try:
            # Run the RAG SuperComponent
            rag_output = self.rag_supercomponent.run(query=query)
            
            # Extract answer and contexts
            answer = rag_output.get('replies', [''])[0]
            retrieved_docs = rag_output.get('documents', [])
            retrieved_contexts = [doc.content for doc in retrieved_docs]
            
            return (index, answer, retrieved_contexts)
        except Exception as e:
            print(f"Error processing query {index}: {str(e)}")
            return (index, "", [])

    async def _process_batch(self, queries_with_indices: List[tuple]) -> List[tuple]:
        """
        Process a batch of queries concurrently.
        
        Args:
            queries_with_indices: List of (index, query) tuples
            
        Returns:
            List of (index, answer, contexts) tuples
        """
        tasks = [
            self._process_single_query(query, idx) 
            for idx, query in queries_with_indices
        ]
        return await asyncio.gather(*tasks)

    @component.output_types(augmented_data_frame=pd.DataFrame)
    async def run(self, data_frame: pd.DataFrame):
        """
        Process all queries in the DataFrame with concurrent execution.
        
        Args:
            data_frame: DataFrame with 'user_input' column containing queries
            
        Returns:
            dict: Dictionary with augmented DataFrame containing responses and contexts
        """
        total_queries = len(data_frame)
        print(f"Running Async RAG on {total_queries} queries (batch size: {self.batch_size})...")
        
        # Prepare queries with their indices
        queries_with_indices = list(enumerate(data_frame["user_input"].tolist()))
        
        # Initialize results storage
        results = [None] * total_queries
        
        # Process in batches
        for batch_start in range(0, total_queries, self.batch_size):
            batch_end = min(batch_start + self.batch_size, total_queries)
            batch = queries_with_indices[batch_start:batch_end]
            
            print(f"Processing batch {batch_start//self.batch_size + 1} "
                  f"(queries {batch_start+1}-{batch_end} of {total_queries})...")
            
            # Process batch concurrently
            batch_results = await self._process_batch(batch)
            
            # Store results in correct order
            for idx, answer, contexts in batch_results:
                results[idx] = (answer, contexts)
        
        # Extract answers and contexts from results
        answers = [r[0] for r in results]
        contexts = [r[1] for r in results]
        
        # Augment the DataFrame
        data_frame = data_frame.copy()
        data_frame['response'] = answers
        data_frame['retrieved_contexts'] = contexts
        
        print(f"‚úì Async RAG processing complete for {total_queries} queries!")
        return {"augmented_data_frame": data_frame}

## Component 3: RAGAS Evaluation Engine

The **RagasEvaluationComponent** integrates the RAGAS framework into our Haystack pipeline, providing systematic evaluation metrics for RAG systems.

**Core Evaluation Metrics:**

| Metric | Purpose | What It Measures |
|--------|---------|------------------|
| **Faithfulness** | Response Quality | Factual consistency with retrieved context |
| **ResponseRelevancy** | Relevance | How well responses answer the questions |
| **LLMContextRecall** | Retrieval Quality | How well retrieval captures relevant information |
| **FactualCorrectness** | Accuracy | Correctness of factual claims in responses |

**Technical Implementation:**
- **Focused Metrics**: Core metrics for reliable comparison
- **LLM Integration**: Uses OpenAI GPT models for evaluation judgments  
- **Data Format Handling**: Automatically formats data for RAGAS requirements
- **Comprehensive Output**: Returns both aggregated metrics and detailed per-query results

In [None]:
from ragas import EvaluationDataset, evaluate
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy

from ragas.llms import llm_factory
from haystack.utils import Secret
import os
from ragas.llms import HaystackLLMWrapper
from haystack.components.generators import OpenAIGenerator


@component
class RagasEvaluationComponent:
    """
    Integrates the RAGAS framework into Haystack pipeline for systematic evaluation.
    
    This component provides systematic evaluation metrics for RAG systems using
    the RAGAS framework with focus on core metrics for reliable comparison.
    
    Core Evaluation Metrics:
    - Faithfulness: Factual consistency with retrieved context
    - ResponseRelevancy: How well responses answer the questions  
    - LLMContextRecall: How well retrieval captures relevant information
    - FactualCorrectness: Correctness of factual claims in responses
    
    Technical Features:
    - Focused Metrics: Core metrics for reliable comparison
    - LLM Integration: Uses provided generator for evaluation judgments
    - Data Format Handling: Automatically formats data for RAGAS requirements
    - Comprehensive Output: Returns both aggregated metrics and detailed per-query results
    """
    
    def __init__(self, 
                 generator: Any,
                 metrics: Optional[List[Any]] = None):
        """
        Initialize the RAGAS Evaluation Component.
        
        Args:
            generator: LLM generator instance (e.g., OpenAIGenerator or OllamaGenerator).
            metrics: List of RAGAS metrics to evaluate (defaults to core metrics)
        """
        
        # Default to core metrics for systematic comparison
        if metrics is None:
            self.metrics = [
                Faithfulness(), 
                ResponseRelevancy(),
                LLMContextRecall(),
                FactualCorrectness()
            ]
        else:
            self.metrics = metrics
        
        # Configure RAGAS LLM for evaluation
        self.ragas_llm = HaystackLLMWrapper(generator)

    @component.output_types(metrics=Dict[str, float], evaluation_df=pd.DataFrame)
    def run(self, augmented_data_frame: pd.DataFrame):
        """
        Run RAGAS evaluation on augmented dataset.
        
        Args:
            augmented_data_frame: DataFrame with RAG responses and retrieved contexts
            
        Returns:
            dict: Dictionary containing evaluation metrics and detailed results DataFrame
        """
        
        # 1. Map columns to Ragas requirements
        ragas_data = pd.DataFrame({
            'user_input': augmented_data_frame['user_input'],
            'response': augmented_data_frame['response'], 
            'retrieved_contexts': augmented_data_frame['retrieved_contexts'],
            'reference': augmented_data_frame['reference'],
            'reference_contexts': augmented_data_frame['reference_contexts'].apply(eval)
        })

        print("Creating Ragas EvaluationDataset...")
        # 2. Create EvaluationDataset
        dataset = EvaluationDataset.from_pandas(ragas_data)

        print("Starting Ragas evaluation...")
        
        # 3. Run Ragas Evaluation
        results = evaluate(
            dataset=dataset,
            metrics=self.metrics,
            llm=self.ragas_llm
        )
        
        results_df = results.to_pandas()
        
        print("Ragas evaluation complete.")
        print(f"Overall metrics: {results}")
        
        return {"metrics": results, "evaluation_df": results_df}

---

# Systematic RAG Comparison: Naive vs Hybrid

Now we'll systematically evaluate both RAG SuperComponents using the same evaluation pipeline. This ensures fair comparison with identical evaluation conditions.

## Comparison Strategy

Our approach enables **systematic comparison**:

1. **Same Dataset**: Both systems evaluated on identical test queries
2. **Same Metrics**: Consistent evaluation criteria across both approaches
3. **Same Pipeline**: Identical processing workflow eliminates bias
4. **Reproducible Results**: Pipeline ensures consistent evaluation conditions

## Dataset Information

We'll use a synthetic evaluation dataset:
- **`synthetic_tests_advanced_branching_2.csv`**: Focused dataset for comparison

**Dataset Structure:**
- `user_input`: Questions to ask the RAG system
- `reference`: Ground truth answers for comparison
- `reference_contexts`: Expected retrieved contexts

### Setup: Initialize Both RAG SuperComponents

First, we'll initialize both RAG SuperComponents with consistent parameters for fair comparison.

**Configuration:**
- **Same base parameters**: Both systems use identical core settings
- **Document store**: Shared Elasticsearch document store
- **Models**: Consistent LLM and embedding models for both systems

In [None]:
# --- Setup Environment & Dependencies ---
from scripts.rag.hybridrag import HybridRAGSuperComponent
from scripts.rag.naiverag import NaiveRAGSuperComponent
from haystack_integrations.document_stores.elasticsearch import ElasticsearchDocumentStore
import os

# Initialize document store
document_store = ElasticsearchDocumentStore(hosts="http://localhost:9200")

# Create both RAG SuperComponents with base parameters for fair comparison
naive_rag_sc = NaiveRAGSuperComponent(
    document_store=document_store
)

hybrid_rag_sc = HybridRAGSuperComponent(
    document_store=document_store
)

print("Both RAG SuperComponents initialized successfully!")
print(f"Naive RAG: {naive_rag_sc.__class__.__name__}")
print(f"Hybrid RAG: {hybrid_rag_sc.__class__.__name__}")

## Evaluation Pipeline: Using Haystack AsyncPipeline

Now we'll create an evaluation pipeline using Haystack's `AsyncPipeline` for proper concurrent execution. This provides a cleaner, more maintainable approach than manually coordinating async components.

In [None]:
from haystack import AsyncPipeline

def create_evaluation_pipeline(
    rag_supercomponent,
    generator: Any,
    batch_size: int = 5
) -> AsyncPipeline:
    """
    Create an evaluation pipeline using Haystack's AsyncPipeline.
    
    Args:
        rag_supercomponent: The RAG system to evaluate
        generator: LLM generator for RAGAS evaluation
        batch_size: Number of queries to process concurrently
        
    Returns:
        AsyncPipeline: Configured evaluation pipeline
    """
    pipeline = AsyncPipeline()
    
    # Add components
    pipeline.add_component("reader", CSVReaderComponent())
    pipeline.add_component(
        "augmenter",
        AsyncRAGDataAugmenterComponent(
            rag_supercomponent=rag_supercomponent,
            batch_size=batch_size
        )
    )
    pipeline.add_component("evaluator", RagasEvaluationComponent(generator=generator))
    
    # Connect components
    pipeline.connect("reader.data_frame", "augmenter.data_frame")
    pipeline.connect("augmenter.augmented_data_frame", "evaluator.augmented_data_frame")
    
    return pipeline


async def evaluate_rag_system_async(
    rag_supercomponent,
    system_name: str,
    csv_file_path: str,
    generator: Any,
    batch_size: int = 5
) -> dict:
    """
    Asynchronously evaluate a single RAG system using AsyncPipeline.
    
    Args:
        rag_supercomponent: The RAG system to evaluate
        system_name: Name for logging
        csv_file_path: Path to evaluation dataset
        generator: LLM generator for RAGAS evaluation
        batch_size: Number of queries to process concurrently
        
    Returns:
        dict: Evaluation results with metrics and detailed DataFrame
    """
    print(f"\n{'='*80}")
    print(f"Starting async evaluation of {system_name}")
    print(f"{'='*80}\n")
    
    # Create evaluation pipeline
    eval_pipeline = create_evaluation_pipeline(
        rag_supercomponent=rag_supercomponent,
        generator=generator,
        batch_size=batch_size
    )
    
    # Run pipeline asynchronously
    results = await eval_pipeline.run_async(data={"reader": {"source": csv_file_path}})
    
    print(f"\n‚úì {system_name} evaluation complete!")
    return {
        "system_name": system_name,
        "metrics": results["evaluator"]["metrics"],
        "evaluation_df": results["evaluator"]["evaluation_df"]
    }


async def evaluate_multiple_rag_systems_async(
    rag_systems: List[tuple],
    csv_file_path: str,
    batch_size: int = 5
) -> List[dict]:
    """
    Evaluate multiple RAG systems concurrently using AsyncPipeline.
    
    Args:
        rag_systems: List of (rag_supercomponent, system_name, generator) tuples
        csv_file_path: Path to evaluation dataset
        batch_size: Number of queries to process concurrently per system
        
    Returns:
        List of evaluation results for each system
    """
    print(f"\n{'='*80}")
    print(f"üöÄ CONCURRENT EVALUATION OF {len(rag_systems)} RAG SYSTEMS")
    print(f"{'='*80}")
    print(f"Dataset: {csv_file_path}")
    print(f"Batch size: {batch_size}")
    print(f"Systems: {[name for _, name, _ in rag_systems]}")
    print(f"{'='*80}\n")
    
    # Create evaluation tasks for each system
    tasks = [
        evaluate_rag_system_async(
            rag_supercomponent=rag_sc,
            system_name=name,
            csv_file_path=csv_file_path,
            generator=generator,
            batch_size=batch_size
        )
        for rag_sc, name, generator in rag_systems
    ]
    
    # Run all evaluations concurrently
    import time
    start_time = time.time()
    results = await asyncio.gather(*tasks)
    elapsed = time.time() - start_time
    
    print(f"\n{'='*80}")
    print(f"‚úì ALL EVALUATIONS COMPLETE")
    print(f"‚è±Ô∏è  Total time: {elapsed:.2f} seconds")
    print(f"‚ö° Average time per system: {elapsed/len(rag_systems):.2f} seconds")
    print(f"{'='*80}\n")
    
    return results

## Run Concurrent Evaluation

Now let's evaluate both Naive and Hybrid RAG systems **simultaneously** using concurrent evaluation.

**Performance Benefits:**
- **2√ó Faster**: Both systems evaluated at the same time
- **Identical Conditions**: Both evaluations start simultaneously
- **Efficient Resource Use**: Maximizes computational efficiency

**Configuration:**
- `batch_size`: Controls concurrent queries per system (adjust based on API limits)

In [None]:
# --- Setup for Concurrent Evaluation ---

# Create separate generators for each evaluation (components can't be shared)
from haystack.components.generators import OpenAIGenerator
from haystack.utils import Secret
import os

eval_generator_naive = OpenAIGenerator(
    model="gpt-4o-mini",
    api_key=Secret.from_token(os.getenv("OPENAI_API_KEY"))
)

eval_generator_hybrid = OpenAIGenerator(
    model="gpt-4o-mini",
    api_key=Secret.from_token(os.getenv("OPENAI_API_KEY"))
)

# Define evaluation dataset
csv_file_path = "data_for_eval/synthetic_tests_advanced_branching_50.csv"

# Configure concurrent evaluation
# batch_size: Number of queries to process in parallel (adjust based on API limits)
batch_size = 5  # Process 5 queries at a time

# Prepare RAG systems for concurrent evaluation
rag_systems = [
    (naive_rag_sc, "Naive RAG", eval_generator_naive),
    (hybrid_rag_sc, "Hybrid RAG", eval_generator_hybrid)
]

# Run concurrent evaluation
all_results = await evaluate_multiple_rag_systems_async(
    rag_systems=rag_systems,
    csv_file_path=csv_file_path,
    batch_size=batch_size
)

# Extract results for each system
naive_results_async = all_results[0]
hybrid_results_async = all_results[1]

print("\n‚úì Concurrent evaluation complete!")
print(f"Naive RAG: {len(naive_results_async['evaluation_df'])} queries evaluated")
print(f"Hybrid RAG: {len(hybrid_results_async['evaluation_df'])} queries evaluated")

### View Evaluation Results

Let's examine the results from our concurrent evaluation of both systems.

In [None]:
# Display results from concurrent evaluation
print("Naive RAG - Async Evaluation Results:")
print(f"Metrics: {naive_results_async['metrics']}\n")
display(naive_results_async['evaluation_df'].head())

print("\n" + "="*80 + "\n")

print("Hybrid RAG - Async Evaluation Results:")
print(f"Metrics: {hybrid_results_async['metrics']}\n")
display(hybrid_results_async['evaluation_df'].head())

## Side-by-Side Performance Comparison

Let's create a comprehensive comparison of both RAG systems to understand their relative performance across all metrics.

In [None]:
# Update the comparison to use async results
import pandas as pd

# Extract average metrics from async evaluations
naive_scores_async = naive_results_async['metrics'].scores
hybrid_scores_async = hybrid_results_async['metrics'].scores

# Compute averages for each metric
naive_df_async = pd.DataFrame(naive_scores_async)
hybrid_df_async = pd.DataFrame(hybrid_scores_async)

naive_averages_async = naive_df_async.mean()
hybrid_averages_async = hybrid_df_async.mean()

# Create comparison DataFrame
comparison_data_async = {
    'Metric': list(naive_averages_async.index),
    'Naive RAG': list(naive_averages_async.values),
    'Hybrid RAG': list(hybrid_averages_async.values)
}

comparison_df_async = pd.DataFrame(comparison_data_async)
comparison_df_async['Improvement (%)'] = (
    (comparison_df_async['Hybrid RAG'] - comparison_df_async['Naive RAG']) / 
    comparison_df_async['Naive RAG'] * 100
).round(2)
comparison_df_async['Better System'] = comparison_df_async.apply(
    lambda row: 'Hybrid RAG' if row['Hybrid RAG'] > row['Naive RAG'] 
    else 'Naive RAG' if row['Naive RAG'] > row['Hybrid RAG'] 
    else 'Tie', axis=1
)

# Round scores
comparison_df_async['Naive RAG'] = comparison_df_async['Naive RAG'].round(4)
comparison_df_async['Hybrid RAG'] = comparison_df_async['Hybrid RAG'].round(4)

print("\n" + "=" * 80)
print("COMPREHENSIVE RAG SYSTEM COMPARISON (ASYNC EVALUATION)")
print("=" * 80)
print(comparison_df_async.to_string(index=False))
print("\n" + "=" * 80)

# Calculate overall winner
hybrid_wins = sum(comparison_df_async['Hybrid RAG'] > comparison_df_async['Naive RAG'])
naive_wins = sum(comparison_df_async['Naive RAG'] > comparison_df_async['Hybrid RAG'])
ties = sum(comparison_df_async['Naive RAG'] == comparison_df_async['Hybrid RAG'])

print(f"\nFINAL SCORECARD:")
print(f"Hybrid RAG wins: {hybrid_wins} metrics")
print(f"Naive RAG wins: {naive_wins} metrics") 
print(f"Ties: {ties} metrics")

if hybrid_wins > naive_wins:
    print(f"\nOVERALL WINNER: Hybrid RAG SuperComponent!")
    print(f"   Better performance in {hybrid_wins}/{len(comparison_df_async)} metrics")
elif naive_wins > hybrid_wins:
    print(f"\nOVERALL WINNER: Naive RAG SuperComponent!")  
    print(f"   Better performance in {naive_wins}/{len(comparison_df_async)} metrics")
else:
    print(f"\nRESULT: It's a tie between both systems!")
    
avg_improvement = comparison_df_async['Improvement (%)'].mean()
print(f"\nAverage improvement by Hybrid RAG: {avg_improvement:.2f}%")

---

## üéØ Summary: Async Optimization Benefits

### Key Performance Improvements

We've optimized the RAGAS evaluation pipeline with async capabilities in three key areas:

#### 1. **Concurrent Query Processing** (`AsyncRAGDataAugmenterComponent`)
- **Sequential**: Processes queries one at a time
- **Async**: Processes multiple queries in parallel batches
- **Speed Improvement**: Up to N√ó faster (where N = batch_size)
- **Configurable**: Adjust `batch_size` based on API rate limits

#### 2. **Parallel System Evaluation** (`evaluate_multiple_rag_systems_async`)
- **Sequential**: Evaluate Naive RAG ‚Üí wait ‚Üí Evaluate Hybrid RAG
- **Async**: Both systems evaluated simultaneously
- **Speed Improvement**: ~2√ó faster total evaluation time
- **Scalable**: Easy to add more RAG systems for comparison

#### 3. **Progress Monitoring**
- Real-time progress tracking during batch processing
- Visibility into which queries are being processed
- Better debugging and monitoring capabilities

### When to Use Each Approach

| Scenario | Sequential | Async |
|----------|-----------|-------|
| Small datasets (<10 queries) | ‚úÖ Simpler | ‚ö†Ô∏è Overkill |
| Large datasets (>50 queries) | ‚ö†Ô∏è Slow | ‚úÖ Much faster |
| Single RAG system | ‚úÖ Adequate | ‚ö†Ô∏è Minor benefit |
| Multiple RAG systems | ‚ö†Ô∏è Slow | ‚úÖ 2-N√ó faster |
| API rate limits | ‚úÖ Safe | ‚ö†Ô∏è Need batch tuning |
| Production evaluations | ‚ö†Ô∏è Time-consuming | ‚úÖ Recommended |

### Configuration Tips

**Batch Size Selection:**
```python
# Conservative (API rate limit safe)
batch_size = 3

# Balanced (good for most cases)
batch_size = 5

# Aggressive (requires high API limits)
batch_size = 10
```

**For OpenAI API:**
- Free tier: `batch_size = 2-3`
- Pay-as-you-go: `batch_size = 5-10`
- Enterprise: `batch_size = 10-20`

### Code Structure Comparison

**Original Sequential Approach:**
```python
# Process queries one by one
for query in queries:
    result = rag_sc.run(query=query)
```

**Optimized Async Approach:**
```python
# Process queries in batches concurrently
async def process_batch(queries):
    tasks = [rag_sc.run(query=q) for q in queries]
    return await asyncio.gather(*tasks)
```

### Performance Metrics Example

For a typical evaluation with 50 queries:

| Approach | Time per Query | Total Time | Speedup |
|----------|---------------|------------|---------|
| Sequential | 5s | 250s (~4min) | 1√ó |
| Async (batch=5) | 5s | 50s (~50s) | **5√ó** |
| Async (2 systems) | 5s | 50s (~50s) | **5√ó** |

### Next Steps

To further optimize your evaluation pipeline:

1. **Increase batch size** if you have higher API rate limits
2. **Use local models** (Ollama) to eliminate rate limit concerns
3. **Cache RAG results** to avoid re-processing identical queries
4. **Implement retry logic** for handling transient API failures
5. **Add monitoring** to track API usage and costs

---