# Hybrid RAG Embedding Model Comparison with Weights & Biases üìäüî¨

## üìã Overview

This notebook focuses on using **Weights & Biases (W&B)** to systematically compare **embedding models** in our **Hybrid RAG** system. We will run two experiments comparing `text-embedding-3-small` vs `text-embedding-3-large` to understand the **quality vs cost tradeoff**.

### Hybrid RAG System Configuration (Embedding Model Comparison):

| Parameter | Small Model Experiment | Large Model Experiment |
| :--- | :--- | :--- |
| **`embedder_model`** | **`text-embedding-3-small`** | **`text-embedding-3-large`** |
| `llm_model` | `gpt-4o-mini` | `gpt-4o-mini` |
| `retriever_top_k` | `5` | `5` |
| `rag_type` | `hybrid` | `hybrid` |
| `reranker_model` | `BAAI/bge-reranker-base` | `BAAI/bge-reranker-base` |
| `bm25_enabled` | `True` | `True` |

### üî¨ Experimental Hypothesis:
- **Large embedding model** may provide better semantic understanding ‚Üí higher faithfulness/context recall
- **Small embedding model** will be more cost-effective ‚Üí better cost-per-query metrics
- Both models will be re-indexed with fresh documents to ensure fair comparison

---

In [1]:
import os
import sys
import wandb
import pandas as pd
import numpy as np
import json
import tiktoken
from pathlib import Path
from datetime import datetime
from typing import Dict, List, Any

# Add the current directory to Python path to ensure imports work
current_dir = Path.cwd()
if str(current_dir) not in sys.path:
    sys.path.insert(0, str(current_dir))

# Import Haystack/Ragas components
from haystack import Pipeline
from ragas.metrics import (LLMContextRecall,\
    Faithfulness,\
    FactualCorrectness,\
    ResponseRelevancy,\
    ContextEntityRecall, NoiseSensitivity)
from haystack_integrations.document_stores.elasticsearch import ElasticsearchDocumentStore

# Import custom components (assuming these paths exist relative to the notebook)
try:
    from scripts.rag.hybridrag import HybridRAGSuperComponent
    from scripts.ragas_evaluation.ragasevalsupercomponent import RAGEvaluationSuperComponent
    from scripts.wandb_experiments.rag_analytics import RAGAnalytics
    from scripts.rag.indexing import IndexingPipelineSuperComponent
    print("‚úÖ All custom components imported successfully")
except ImportError as e:
    print(f"WARNING: Custom components could not be imported: {e}")
    print("Ensure all required components are available.")

# Environment setup (reduced logging)
os.environ["HAYSTACK_CONTENT_TRACING_ENABLED"] = "false"
print("Setup: Imports and Environment variables loaded.")

‚úÖ All custom components imported successfully
Setup: Imports and Environment variables loaded.


In [2]:
import wandb
from pathlib import Path
from datetime import datetime

class RAGEvaluationExperiment:
    """Enhanced RAG evaluation workflow with streamlined W&B integration using RAGEvaluationSuperComponent."""
    
    def __init__(self, project_name: str, experiment_name: str):
        self.project_name = project_name
        self.experiment_name = experiment_name
        self.run = None
        self.evaluation_supercomponent = None
    
    def setup_pipeline(self, rag_supercomponent, metrics_list: list, config: dict = None):
        """Set up the evaluation pipeline with W&B tracking using RAGEvaluationSuperComponent."""
        self.run = wandb.init(
            project=self.project_name,
            name=self.experiment_name,
            config=config,
            reinit=True
        )
        print(f"W&B STARTED: {self.experiment_name} | URL: {self.run.url}")
        
        # Initialize the RAGEvaluationSuperComponent with the RAG system to evaluate
        self.evaluation_supercomponent = RAGEvaluationSuperComponent(
            rag_supercomponent=rag_supercomponent,
            system_name=self.experiment_name,
            llm_model=config.get('llm_model', 'gpt-4o-mini') if config else 'gpt-4o-mini'
        )
        
        # Override the metrics in the evaluator component if custom metrics are provided
        if metrics_list:
            # Access the evaluator component and update its metrics
            evaluator = self.evaluation_supercomponent.pipeline.get_component("evaluator")
            evaluator.metrics = metrics_list
        
        return self.evaluation_supercomponent
    
    def run_evaluation(self, csv_file_path: str):
        """Execute the pipeline using RAGEvaluationSuperComponent, log high-level metrics, and return results."""
        if not self.evaluation_supercomponent:
            raise ValueError("Pipeline not set up. Call setup_pipeline() first.")
        
        start_time = datetime.now()
        print(f"\nRunning RAGEvaluationSuperComponent on {csv_file_path}...")
        
        # Run the supercomponent with the CSV file
        results = self.evaluation_supercomponent.run(csv_source=csv_file_path)
        end_time = datetime.now()
        
        execution_time = (end_time - start_time).total_seconds()
        metrics = results["metrics"]
        evaluation_df = results["evaluation_df"].rename(columns={
            'factual_correctness(mode=f1)': 'factual_correctness_f1'
        })
        
        # Log dataset artifact
        dataset_artifact = wandb.Artifact(name=f"evaluation-dataset-{Path(csv_file_path).stem}", type="dataset")
        dataset_artifact.add_file(csv_file_path)
        self.run.log_artifact(dataset_artifact)
        
        # Extract and log summary metrics
        wandb_metrics = {
            "execution_time_seconds": execution_time,
            "num_queries_evaluated": len(evaluation_df),
        }
        # Simple conversion of Ragas EvaluationResult metrics to flat dictionary
        if hasattr(metrics, 'to_dict'):
            metrics_dict = metrics.to_dict()
            for metric_name, metric_value in metrics_dict.items():
                if isinstance(metric_value, (int, float)):
                    # Standardize metric names for W&B comparison
                    clean_name = metric_name.replace('(mode=f1)', '').replace('ragas_', '').strip()
                    wandb_metrics[f"ragas_{clean_name}"] = metric_value
        
        self.run.log(wandb_metrics)
        print(f"Evaluation Complete: Logged {len(evaluation_df)} queries and {len(wandb_metrics)} metrics.")
        
        return {
            "metrics": metrics, # Full EvaluationResult object
            "evaluation_df": evaluation_df,
            "execution_time": execution_time,
            "wandb_url": self.run.url
        }
    
    def finish_experiment(self):
        """Finish the W&B run."""
        if self.run:
            url = self.run.url
            self.run.finish()
            print(f"\nW&B COMPLETED: {self.experiment_name} | View Results: {url}")

## üóÇÔ∏è Document Store Setup and Indexing

Before running our experiments, we need to set up our document sources and create a clean indexing process for each embedding model.

In [3]:
document_store_small = ElasticsearchDocumentStore(hosts="http://localhost:9200")
document_store_large = ElasticsearchDocumentStore(hosts="http://localhost:9201")


## üî¨ Experiment 1: Hybrid RAG with Small Embedding Model (`text-embedding-3-small`)

We establish a performance baseline for the Hybrid RAG system using the smaller, more cost-effective embedding model.

In [None]:
# 1. Define evaluation metrics (Focusing on core RAGAS metrics)
evaluation_metrics = [LLMContextRecall(), \
                Faithfulness(), \
                FactualCorrectness(), \
                ResponseRelevancy(), \
                ContextEntityRecall(), \
                NoiseSensitivity()]
csv_file_path = "data_for_eval/synthetic_tests_advanced_branching_3.csv"

# 2. Configuration for the Small Embedding Model Experiment
small_embedding_config = {
    "embedder_model": "text-embedding-3-small",
    "llm_model": "gpt-4o-mini",
    "retriever_top_k": 5,  # Fixed value
    "rag_type": "hybrid",
    "document_store": "elasticsearch",
    "reranker_model": "BAAI/bge-reranker-base",
}



# 4. Initialize RAG component with small embedder
small_embedding_rag_sc = HybridRAGSuperComponent(
    document_store=document_store_small,
    embedder_model=small_embedding_config["embedder_model"]
)

# 5. Initialize and Setup Small Embedding Experiment
small_embedding_experiment = RAGEvaluationExperiment(
    project_name="embedding-model-comparison",
    experiment_name="hybrid-rag-small-embedding"
)

small_embedding_pipeline = small_embedding_experiment.setup_pipeline(
    rag_supercomponent=small_embedding_rag_sc, 
    metrics_list=evaluation_metrics,
    config=small_embedding_config
)

# 6. Run the evaluation and store results
small_embedding_results = small_embedding_experiment.run_evaluation(
    csv_file_path=csv_file_path
)

# 7. Run Analytics and log to W&B
small_embedding_analytics = RAGAnalytics(small_embedding_results, model_name=small_embedding_config['llm_model'])
small_embedding_summary = small_embedding_analytics.log_to_wandb(small_embedding_experiment.run)

# 8. Finish the experiment run
small_embedding_experiment.finish_experiment()

[34m[1mwandb[0m: Currently logged in as: [33mlgutierrwr[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Detected [instructor, openai] in use.
[34m[1mwandb[0m: Use W&B Weave for improved LLM call tracing. Weave is installed but not imported. Add `import weave` to the top of your script.
[34m[1mwandb[0m: For more information, check out the docs at: https://weave-docs.wandb.ai/


W&B STARTED: hybrid-rag-small-embedding | URL: https://wandb.ai/lgutierrwr/embedding-model-comparison/runs/kzm3h5zb

üîÑ Building evaluation pipeline for hybrid-rag-small-embedding...
‚úÖ Evaluation pipeline for hybrid-rag-small-embedding built successfully!

Running RAGEvaluationSuperComponent on data_for_eval/synthetic_tests_advanced_branching_3.csv...
Loaded DataFrame with 4 rows from data_for_eval/synthetic_tests_advanced_branching_3.csv.
Running RAG SuperComponent on 4 queries...
RAG processing complete.
Creating Ragas EvaluationDataset...
Starting Ragas evaluation...


Evaluating:   0%|          | 0/24 [00:00<?, ?it/s]

LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.


## üöÄ Experiment 2: Hybrid RAG with Large Embedding Model (`text-embedding-3-large`)

To potentially improve semantic understanding and retrieval quality, we'll use the larger, more powerful embedding model. This tests the tradeoff between **embedding quality vs. cost/latency**.

In [None]:
# 1. Configuration for the Large Embedding Model Experiment
large_embedding_config = {
    "embedder_model": "text-embedding-3-large",
    "llm_model": "gpt-4o-mini",
    "retriever_top_k": 5,  # Same as small embedding experiment
    "rag_type": "hybrid",
    "document_store": "elasticsearch",
    "reranker_model": "BAAI/bge-reranker-base",
}

# 2. Initialize RAG component with large embedder
large_embedding_rag_sc = HybridRAGSuperComponent(
    document_store=document_store_large,
    embedder_model=large_embedding_config["embedder_model"]
)

# 4. Initialize and Setup Large Embedding Experiment
large_embedding_experiment = RAGEvaluationExperiment(
    project_name="embedding-model-comparison",
    experiment_name="hybrid-rag-large-embedding"
)

large_embedding_pipeline = large_embedding_experiment.setup_pipeline(
    rag_supercomponent=large_embedding_rag_sc, 
    metrics_list=evaluation_metrics,
    config=large_embedding_config
)

# 5. Run the evaluation and store results
large_embedding_results = large_embedding_experiment.run_evaluation(
    csv_file_path=csv_file_path
)

# 6. Run Analytics and log to W&B
large_embedding_analytics = RAGAnalytics(large_embedding_results, model_name=large_embedding_config['llm_model'])
large_embedding_summary = large_embedding_analytics.log_to_wandb(large_embedding_experiment.run)

# 7. Finish the experiment run
large_embedding_experiment.finish_experiment()

## üìä Comparative Analysis & Key Insights

Now we can programmatically compare the key metrics between the two runs. The full comparison is available in the W&B dashboard, but a quick summary confirms the tradeoff.

In [None]:
import numpy as np
def extract_ragas_metrics(metrics_obj):
    """Extract and flatten RAGAS metrics from the result object for comparison."""
    metrics_dict = {} 
    if hasattr(metrics_obj, 'to_dict'):
        raw_metrics = metrics_obj.to_dict()
        for k, v in raw_metrics.items():
            if isinstance(v, (float, int)):
                # Clean metric name for the output table
                clean_name = k.replace('(mode=f1)', '').strip()
                metrics_dict[clean_name] = v
    return metrics_dict

# 1. Extract and combine summary data
small_embedding_data = {
    'System': 'Small Embedding (text-embedding-3-small)',
    'embedder_model': 'text-embedding-3-small',
    'Execution Time (s)': small_embedding_results['execution_time'],
    'Avg Cost (USD)': small_embedding_summary['average_cost_per_query_usd'],
    'Avg Tokens/Query': small_embedding_summary['average_tokens_per_query'],
    'Faithfulness': np.array(small_embedding_results['metrics']['faithfulness']).mean(),
    'Context Recall': np.array(small_embedding_results['metrics']['context_recall']).mean(),
    'Factual Correctness': np.array(small_embedding_results['metrics']['factual_correctness(mode=f1)']).mean(),
    'Response Relevancy': np.array(small_embedding_results['metrics']['answer_relevancy']).mean(),
    'Noise Sensitivity': np.array(small_embedding_results['metrics']['noise_sensitivity(mode=relevant)']).mean(),
    'Context Entity Recall': np.array(small_embedding_results['metrics']['context_entity_recall']).mean()
}

large_embedding_data = {
    'System': 'Large Embedding (text-embedding-3-large)',
    'embedder_model': 'text-embedding-3-large',
    'Execution Time (s)': large_embedding_results['execution_time'],
    'Avg Cost (USD)': large_embedding_summary['average_cost_per_query_usd'],
    'Avg Tokens/Query': large_embedding_summary['average_tokens_per_query'],
    'Faithfulness': np.array(large_embedding_results['metrics']['faithfulness']).mean(),
    'Context Recall': np.array(large_embedding_results['metrics']['context_recall']).mean(),
    'Factual Correctness': np.array(large_embedding_results['metrics']['factual_correctness(mode=f1)']).mean(),
    'Response Relevancy': np.array(large_embedding_results['metrics']['answer_relevancy']).mean(),
    'Noise Sensitivity': np.array(large_embedding_results['metrics']['noise_sensitivity(mode=relevant)']).mean(),
    'Context Entity Recall': np.array(large_embedding_results['metrics']['context_entity_recall']).mean()
}

comparison_df = pd.DataFrame([small_embedding_data, large_embedding_data])
comparison_df = comparison_df.set_index('System')

# Display the comparison table
print("üìä EMBEDDING MODEL COMPARISON RESULTS:")
print("=" * 80)
print(comparison_df.round(4))

# Log final summary table to W&B for easy comparison
final_run = wandb.init(project="embedding-model-comparison", name="final-comparison", reinit=True)
final_run.log({"embedding_comparison_table": wandb.Table(dataframe=comparison_df.reset_index())})
final_run.finish()

## üéì Summary and Next Steps

You have successfully executed an **embedding model comparison experiment** for the Hybrid RAG system and logged all results to a single W&B project: `embedding-model-comparison`.

### Key Accomplishments:
1.  **Document Store Management:** Implemented proper document wiping and re-indexing workflow to ensure fair comparison between embedding models.
2.  **Small vs Large Embedding Comparison:** Tested `text-embedding-3-small` vs `text-embedding-3-large` to understand the **quality vs cost tradeoff**.
3.  **Controlled Variables:** Fixed all other parameters (`retriever_top_k=5`, `llm_model=gpt-4o-mini`) to isolate the impact of embedding model choice.
4.  **Comprehensive Logging:** Logged both runs with their configurations, metrics, and cost analysis to W&B for detailed comparison.

### Expected Insights:
- **Performance:** Large embedding model may show improved semantic understanding (higher context recall, faithfulness)
- **Cost:** Large embedding model will likely have higher embedding costs but same LLM costs
- **Latency:** Large embeddings may have slightly higher processing time
- **Use Case Guidance:** Data will help decide if quality improvement justifies additional cost

### Next Steps in W&B:
1.  **View Comparison:** Navigate to the **`embedding-model-comparison`** project dashboard on W&B. Compare the **faithfulness vs. embedding cost** and **quality metrics vs. model size** visualizations.
2.  **Cost Analysis:** Analyze the embedding cost difference and determine ROI for your specific use case.
3.  **Deeper Optimization:** Use W&B Sweeps to test additional embedding models or explore hybrid approaches using both models for different query types.
4.  **Production Decision:** Use these results to make an informed choice for your production RAG system based on your quality requirements and budget constraints.