# Hybrid RAG Embedding Model Comparison with Weights & Biases üìäüî¨

## üìã Overview

This notebook focuses on using **Weights & Biases (W&B)** to systematically compare **embedding models** in our **Hybrid RAG** system. We will run two experiments comparing `text-embedding-3-small` vs `text-embedding-3-large` to understand the **quality vs cost tradeoff**.

### Hybrid RAG System Configuration (Embedding Model Comparison):

| Parameter | Small Model Experiment | Large Model Experiment |
| :--- | :--- | :--- |
| **`embedder_model`** | **`text-embedding-3-small`** | **`text-embedding-3-large`** |
| `llm_model` | `gpt-4o-mini` | `gpt-4o-mini` |
| `retriever_top_k` | `5` | `5` |
| `rag_type` | `hybrid` | `hybrid` |
| `reranker_model` | `BAAI/bge-reranker-base` | `BAAI/bge-reranker-base` |
| `bm25_enabled` | `True` | `True` |

### üî¨ Experimental Hypothesis:
- **Large embedding model** may provide better semantic understanding ‚Üí higher faithfulness/context recall
- **Small embedding model** will be more cost-effective ‚Üí better cost-per-query metrics
- Both models will be re-indexed with fresh documents to ensure fair comparison

---

In [1]:
import os
import sys
import wandb
import pandas as pd
import numpy as np
import json
import tiktoken
from pathlib import Path
from datetime import datetime
from typing import Dict, List, Any

# Add the current directory to Python path to ensure imports work
current_dir = Path.cwd()
if str(current_dir) not in sys.path:
    sys.path.insert(0, str(current_dir))

# Import Haystack/Ragas components
from haystack import Pipeline
from ragas.metrics import (LLMContextRecall,\
    Faithfulness,\
    FactualCorrectness,\
    ResponseRelevancy,\
    ContextEntityRecall, NoiseSensitivity)
from haystack_integrations.document_stores.elasticsearch import ElasticsearchDocumentStore

# Import custom components (assuming these paths exist relative to the notebook)
try:
    from scripts.rag.hybridrag import HybridRAGSuperComponent
    from scripts.ragas_evaluation.ragasevalsupercomponent import RAGEvaluationSuperComponent
    from scripts.wandb_experiments.rag_analytics import RAGAnalytics
    from scripts.rag.indexing import IndexingPipelineSuperComponent
    print("‚úÖ All custom components imported successfully")
except ImportError as e:
    print(f"WARNING: Custom components could not be imported: {e}")
    print("Ensure all required components are available.")

# Environment setup (reduced logging)
os.environ["HAYSTACK_CONTENT_TRACING_ENABLED"] = "false"
print("Setup: Imports and Environment variables loaded.")

‚úÖ All custom components imported successfully
Setup: Imports and Environment variables loaded.


In [2]:
import wandb
from pathlib import Path
from datetime import datetime

class RAGEvaluationExperiment:
    """Enhanced RAG evaluation workflow with streamlined W&B integration using RAGEvaluationSuperComponent."""
    
    def __init__(self, project_name: str, experiment_name: str):
        self.project_name = project_name
        self.experiment_name = experiment_name
        self.run = None
        self.evaluation_supercomponent = None
    
    def setup_pipeline(self, rag_supercomponent, metrics_list: list, config: dict = None):
        """Set up the evaluation pipeline with W&B tracking using RAGEvaluationSuperComponent."""
        self.run = wandb.init(
            project=self.project_name,
            name=self.experiment_name,
            config=config,
            reinit=True
        )
        print(f"W&B STARTED: {self.experiment_name} | URL: {self.run.url}")
        
        # Initialize the RAGEvaluationSuperComponent with the RAG system to evaluate
        self.evaluation_supercomponent = RAGEvaluationSuperComponent(
            rag_supercomponent=rag_supercomponent,
            system_name=self.experiment_name,
            llm_model=config.get('llm_model', 'gpt-4o-mini') if config else 'gpt-4o-mini'
        )
        
        # Override the metrics in the evaluator component if custom metrics are provided
        if metrics_list:
            # Access the evaluator component and update its metrics
            evaluator = self.evaluation_supercomponent.pipeline.get_component("evaluator")
            evaluator.metrics = metrics_list
        
        return self.evaluation_supercomponent
    
    def run_evaluation(self, csv_file_path: str):
        """Execute the pipeline using RAGEvaluationSuperComponent, log high-level metrics, and return results."""
        if not self.evaluation_supercomponent:
            raise ValueError("Pipeline not set up. Call setup_pipeline() first.")
        
        start_time = datetime.now()
        print(f"\nRunning RAGEvaluationSuperComponent on {csv_file_path}...")
        
        # Run the supercomponent with the CSV file
        results = self.evaluation_supercomponent.run(csv_source=csv_file_path)
        end_time = datetime.now()
        
        execution_time = (end_time - start_time).total_seconds()
        metrics = results["metrics"]
        evaluation_df = results["evaluation_df"].rename(columns={
            'factual_correctness(mode=f1)': 'factual_correctness_f1'
        })
        
        # Log dataset artifact
        dataset_artifact = wandb.Artifact(name=f"evaluation-dataset-{Path(csv_file_path).stem}", type="dataset")
        dataset_artifact.add_file(csv_file_path)
        self.run.log_artifact(dataset_artifact)
        
        # Extract and log summary metrics
        wandb_metrics = {
            "execution_time_seconds": execution_time,
            "num_queries_evaluated": len(evaluation_df),
        }
        # Simple conversion of Ragas EvaluationResult metrics to flat dictionary
        if hasattr(metrics, 'to_dict'):
            metrics_dict = metrics.to_dict()
            for metric_name, metric_value in metrics_dict.items():
                if isinstance(metric_value, (int, float)):
                    # Standardize metric names for W&B comparison
                    clean_name = metric_name.replace('(mode=f1)', '').replace('ragas_', '').strip()
                    wandb_metrics[f"ragas_{clean_name}"] = metric_value
        
        self.run.log(wandb_metrics)
        print(f"Evaluation Complete: Logged {len(evaluation_df)} queries and {len(wandb_metrics)} metrics.")
        
        return {
            "metrics": metrics, # Full EvaluationResult object
            "evaluation_df": evaluation_df,
            "execution_time": execution_time,
            "wandb_url": self.run.url
        }
    
    def finish_experiment(self):
        """Finish the W&B run."""
        if self.run:
            url = self.run.url
            self.run.finish()
            print(f"\nW&B COMPLETED: {self.experiment_name} | View Results: {url}")

## üóÇÔ∏è Document Store Setup and Indexing

Before running our experiments, we need to set up our document sources and create a clean indexing process for each embedding model.

In [3]:
document_store_small = ElasticsearchDocumentStore(hosts="http://localhost:9200")
document_store_large = ElasticsearchDocumentStore(hosts="http://localhost:9201", index="large_embeddings")


## üî¨ Experiment 1: Hybrid RAG with Small Embedding Model (`text-embedding-3-small`)

We establish a performance baseline for the Hybrid RAG system using the smaller, more cost-effective embedding model.

In [None]:
# 1. Define evaluation metrics (Focusing on core RAGAS metrics)
evaluation_metrics = [LLMContextRecall(), \
                Faithfulness(), \
                FactualCorrectness(), \
                ResponseRelevancy(), \
                ContextEntityRecall(), \
                NoiseSensitivity()]
csv_file_path = "data_for_eval/synthetic_tests_advanced_branching_10.csv"

# 2. Configuration for the Small Embedding Model Experiment
small_embedding_config = {
    "embedder_model": "text-embedding-3-small",
    "llm_model": "gpt-4o-mini",
    "retriever_top_k": 5,  # Fixed value
    "rag_type": "hybrid",
    "document_store": "elasticsearch",
    "reranker_model": "BAAI/bge-reranker-base",
}

# 4. Initialize RAG component with small embedder
small_embedding_rag_sc = HybridRAGSuperComponent(
    document_store=document_store_small,
    embedder_model=small_embedding_config["embedder_model"]
)

# 5. Initialize and Setup Small Embedding Experiment
small_embedding_experiment = RAGEvaluationExperiment(
    project_name="embedding-model-comparison",
    experiment_name="hybrid-rag-small-embedding"
)

small_embedding_pipeline = small_embedding_experiment.setup_pipeline(
    rag_supercomponent=small_embedding_rag_sc, 
    metrics_list=evaluation_metrics,
    config=small_embedding_config
)

# 6. Run the evaluation and store results
small_embedding_results = small_embedding_experiment.run_evaluation(
    csv_file_path=csv_file_path
)

# 7. Run Analytics and log to W&B with updated RAGAnalytics
# Pass the specific embedding model being used as a list
small_embedding_analytics = RAGAnalytics(
    results=small_embedding_results, 
    model_name=small_embedding_config['llm_model'],
    embedding_models=[small_embedding_config['embedder_model']]  # Pass as list
)
small_embedding_summary = small_embedding_analytics.log_to_wandb(small_embedding_experiment.run)

# 8. Finish the experiment run
small_embedding_experiment.finish_experiment()

[34m[1mwandb[0m: Currently logged in as: [33mlgutierrwr[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Detected [instructor, openai] in use.
[34m[1mwandb[0m: Use W&B Weave for improved LLM call tracing. Weave is installed but not imported. Add `import weave` to the top of your script.
[34m[1mwandb[0m: For more information, check out the docs at: https://weave-docs.wandb.ai/


W&B STARTED: hybrid-rag-small-embedding | URL: https://wandb.ai/lgutierrwr/embedding-model-comparison/runs/hnyo2tbe

üîÑ Building evaluation pipeline for hybrid-rag-small-embedding...
‚úÖ Evaluation pipeline for hybrid-rag-small-embedding built successfully!

Running RAGEvaluationSuperComponent on data_for_eval/synthetic_tests_advanced_branching_10.csv...
Loaded DataFrame with 10 rows from data_for_eval/synthetic_tests_advanced_branching_10.csv.
Running RAG SuperComponent on 10 queries...
RAG processing complete.
Creating Ragas EvaluationDataset...
Starting Ragas evaluation...


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
Exception raised in Job[59]: TimeoutError()


Ragas evaluation complete.
Overall metrics: {'context_recall': 0.9400, 'faithfulness': 0.8612, 'factual_correctness(mode=f1)': 0.4770, 'answer_relevancy': 0.7712, 'context_entity_recall': 0.3943, 'noise_sensitivity(mode=relevant)': 0.3532}


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Evaluation Complete: Logged 10 queries and 2 metrics.
Analytics: Logged comprehensive analysis for 10 queries.


0,1
average_cost_per_query_usd,‚ñÅ
average_tokens_per_query,‚ñÅ
execution_time_seconds,‚ñÅ
num_queries_evaluated,‚ñÅ
token_efficiency_tps_per_dollar,‚ñÅ
total_cost_usd,‚ñÅ

0,1
average_cost_per_query_usd,0.00084
average_tokens_per_query,4981.2
execution_time_seconds,282.73325
num_queries_evaluated,10.0
token_efficiency_tps_per_dollar,5928623.71235
total_cost_usd,0.0084



W&B COMPLETED: hybrid-rag-small-embedding | View Results: https://wandb.ai/lgutierrwr/embedding-model-comparison/runs/hnyo2tbe


## üöÄ Experiment 2: Hybrid RAG with Large Embedding Model (`text-embedding-3-large`)

To potentially improve semantic understanding and retrieval quality, we'll use the larger, more powerful embedding model. This tests the tradeoff between **embedding quality vs. cost/latency**.

In [None]:
# 1. Configuration for the Large Embedding Model Experiment
large_embedding_config = {
    "embedder_model": "text-embedding-3-large",
    "llm_model": "gpt-4o-mini",
    "retriever_top_k": 5,  # Same as small embedding experiment
    "rag_type": "hybrid",
    "document_store": "elasticsearch",
    "reranker_model": "BAAI/bge-reranker-base",
}

# 2. Initialize RAG component with large embedder
large_embedding_rag_sc = HybridRAGSuperComponent(
    document_store=document_store_large,
    embedder_model=large_embedding_config["embedder_model"]
)

# 4. Initialize and Setup Large Embedding Experiment
large_embedding_experiment = RAGEvaluationExperiment(
    project_name="embedding-model-comparison",
    experiment_name="hybrid-rag-large-embedding"
)

large_embedding_pipeline = large_embedding_experiment.setup_pipeline(
    rag_supercomponent=large_embedding_rag_sc, 
    metrics_list=evaluation_metrics,
    config=large_embedding_config
)

# 5. Run the evaluation and store results
large_embedding_results = large_embedding_experiment.run_evaluation(
    csv_file_path=csv_file_path
)

# 6. Run Analytics and log to W&B with updated RAGAnalytics
# Pass the specific embedding model being used as a list
large_embedding_analytics = RAGAnalytics(
    results=large_embedding_results, 
    model_name=large_embedding_config['llm_model'],
    embedding_models=[large_embedding_config['embedder_model']]  # Pass as list
)
large_embedding_summary = large_embedding_analytics.log_to_wandb(large_embedding_experiment.run)

# 7. Finish the experiment run
large_embedding_experiment.finish_experiment()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[34m[1mwandb[0m: Detected [huggingface_hub.inference, instructor, openai] in use.


W&B STARTED: hybrid-rag-large-embedding | URL: https://wandb.ai/lgutierrwr/embedding-model-comparison/runs/bimrym87

üîÑ Building evaluation pipeline for hybrid-rag-large-embedding...
‚úÖ Evaluation pipeline for hybrid-rag-large-embedding built successfully!

Running RAGEvaluationSuperComponent on data_for_eval/synthetic_tests_advanced_branching_10.csv...
Loaded DataFrame with 10 rows from data_for_eval/synthetic_tests_advanced_branching_10.csv.
Running RAG SuperComponent on 10 queries...
RAG processing complete.
Creating Ragas EvaluationDataset...
Starting Ragas evaluation...


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.


Ragas evaluation complete.
Overall metrics: {'context_recall': 0.9600, 'faithfulness': 0.8000, 'factual_correctness(mode=f1)': 0.5280, 'answer_relevancy': 0.8642, 'context_entity_recall': 0.3376, 'noise_sensitivity(mode=relevant)': 0.2989}
Evaluation Complete: Logged 10 queries and 2 metrics.
Analytics: Logged comprehensive analysis for 10 queries.


0,1
average_cost_per_query_usd,‚ñÅ
average_tokens_per_query,‚ñÅ
execution_time_seconds,‚ñÅ
num_queries_evaluated,‚ñÅ
token_efficiency_tps_per_dollar,‚ñÅ
total_cost_usd,‚ñÅ

0,1
average_cost_per_query_usd,0.00082
average_tokens_per_query,4865.0
execution_time_seconds,242.19803
num_queries_evaluated,10.0
token_efficiency_tps_per_dollar,5906850.24647
total_cost_usd,0.00824



W&B COMPLETED: hybrid-rag-large-embedding | View Results: https://wandb.ai/lgutierrwr/embedding-model-comparison/runs/bimrym87


## üìä Comparative Analysis & Key Insights

Now we can programmatically compare the key metrics between the two runs. The full comparison is available in the W&B dashboard, but a quick summary confirms the tradeoff.

In [7]:
import numpy as np
def extract_ragas_metrics(metrics_obj):
    """Extract and flatten RAGAS metrics from the result object for comparison."""
    metrics_dict = {} 
    if hasattr(metrics_obj, 'to_dict'):
        raw_metrics = metrics_obj.to_dict()
        for k, v in raw_metrics.items():
            if isinstance(v, (float, int)):
                # Clean metric name for the output table
                clean_name = k.replace('(mode=f1)', '').strip()
                metrics_dict[clean_name] = v
    return metrics_dict

# 1. Extract and combine summary data
small_embedding_data = {
    'System': 'Small Embedding (text-embedding-3-small)',
    'embedder_model': 'text-embedding-3-small',
    'Execution Time (s)': small_embedding_results['execution_time'],
    'Avg Cost (USD)': small_embedding_summary['average_cost_per_query_usd'],
    'Avg Tokens/Query': small_embedding_summary['average_tokens_per_query'],
    'Faithfulness': np.array(small_embedding_results['metrics']['faithfulness']).mean(),
    'Context Recall': np.array(small_embedding_results['metrics']['context_recall']).mean(),
    'Factual Correctness': np.array(small_embedding_results['metrics']['factual_correctness(mode=f1)']).mean(),
    'Response Relevancy': np.array(small_embedding_results['metrics']['answer_relevancy']).mean(),
    'Noise Sensitivity': np.array(small_embedding_results['metrics']['noise_sensitivity(mode=relevant)']).mean(),
    'Context Entity Recall': np.array(small_embedding_results['metrics']['context_entity_recall']).mean()
}

large_embedding_data = {
    'System': 'Large Embedding (text-embedding-3-large)',
    'embedder_model': 'text-embedding-3-large',
    'Execution Time (s)': large_embedding_results['execution_time'],
    'Avg Cost (USD)': large_embedding_summary['average_cost_per_query_usd'],
    'Avg Tokens/Query': large_embedding_summary['average_tokens_per_query'],
    'Faithfulness': np.array(large_embedding_results['metrics']['faithfulness']).mean(),
    'Context Recall': np.array(large_embedding_results['metrics']['context_recall']).mean(),
    'Factual Correctness': np.array(large_embedding_results['metrics']['factual_correctness(mode=f1)']).mean(),
    'Response Relevancy': np.array(large_embedding_results['metrics']['answer_relevancy']).mean(),
    'Noise Sensitivity': np.array(large_embedding_results['metrics']['noise_sensitivity(mode=relevant)']).mean(),
    'Context Entity Recall': np.array(large_embedding_results['metrics']['context_entity_recall']).mean()
}

comparison_df = pd.DataFrame([small_embedding_data, large_embedding_data])
comparison_df = comparison_df.set_index('System')



# Log final summary table to W&B for easy comparison
final_run = wandb.init(project="embedding-model-comparison", name="final-comparison", reinit=True)
final_run.log({"embedding_comparison_table": wandb.Table(dataframe=comparison_df.reset_index())})
final_run.finish()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## üí∞ Enhanced Cost Analysis with Updated RAGAnalytics

The updated `RAGAnalytics` class provides comprehensive cost tracking including embedding costs, LLM costs, and detailed breakdowns. Let's create a final analysis that demonstrates the full capabilities.

In [None]:
# Create comprehensive analytics comparison using both models
print("üîç Creating Enhanced Cost Analysis with Updated RAGAnalytics...")
print("=" * 70)

# 1. Create a dual-embedding analytics for comparison purposes
# This simulates analyzing a system that uses both embedding models
dual_embedding_analytics = RAGAnalytics(
    results=small_embedding_results,  # Use one set of results as base
    model_name="gpt-4o-mini",
    embedding_models=["text-embedding-3-small", "text-embedding-3-large"]  # Both models
)

# 2. Extract detailed cost breakdowns from individual experiments
print("\nüìä Individual Experiment Cost Analysis:")
print("-" * 50)

print(f"üîπ Small Embedding Experiment:")
print(f"   Total Cost: ${small_embedding_summary['total_cost_usd']:.6f}")
print(f"   LLM Cost: ${small_embedding_summary['llm_cost_usd']:.6f}")
print(f"   Embedding Cost: ${small_embedding_summary['embedding_cost_usd']:.6f}")
print(f"   Cost per Query: ${small_embedding_summary['average_cost_per_query_usd']:.6f}")

print(f"\nüîπ Large Embedding Experiment:")
print(f"   Total Cost: ${large_embedding_summary['total_cost_usd']:.6f}")
print(f"   LLM Cost: ${large_embedding_summary['llm_cost_usd']:.6f}")
print(f"   Embedding Cost: ${large_embedding_summary['embedding_cost_usd']:.6f}")
print(f"   Cost per Query: ${large_embedding_summary['average_cost_per_query_usd']:.6f}")

# 3. Calculate cost differences and efficiency metrics
cost_difference = large_embedding_summary['total_cost_usd'] - small_embedding_summary['total_cost_usd']
embedding_cost_difference = large_embedding_summary['embedding_cost_usd'] - small_embedding_summary['embedding_cost_usd']
cost_increase_percentage = (cost_difference / small_embedding_summary['total_cost_usd']) * 100

print(f"\nüìà Cost Comparison Analysis:")
print(f"   Total Cost Difference: ${cost_difference:.6f} ({cost_increase_percentage:.1f}% increase)")
print(f"   Embedding Cost Difference: ${embedding_cost_difference:.6f}")
print(f"   Cost Increase per Query: ${cost_difference / len(small_embedding_results['evaluation_df']):.6f}")

# 4. Performance vs Cost Analysis
faithfulness_improvement = (large_embedding_data['Faithfulness'] - small_embedding_data['Faithfulness'])
context_recall_improvement = (large_embedding_data['Context Recall'] - small_embedding_data['Context Recall'])

print(f"\n‚öñÔ∏è Performance vs Cost Trade-off:")
print(f"   Faithfulness Improvement: {faithfulness_improvement:.4f} ({faithfulness_improvement/small_embedding_data['Faithfulness']*100:.1f}%)")
print(f"   Context Recall Improvement: {context_recall_improvement:.4f} ({context_recall_improvement/small_embedding_data['Context Recall']*100:.1f}%)")
print(f"   Cost per 1% Faithfulness Improvement: ${cost_difference/(faithfulness_improvement*100):.6f}" if faithfulness_improvement > 0 else "   Faithfulness: No improvement")

# 5. Create enhanced comparison DataFrame with cost details
enhanced_comparison = pd.DataFrame({
    'Metric': ['Small Embedding', 'Large Embedding'],
    'Total Cost ($)': [small_embedding_summary['total_cost_usd'], large_embedding_summary['total_cost_usd']],
    'LLM Cost ($)': [small_embedding_summary['llm_cost_usd'], large_embedding_summary['llm_cost_usd']],
    'Embedding Cost ($)': [small_embedding_summary['embedding_cost_usd'], large_embedding_summary['embedding_cost_usd']],
    'Avg Cost/Query ($)': [small_embedding_summary['average_cost_per_query_usd'], large_embedding_summary['average_cost_per_query_usd']],
    'Faithfulness': [small_embedding_data['Faithfulness'], large_embedding_data['Faithfulness']],
    'Context Recall': [small_embedding_data['Context Recall'], large_embedding_data['Context Recall']],
    'Embedding Model': ['text-embedding-3-small', 'text-embedding-3-large']
})

print(f"\nüìã Enhanced Comparison Table:")
print(enhanced_comparison.round(6).to_string(index=False))

# 6. Log the enhanced comparison to W&B
enhanced_run = wandb.init(project="embedding-model-comparison", name="enhanced-cost-analysis", reinit=True)

# Log the enhanced comparison table
enhanced_run.log({"enhanced_cost_comparison": wandb.Table(dataframe=enhanced_comparison)})

# Log key insights as metrics
enhanced_run.log({
    "cost_increase_percentage": cost_increase_percentage,
    "cost_difference_usd": cost_difference,
    "embedding_cost_difference_usd": embedding_cost_difference,
    "faithfulness_improvement": faithfulness_improvement,
    "context_recall_improvement": context_recall_improvement,
    "cost_per_faithfulness_improvement": cost_difference/(faithfulness_improvement*100) if faithfulness_improvement > 0 else 0
})

enhanced_run.finish()

print(f"\n‚úÖ Enhanced cost analysis complete and logged to W&B!")

In [8]:
comparison_df

Unnamed: 0_level_0,embedder_model,Execution Time (s),Avg Cost (USD),Avg Tokens/Query,Faithfulness,Context Recall,Factual Correctness,Response Relevancy,Noise Sensitivity,Context Entity Recall
System,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Small Embedding (text-embedding-3-small),text-embedding-3-small,282.733249,0.00084,4981.2,0.861166,0.94,0.477,0.771222,,0.394305
Large Embedding (text-embedding-3-large),text-embedding-3-large,242.198028,0.000824,4865.0,0.8,0.96,0.528,0.864172,0.298932,0.337561


## üéì Summary and Next Steps

You have successfully executed an **embedding model comparison experiment** with **enhanced cost analytics** for the Hybrid RAG system and logged all results to W&B project: `embedding-model-comparison`.

### Key Accomplishments:
1. **Document Store Management:** Used separate Elasticsearch instances (small: port 9200, large: port 9201) for proper comparison between embedding models.
2. **Small vs Large Embedding Comparison:** Tested `text-embedding-3-small` vs `text-embedding-3-large` with **current OpenAI pricing** (November 2024).
3. **Enhanced Cost Analytics:** Implemented comprehensive cost tracking including:
   - **LLM costs** (input/output tokens)
   - **Embedding costs** (indexing and retrieval operations)
   - **Cost breakdowns** and efficiency metrics
   - **Performance vs cost trade-offs**
4. **Controlled Variables:** Fixed all parameters except embedding model to isolate impact.
5. **Comprehensive W&B Logging:** Enhanced logging with cost breakdowns, embedding comparisons, and efficiency metrics.

### Current OpenAI Pricing Integration:
- **text-embedding-3-small**: $0.02 per 1M tokens
- **text-embedding-3-large**: $0.13 per 1M tokens  
- **gpt-4o-mini**: $0.15/$0.60 per 1M input/output tokens
- **Automatic cost calculation** for both LLM and embedding operations

### Expected Insights from Enhanced Analytics:
- **Cost Breakdown**: Percentage split between LLM vs embedding costs
- **ROI Analysis**: Cost per performance improvement metrics
- **Efficiency Metrics**: Tokens per dollar and cost per query optimization
- **Model Comparison**: Direct cost and performance comparison between embedding models

### Next Steps in W&B:
1. **Enhanced Dashboard**: View the `embedding-model-comparison` project with new cost breakdown visualizations
2. **Performance ROI**: Analyze cost per improvement in faithfulness and context recall
3. **Production Planning**: Use cost vs performance data for budget planning
4. **Optimization Strategy**: Consider hybrid approaches or query-based model selection based on cost/performance profiles
5. **Scaling Analysis**: Project costs for production volumes using the detailed metrics

### Key Files Updated:
- ‚úÖ **RAGAnalytics**: Enhanced with current OpenAI pricing and embedding cost tracking
- ‚úÖ **Notebook**: Compatible with updated analytics class and dual Elasticsearch setup
- ‚úÖ **Cost Analysis**: Comprehensive cost breakdown and efficiency metrics