# Hybrid RAG Optimization with Weights & Biases ðŸ“ŠðŸ”¬

## ðŸ“‹ Overview

This notebook focuses on using **Weights & Biases (W&B)** to systematically evaluate and optimize our **Hybrid RAG** system. We will run two experiments: a baseline and an optimized configuration, with the goal of improving the `faithfulness` metric while monitoring cost.

### Hybrid RAG System Configuration (to be optimized):

| Parameter | Baseline Value | Optimization Target |
| :--- | :--- | :--- |
| `embedder_model` | `sentence-transformers/all-MiniLM-L6-v2` | Fixed |
| `llm_model` | `gpt-4o-mini` | Fixed |
| **`retriever_top_k`** | **3** | **7** |
| `rag_type` | `hybrid` | Fixed |
| `reranker_model` | `BAAI/bge-reranker-base` | Fixed |
| `bm25_enabled` | `True` | Fixed |

---

In [1]:
import os
import wandb
import pandas as pd
import numpy as np
import json
import tiktoken
from pathlib import Path
from datetime import datetime
from typing import Dict, List, Any

# Import Haystack/Ragas components
from haystack import Pipeline
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity

# Import custom components (assuming these paths exist relative to the notebook)
try:
    from scripts.ragasevaluation import CSVReaderComponent, RAGDataAugmenterComponent, RagasEvaluationComponent
    from scripts.rag.hybridrag import hybrid_rag_sc
    from scripts.wandb_experiments.rag_analytics import RAGAnalytics
except ImportError:
    print("WARNING: Custom components could not be imported. Ensure 'scripts/ragasevaluation.py' and 'scripts/rag/hybridrag.py' are available.")

# Environment setup (reduced logging)
os.environ["HAYSTACK_CONTENT_TRACING_ENABLED"] = "false"
print("Setup: Imports and Environment variables loaded.")

Setup: Imports and Environment variables loaded.


In [2]:
import wandb
from pathlib import Path
from datetime import datetime

class RAGEvaluationExperiment:
    """Enhanced RAG evaluation workflow with streamlined W&B integration."""
    
    def __init__(self, project_name: str, experiment_name: str):
        self.project_name = project_name
        self.experiment_name = experiment_name
        self.run = None
        self.pipeline = None
    
    def setup_pipeline(self, rag_supercomponent, metrics_list: list, config: dict = None):
        """Set up the evaluation pipeline with W&B tracking."""
        self.run = wandb.init(
            project=self.project_name,
            name=self.experiment_name,
            config=config,
            reinit=True
        )
        print(f"W&B STARTED: {self.experiment_name} | URL: {self.run.url}")
        
        # Initialize components
        reader = CSVReaderComponent()
        # The RAGDataAugmenterComponent needs access to the current config (specifically top_k)
        # We assume the underlying implementation of hybrid_rag_sc takes the config or is already correctly set up.
        # Since the goal is only to change the parameter in the run config, we leave the RAG SuperComponent initialization simple here.
        # The actual RAG configuration modification logic should ideally be inside RAGDataAugmenterComponent/hybrid_rag_sc.
        # For this minimal example, we assume the wrapper logic or environment variables handle the parameter change for hybrid_rag_sc based on 'config'.
        augmenter = RAGDataAugmenterComponent(rag_supercomponent=rag_supercomponent)
        evaluator = RagasEvaluationComponent(metrics=metrics_list)
        
        self.pipeline = Pipeline()
        self.pipeline.add_component("reader", reader)
        self.pipeline.add_component("augmenter", augmenter)
        self.pipeline.add_component("evaluator", evaluator)
        
        self.pipeline.connect("reader.data_frame", "augmenter.data_frame")
        self.pipeline.connect("augmenter.augmented_data_frame", "evaluator.augmented_data_frame")
        
        return self.pipeline
    
    def run_evaluation(self, csv_file_path: str):
        """Execute the pipeline, log high-level metrics, and return results."""
        if not self.pipeline:
            raise ValueError("Pipeline not set up. Call setup_pipeline() first.")
        
        start_time = datetime.now()
        print(f"\nRunning pipeline on {csv_file_path}...")
        results = self.pipeline.run({"reader": {"source": csv_file_path}})
        end_time = datetime.now()
        
        execution_time = (end_time - start_time).total_seconds()
        metrics = results["evaluator"]["metrics"]
        evaluation_df = results["evaluator"]["evaluation_df"].rename(columns={
            'factual_correctness(mode=f1)': 'factual_correctness_f1'
        })
        
        # Log dataset artifact
        dataset_artifact = wandb.Artifact(name=f"evaluation-dataset-{Path(csv_file_path).stem}", type="dataset")
        dataset_artifact.add_file(csv_file_path)
        self.run.log_artifact(dataset_artifact)
        
        # Extract and log summary metrics
        wandb_metrics = {
            "execution_time_seconds": execution_time,
            "num_queries_evaluated": len(evaluation_df),
        }
        # Simple conversion of Ragas EvaluationResult metrics to flat dictionary
        if hasattr(metrics, 'to_dict'):
            metrics_dict = metrics.to_dict()
            for metric_name, metric_value in metrics_dict.items():
                if isinstance(metric_value, (int, float)):
                    # Standardize metric names for W&B comparison
                    clean_name = metric_name.replace('(mode=f1)', '').replace('ragas_', '').strip()
                    wandb_metrics[f"ragas_{clean_name}"] = metric_value
        
        self.run.log(wandb_metrics)
        print(f"Evaluation Complete: Logged {len(evaluation_df)} queries and {len(wandb_metrics)} metrics.")
        
        return {
            "metrics": metrics, # Full EvaluationResult object
            "evaluation_df": evaluation_df,
            "execution_time": execution_time,
            "wandb_url": self.run.url
        }
    
    def finish_experiment(self):
        """Finish the W&B run."""
        if self.run:
            url = self.run.url
            self.run.finish()
            print(f"\nW&B COMPLETED: {self.experiment_name} | View Results: {url}")

## ðŸ”¬ Experiment 1: Hybrid RAG Baseline (`retriever_top_k=3`)

We establish a performance baseline for the Hybrid RAG system using a retrieval limit of `top_k=3` documents.

In [None]:
# 1. Define evaluation metrics (Focusing on core RAGAS metrics)
evaluation_metrics = [LLMContextRecall(), \
                Faithfulness(), \
                FactualCorrectness(), \
                ResponseRelevancy(), \
                ContextEntityRecall(), \
                NoiseSensitivity()]
csv_file_path = "data_for_eval/synthetic_tests_advanced_branching_3.csv"

# 2. Configuration for the Baseline Experiment
baseline_config = {
    "embedder_model": "text-embedding-3-small",
    "llm_model": "gpt-4o-mini",
    "retriever_top_k": 3,  # Baseline value
    "rag_type": "hybrid",
    "document_store": "elasticsearch",
    "reranker_model": "BAAI/bge-reranker-base",
}

# 3. Initialize and Setup Baseline Experiment
baseline_experiment = RAGEvaluationExperiment(
    project_name="hybrid-rag-optimization",
    experiment_name="hybrid-rag-baseline-k3"
)

baseline_pipeline = baseline_experiment.setup_pipeline(
    rag_supercomponent=hybrid_rag_sc, 
    metrics_list=evaluation_metrics,
    config=baseline_config
)

# 4. Run the evaluation and store results
baseline_results = baseline_experiment.run_evaluation(
    csv_file_path=csv_file_path
)

# 5. Run Analytics and log to W&B
baseline_analytics = RAGAnalytics(baseline_results, model_name=baseline_config['llm_model'])
baseline_summary = baseline_analytics.log_to_wandb(baseline_experiment.run)

# 6. Finish the experiment run
baseline_experiment.finish_experiment()

## ðŸš€ Experiment 2: Hybrid RAG Optimization (`retriever_top_k=7`)

To potentially improve context recall and faithfulness, we'll increase the number of retrieved documents from 3 to 7. This is a common **hyperparameter tuning** strategy to check the tradeoff between performance and cost/latency.

In [None]:
# 1. Configuration for the Optimized Experiment (change top_k)
optimized_config = baseline_config.copy()
optimized_config['retriever_top_k'] = 7  # Optimized value

# 2. Initialize and Setup Optimized Experiment
optimized_experiment = RAGEvaluationExperiment(
    project_name="hybrid-rag-optimization",
    experiment_name="hybrid-rag-optimized-k7"
)

optimized_pipeline = optimized_experiment.setup_pipeline(
    rag_supercomponent=hybrid_rag_sc, 
    metrics_list=evaluation_metrics,
    config=optimized_config
)

# 3. Run the evaluation and store results
optimized_results = optimized_experiment.run_evaluation(
    csv_file_path=csv_file_path
)

# 4. Run Analytics and log to W&B
optimized_analytics = RAGAnalytics(optimized_results, model_name=optimized_config['llm_model'])
optimized_summary = optimized_analytics.log_to_wandb(optimized_experiment.run)

# 5. Finish the experiment run
optimized_experiment.finish_experiment()

## ðŸ“Š Comparative Analysis & Key Insights

Now we can programmatically compare the key metrics between the two runs. The full comparison is available in the W&B dashboard, but a quick summary confirms the tradeoff.

In [None]:
comparison_df

In [None]:
import numpy as np
def extract_ragas_metrics(metrics_obj):
    """Extract and flatten RAGAS metrics from the result object for comparison."""
    metrics_dict = {} 
    if hasattr(metrics_obj, 'to_dict'):
        raw_metrics = metrics_obj.to_dict()
        for k, v in raw_metrics.items():
            if isinstance(v, (float, int)):
                # Clean metric name for the output table
                clean_name = k.replace('(mode=f1)', '').strip()
                metrics_dict[clean_name] = v
    return metrics_dict

# 1. Extract and combine summary data
baseline_data = {
    'System': 'Baseline (k=3)',
    'retriever_top_k': 3,
    'Execution Time (s)': baseline_results['execution_time'],
    'Avg Cost (USD)': baseline_summary['average_cost_per_query_usd'],
    'Avg Tokens/Query': baseline_summary['average_tokens_per_query'],
    'Faithfulness': np.array(baseline_results['metrics']['faithfulness']).mean(),
    'Context Recall': np.array(baseline_results['metrics']['context_recall']).mean(),
    'Factual Correctness': np.array(baseline_results['metrics']['factual_correctness(mode=f1)']).mean(),
    'Response Relevancy': np.array(baseline_results['metrics']['answer_relevancy']).mean(),
    'Noise Sensitivity': np.array(baseline_results['metrics']['noise_sensitivity(mode=relevant)']).mean(),
    'Context Entity Recall': np.array(baseline_results['metrics']['context_entity_recall']).mean()
}

optimized_data = {
    'System': 'Optimized (k=7)',
    'retriever_top_k': 7,
    'Execution Time (s)': optimized_results['execution_time'],
    'Avg Cost (USD)': optimized_summary['average_cost_per_query_usd'],
    'Avg Tokens/Query': optimized_summary['average_tokens_per_query'],
    'Faithfulness': np.array(optimized_results['metrics']['faithfulness']).mean(),
    'Context Recall': np.array(optimized_results['metrics']['context_recall']).mean(),
    'Factual Correctness': np.array(optimized_results['metrics']['factual_correctness(mode=f1)']).mean(),
    'Response Relevancy': np.array(optimized_results['metrics']['answer_relevancy']).mean(),
    'Noise Sensitivity': np.array(optimized_results['metrics']['noise_sensitivity(mode=relevant)']).mean(),
    'Context Entity Recall': np.array(optimized_results['metrics']['context_entity_recall']).mean()
}

comparison_df = pd.DataFrame([baseline_data, optimized_data])
comparison_df = comparison_df.set_index('System')


# Log final summary table to W&B for easy comparison
final_run = wandb.init(project="hybrid-rag-optimization", name="final-comparison", reinit=True)
final_run.log({"final_comparison_table": wandb.Table(dataframe=comparison_df.reset_index())})
final_run.finish()

In [None]:
optimized_results['metrics']

In [None]:
optimized_results['metrics']

## ðŸŽ“ Summary and Next Steps

You have successfully executed a targeted optimization experiment for the Hybrid RAG system and logged all results to a single W&B project: `hybrid-rag-optimization`.

### Key Accomplishments:
1.  **Baseline Established:** Created a baseline measurement of Hybrid RAG performance (`top_k=3`) across RAGAS metrics, cost, and latency.
2.  **Parameter Tuning:** Successfully ran an experiment with an updated hyperparameter (`top_k=7`).
3.  **Comprehensive Logging:** Logged both runs, their configurations, raw data samples, and **meaningful visualizations** to W&B.

### Next Steps in W&B:
1.  **View Comparison:** Navigate to the **`hybrid-rag-optimization`** project dashboard on W&B. Use the **Compare Runs** feature to analyze the **faithfulness vs. cost scatter plot** and the **cost vs. input tokens scatter plot** to understand the true impact of increasing `top_k`.
2.  **Deeper Optimization:** Utilize W&B Sweeps to automate the search for the optimal `retriever_top_k` value, or try tuning other parameters like `reranker_model` or document chunking strategies.
3.  **Reproduce:** Every run is versioned, ensuring you can reproduce these results precisely later.