# A/B Testing Model Configurations

This notebook shows you how to systematically compare AI configurations across different customer segments to optimize quality-cost tradeoffs. Learn when model upgrades, prompt changes, or parameter adjustments are worth the cost.

**What You'll Learn:**
- Set up tier configurations to test different models side-by-side
- Run A/B tests comparing model performance on consistent datasets
- Measure quality with LLM-as-Judge and code evaluators
- Analyze cost vs. quality tradeoffs to make data-driven decisions
- Compare configurations in the Netra dashboard

**Prerequisites:**
- Python 3.9+
- OpenAI API key
- Netra API key ([Get started here](https://docs.getnetra.ai/quick-start/Overview))
- A test dataset with expected outputs

## Step 0: Install Packages

In [None]:
pip install netra-sdk openai

## Step 1: Set Environment Variables

In [None]:
import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API Key:")
os.environ["NETRA_API_KEY"] = getpass("Enter your Netra API Key:")
os.environ["NETRA_OTLP_ENDPOINT"] = getpass("Enter your Netra OTLP Endpoint:")

print("API keys configured!")


## Step 2: Initialize Netra

In [None]:
from netra import Netra
from netra.instrumentation.instruments import InstrumentSet

Netra.init(
    app_name="ab-testing",
    headers=f"x-api-key={os.getenv('NETRA_API_KEY')}",
    environment="testing",
    trace_content=True,
    instruments={InstrumentSet.OPENAI},
)

print("Netra initialized for A/B testing!")

## Step 3: Define Tier Configurations

Set up different models and configurations to test.

In [None]:
from dataclasses import dataclass
from typing import Optional

@dataclass
class TierConfig:
    """Configuration for an A/B test tier."""
    name: str
    model: str
    temperature: float
    description: str

# Define tiers to compare
TIERS = {
    "baseline": TierConfig(
        name="Baseline (GPT-4o-mini)",
        model="gpt-4o-mini",
        temperature=0.7,
        description="Current production configuration"
    ),
    "experimental_low_temp": TierConfig(
        name="Experimental (Lower Temperature)",
        model="gpt-4o-mini",
        temperature=0.3,
        description="Lower temperature for consistency"
    ),
    "premium": TierConfig(
        name="Premium (GPT-4o)",
        model="gpt-4o-mini",
        temperature=0.7,
        description="Higher quality model (more expensive)"
    ),
}

print("Tier configurations defined:")
for tier_id, config in TIERS.items():
    print(f"\n{tier_id}:")
    print(f"  Name: {config.name}")
    print(f"  Model: {config.model}")
    print(f"  Temperature: {config.temperature}")
    print(f"  Description: {config.description}")

## Step 4: Create Test Dataset

Define consistent test cases with expected outputs.

In [None]:
# Test dataset for A/B testing
TEST_CASES = [
    {
        "id": "email-1",
        "input": "Write a professional email asking for a meeting to discuss Q4 strategy.",
        "expected_qualities": ["professional", "clear", "concise", "actionable"]
    },
    {
        "id": "summary-1",
        "input": "Summarize the key points: Machine learning is transforming industries by enabling systems to learn from data without explicit programming. Applications span healthcare, finance, transportation, and more. Key challenges include data quality, model interpretability, and regulatory compliance.",
        "expected_qualities": ["accurate", "concise", "complete", "organized"]
    },
    {
        "id": "creative-1",
        "input": "Write a creative headline for a blog post about AI trends in 2024.",
        "expected_qualities": ["engaging", "relevant", "unique", "readable"]
    },
    {
        "id": "technical-1",
        "input": "Explain how neural networks work in simple terms.",
        "expected_qualities": ["accurate", "clear", "accessible", "thorough"]
    },
]

print(f"Test dataset: {len(TEST_CASES)} cases")
for case in TEST_CASES:
    print(f"  - {case['id']}: {case['input'][:40]}...")

## Step 5: Implement Evaluation Function

Create a simple quality evaluator using an LLM as a judge.

In [None]:
from openai import OpenAI
import time

class ABTestRunner:
    """Run A/B tests comparing different configurations."""

    def __init__(self):
        self.openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
        self.results = {}

    def run_test_case(self, tier_id: str, test_case: dict) -> dict:
        """Run a single test case with a specific configuration."""
        config = TIERS[tier_id]
        Netra.set_custom_attributes(key="tier_id", value=tier_id)
        Netra.set_custom_attributes(key="test_case_id", value=test_case["id"])

        start_time = time.time()
        response = self.openai_client.chat.completions.create(
            model=config.model,
            messages=[{"role": "user", "content": test_case["input"]}],
            temperature=config.temperature,
            max_tokens=500
        )
        latency_ms = (time.time() - start_time) * 1000

        output = response.choices[0].message.content
        tokens = response.usage.total_tokens

        # Evaluate quality using a judge
        quality_score = self.evaluate_output(
            output,
            test_case["expected_qualities"]
        )

        return {
            "tier_id": tier_id,
            "tier_name": config.name,
            "test_case_id": test_case["id"],
            "output": output,
            "tokens": tokens,
            "latency_ms": latency_ms,
            "quality_score": quality_score,
        }

    def evaluate_output(self, output: str, expected_qualities: list) -> float:
        """Evaluate output quality using LLM as a judge."""
        qualities_text = ", ".join(expected_qualities)
        
        judge_response = self.openai_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": "You are an expert evaluator. Rate the output on a scale of 1-10."
                },
                {
                    "role": "user",
                    "content": f"Rate this output for the following qualities: {qualities_text}\n\nOutput: {output[:500]}\n\nProvide only a number 1-10."
                }
            ],
            max_tokens=10,
            temperature=0
        )

        try:
            score = float(judge_response.choices[0].message.content.strip())
            return min(max(score, 1), 10)  # Clamp between 1-10
        except:
            return 5.0  # Default if parsing fails

    def store_result(self, result: dict):
        """Store test result."""
        key = f"{result['tier_id']}_{result['test_case_id']}"
        self.results[key] = result

    def print_summary(self):
        """Print summary of A/B test results."""
        print("\n" + "="*70)
        print("A/B Test Results Summary")
        print("="*70)

        # Aggregate by tier
        tier_stats = {}
        for result in self.results.values():
            tier_id = result["tier_id"]
            if tier_id not in tier_stats:
                tier_stats[tier_id] = {
                    "tier_name": result["tier_name"],
                    "quality_scores": [],
                    "latencies": [],
                    "tokens": [],
                }
            tier_stats[tier_id]["quality_scores"].append(result["quality_score"])
            tier_stats[tier_id]["latencies"].append(result["latency_ms"])
            tier_stats[tier_id]["tokens"].append(result["tokens"])

        for tier_id, stats in sorted(tier_stats.items()):
            avg_quality = sum(stats["quality_scores"]) / len(stats["quality_scores"])
            avg_latency = sum(stats["latencies"]) / len(stats["latencies"])
            avg_tokens = sum(stats["tokens"]) / len(stats["tokens"])

            print(f"\n{stats['tier_name']}:")
            print(f"  Avg Quality Score: {avg_quality:.2f}/10")
            print(f"  Avg Latency: {avg_latency:.0f}ms")
            print(f"  Avg Tokens/Call: {avg_tokens:.0f}")


print("A/B test runner class defined!")

## Step 6: Run the A/B Test

Execute the test across all tiers.

In [None]:
# Initialize test runner
runner = ABTestRunner()

print("Starting A/B test execution...\n")

# Run each tier on each test case
for tier_id in TIERS.keys():
    print(f"\nTesting tier: {TIERS[tier_id].name}")
    print("-" * 50)
    
    for test_case in TEST_CASES:
        print(f"  Running {test_case['id']}...", end="", flush=True)
        result = runner.run_test_case(tier_id, test_case)
        runner.store_result(result)
        print(f" âœ“ (Quality: {result['quality_score']:.1f}/10)")

print("\n" + "="*50)
print("A/B test execution complete!")
print("="*50)

## Step 7: Analyze Results

In [None]:
# Print summary
runner.print_summary()

# Print detailed results by test case
print("\n" + "="*70)
print("Detailed Results by Test Case")
print("="*70)

for test_case in TEST_CASES:
    print(f"\n{test_case['id']}: {test_case['input']}")
    print("-" * 70)
    
    for tier_id in TIERS.keys():
        key = f"{tier_id}_{test_case['id']}"
        if key in runner.results:
            result = runner.results[key]
            print(f"\n  {result['tier_name']}:")
            print(f"    Quality: {result['quality_score']:.1f}/10")
            print(f"    Latency: {result['latency_ms']:.0f}ms")
            print(f"    Tokens: {result['tokens']}")

## Step 8: Cost vs. Quality Analysis

Determine which configuration offers the best cost-quality tradeoff.

In [None]:
# Estimate costs (rough pricing for illustration)
MODEL_COSTS = {
    "gpt-4o-mini": {"input": 0.15, "output": 0.60}  # per 1M tokens
}

print("\n" + "="*70)
print("Cost-Quality Analysis")
print("="*70)

# Aggregate stats by tier
tier_stats = {}
for result in runner.results.values():
    tier_id = result["tier_id"]
    if tier_id not in tier_stats:
        tier_stats[tier_id] = {
            "tier_name": result["tier_name"],
            "model": TIERS[tier_id].model,
            "quality_scores": [],
            "tokens": [],
        }
    tier_stats[tier_id]["quality_scores"].append(result["quality_score"])
    tier_stats[tier_id]["tokens"].append(result["tokens"])

# Calculate and display cost-quality analysis
for tier_id, stats in sorted(tier_stats.items()):
    avg_quality = sum(stats["quality_scores"]) / len(stats["quality_scores"])
    avg_tokens = sum(stats["tokens"]) / len(stats["tokens"])
    
    model = stats["model"]
    if model in MODEL_COSTS:
        pricing = MODEL_COSTS[model]
        cost_per_call = (avg_tokens * 0.7 * pricing["input"] + avg_tokens * 0.3 * pricing["output"]) / 1_000_000
    else:
        cost_per_call = 0
    
    quality_per_dollar = avg_quality / max(cost_per_call, 0.001)
    
    print(f"\n{stats['tier_name']}:")
    print(f"  Model: {model}")
    print(f"  Avg Quality: {avg_quality:.2f}/10")
    print(f"  Avg Tokens: {avg_tokens:.0f}")
    print(f"  Est. Cost/Call: ${cost_per_call:.4f}")
    print(f"  Quality per Dollar: {quality_per_dollar:.2f}")

print("\n" + "="*70)
print("Recommendation: Choose the configuration with highest quality/cost ratio")
print("="*70)

---

## Dashboard Comparison

In the Netra dashboard, you can:

1. **Filter by tier_id** to see results per configuration
2. **Compare latency** between tiers for the same test case
3. **Track token usage** to calculate actual costs
4. **Analyze quality scores** from your evaluators

## Documentation Links

- [Netra Documentation](https://docs.getnetra.ai)
- [Evaluation Framework](https://docs.getnetra.ai/Evaluation)
- [Evaluators Guide](https://docs.getnetra.ai/Evaluation/Evaluators)
- [Test Runs](https://docs.getnetra.ai/Evaluation/Test-Runs)

## See Also

- [Evaluating RAG Quality](/Cookbooks/evaluation/evaluating-rag-quality) - Quality evaluation for RAG systems
- [Custom Evaluator Patterns](/Cookbooks/evaluation/custom-evaluator-patterns) - Build domain-specific evaluators
- [Evaluating Agent Decisions](/Cookbooks/evaluation/evaluating-agent-decisions) - Evaluate agent behavior