# Evaluation & Results

**Paper Metric**: Deferral curves showing cost vs accuracy tradeoff

**Key Result**: Universal routing works with new models without retraining

In [None]:
# Load router from previous notebook
%run 02_model_characterization.ipynb

## Test Set Evaluation

Independent test set (different from validation used for characterization)

In [None]:
# Create test set
test_dataset = load_dataset("cais/mmlu", "all", split="test")
test_dataset = test_dataset.shuffle(seed=123).select(range(50))

test_set = []
for item in test_dataset:
    prompt = f"{item['question']}\nA) {item['choices'][0]}\nB) {item['choices'][1]}\nC) {item['choices'][2]}\nD) {item['choices'][3]}\nAnswer:"
    test_set.append({
        'prompt': prompt,
        'answer': ['A', 'B', 'C', 'D'][item['answer']]
    })

print(f"✅ Test set: {len(test_set)} examples")

## Deferral Curve Analysis

**Core Experiment**: How does accuracy change with cost as we vary λ?

In [None]:
lambda_values = [0.001, 0.01, 0.1, 1.0, 10.0]
results = []

for λ in lambda_values:
    correct, total_cost = 0, 0
    
    for example in test_set:
        # Route and predict
        routing = router.route(example['prompt'], lambda_cost=λ)
        selected = routing['model']
        total_cost += routing['cost']
        
        response = call_llm(
            selected,
            router.model_db[selected]['provider'],
            example['prompt']
        )
        
        if response == example['answer']:
            correct += 1
    
    accuracy = correct / len(test_set)
    avg_cost = total_cost / len(test_set)
    
    results.append({'lambda': λ, 'accuracy': accuracy, 'cost': avg_cost})
    print(f"λ={λ:5.3f}: {accuracy:.1%} accuracy, ${avg_cost:.3f} avg cost")

## Deferral Curve Visualization

In [None]:
costs = [r['cost'] for r in results]
accuracies = [r['accuracy'] for r in results]

plt.figure(figsize=(10, 6))
plt.plot(costs, accuracies, 'bo-', linewidth=2, markersize=8)

for r in results:
    plt.annotate(f"λ={r['lambda']}", (r['cost'], r['accuracy']), 
                xytext=(5, 5), textcoords='offset points')

plt.xlabel('Average Cost ($)')
plt.ylabel('Accuracy')
plt.title('UniRoute Deferral Curve\n(Jitkrittum et al., 2025)')
plt.grid(True, alpha=0.3)
plt.show()

print("\n✅ Experiment complete!")

## Adding New Models (Zero Retraining)

**Paper's Key Advantage**: Add new models by just computing their Ψ(m) profile

In [None]:
def add_new_model(model_name: str, provider: str, cost: float):
    """Add model without retraining router"""
    
    # Compute error profile on existing clusters
    psi_vector = characterize_model(model_name, provider)
    
    # Add to database
    router.model_db[model_name] = {
        'psi_vector': psi_vector,
        'provider': provider,
        'cost': cost
    }
    
    router._normalize_costs()
    print(f"✅ {model_name} added!")

# Example: Add better model
# add_new_model('llama-3.3-70b-versatile', 'groq', 0.69)

print("💡 New models can be added dynamically!")

## Key Insights (Jitkrittum et al., 2025)

- ✅ **Universal**: Works with any new LLM
- ✅ **Efficient**: No router retraining needed
- ✅ **Practical**: Real cost-quality optimization
- ✅ **Scalable**: O(K) cost to add new models