# DKW Controller Evaluation

This notebook contains an evaluation script for the DKW Controller, comparing baseline and proposed methods in terms of:
- Decision rates (fusion vs fission)
- Error rates  
- API call efficiency
- Performance improvements

The notebook is self-contained with inline sample data for demonstration.

In [None]:
"""Evaluation script for DKW Controller."""
import json
import numpy as np

## Sample Data

Instead of reading from external JSON files, we'll create inline sample data that represents evaluation results for 200 test examples each.

In [None]:
# Create sample data that matches the expected evaluation results
# 200 examples per method to produce the metrics shown in eval_out.json

# Baseline method: 100% fission decisions, 8% error rate
baseline_data = []
for i in range(200):
    baseline_data.append({
        "decision": "fission",  # All baseline decisions are fission
        "error": i < 16  # First 16 examples have errors (8% error rate)
    })

# Proposed method: 65% fusion, 35% fission, 9% error rate  
proposed_data = []
for i in range(200):
    if i < 130:  # First 130 examples use fusion (65%)
        decision = "fusion"
    else:  # Remaining 70 examples use fission (35%)
        decision = "fission"
    
    proposed_data.append({
        "decision": decision,
        "error": i < 18  # First 18 examples have errors (9% error rate)
    })

# Combine into the expected data structure
results = {
    "baseline": baseline_data,
    "proposed": proposed_data
}

print(f"Created sample data:")
print(f"- Baseline: {len(results['baseline'])} examples")
print(f"- Proposed: {len(results['proposed'])} examples")

## Evaluation Function

The `compute_metrics` function analyzes the results and calculates key performance indicators for both methods.

In [None]:
def compute_metrics(results: dict) -> dict:
    """Compute evaluation metrics."""
    metrics = {}

    for method in ["baseline", "proposed"]:
        preds = results[method]

        # Count decisions
        fusion_count = sum(1 for p in preds if p["decision"] == "fusion")
        fission_count = sum(1 for p in preds if p["decision"] == "fission")

        # Compute error rate
        errors = sum(1 for p in preds if p["error"])
        error_rate = errors / len(preds)

        # API calls (fusion=1, fission=2)
        api_calls = fusion_count + 2 * fission_count

        metrics[method] = {
            "fusion_rate": fusion_count / len(preds),
            "fission_rate": fission_count / len(preds),
            "error_rate": error_rate,
            "api_calls": api_calls,
            "avg_calls_per_example": api_calls / len(preds),
        }

    # Compute improvement
    baseline_calls = metrics["baseline"]["avg_calls_per_example"]
    proposed_calls = metrics["proposed"]["avg_calls_per_example"]
    metrics["improvement"] = {
        "api_reduction_pct": (baseline_calls - proposed_calls) / baseline_calls * 100,
        "error_rate_diff": metrics["proposed"]["error_rate"] - metrics["baseline"]["error_rate"],
    }

    return metrics

## Run Evaluation

Compute the metrics and display the results.

In [None]:
# Compute metrics using our sample data
metrics = compute_metrics(results)

# Display key improvement metric
print(f"API reduction: {metrics['improvement']['api_reduction_pct']:.1f}%")
print(f"Error rate change: {metrics['improvement']['error_rate_diff']:.3f}")
print()

## Detailed Results

Let's examine the complete metrics breakdown for both methods.

In [None]:
# Display detailed metrics in a readable format
print("=" * 60)
print("DETAILED EVALUATION RESULTS")
print("=" * 60)

for method in ["baseline", "proposed"]:
    print(f"\n{method.upper()} METHOD:")
    print(f"  Fusion rate:        {metrics[method]['fusion_rate']:.2%}")
    print(f"  Fission rate:       {metrics[method]['fission_rate']:.2%}")
    print(f"  Error rate:         {metrics[method]['error_rate']:.2%}")
    print(f"  Total API calls:    {metrics[method]['api_calls']:,}")
    print(f"  Avg calls/example:  {metrics[method]['avg_calls_per_example']:.2f}")

print(f"\nIMPROVEMENT SUMMARY:")
print(f"  API reduction:      {metrics['improvement']['api_reduction_pct']:.1f}%")
print(f"  Error rate change:  {metrics['improvement']['error_rate_diff']:+.3f}")

# Save results to match original script behavior (optional)
output_data = metrics
print(f"\nResults computed successfully!")
print(f"Equivalent to saving to 'eval_out.json':")

In [None]:
# Show the JSON output that would be saved to eval_out.json
print(json.dumps(metrics, indent=2))

## Customization

To modify this evaluation:

1. **Change the sample data**: Edit the data generation cell to use your own experimental results
2. **Adjust metrics**: Modify the `compute_metrics` function to add new evaluation criteria
3. **Add visualizations**: Use matplotlib/seaborn to create charts from the metrics
4. **Scale the analysis**: Increase the number of test examples or add new methods

This notebook is completely self-contained and doesn't require any external files.