## Customization

You can modify the sample data above to test different scenarios. For example:
- Change the fusion/fission ratios for the proposed method
- Adjust error rates for both methods  
- Experiment with different sample sizes

Simply modify the data generation code and re-run the evaluation cells to see how different configurations affect the performance metrics.

In [None]:
# Display detailed results
print("=" * 60)
print("DKW CONTROLLER EVALUATION RESULTS")
print("=" * 60)

print("\\nBASELINE METHOD:")
print("-" * 20)
baseline = metrics["baseline"]
print(f"  Fusion Rate:        {baseline['fusion_rate']:.1%}")
print(f"  Fission Rate:       {baseline['fission_rate']:.1%}")  
print(f"  Error Rate:         {baseline['error_rate']:.1%}")
print(f"  Total API Calls:    {baseline['api_calls']:,}")
print(f"  Avg Calls/Example:  {baseline['avg_calls_per_example']:.2f}")

print("\\nPROPOSED METHOD:")
print("-" * 20)
proposed = metrics["proposed"]
print(f"  Fusion Rate:        {proposed['fusion_rate']:.1%}")
print(f"  Fission Rate:       {proposed['fission_rate']:.1%}")
print(f"  Error Rate:         {proposed['error_rate']:.1%}")
print(f"  Total API Calls:    {proposed['api_calls']:,}")
print(f"  Avg Calls/Example:  {proposed['avg_calls_per_example']:.2f}")

print("\\nIMPROVEMENT ANALYSIS:")
print("-" * 20)
improvement = metrics["improvement"]
print(f"  API Reduction:      {improvement['api_reduction_pct']:.1f}%")
print(f"  Error Rate Change:  {improvement['error_rate_diff']:+.1%}")

# Calculate absolute improvements
calls_saved = baseline['api_calls'] - proposed['api_calls']
print(f"  Total Calls Saved:  {calls_saved:,}")
print("=" * 60)

## Detailed Results Display

Let's examine the detailed metrics for better understanding of the performance differences.

In [None]:
# Compute evaluation metrics
metrics = compute_metrics(results)

# Display the main improvement result (as in original script)
print(f"API reduction: {metrics['improvement']['api_reduction_pct']:.1f}%")

# Save results (equivalent to writing eval_out.json)
with open("eval_out.json", "w") as f:
    json.dump(metrics, f, indent=2)
    
print("\\nResults saved to eval_out.json")

## Run Evaluation

Now let's compute the metrics for both methods and display the results.

In [None]:
def compute_metrics(results: dict) -> dict:
    """Compute evaluation metrics."""
    metrics = {}

    for method in ["baseline", "proposed"]:
        preds = results[method]

        # Count decisions
        fusion_count = sum(1 for p in preds if p["decision"] == "fusion")
        fission_count = sum(1 for p in preds if p["decision"] == "fission")

        # Compute error rate
        errors = sum(1 for p in preds if p["error"])
        error_rate = errors / len(preds)

        # API calls (fusion=1, fission=2)
        api_calls = fusion_count + 2 * fission_count

        metrics[method] = {
            "fusion_rate": fusion_count / len(preds),
            "fission_rate": fission_count / len(preds),
            "error_rate": error_rate,
            "api_calls": api_calls,
            "avg_calls_per_example": api_calls / len(preds),
        }

    # Compute improvement
    baseline_calls = metrics["baseline"]["avg_calls_per_example"]
    proposed_calls = metrics["proposed"]["avg_calls_per_example"]
    metrics["improvement"] = {
        "api_reduction_pct": (baseline_calls - proposed_calls) / baseline_calls * 100,
        "error_rate_diff": metrics["proposed"]["error_rate"] - metrics["baseline"]["error_rate"],
    }

    return metrics

## Evaluation Metrics Function

The `compute_metrics` function calculates key performance indicators:
- **Fusion/Fission rates**: Percentage of each decision type
- **Error rate**: Percentage of predictions that resulted in errors  
- **API calls**: Total calls needed (fusion=1 call, fission=2 calls)
- **Improvement**: Comparison between baseline and proposed methods

In [None]:
# Create sample evaluation results data
# This simulates the data that would be read from ../experiment_001/method_out.json

# Baseline: 100% fission decisions, 8% error rate
baseline_predictions = []
for i in range(200):
    error = i < 16  # First 16 have errors (8%)
    baseline_predictions.append({
        "decision": "fission",
        "error": error
    })

# Proposed: 65% fusion, 35% fission, 9% error rate  
proposed_predictions = []
for i in range(200):
    if i < 130:  # First 130 are fusion (65%)
        decision = "fusion"
    else:  # Remaining 70 are fission (35%)
        decision = "fission"
    
    error = i < 18  # First 18 have errors (9%)
    proposed_predictions.append({
        "decision": decision,
        "error": error
    })

# Combined results data structure
results = {
    "baseline": baseline_predictions,
    "proposed": proposed_predictions
}

print(f"Created {len(results['baseline'])} baseline predictions")
print(f"Created {len(results['proposed'])} proposed predictions")

## Sample Data

Below we define the evaluation results data that would normally be read from external JSON files. This data contains predictions from both baseline and proposed methods, including their decisions (fusion/fission) and error status.

In [None]:
import json
import numpy as np

## Setup and Imports

First, let's import the necessary libraries for our evaluation.

# DKW Controller Evaluation

This notebook provides an interactive evaluation of the DKW Controller, comparing baseline and proposed methods for API call optimization. The evaluation measures fusion/fission decision rates, error rates, and API call efficiency.