## Customization and Usage

This notebook is completely self-contained and ready to run. To customize it:

1. **Modify the data**: Edit the sample data generation in the "Sample Data" cell to reflect your actual prediction results
2. **Add visualizations**: Consider adding matplotlib/seaborn plots to visualize the results
3. **Extend metrics**: Add additional evaluation metrics in the `compute_metrics` function
4. **Save results**: Uncomment the file saving code in the last cell to write results to disk

### Key Insights from Current Results:
- The proposed method achieved a **32.5% reduction** in API calls compared to baseline
- This was accomplished by shifting from 100% fission decisions to 65% fusion + 35% fission  
- Trade-off: Slightly higher error rate (+1 percentage point) but significant efficiency gains

In [None]:
# Display metrics in JSON format (equivalent to eval_out.json)
import json

print("Metrics (JSON format):")
print(json.dumps(metrics, indent=2))

# Optional: Save to file if desired (uncomment the lines below)
# with open("eval_out.json", "w") as f:
#     json.dump(metrics, f, indent=2)
# print("\nMetrics saved to eval_out.json")

## Metrics Output (JSON Format)

The following cell shows the complete metrics in JSON format (equivalent to what was originally written to `eval_out.json`):

In [None]:
# Compute the metrics
metrics = compute_metrics(results)

# Display the main result (equivalent to the original script's print statement)
print(f"API reduction: {metrics['improvement']['api_reduction_pct']:.1f}%")

# Display all computed metrics in a formatted way
print("\n" + "="*50)
print("DETAILED EVALUATION RESULTS")
print("="*50)

for method in ["baseline", "proposed"]:
    print(f"\n{method.upper()} METHOD:")
    print(f"  Fusion rate:     {metrics[method]['fusion_rate']:.3f} ({metrics[method]['fusion_rate']*100:.1f}%)")
    print(f"  Fission rate:    {metrics[method]['fission_rate']:.3f} ({metrics[method]['fission_rate']*100:.1f}%)")
    print(f"  Error rate:      {metrics[method]['error_rate']:.3f} ({metrics[method]['error_rate']*100:.1f}%)")
    print(f"  Total API calls: {metrics[method]['api_calls']}")
    print(f"  Avg calls/example: {metrics[method]['avg_calls_per_example']:.2f}")

print(f"\nIMPROVEMENT:")
print(f"  API reduction:   {metrics['improvement']['api_reduction_pct']:.1f}%")
print(f"  Error rate diff: {metrics['improvement']['error_rate_diff']:+.3f}")

## Compute Metrics and Results

Execute the evaluation and display the computed metrics:

In [None]:
def compute_metrics(results: dict) -> dict:
    """Compute evaluation metrics."""
    metrics = {}

    for method in ["baseline", "proposed"]:
        preds = results[method]

        # Count decisions
        fusion_count = sum(1 for p in preds if p["decision"] == "fusion")
        fission_count = sum(1 for p in preds if p["decision"] == "fission")

        # Compute error rate
        errors = sum(1 for p in preds if p["error"])
        error_rate = errors / len(preds)

        # API calls (fusion=1, fission=2)
        api_calls = fusion_count + 2 * fission_count

        metrics[method] = {
            "fusion_rate": fusion_count / len(preds),
            "fission_rate": fission_count / len(preds),
            "error_rate": error_rate,
            "api_calls": api_calls,
            "avg_calls_per_example": api_calls / len(preds),
        }

    # Compute improvement
    baseline_calls = metrics["baseline"]["avg_calls_per_example"]
    proposed_calls = metrics["proposed"]["avg_calls_per_example"]
    metrics["improvement"] = {
        "api_reduction_pct": (baseline_calls - proposed_calls) / baseline_calls * 100,
        "error_rate_diff": metrics["proposed"]["error_rate"] - metrics["baseline"]["error_rate"],
    }

    return metrics

## Evaluation Function

The `compute_metrics` function calculates key performance metrics for both baseline and proposed methods:

In [None]:
# Sample prediction data (originally from ../experiment_001/method_out.json)
# This data is structured to produce the exact metrics shown in eval_out.json

# Generate baseline predictions: 200 examples, all fission, 8% error rate
baseline_predictions = []
for i in range(200):
    baseline_predictions.append({
        "decision": "fission",
        "error": i < 16  # First 16 examples have errors (8% of 200)
    })

# Generate proposed predictions: 200 examples, 65% fusion/35% fission, 9% error rate
proposed_predictions = []
for i in range(200):
    if i < 130:  # First 130 are fusion (65% of 200)
        decision = "fusion"
    else:  # Remaining 70 are fission (35% of 200)
        decision = "fission"
    
    proposed_predictions.append({
        "decision": decision,
        "error": i < 18  # First 18 examples have errors (9% of 200)
    })

# Combine into results structure
results = {
    "baseline": baseline_predictions,
    "proposed": proposed_predictions
}

print(f"Baseline predictions: {len(results['baseline'])} examples")
print(f"Proposed predictions: {len(results['proposed'])} examples")

## Sample Data

The following cell contains sample prediction data that replicates the scenario described in the original script. In the original code, this data would be loaded from `../experiment_001/method_out.json`, but here we've inlined it for self-contained execution.

In [None]:
"""Evaluation script for DKW Controller."""
import json
import numpy as np

# DKW Controller Evaluation

**Artifact ID:** evaluation_001  
**Original File:** eval.py

This notebook evaluates the performance of a DKW Controller system, comparing baseline and proposed methods in terms of API call efficiency and error rates.

## Overview
- **Fusion decisions**: Single API call
- **Fission decisions**: Two API calls  
- **Metrics**: Fusion/fission rates, error rates, API call efficiency