# DKW Controller Evaluation

This notebook contains an evaluation script for the DKW Controller, comparing baseline and proposed methods across various metrics including fusion/fission decision rates, error rates, and API call efficiency.

**Artifact ID:** evaluation_001  
**Original File:** eval.py

## Imports and Setup

Import required libraries for the evaluation:

In [None]:
"""Evaluation script for DKW Controller."""
import json
import numpy as np

## Sample Data (Inlined for Self-Contained Demo)

The original script reads from `../experiment_001/method_out.json`. For this self-contained notebook, we'll inline the data that produces the exact same results shown in `eval_out.json`.

The evaluation compares two methods:
- **Baseline**: Uses fission-only strategy (2 API calls per example)
- **Proposed**: Uses adaptive fusion/fission strategy (1-2 API calls per example)

Each prediction contains:
- `decision`: Either "fusion" (1 API call) or "fission" (2 API calls)
- `error`: Boolean indicating if the prediction was incorrect

In [None]:
# Inline sample data (replaces reading from external JSON files)
# This data produces the exact metrics shown in eval_out.json

# Create sample data for 200 examples each
results = {
    "baseline": [
        # All fission decisions, 8% error rate (16 errors out of 200)
        *[{"decision": "fission", "error": True} for _ in range(16)],   # 16 errors
        *[{"decision": "fission", "error": False} for _ in range(184)]  # 184 correct
    ],
    "proposed": [
        # 65% fusion (130), 35% fission (70), 9% error rate (18 errors out of 200)
        *[{"decision": "fusion", "error": True} for _ in range(12)],    # 12 fusion errors  
        *[{"decision": "fusion", "error": False} for _ in range(118)],  # 118 fusion correct
        *[{"decision": "fission", "error": True} for _ in range(6)],    # 6 fission errors
        *[{"decision": "fission", "error": False} for _ in range(64)]   # 64 fission correct
    ]
}

print(f"Baseline examples: {len(results['baseline'])}")
print(f"Proposed examples: {len(results['proposed'])}")
print(f"\nBaseline fusion decisions: {sum(1 for p in results['baseline'] if p['decision'] == 'fusion')}")
print(f"Proposed fusion decisions: {sum(1 for p in results['proposed'] if p['decision'] == 'fusion')}")
print(f"Proposed fission decisions: {sum(1 for p in results['proposed'] if p['decision'] == 'fission')}")

## Metrics Computation Function

The `compute_metrics` function calculates:
- **Fusion/Fission rates**: Proportion of each decision type
- **Error rate**: Proportion of incorrect predictions  
- **API calls**: Total calls (fusion=1, fission=2 per example)
- **API efficiency**: Average calls per example
- **Improvement metrics**: Percentage reduction and error rate difference

In [None]:
def compute_metrics(results: dict) -> dict:
    """Compute evaluation metrics."""
    metrics = {}

    for method in ["baseline", "proposed"]:
        preds = results[method]

        # Count decisions
        fusion_count = sum(1 for p in preds if p["decision"] == "fusion")
        fission_count = sum(1 for p in preds if p["decision"] == "fission")

        # Compute error rate
        errors = sum(1 for p in preds if p["error"])
        error_rate = errors / len(preds)

        # API calls (fusion=1, fission=2)
        api_calls = fusion_count + 2 * fission_count

        metrics[method] = {
            "fusion_rate": fusion_count / len(preds),
            "fission_rate": fission_count / len(preds),
            "error_rate": error_rate,
            "api_calls": api_calls,
            "avg_calls_per_example": api_calls / len(preds),
        }

    # Compute improvement
    baseline_calls = metrics["baseline"]["avg_calls_per_example"]
    proposed_calls = metrics["proposed"]["avg_calls_per_example"]
    metrics["improvement"] = {
        "api_reduction_pct": (baseline_calls - proposed_calls) / baseline_calls * 100,
        "error_rate_diff": metrics["proposed"]["error_rate"] - metrics["baseline"]["error_rate"],
    }

    return metrics

## Run Evaluation

Execute the metrics computation and display results (replicating the original script output):

In [None]:
# Compute metrics (equivalent to: metrics = compute_metrics(results))
metrics = compute_metrics(results)

# Display main result (matching original script output)
print(f"API reduction: {metrics['improvement']['api_reduction_pct']:.1f}%")

## Save Results

Display the complete evaluation results in JSON format (equivalent to what would be saved to `eval_out.json`):

In [None]:
# Display the metrics in formatted JSON (replicating file output)
print("Contents of eval_out.json:")
print(json.dumps(metrics, indent=2))

# Optionally save to file (uncomment to use)
# with open("eval_out.json", "w") as f:
#     json.dump(metrics, f, indent=2)
# print("\nMetrics saved to eval_out.json")

## Analysis Summary

Interactive analysis of the evaluation results:

In [None]:
# Extract key findings for analysis
baseline = metrics["baseline"]
proposed = metrics["proposed"] 
improvement = metrics["improvement"]

print("üìä DKW CONTROLLER EVALUATION SUMMARY")
print("=" * 50)
print(f"üéØ API Call Reduction: {improvement['api_reduction_pct']:.1f}%")
print(f"üìà Baseline avg calls/example: {baseline['avg_calls_per_example']:.2f}")
print(f"üìâ Proposed avg calls/example: {proposed['avg_calls_per_example']:.2f}")
print()
print("üîÄ Decision Strategy Comparison:")
print(f"   Baseline: {baseline['fusion_rate']:.0%} fusion, {baseline['fission_rate']:.0%} fission")
print(f"   Proposed: {proposed['fusion_rate']:.0%} fusion, {proposed['fission_rate']:.0%} fission") 
print()
print("‚ö†Ô∏è Error Rate Analysis:")
print(f"   Baseline: {baseline['error_rate']:.1%}")
print(f"   Proposed: {proposed['error_rate']:.1%}")
print(f"   Difference: {improvement['error_rate_diff']:+.1%}")
print()
print("üí° Key Insight:")
print(f"   The proposed method achieves a {improvement['api_reduction_pct']:.1f}% reduction in API calls")
print(f"   by using fusion {proposed['fusion_rate']:.0%} of the time, with only a")
print(f"   {improvement['error_rate_diff']:.1%} increase in error rate.")

## How to Modify This Notebook

This notebook is completely self-contained and runnable. To customize it:

1. **Change the data**: Modify the `results` dictionary to test different scenarios
2. **Add new metrics**: Extend the `compute_metrics()` function 
3. **Export results**: Uncomment the file saving code to write JSON output
4. **Add visualization**: Use matplotlib/seaborn to create charts

### Key Changes from Original Script:
- **No external file dependencies**: JSON data is inlined as Python dictionaries
- **Interactive exploration**: Added detailed analysis and formatted output
- **Self-contained**: Can be run without any additional files

### Expected Output Verification:
This notebook produces the exact same results as shown in the provided `eval_out.json`:
- API reduction: 32.5%
- Baseline: 0% fusion, 100% fission, 8% error rate
- Proposed: 65% fusion, 35% fission, 9% error rate