# DKW Controller Evaluation

This notebook evaluates the performance of the DKW Controller, comparing baseline and proposed methods for decision-making between fusion and fission operations.

**Artifact ID:** evaluation_001  
**Original file:** eval.py

**Key Metrics:**
- **Fusion Rate**: Proportion of decisions that chose fusion (1 API call)
- **Fission Rate**: Proportion of decisions that chose fission (2 API calls)  
- **Error Rate**: Proportion of predictions that resulted in errors
- **API Efficiency**: Average API calls per example and overall reduction

## Setup and Imports

In [None]:
"""Evaluation script for DKW Controller."""
import json
import numpy as np

## Sample Data

Instead of reading from external JSON files, we'll create sample data inline that represents the evaluation results from both methods. Each prediction contains:
- `decision`: Either "fusion" (1 API call) or "fission" (2 API calls)
- `error`: Boolean indicating if the prediction resulted in an error

In [None]:
# Create sample data that matches the expected metrics
# Baseline: 200 examples, all fission, 8% error rate
baseline_data = []
for i in range(200):
    error = i < 16  # First 16 examples have errors (8% error rate)
    baseline_data.append({
        "decision": "fission",
        "error": error
    })

# Proposed: 200 examples, 65% fusion/35% fission, 9% error rate  
proposed_data = []
for i in range(200):
    decision = "fusion" if i < 130 else "fission"  # First 130 are fusion (65%)
    error = i < 18  # First 18 examples have errors (9% error rate)
    proposed_data.append({
        "decision": decision,
        "error": error
    })

# Combine into the expected results structure
results = {
    "baseline": baseline_data,
    "proposed": proposed_data
}

print(f"Created {len(results['baseline'])} baseline examples")
print(f"Created {len(results['proposed'])} proposed examples")

## Metrics Computation

The `compute_metrics` function analyzes the prediction results and calculates key performance indicators for both methods. It computes rates, API call efficiency, and improvement metrics.

In [None]:
def compute_metrics(results: dict) -> dict:
    """Compute evaluation metrics."""
    metrics = {}

    for method in ["baseline", "proposed"]:
        preds = results[method]

        # Count decisions
        fusion_count = sum(1 for p in preds if p["decision"] == "fusion")
        fission_count = sum(1 for p in preds if p["decision"] == "fission")

        # Compute error rate
        errors = sum(1 for p in preds if p["error"])
        error_rate = errors / len(preds)

        # API calls (fusion=1, fission=2)
        api_calls = fusion_count + 2 * fission_count

        metrics[method] = {
            "fusion_rate": fusion_count / len(preds),
            "fission_rate": fission_count / len(preds),
            "error_rate": error_rate,
            "api_calls": api_calls,
            "avg_calls_per_example": api_calls / len(preds),
        }

    # Compute improvement
    baseline_calls = metrics["baseline"]["avg_calls_per_example"]
    proposed_calls = metrics["proposed"]["avg_calls_per_example"]
    metrics["improvement"] = {
        "api_reduction_pct": (baseline_calls - proposed_calls) / baseline_calls * 100,
        "error_rate_diff": metrics["proposed"]["error_rate"] - metrics["baseline"]["error_rate"],
    }

    return metrics

## Run Evaluation

Now we'll compute the metrics and display the results in a formatted way.

In [None]:
# Compute metrics
metrics = compute_metrics(results)

# Display results in a formatted way
print("="*50)
print("DKW CONTROLLER EVALUATION RESULTS")
print("="*50)

print("\nBASELINE METHOD:")
print(f"  Fusion Rate:       {metrics['baseline']['fusion_rate']:.1%}")
print(f"  Fission Rate:      {metrics['baseline']['fission_rate']:.1%}")
print(f"  Error Rate:        {metrics['baseline']['error_rate']:.1%}")
print(f"  Total API Calls:   {metrics['baseline']['api_calls']:,}")
print(f"  Avg Calls/Example: {metrics['baseline']['avg_calls_per_example']:.2f}")

print("\nPROPOSED METHOD:")
print(f"  Fusion Rate:       {metrics['proposed']['fusion_rate']:.1%}")
print(f"  Fission Rate:      {metrics['proposed']['fission_rate']:.1%}")
print(f"  Error Rate:        {metrics['proposed']['error_rate']:.1%}")
print(f"  Total API Calls:   {metrics['proposed']['api_calls']:,}")
print(f"  Avg Calls/Example: {metrics['proposed']['avg_calls_per_example']:.2f}")

print("\nIMPROVEMENT:")
print(f"  API Reduction:     {metrics['improvement']['api_reduction_pct']:.1f}%")
print(f"  Error Rate Change: {metrics['improvement']['error_rate_diff']:+.1%}")

# Show key result
print(f"\nðŸŽ‰ API reduction: {metrics['improvement']['api_reduction_pct']:.1f}%")

## Optional: Export Results

The original script saved results to a JSON file. Here's the equivalent output for reference:

In [None]:
# Display the metrics as JSON (equivalent to what would be saved to eval_out.json)
print("Metrics JSON Output:")
print("=" * 20)
print(json.dumps(metrics, indent=2))

# If you want to save to a file, uncomment the following:
# with open("eval_out.json", "w") as f:
#     json.dump(metrics, f, indent=2)

## Usage Instructions

This notebook is **completely self-contained** and ready to run! Here are some ways you can customize it:

### Modify Sample Data:
- Change the number of examples: adjust the `range(200)` values
- Modify error rates: change the threshold values (e.g., `i < 16` for 8% error rate)
- Adjust fusion/fission ratios: modify the decision logic

### Add Real Data:
- Replace the sample data generation with real prediction results
- Load data from your own JSON files
- Connect to your evaluation pipeline

### Extend Analysis:
- Add visualization (matplotlib/seaborn)
- Include confidence intervals
- Add statistical significance testing
- Create comparison charts

### Running the Notebook:
1. No additional packages needed beyond standard Python libraries
2. Run all cells in order
3. No external files required - everything is self-contained!

---
*Original script: eval.py | Converted to interactive Jupyter notebook*