## Analysis Summary

**Key Findings:**

1. **API Efficiency**: The proposed method achieves a **32.5% reduction** in API calls compared to baseline
   - Baseline: 2.0 calls per example (all fission decisions)
   - Proposed: 1.35 calls per example (mix of fusion/fission)

2. **Decision Strategy**: The proposed method uses a mixed strategy:
   - 65% fusion decisions (more efficient, 1 API call each)
   - 35% fission decisions (less efficient, 2 API calls each) 

3. **Accuracy Trade-off**: There's a small increase in error rate (1% higher) but significant API savings

4. **Overall**: The proposed method successfully balances efficiency and accuracy, making it practical for production use where API costs are a concern.

**To modify this notebook**: Update the data generation section to experiment with different decision strategies and error rates.

In [None]:
# Display full metrics in formatted JSON (equivalent to eval_out.json)
print("Complete evaluation metrics:")
print("=" * 40)
print(json.dumps(metrics, indent=2))

# Also save to file (optional, as in original script)
with open("eval_out.json", "w") as f:
    json.dump(metrics, f, indent=2)
    
print(f"\nMetrics also saved to 'eval_out.json'")    

## Detailed Results

Here are the complete evaluation metrics (equivalent to the original `eval_out.json` output):

In [None]:
# Compute evaluation metrics
metrics = compute_metrics(results)

# Display the key result that the original script printed
print(f"API reduction: {metrics['improvement']['api_reduction_pct']:.1f}%")
print(f"Error rate difference: {metrics['improvement']['error_rate_diff']:.2f}")
print()

## Run Analysis

Now we'll compute the evaluation metrics using our sample data and display the results. This replaces the original script's file I/O operations with direct computation.

In [None]:
def compute_metrics(results: dict) -> dict:
    """Compute evaluation metrics."""
    metrics = {}

    for method in ["baseline", "proposed"]:
        preds = results[method]

        # Count decisions
        fusion_count = sum(1 for p in preds if p["decision"] == "fusion")
        fission_count = sum(1 for p in preds if p["decision"] == "fission")

        # Compute error rate
        errors = sum(1 for p in preds if p["error"])
        error_rate = errors / len(preds)

        # API calls (fusion=1, fission=2)
        api_calls = fusion_count + 2 * fission_count

        metrics[method] = {
            "fusion_rate": fusion_count / len(preds),
            "fission_rate": fission_count / len(preds),
            "error_rate": error_rate,
            "api_calls": api_calls,
            "avg_calls_per_example": api_calls / len(preds),
        }

    # Compute improvement
    baseline_calls = metrics["baseline"]["avg_calls_per_example"]
    proposed_calls = metrics["proposed"]["avg_calls_per_example"]
    metrics["improvement"] = {
        "api_reduction_pct": (baseline_calls - proposed_calls) / baseline_calls * 100,
        "error_rate_diff": metrics["proposed"]["error_rate"] - metrics["baseline"]["error_rate"],
    }

    return metrics

## Evaluation Metrics Function

The `compute_metrics` function analyzes prediction results and calculates key performance indicators:

**For each method (baseline/proposed):**
- **Fusion rate**: Proportion of decisions that were "fusion" (1 API call each)
- **Fission rate**: Proportion of decisions that were "fission" (2 API calls each) 
- **Error rate**: Proportion of predictions that were incorrect
- **Total API calls**: fusion_count × 1 + fission_count × 2
- **Average calls per example**: Total API calls divided by number of examples

**Overall improvement metrics:**
- **API reduction percentage**: How much the proposed method reduces API usage vs baseline
- **Error rate difference**: Change in error rate (proposed - baseline)

In [None]:
# Create sample data that matches the original eval_out.json results
# Baseline: 200 examples, all fission, 8% error rate
baseline_predictions = []
for i in range(200):
    error = i < 16  # First 16 examples have errors (8% of 200)
    baseline_predictions.append({
        "decision": "fission",
        "error": error
    })

# Proposed: 200 examples, 65% fusion (130), 35% fission (70), 9% error rate
proposed_predictions = []
error_indices = set(range(18))  # First 18 examples have errors (9% of 200)

# Add fusion decisions (130 examples)
for i in range(130):
    proposed_predictions.append({
        "decision": "fusion",
        "error": i in error_indices
    })

# Add fission decisions (70 examples)  
for i in range(130, 200):
    proposed_predictions.append({
        "decision": "fission",
        "error": i in error_indices
    })

# Combined results data
results = {
    "baseline": baseline_predictions,
    "proposed": proposed_predictions
}

print(f"Created sample data:")
print(f"- Baseline: {len(results['baseline'])} predictions")
print(f"- Proposed: {len(results['proposed'])} predictions")

## Experimental Data

The original script read from `../experiment_001/method_out.json`. For this self-contained notebook, we'll inline sample data that produces the same results as shown in the original `eval_out.json`.

The data contains prediction results for both baseline and proposed methods, where each prediction has:
- `decision`: Either "fusion" (1 API call) or "fission" (2 API calls)  
- `error`: Boolean indicating if the prediction was incorrect

In [None]:
"""Import required libraries"""
import json
import numpy as np

# DKW Controller Evaluation

This notebook evaluates the performance of DKW Controller methods, comparing baseline and proposed approaches in terms of API efficiency and error rates.

The original script analyzed results from experiments and computed key performance metrics including:
- Fusion/fission decision rates
- Error rates
- API call efficiency
- Performance improvements