## How to Use with Your Own Data

To use this notebook with your own evaluation data:

1. **Replace the sample data generation**: Modify the second code cell to load your actual data instead of generating sample data.

2. **Expected data format**: Your data should be a dictionary with this structure:
   ```python
   {
       "baseline": [
           {"decision": "fusion" or "fission", "error": True or False},
           # ... more examples
       ],
       "proposed": [
           {"decision": "fusion" or "fission", "error": True or False},
           # ... more examples
       ]
   }
   ```

3. **Loading from files**: If you have JSON files, replace the sample data section with:
   ```python
   with open("your_results_file.json") as f:
       results = json.load(f)
   ```

4. **Customizing metrics**: Modify the `compute_metrics` function if you need different evaluation metrics or have different cost models for fusion/fission operations.

The rest of the notebook will automatically work with your data!

In [None]:
import matplotlib.pyplot as plt

# Create visualizations
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('DKW Controller Evaluation Results', fontsize=16, fontweight='bold')

# 1. Decision Types Comparison
methods = ['Baseline', 'Proposed']
fusion_rates = [metrics['baseline']['fusion_rate'], metrics['proposed']['fusion_rate']]
fission_rates = [metrics['baseline']['fission_rate'], metrics['proposed']['fission_rate']]

x = range(len(methods))
width = 0.35

ax1.bar([i - width/2 for i in x], fusion_rates, width, label='Fusion', color='skyblue')
ax1.bar([i + width/2 for i in x], fission_rates, width, label='Fission', color='lightcoral')
ax1.set_ylabel('Rate')
ax1.set_title('Decision Type Distribution')
ax1.set_xticks(x)
ax1.set_xticklabels(methods)
ax1.legend()

# 2. Error Rates
error_rates = [metrics['baseline']['error_rate'], metrics['proposed']['error_rate']]
ax2.bar(methods, error_rates, color=['orange', 'red'], alpha=0.7)
ax2.set_ylabel('Error Rate')
ax2.set_title('Error Rate Comparison')
ax2.set_ylim(0, max(error_rates) * 1.2)

# 3. API Calls per Example
api_calls = [metrics['baseline']['avg_calls_per_example'], metrics['proposed']['avg_calls_per_example']]
bars = ax3.bar(methods, api_calls, color=['lightblue', 'lightgreen'])
ax3.set_ylabel('Average API Calls per Example')
ax3.set_title('API Efficiency')

# Add value labels on bars
for bar, value in zip(bars, api_calls):
    height = bar.get_height()
    ax3.text(bar.get_x() + bar.get_width()/2., height,
             f'{value:.2f}', ha='center', va='bottom')

# 4. Summary metrics
summary_labels = ['API Reduction\n(%)', 'Error Rate\nDifference (%)']
summary_values = [metrics['improvement']['api_reduction_pct'], 
                 metrics['improvement']['error_rate_diff'] * 100]
colors = ['green' if v > 0 else 'red' for v in summary_values]

bars = ax4.bar(summary_labels, summary_values, color=colors, alpha=0.7)
ax4.set_title('Improvement Summary')
ax4.axhline(y=0, color='black', linestyle='-', linewidth=0.5)

# Add value labels
for bar, value in zip(bars, summary_values):
    height = bar.get_height()
    ax4.text(bar.get_x() + bar.get_width()/2., height + (1 if height >= 0 else -1),
             f'{value:.1f}%', ha='center', va='bottom' if height >= 0 else 'top')

plt.tight_layout()
plt.show()

# Print key insights
print("KEY INSIGHTS:")
print(f"‚úì The proposed method achieves {metrics['improvement']['api_reduction_pct']:.1f}% reduction in API calls")
print(f"‚úì Fusion rate increased from {metrics['baseline']['fusion_rate']:.0%} to {metrics['proposed']['fusion_rate']:.0%}")
print(f"‚úì Error rate changed by {metrics['improvement']['error_rate_diff']:.1%} (slight increase)")

## Visualization

Let's create some plots to visualize the controller's behavior over time.

## Usage Notes & Customization

### How to use this notebook:
1. **Self-contained**: This notebook runs without any external files or dependencies
2. **Customizable**: Modify the `simulated_dataset` to test with your own questions
3. **Extensible**: Add new fields to the output format by modifying the `collect_data()` function

### Original vs Notebook differences:
- **Original**: Loads data from HuggingFace datasets library
- **Notebook**: Uses inline sample data for demonstration
- **Original**: Saves output to `data_out.json` file  
- **Notebook**: Displays output directly in cells

### To restore original functionality:
1. Install dependencies: `pip install datasets`
2. Uncomment the HuggingFace dataset loading code
3. Add file writing functionality back if needed

In [None]:
# Analyze results
baseline_errors = sum(1 for r in results["baseline"] if r["error"])
proposed_errors = sum(1 for r in results["proposed"] if r["error"])

baseline_error_rate = baseline_errors / len(results["baseline"])
proposed_error_rate = proposed_errors / len(results["proposed"])

# Count fusion vs fission decisions for proposed method
fusion_count = sum(1 for r in results["proposed"] if r["decision"] == "fusion")
fission_count = sum(1 for r in results["proposed"] if r["decision"] == "fission")

print("üìà EXPERIMENT RESULTS")
print("=" * 50)
print(f"Total examples processed: {len(results['baseline'])}")
print()
print("üî∏ BASELINE (always fission):")
print(f"   Error rate: {baseline_error_rate:.1%} ({baseline_errors}/{len(results['baseline'])})")
print(f"   Fusion decisions: 0/{len(results['baseline'])} (0%)")
print()
print("üîπ PROPOSED (DKW controller):")
print(f"   Error rate: {proposed_error_rate:.1%} ({proposed_errors}/{len(results['proposed'])})")
print(f"   Fusion decisions: {fusion_count}/{len(results['proposed'])} ({100*fusion_count/len(results['proposed']):.1f}%)")
print(f"   Fission decisions: {fission_count}/{len(results['proposed'])} ({100*fission_count/len(results['proposed']):.1f}%)")
print()
print("üîÑ DECISION CHANGES:")
print(f"   Mode switches: {len(decision_changes)}")
for change in decision_changes:
    print(f"   Step {change['step']}: {change['from']} ‚Üí {change['to']} (upper bound: {change['stats']['upper_bound']:.3f})")

## Visualization

Create visualizations to better understand the performance comparison between baseline and proposed methods.

In [None]:
# Expected output format (from original data_out.json)
expected_output = [
    {
        "id": "example_000",
        "question": "What is 2+2?",
        "answer": "4",
        "difficulty": 0.15
    },
    {
        "id": "example_001", 
        "question": "If x=5, what is 2x?",
        "answer": "10",
        "difficulty": 0.22
    },
    {
        "id": "example_002",
        "question": "Solve: 3y + 6 = 15",
        "answer": "y=3",
        "difficulty": 0.28
    }
]

print("Expected output format:")
print(json.dumps(expected_output, indent=2))

## Results Analysis

Let's analyze the performance of our DKW controller compared to the baseline approach.

In [None]:
# Save results to JSON file (equivalent to the original script)
with open("eval_out.json", "w") as f:
    json.dump(metrics, f, indent=2)

print("Results saved to eval_out.json")

# Display the JSON content for verification
print("\nSaved JSON content:")
print(json.dumps(metrics, indent=2))

In [None]:
def run_experiment(data, verbose=False):
    """Run DKW controller experiment with inline data."""
    controller = DKWController()
    results = {"baseline": [], "proposed": []}
    
    if verbose:
        print("üöÄ Starting experiment...")
        print(f"Controller settings: target={controller.epsilon_target}, min_samples={controller.min_samples}")
    
    decision_changes = []
    
    for i, example in enumerate(data):
        # Simulate error occurrence based on difficulty
        error = np.random.random() < example["difficulty"]
        controller.add_observation(float(error))
        decision = controller.decide()
        
        # Track decision changes for analysis
        if i > 0 and decision != results["proposed"][-1]["decision"]:
            stats = controller.get_stats()
            decision_changes.append({
                "step": i,
                "from": results["proposed"][-1]["decision"],
                "to": decision,
                "stats": stats
            })

        results["proposed"].append({
            "id": example["id"],
            "decision": decision,
            "error": error,
            "difficulty": example["difficulty"]
        })
        results["baseline"].append({
            "id": example["id"],
            "decision": "fission",  # Always conservative
            "error": error,
            "difficulty": example["difficulty"]
        })
        
        if verbose and i % 50 == 0:
            stats = controller.get_stats()
            print(f"  Step {i}: {stats['samples']} samples, error rate: {stats['empirical_error']:.3f}, mode: {stats['current_state']}")

    if verbose:
        print(f"‚úÖ Experiment complete! Decision changes: {len(decision_changes)}")
    
    return results, decision_changes

# Run the experiment
print("üî¨ Running experiment...")
results, decision_changes = run_experiment(sample_data, verbose=True)

## Expected Output Reference

For comparison, here's the expected output structure that was provided in the original specification:

## Save Results

Save the evaluation metrics to a JSON file (as in the original script).

In [None]:
# Display the complete processed dataset
print("Complete processed dataset:")
print(json.dumps(data, indent=2))

# In the original script, this would be saved to a file:
# with open("data_out.json", "w") as f:
#     json.dump(data, f, indent=2)

In [None]:
# Compute the evaluation metrics
metrics = compute_metrics(results)

# Display the key result (as in the original script)
print(f"API reduction: {metrics['improvement']['api_reduction_pct']:.1f}%")
print()

# Display detailed metrics in a nice format
print("Detailed Evaluation Results:")
print("=" * 50)

for method in ["baseline", "proposed"]:
    print(f"\n{method.upper()} METHOD:")
    m = metrics[method]
    print(f"  Fusion rate:           {m['fusion_rate']:.1%}")
    print(f"  Fission rate:          {m['fission_rate']:.1%}")
    print(f"  Error rate:            {m['error_rate']:.1%}")
    print(f"  Total API calls:       {m['api_calls']}")
    print(f"  Avg calls per example: {m['avg_calls_per_example']:.2f}")

print(f"\nIMPROVEMENT:")
imp = metrics['improvement']
print(f"  API reduction:         {imp['api_reduction_pct']:.1f}%")
print(f"  Error rate difference: {imp['error_rate_diff']:+.1%}")

## View Complete Dataset

Let's examine the complete processed dataset structure that would normally be saved to `data_out.json`.

## Experiment Function

The experiment simulates running both the DKW controller (proposed method) and a baseline that always uses fission mode. Errors occur probabilistically based on each example's difficulty level.

In [None]:
# Execute the data collection
data = collect_data()

print(f"Collected {len(data)} examples")
print("\nFirst few examples:")
for item in data[:3]:
    print(f"- ID: {item['id']}")
    print(f"  Question: {item['question']}")
    print(f"  Answer: {item['answer']}")
    print(f"  Difficulty: {item['difficulty']:.2f}")
    print()

## Compute Metrics

Run the evaluation on our sample data and display the results.

In [None]:
# Sample dataset - inlined instead of reading from file
# This replaces the original: with open("../dataset_001/data_out.json") as f: data = json.load(f)

sample_data = [
    {"id": f"example_{i:03d}", "difficulty": 0.05 + 0.15 * np.random.random()}
    for i in range(200)
]

# Add some high-difficulty examples to test mode switching
for i in range(50):
    sample_data.append({
        "id": f"hard_example_{i:03d}", 
        "difficulty": 0.3 + 0.4 * np.random.random()
    })

print(f"üìä Created {len(sample_data)} sample examples")
print(f"üìà Difficulty range: {min(ex['difficulty'] for ex in sample_data):.3f} - {max(ex['difficulty'] for ex in sample_data):.3f}")

# Show first few examples
print("\nüîç First 5 examples:")
for i, ex in enumerate(sample_data[:5]):
    print(f"  {ex['id']}: difficulty = {ex['difficulty']:.3f}")

## Execute Data Collection

Let's run the data collection function and examine the results.

In [None]:
def compute_metrics(results: dict) -> dict:
    """Compute evaluation metrics."""
    metrics = {}

    for method in ["baseline", "proposed"]:
        preds = results[method]

        # Count decisions
        fusion_count = sum(1 for p in preds if p["decision"] == "fusion")
        fission_count = sum(1 for p in preds if p["decision"] == "fission")

        # Compute error rate
        errors = sum(1 for p in preds if p["error"])
        error_rate = errors / len(preds)

        # API calls (fusion=1, fission=2)
        api_calls = fusion_count + 2 * fission_count

        metrics[method] = {
            "fusion_rate": fusion_count / len(preds),
            "fission_rate": fission_count / len(preds),
            "error_rate": error_rate,
            "api_calls": api_calls,
            "avg_calls_per_example": api_calls / len(preds),
        }

    # Compute improvement
    baseline_calls = metrics["baseline"]["avg_calls_per_example"]
    proposed_calls = metrics["proposed"]["avg_calls_per_example"]
    metrics["improvement"] = {
        "api_reduction_pct": (baseline_calls - proposed_calls) / baseline_calls * 100,
        "error_rate_diff": metrics["proposed"]["error_rate"] - metrics["baseline"]["error_rate"],
    }

    return metrics

print("Evaluation function defined successfully!")

In [None]:
def collect_data() -> List[Dict[str, Any]]:
    """Collect benchmark data for DKW controller evaluation."""
    
    # In the original script, this would be:
    # ds = load_dataset("gsm8k", "main", split="test[:200]")
    # Here we use our simulated dataset instead
    ds = simulated_dataset

    data = []
    for i, example in enumerate(ds):
        processed_item = {
            "id": f"example_{i:03d}",
            "question": example["question"],
            "answer": example["answer"],
            "difficulty": len(example["question"]) / 100,  # Simple proxy for difficulty
        }
        data.append(processed_item)

    return data

## Sample Data

Instead of reading from external files, we'll create sample data inline. The data represents examples with varying difficulty levels that influence error probability.

## Evaluation Function

The `compute_metrics` function calculates performance metrics for both baseline and proposed methods, including fusion/fission rates, error rates, API call counts, and improvement percentages.

## Data Processing Function

The `collect_data()` function processes the raw dataset and adds additional metadata like difficulty scoring based on question length.

In [None]:
@dataclass
class DKWController:
    """DKW-guided fusion/fission controller."""
    epsilon_target: float = 0.10
    delta: float = 0.05
    min_samples: int = 100
    hysteresis: float = 0.05

    samples: list = field(default_factory=list)
    current_state: str = "fission"

    def dkw_epsilon(self, n: int) -> float:
        """Compute DKW epsilon for n samples."""
        if n < 2:
            return 1.0
        return np.sqrt(np.log(2 / self.delta) / (2 * n))

    def add_observation(self, error: float) -> None:
        """Add error observation for calibration."""
        self.samples.append(error)

    def decide(self) -> str:
        """Make fusion/fission decision with DKW guarantee."""
        n = len(self.samples)
        if n < self.min_samples:
            return self.current_state

        epsilon = self.dkw_epsilon(n)
        empirical_error = np.mean(self.samples[-self.min_samples:])
        error_upper_bound = empirical_error + epsilon

        if self.current_state == "fusion":
            if error_upper_bound > self.epsilon_target + self.hysteresis:
                self.current_state = "fission"
        else:
            if error_upper_bound < self.epsilon_target - self.hysteresis:
                self.current_state = "fusion"

        return self.current_state
    
    def get_stats(self):
        """Get current controller statistics."""
        n = len(self.samples)
        if n == 0:
            return {"samples": 0, "empirical_error": 0, "epsilon": 1.0, "upper_bound": 1.0}
        
        empirical_error = np.mean(self.samples[-self.min_samples:]) if n >= self.min_samples else np.mean(self.samples)
        epsilon = self.dkw_epsilon(n)
        
        return {
            "samples": n,
            "empirical_error": empirical_error,
            "epsilon": epsilon,
            "upper_bound": empirical_error + epsilon,
            "current_state": self.current_state
        }

print("‚úÖ DKWController class defined!")

In [None]:
# Simulated GSM8K dataset samples (normally loaded from HuggingFace)
simulated_dataset = [
    {
        "question": "What is 2+2?",
        "answer": "4"
    },
    {
        "question": "If x=5, what is 2x?", 
        "answer": "10"
    },
    {
        "question": "Solve: 3y + 6 = 15",
        "answer": "y=3"
    },
    {
        "question": "A store sells apples for $3 per pound. How much do 4 pounds cost?",
        "answer": "$12"
    },
    {
        "question": "If a rectangle has length 8 and width 6, what is its area?",
        "answer": "48"
    }
]

print(f"Loaded {len(simulated_dataset)} sample questions")

In [None]:
import json
import numpy as np

# Sample data that produces the expected evaluation metrics
# This replaces reading from "../experiment_001/method_out.json"

# Generate sample baseline data: 100% fission, 8% error rate
baseline_data = []
for i in range(200):
    error = i < 16  # First 16 examples have errors (8% error rate)
    baseline_data.append({
        "decision": "fission",  # 100% fission rate
        "error": error
    })

# Generate sample proposed data: 65% fusion, 35% fission, 9% error rate
proposed_data = []
for i in range(200):
    decision = "fusion" if i < 130 else "fission"  # 65% fusion, 35% fission
    error = i < 18  # First 18 examples have errors (9% error rate)
    proposed_data.append({
        "decision": decision,
        "error": error
    })

# Combine into the expected format
results = {
    "baseline": baseline_data,
    "proposed": proposed_data
}

print(f"Generated sample data:")
print(f"- Baseline: {len(results['baseline'])} examples")
print(f"- Proposed: {len(results['proposed'])} examples")

## DKW Controller Class

The `DKWController` uses the Dvoretzky-Kiefer-Wolfowitz inequality to provide statistical guarantees on error rate estimates. 

### Key Parameters:
- **`epsilon_target`**: Target error rate threshold (default: 0.10)
- **`delta`**: Confidence level parameter for DKW bound (default: 0.05)  
- **`min_samples`**: Minimum samples before making decisions (default: 100)
- **`hysteresis`**: Prevents rapid mode switching (default: 0.05)

### DKW Inequality:
For n samples, the true error rate is within `empirical_error ¬± epsilon` with probability ‚â• 1-Œ¥, where:
```
epsilon = sqrt(log(2/Œ¥) / (2*n))
```

## Sample Dataset

Since this is a self-contained notebook, we'll simulate the GSM8K dataset with sample data instead of loading from HuggingFace. In the original script, this would be loaded using `load_dataset("gsm8k", "main", split="test[:200]")`.

## Sample Data

The original script reads from `../experiment_001/method_out.json`. For this self-contained notebook, we'll inline the sample data that would produce the expected evaluation results.

In [None]:
"""Dataset collection script for DKW benchmark."""
import json
from typing import List, Dict, Any

# Note: In the original script, this would be: from datasets import load_dataset
# For this self-contained notebook, we'll use inline data instead

In [None]:
"""DKW Controller Implementation - Imports and Setup"""
import json
import numpy as np
from dataclasses import dataclass, field
import matplotlib.pyplot as plt
import pandas as pd

# Set random seed for reproducibility
np.random.seed(42)

print("üì¶ All packages imported successfully!")

# DKW Controller Evaluation

This notebook contains the evaluation script for the DKW Controller, converted from `eval.py` into an interactive format. The notebook analyzes the performance of two methods (baseline and proposed) by computing various metrics including API call reduction and error rates.

## Import Dependencies

First, let's import the required libraries for data processing.

# DKW Controller Implementation - Interactive Demo

This notebook demonstrates a **DKW-guided fusion/fission controller** implementation. The controller uses the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality to make statistically guaranteed decisions between fusion and fission modes based on observed error rates.

## Overview
- **Fusion mode**: Aggressive strategy that may have higher error rates but better performance
- **Fission mode**: Conservative strategy with lower error rates but potentially reduced performance
- **DKW guarantee**: Statistical bound ensuring our error estimates are reliable

The controller switches between modes based on observed error rates with statistical confidence bounds.

# Dataset Collection for DKW Benchmark

This notebook demonstrates the dataset collection script for DKW controller evaluation. The script processes benchmark data from the GSM8K dataset and formats it for evaluation purposes.

**Original Artifact:** data.py  
**Purpose:** Collect and format benchmark data for mathematical reasoning tasks

In [None]:
# Interactive parameter exploration
# Modify these parameters to see different scenarios

def create_custom_scenario(n_examples=200, 
                          proposed_fusion_rate=0.65, 
                          baseline_error_rate=0.08,
                          proposed_error_rate=0.09):
    """Create a custom evaluation scenario."""
    
    # Baseline: always fission
    baseline_data = []
    for i in range(n_examples):
        error = i < int(n_examples * baseline_error_rate)
        baseline_data.append({
            "decision": "fission",
            "error": error
        })
    
    # Proposed: mix of fusion and fission  
    proposed_data = []
    fusion_count = int(n_examples * proposed_fusion_rate)
    for i in range(n_examples):
        decision = "fusion" if i < fusion_count else "fission"
        error = i < int(n_examples * proposed_error_rate)
        proposed_data.append({
            "decision": decision,
            "error": error
        })
    
    custom_results = {
        "baseline": baseline_data,
        "proposed": proposed_data
    }
    
    return compute_metrics(custom_results)

# Try different scenarios
print("=== SCENARIO 1: Higher Fusion Rate ===")
scenario1 = create_custom_scenario(proposed_fusion_rate=0.80)
print(f"API Reduction: {scenario1['improvement']['api_reduction_pct']:.1f}%")

print("\n=== SCENARIO 2: Lower Error Rate ===")  
scenario2 = create_custom_scenario(proposed_error_rate=0.05)
print(f"API Reduction: {scenario2['improvement']['api_reduction_pct']:.1f}%")
print(f"Error Rate Diff: {scenario2['improvement']['error_rate_diff']:+.1%}")

print("\n=== SCENARIO 3: Conservative Approach ===")
scenario3 = create_custom_scenario(proposed_fusion_rate=0.40, proposed_error_rate=0.06)
print(f"API Reduction: {scenario3['improvement']['api_reduction_pct']:.1f}%")
print(f"Error Rate Diff: {scenario3['improvement']['error_rate_diff']:+.1%}")

## Customization & Usage

### Modifying the Sample Data
You can easily modify the `sample_dataset` variable above to include your own questions and answers. Just maintain the format:

```python
sample_dataset = [
    {
        "question": "Your question here",
        "answer": "Your answer here"
    }
    # Add more examples...
]
```

### Using Real HuggingFace Data
To use the actual GSM8k dataset from HuggingFace:

1. Install the datasets library: `pip install datasets`
2. Uncomment the import: `from datasets import load_dataset`
3. Use the `collect_data_from_huggingface()` function

### Difficulty Metric
The difficulty score is calculated as `question_length / 100`. You can modify this calculation in the `collect_data()` function to use more sophisticated metrics.

### Next Steps
This processed data can now be used for:
- DKW benchmark evaluation
- Mathematical reasoning model testing
- Performance analysis and comparison

## Interactive Exploration

Try modifying the parameters below to see how different scenarios affect the results.

In [None]:
# Display the formatted data (equivalent to what would be saved in data_out.json)
formatted_output = json.dumps(collected_data, indent=2)
print(formatted_output)

print(f"\nüíæ In the original script, this data would be saved to 'data_out.json'")
print(f"üéØ The data is now ready for DKW benchmark evaluation!")

In [None]:
# Display detailed metrics in a formatted way
import json

print("=== BASELINE METHOD ===")
baseline = metrics["baseline"]
print(f"Fusion Rate:     {baseline['fusion_rate']:.1%}")
print(f"Fission Rate:    {baseline['fission_rate']:.1%}")
print(f"Error Rate:      {baseline['error_rate']:.1%}")
print(f"Total API Calls: {baseline['api_calls']}")
print(f"Avg Calls/Example: {baseline['avg_calls_per_example']:.2f}")

print("\n=== PROPOSED METHOD ===")
proposed = metrics["proposed"]
print(f"Fusion Rate:     {proposed['fusion_rate']:.1%}")
print(f"Fission Rate:    {proposed['fission_rate']:.1%}")
print(f"Error Rate:      {proposed['error_rate']:.1%}")
print(f"Total API Calls: {proposed['api_calls']}")
print(f"Avg Calls/Example: {proposed['avg_calls_per_example']:.2f}")

print("\n=== IMPROVEMENT ===")
improvement = metrics["improvement"]
print(f"API Reduction:   {improvement['api_reduction_pct']:.1f}%")
print(f"Error Rate Diff: {improvement['error_rate_diff']:+.1%}")

print("\n=== COMPLETE METRICS (JSON) ===")
print(json.dumps(metrics, indent=2))

In [None]:
# Execute the data collection
print("üöÄ Starting data collection process...\n")

# Collect and process the data
collected_data = collect_data()

# Display the results
print(f"\nüìà Collection Summary:")
print(f"   ‚Ä¢ Total examples: {len(collected_data)}")
print(f"   ‚Ä¢ Average difficulty: {sum(item['difficulty'] for item in collected_data) / len(collected_data):.3f}")

# Instead of writing to file, we'll display the data inline
print(f"\nüìã Collected Data Structure:")
print("-" * 50)

## Detailed Results

Let's examine the detailed metrics for both methods.

## Execute Data Collection

Now let's run the data collection process and see the results. The function will process our sample data and format it for benchmark evaluation.

In [None]:
# Compute metrics using our sample data
metrics = compute_metrics(results)

# Display the main result
print(f"API reduction: {metrics['improvement']['api_reduction_pct']:.1f}%")

# Save results (optional - replaces writing to JSON file)
eval_output = metrics
print("\nEvaluation completed successfully!")

In [None]:
def collect_data(dataset=None):
    """Collect benchmark data for DKW controller evaluation."""
    
    # Use inline sample data if no dataset provided (self-contained mode)
    if dataset is None:
        dataset = sample_dataset
        print("üîÑ Using inline sample data for self-contained execution")
    
    # Process the dataset
    data = []
    for i, example in enumerate(dataset):
        data.append({
            "id": f"example_{i:03d}",
            "question": example["question"],
            "answer": example["answer"],
            "difficulty": len(example["question"]) / 100,  # Simple proxy based on question length
        })
    
    print(f"‚úÖ Processed {len(data)} examples successfully")
    return data

# Alternative function that would work with HuggingFace datasets (commented for reference)
def collect_data_from_huggingface():
    """
    Original function that loads from HuggingFace (requires 'datasets' package):
    
    from datasets import load_dataset
    ds = load_dataset("gsm8k", "main", split="test[:200]")
    return collect_data(ds)
    """
    pass

print("üîß Data collection functions defined successfully!")

## Run Evaluation

Now let's compute the metrics and display the results.

In [None]:
def compute_metrics(results: dict) -> dict:
    """Compute evaluation metrics."""
    metrics = {}

    for method in ["baseline", "proposed"]:
        preds = results[method]

        # Count decisions
        fusion_count = sum(1 for p in preds if p["decision"] == "fusion")
        fission_count = sum(1 for p in preds if p["decision"] == "fission")

        # Compute error rate
        errors = sum(1 for p in preds if p["error"])
        error_rate = errors / len(preds)

        # API calls (fusion=1, fission=2)
        api_calls = fusion_count + 2 * fission_count

        metrics[method] = {
            "fusion_rate": fusion_count / len(preds),
            "fission_rate": fission_count / len(preds),
            "error_rate": error_rate,
            "api_calls": api_calls,
            "avg_calls_per_example": api_calls / len(preds),
        }

    # Compute improvement
    baseline_calls = metrics["baseline"]["avg_calls_per_example"]
    proposed_calls = metrics["proposed"]["avg_calls_per_example"]
    metrics["improvement"] = {
        "api_reduction_pct": (baseline_calls - proposed_calls) / baseline_calls * 100,
        "error_rate_diff": metrics["proposed"]["error_rate"] - metrics["baseline"]["error_rate"],
    }

    return metrics

## Data Collection Function

The `collect_data()` function processes the raw dataset and formats it for DKW benchmark evaluation. It:

1. Takes mathematical questions and answers
2. Assigns unique IDs to each example
3. Calculates a difficulty metric based on question length
4. Returns structured data ready for benchmark testing

In [None]:
# Sample data that mimics HuggingFace GSM8k dataset format
# This represents what would be loaded from: load_dataset("gsm8k", "main", split="test[:200]")
sample_dataset = [
    {
        "question": "What is 2+2?",
        "answer": "4"
    },
    {
        "question": "If x=5, what is 2x?", 
        "answer": "10"
    },
    {
        "question": "Solve: 3y + 6 = 15",
        "answer": "y=3"
    }
]

print(f"üìä Sample dataset loaded with {len(sample_dataset)} examples")
print("\nüîç Preview of first example:")
print(f"Question: {sample_dataset[0]['question']}")
print(f"Answer: {sample_dataset[0]['answer']}")

## Metrics Computation Function

This function analyzes the results and computes key performance metrics for both methods.

In [None]:
# Create sample data that matches the expected evaluation results
# 200 examples total for each method

# Baseline: 100% fission, 8% error rate  
baseline_data = []
for i in range(200):
    error = i < 16  # First 16 examples have errors (8% error rate)
    baseline_data.append({
        "decision": "fission",
        "error": error
    })

# Proposed: 65% fusion, 35% fission, 9% error rate
proposed_data = []
for i in range(200):
    if i < 130:  # First 130 examples use fusion (65%)
        decision = "fusion"
    else:  # Last 70 examples use fission (35%)
        decision = "fission"
    
    error = i < 18  # First 18 examples have errors (9% error rate)
    proposed_data.append({
        "decision": decision,
        "error": error
    })

# Combined results dictionary (replaces reading from JSON file)
results = {
    "baseline": baseline_data,
    "proposed": proposed_data
}

print(f"Created sample data:")
print(f"- Baseline: {len(results['baseline'])} examples")
print(f"- Proposed: {len(results['proposed'])} examples")

## Sample Data (Inline)

For self-contained execution, we'll use sample data that represents what would normally be loaded from the HuggingFace GSM8k dataset. This data includes mathematical reasoning questions with their answers.

## Sample Data

Instead of reading from external JSON files, we'll define the sample data inline. This represents the results from 200 test examples for both baseline and proposed methods.

In [None]:
"""Dataset collection script for DKW benchmark."""
import json
# Note: In a real environment, you would need: pip install datasets
# from datasets import load_dataset

# For this self-contained demo, we'll use inline sample data
print("‚úÖ Imports loaded successfully!")
print("üìù Note: This notebook uses inline sample data for self-contained execution")

# DKW Benchmark Dataset Collection

**Artifact:** dataset_001 - data.py

This notebook demonstrates the dataset collection script for DKW benchmark evaluation. It processes mathematical reasoning questions from the GSM8k dataset and formats them for benchmark testing.

## Features
- Loads data from HuggingFace GSM8k dataset
- Processes and formats questions with answers
- Calculates difficulty metrics
- Self-contained execution with sample data

In [None]:
"""Evaluation script for DKW Controller."""
import json
import numpy as np

# DKW Controller Evaluation

This notebook evaluates the performance of a proposed method against a baseline for the DKW Controller system. 

The evaluation compares two approaches:
- **Baseline**: Always uses fission (2 API calls per example)
- **Proposed**: Intelligently chooses between fusion (1 API call) and fission (2 API calls)

Key metrics computed:
- Fusion/Fission rates
- Error rates 
- API call efficiency
- Performance improvement