## Conclusion and Next Steps

This notebook demonstrates how the DKW controller adapts its decisions based on observed error rates while providing statistical guarantees. Key observations:

1. **Adaptive Behavior**: The controller switches between fusion and fission based on error observations
2. **Statistical Guarantees**: Uses DKW inequality to bound estimation error
3. **Hysteresis**: Prevents oscillation between states

### Experiment with Parameters

You can modify the controller parameters to see different behaviors:

```python
# Create a new controller with different parameters
custom_controller = DKWController(
    epsilon_target=0.05,    # Tighter error tolerance
    delta=0.01,            # Higher confidence
    min_samples=50,        # Faster adaptation
    hysteresis=0.02        # Less hysteresis
)
```

### Modify the Data

Change the `sample_data` generation to test different scenarios:
- Different difficulty patterns
- More or fewer examples
- Varying error rates over time

### Save Results

Uncomment the save block in the previous cell to save results to a JSON file.

In [None]:
# Display sample results (first 5 examples)
sample_results = {
    "baseline": results["baseline"][:5],
    "proposed": results["proposed"][:5]
}

print("Sample Results (first 5 examples):")
print(json.dumps(sample_results, indent=2))

# Optionally save full results to file
# with open("method_out.json", "w") as f:
#     json.dump(results, f, indent=2)
# print(f"\nFull results saved to method_out.json")

print(f"\nTotal examples in full results: {len(results['baseline'])}")

## Sample Output

Here's a sample of the results in the same format as the original `method_out.json` file:

In [None]:
# Create visualization of decision patterns over time
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(12, 10))

# Extract decision sequences
proposed_decisions = [1 if r['decision'] == 'fusion' else 0 for r in results['proposed']]
errors = [1 if r['error'] else 0 for r in results['proposed']]
example_ids = list(range(len(results['proposed'])))

# Plot 1: Decision patterns over time
ax1.plot(example_ids, proposed_decisions, 'b-', alpha=0.7, linewidth=2, label='DKW Controller (1=Fusion, 0=Fission)')
ax1.fill_between(example_ids, proposed_decisions, alpha=0.3, color='blue')
ax1.set_ylabel('Decision')
ax1.set_title('DKW Controller Decisions Over Time')
ax1.set_ylim(-0.1, 1.1)
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Error occurrences
ax2.scatter(example_ids, errors, alpha=0.6, c='red', s=20)
ax2.set_ylabel('Error Occurred')
ax2.set_title('Error Occurrences Over Time')
ax2.set_ylim(-0.1, 1.1)
ax2.grid(True, alpha=0.3)

# Plot 3: Running error rate
window_size = 20
running_error_rate = []
for i in range(len(errors)):
    start_idx = max(0, i - window_size + 1)
    window_errors = errors[start_idx:i+1]
    running_error_rate.append(np.mean(window_errors))

ax3.plot(example_ids, running_error_rate, 'g-', linewidth=2, label='Running Error Rate (20-sample window)')
ax3.axhline(y=0.10, color='red', linestyle='--', alpha=0.7, label='Target Error Rate (0.10)')
ax3.set_xlabel('Example Index')
ax3.set_ylabel('Error Rate')
ax3.set_title('Running Error Rate vs Target')
ax3.legend()
ax3.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Experiment: What if we had a different proposed method?
# Modify these parameters to see different scenarios

def create_experimental_results(n_examples=200, fusion_rate=0.8, error_rate=0.05):
    """Create experimental results with custom parameters."""
    results = []
    n_fusion = int(n_examples * fusion_rate)
    n_errors = int(n_examples * error_rate)
    
    for i in range(n_examples):
        decision = "fusion" if i < n_fusion else "fission"
        error = i < n_errors
        results.append({"decision": decision, "error": error})
    
    return results

# Try different scenarios
experimental_scenarios = {
    "baseline": baseline_results,  # Keep baseline the same
    "high_fusion_low_error": create_experimental_results(fusion_rate=0.9, error_rate=0.03),
    "medium_fusion": create_experimental_results(fusion_rate=0.5, error_rate=0.06),
    "original_proposed": proposed_results
}

print("Comparing different scenarios:\n")
for scenario_name, scenario_results in experimental_scenarios.items():
    if scenario_name == "baseline":
        continue
    
    scenario_data = {"baseline": baseline_results, "proposed": scenario_results}
    scenario_metrics = compute_metrics(scenario_data)
    
    print(f"{scenario_name.upper()}:")
    print(f"  API reduction: {scenario_metrics['improvement']['api_reduction_pct']:.1f}%")
    print(f"  Error rate change: {scenario_metrics['improvement']['error_rate_diff']:+.1%}")
    print(f"  Fusion rate: {scenario_metrics['proposed']['fusion_rate']:.1%}")
    print()

## Results Analysis

Let's analyze the behavior of both methods and visualize how the DKW controller adapts over time.

## Experiment with Different Scenarios

You can modify the parameters below to see how different controller behaviors would affect the metrics:

In [None]:
# Run the experiment
results = run_experiment(sample_data)

# Display basic statistics
print("Experiment Results:")
print(f"Total examples processed: {len(results['baseline'])}")

# Count decisions for each method
baseline_fission = sum(1 for r in results['baseline'] if r['decision'] == 'fission')
baseline_fusion = sum(1 for r in results['baseline'] if r['decision'] == 'fusion')

proposed_fission = sum(1 for r in results['proposed'] if r['decision'] == 'fission')
proposed_fusion = sum(1 for r in results['proposed'] if r['decision'] == 'fusion')

print(f"\nBaseline method:")
print(f"  Fission decisions: {baseline_fission}")
print(f"  Fusion decisions: {baseline_fusion}")

print(f"\nProposed DKW method:")
print(f"  Fission decisions: {proposed_fission}")
print(f"  Fusion decisions: {proposed_fusion}")

# Count errors for each method
baseline_errors = sum(1 for r in results['baseline'] if r['error'])
proposed_errors = sum(1 for r in results['proposed'] if r['error'])

print(f"\nError counts:")
print(f"  Baseline: {baseline_errors} errors")
print(f"  Proposed: {proposed_errors} errors")

In [None]:
# Compute metrics
metrics = compute_metrics(results)

# Display results in a formatted way
print("="*50)
print("DKW CONTROLLER EVALUATION RESULTS")
print("="*50)

for method in ["baseline", "proposed"]:
    print(f"\n{method.upper()} METHOD:")
    m = metrics[method]
    print(f"  Fusion rate:     {m['fusion_rate']:.1%}")
    print(f"  Fission rate:    {m['fission_rate']:.1%}") 
    print(f"  Error rate:      {m['error_rate']:.1%}")
    print(f"  Total API calls: {m['api_calls']}")
    print(f"  Avg calls/example: {m['avg_calls_per_example']:.2f}")

print(f"\nIMPROVEMENT:")
print(f"  API reduction: {metrics['improvement']['api_reduction_pct']:.1f}%")
print(f"  Error rate change: {metrics['improvement']['error_rate_diff']:+.1%}")

# Also save as JSON (inline output)
print(f"\nFull metrics as JSON:")
print(json.dumps(metrics, indent=2))

## Run Experiment

Let's execute the experiment and collect results from both the proposed DKW controller and the baseline always-fission approach.

In [None]:
def run_experiment(data):
    """Run DKW controller experiment."""
    controller = DKWController()
    results = {"baseline": [], "proposed": []}

    for example in data:
        # Simulate error occurrence based on difficulty
        error = np.random.random() < example["difficulty"]
        controller.add_observation(float(error))
        decision = controller.decide()

        results["proposed"].append({
            "id": example["id"],
            "decision": decision,
            "error": error,
        })
        results["baseline"].append({
            "id": example["id"],
            "decision": "fission",  # Always conservative
            "error": error,
        })

    return results

## Run Evaluation

Now let's compute the metrics and display the results:

In [None]:
def compute_metrics(results: dict) -> dict:
    """Compute evaluation metrics."""
    metrics = {}

    for method in ["baseline", "proposed"]:
        preds = results[method]

        # Count decisions
        fusion_count = sum(1 for p in preds if p["decision"] == "fusion")
        fission_count = sum(1 for p in preds if p["decision"] == "fission")

        # Compute error rate
        errors = sum(1 for p in preds if p["error"])
        error_rate = errors / len(preds)

        # API calls (fusion=1, fission=2)
        api_calls = fusion_count + 2 * fission_count

        metrics[method] = {
            "fusion_rate": fusion_count / len(preds),
            "fission_rate": fission_count / len(preds),
            "error_rate": error_rate,
            "api_calls": api_calls,
            "avg_calls_per_example": api_calls / len(preds),
        }

    # Compute improvement
    baseline_calls = metrics["baseline"]["avg_calls_per_example"]
    proposed_calls = metrics["proposed"]["avg_calls_per_example"]
    metrics["improvement"] = {
        "api_reduction_pct": (baseline_calls - proposed_calls) / baseline_calls * 100,
        "error_rate_diff": metrics["proposed"]["error_rate"] - metrics["baseline"]["error_rate"],
    }

    return metrics

## Experiment Function

The `run_experiment` function simulates the controller's behavior over a sequence of examples, comparing:
- **Proposed method**: Uses DKW controller for adaptive decisions
- **Baseline method**: Always uses conservative "fission" mode

For each example, we simulate whether an error occurs based on the difficulty level.

In [None]:
# Sample data - normally read from "../dataset_001/data_out.json"
# Create synthetic data with varying difficulty levels
sample_data = []

# Generate 300 examples with varying difficulty
for i in range(300):
    # Create examples with different difficulty patterns
    if i < 100:
        difficulty = 0.05  # Easy examples (low error rate)
    elif i < 200:
        difficulty = 0.15  # Medium examples (moderate error rate)
    else:
        difficulty = 0.08  # Harder examples (but still manageable)
    
    sample_data.append({
        "id": f"example_{i:03d}",
        "difficulty": difficulty
    })

print(f"Created {len(sample_data)} sample examples")
print(f"Sample data preview: {sample_data[:3]}")

## Evaluation Metrics Function

The `compute_metrics` function calculates several key performance indicators:

- **Fusion/Fission rates**: Proportion of decisions for each strategy
- **Error rate**: Percentage of incorrect decisions
- **API calls**: Total API usage (fusion = 1 call, fission = 2 calls)
- **Efficiency improvements**: Comparison between methods

## Sample Data

We'll create sample data to simulate the input that would normally be read from a JSON file. Each example has:
- `id`: Unique identifier
- `difficulty`: Probability of error occurring (0.0 to 1.0)

In [None]:
# Create sample experimental results
# This data represents the decisions made by baseline vs proposed methods

# Baseline method: always chooses fission, 8% error rate
baseline_results = []
for i in range(200):
    baseline_results.append({
        "decision": "fission",
        "error": i < 16  # First 16 examples have errors (8% of 200)
    })

# Proposed method: 65% fusion, 35% fission, 9% error rate  
proposed_results = []
for i in range(200):
    if i < 130:  # First 130 examples use fusion (65% of 200)
        decision = "fusion"
    else:  # Remaining 70 examples use fission (35% of 200)
        decision = "fission"
    
    proposed_results.append({
        "decision": decision,
        "error": i < 18  # First 18 examples have errors (9% of 200)
    })

# Combine into results structure
results = {
    "baseline": baseline_results,
    "proposed": proposed_results
}

print(f"Generated {len(baseline_results)} baseline results and {len(proposed_results)} proposed results")
print(f"Baseline decisions: {sum(1 for r in baseline_results if r['decision'] == 'fusion')} fusion, {sum(1 for r in baseline_results if r['decision'] == 'fission')} fission")
print(f"Proposed decisions: {sum(1 for r in proposed_results if r['decision'] == 'fusion')} fusion, {sum(1 for r in proposed_results if r['decision'] == 'fission')} fission")

In [None]:
@dataclass
class DKWController:
    """DKW-guided fusion/fission controller."""
    epsilon_target: float = 0.10
    delta: float = 0.05
    min_samples: int = 100
    hysteresis: float = 0.05

    samples: list = field(default_factory=list)
    current_state: str = "fission"

    def dkw_epsilon(self, n: int) -> float:
        """Compute DKW epsilon for n samples."""
        if n < 2:
            return 1.0
        return np.sqrt(np.log(2 / self.delta) / (2 * n))

    def add_observation(self, error: float) -> None:
        """Add error observation for calibration."""
        self.samples.append(error)

    def decide(self) -> str:
        """Make fusion/fission decision with DKW guarantee."""
        n = len(self.samples)
        if n < self.min_samples:
            return self.current_state

        epsilon = self.dkw_epsilon(n)
        empirical_error = np.mean(self.samples[-self.min_samples:])
        error_upper_bound = empirical_error + epsilon

        if self.current_state == "fusion":
            if error_upper_bound > self.epsilon_target + self.hysteresis:
                self.current_state = "fission"
        else:
            if error_upper_bound < self.epsilon_target - self.hysteresis:
                self.current_state = "fusion"

        return self.current_state

## Sample Data

Instead of reading from external JSON files, we'll create sample data inline that represents the experimental results from both baseline and proposed methods.

## DKW Controller Class

The core controller that implements the DKW-guided decision making algorithm.

**Parameters:**
- `epsilon_target`: Target error threshold (0.10)
- `delta`: Confidence parameter for DKW bound (0.05)
- `min_samples`: Minimum samples before making decisions (100)
- `hysteresis`: Buffer to prevent oscillation (0.05)

**Key Methods:**
- `dkw_epsilon()`: Computes the DKW confidence interval width
- `add_observation()`: Records error observations
- `decide()`: Makes fusion/fission decision with statistical guarantees

In [None]:
import json
import numpy as np

# DKW Controller Evaluation

This notebook evaluates the performance of a DKW Controller, comparing baseline and proposed methods. The evaluation focuses on:
- **Fusion vs Fission decisions**: The controller can choose to fuse or split operations
- **API efficiency**: Fusion requires 1 API call, fission requires 2 API calls
- **Error rates**: How often the controller makes incorrect decisions

The goal is to measure the improvement in API efficiency while maintaining acceptable error rates.

In [None]:
"""DKW Controller Implementation."""
import json
import numpy as np
from dataclasses import dataclass, field
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# DKW Controller Implementation Demo

This notebook demonstrates a **DKW-guided fusion/fission controller** that makes adaptive decisions based on error observations with statistical guarantees.

## Overview
The DKW (Dvoretzky-Kiefer-Wolfowitz) inequality provides a way to bound the difference between empirical and true error rates, enabling principled decision-making between fusion and fission states.

**Key Features:**
- Statistical guarantees via DKW inequality
- Adaptive switching between fusion/fission modes
- Hysteresis to prevent oscillation
- Real-time error calibration

# Dataset Collection for DKW Benchmark

This notebook demonstrates dataset collection for DKW controller evaluation using the GSM8K dataset from HuggingFace. The script processes mathematical word problems and creates structured benchmark data with difficulty estimates.

**Artifact ID:** dataset_001  
**Original file:** data.py

## Import Required Libraries

We'll need the `datasets` library to load data from HuggingFace and `json` for data serialization.

In [None]:
"""Dataset collection script for DKW benchmark."""
import json
from datasets import load_dataset

## Data Collection Function

The `collect_data()` function loads the GSM8K dataset from HuggingFace and processes it into a structured format suitable for benchmarking. Each example includes:
- **id**: Unique identifier for the example
- **question**: The mathematical word problem
- **answer**: The correct answer
- **difficulty**: A simple difficulty estimate based on question length

In [None]:
def collect_data():
    """Collect benchmark data for DKW controller evaluation."""
    # Load HuggingFace dataset
    ds = load_dataset("gsm8k", "main", split="test[:200]")

    data = []
    for i, example in enumerate(ds):
        data.append({
            "id": f"example_{i:03d}",
            "question": example["question"],
            "answer": example["answer"],
            "difficulty": len(example["question"]) / 100,  # Simple proxy
        })

    return data

## Execute Data Collection

Run the data collection function and display the results. This will download 200 examples from the GSM8K test set and process them into our benchmark format.

**Note:** This is completely self-contained - no external files are needed!

In [None]:
# Collect the data
data = collect_data()

# Display results
print(f"Collected {len(data)} examples")
print(f"\nFirst 3 examples:")
for i in range(min(3, len(data))):
    print(f"\nExample {i+1}:")
    print(f"  ID: {data[i]['id']}")
    print(f"  Question: {data[i]['question'][:100]}...")
    print(f"  Answer: {data[i]['answer']}")
    print(f"  Difficulty: {data[i]['difficulty']:.2f}")

## Save Data to JSON File

Optionally save the collected data to a JSON file for later use. This mimics the original script's behavior.

In [None]:
# Save data to JSON file (optional)
with open("data_out.json", "w") as f:
    json.dump(data, f, indent=2)

print("Data saved to 'data_out.json'")

# Show the JSON structure
print(f"\nJSON file contains {len(data)} examples")
print("Sample JSON structure:")
print(json.dumps(data[:2], indent=2))

## Example Output Format

Here's an example of what the processed data looks like. This demonstrates the expected structure and format:

In [None]:
# Example of processed data structure (for reference)
sample_data = [
    {
        "id": "example_000",
        "question": "What is 2+2?",
        "answer": "4",
        "difficulty": 0.15
    },
    {
        "id": "example_001",
        "question": "If x=5, what is 2x?",
        "answer": "10",
        "difficulty": 0.22
    },
    {
        "id": "example_002",
        "question": "Solve: 3y + 6 = 15",
        "answer": "y=3",
        "difficulty": 0.28
    }
]

print("Sample data structure:")
print(json.dumps(sample_data, indent=2))

## Usage and Customization

This notebook is completely self-contained and ready to run! Here are some ways you can customize it:

### Modify Dataset Parameters:
- Change the number of examples: `split="test[:200]"` â†’ `split="test[:500]"`
- Use different splits: `split="test"` or `split="train"`
- Use validation set: `split="validation"`

### Adjust Difficulty Calculation:
The current difficulty is based on question length. You could modify it to use:
- Number of mathematical operations
- Presence of certain keywords
- Complexity scoring algorithms

### Data Processing:
- Add additional fields (e.g., topic classification, solution steps)
- Filter examples by certain criteria
- Apply text preprocessing

### Running the Notebook:
1. Make sure you have the required packages: `pip install datasets`
2. Run all cells in order
3. The data will be collected from HuggingFace automatically
4. No external files needed!