## üèÉ‚Äç‚ôÇÔ∏è Running the Experiment

Let's execute the experiment and see how the DKW controller performs compared to the baseline approach.

## Conclusion

This notebook has successfully converted the original `data.py` script into an interactive Jupyter format. Key features:

‚úÖ **Self-contained**: All data is inlined, no external file dependencies  
‚úÖ **Interactive**: Each step can be run and modified independently  
‚úÖ **Educational**: Clear explanations for each section  
‚úÖ **Flexible**: Can work with sample data or real GSM8K dataset  

### Next Steps
- Modify the `collect_data()` function to change the dataset size or processing logic
- Add additional data analysis or visualization
- Integrate with your DKW controller evaluation pipeline

### Original Script Equivalent
The cells above replicate the functionality of the original script:
```python
if __name__ == "__main__":
    data = collect_data()
    with open("data_out.json", "w") as f:
        json.dump(data, f, indent=2)
    print(f"Collected {len(data)} examples")
```

In [None]:
def run_experiment(data, verbose=True):
    """
    Run DKW controller experiment comparing baseline vs proposed approach.
    
    Args:
        data: List of examples with 'id' and 'difficulty' fields
        verbose: Whether to print progress updates
        
    Returns:
        Dictionary with 'baseline' and 'proposed' results
    """
    
    controller = DKWController()
    results = {"baseline": [], "proposed": []}
    
    if verbose:
        print("üöÄ Starting experiment...")
        print(f"üìä Processing {len(data)} examples...")
        print("=" * 50)
    
    for i, example in enumerate(data):
        # Simulate error occurrence based on difficulty
        # Higher difficulty = higher chance of error
        error = np.random.random() < example["difficulty"]
        
        # Feed error observation to controller
        controller.add_observation(float(error))
        
        # Get DKW controller decision
        decision = controller.decide()
        
        # Record results for proposed method
        results["proposed"].append({
            "id": example["id"],
            "decision": decision,
            "error": error,
            "difficulty": example["difficulty"]
        })
        
        # Baseline always chooses conservative "fission" mode
        results["baseline"].append({
            "id": example["id"], 
            "decision": "fission",  # Always conservative
            "error": error,
            "difficulty": example["difficulty"]
        })
        
        # Progress updates
        if verbose and i % 5 == 0:
            stats = controller.get_stats()
            print(f"üìà Step {i:2d}: {example['id']} | "
                  f"Error: {error} | Decision: {decision} | "
                  f"Samples: {stats['samples']:3d} | "
                  f"Error bound: {stats['upper_bound']:.3f}")
    
    if verbose:
        print("=" * 50)
        print("‚úÖ Experiment completed!")
        
        # Summary statistics
        proposed_fusion_count = sum(1 for r in results["proposed"] if r["decision"] == "fusion")
        total_errors = sum(1 for r in results["proposed"] if r["error"])
        
        print(f"üìä Summary:")
        print(f"   Total examples: {len(data)}")
        print(f"   Total errors: {total_errors}")
        print(f"   Proposed fusion decisions: {proposed_fusion_count}")
        print(f"   Baseline fusion decisions: 0 (always fission)")
    
    return results

print("‚úÖ Experiment function ready!")
print("üéØ Ready to run comparative analysis")

In [None]:
# Save sample data to file (replace 'sample_data' with 'data' if you collected real data)
output_file = "data_out.json"

with open(output_file, "w") as f:
    json.dump(sample_data, f, indent=2)

print(f"Data saved to {output_file}")
print(f"File contains {len(sample_data)} examples")

## Step 6: Save Data to File (Optional)

If you want to save the collected data to a JSON file, run the following cell:

## üß™ Experiment Runner

The experiment compares two strategies:
1. **Baseline**: Always uses conservative "fission" mode
2. **Proposed**: Uses DKW controller to adaptively choose modes

For each example, we:
- Simulate error occurrence based on difficulty 
- Feed errors to the DKW controller
- Record decisions and outcomes
- Compare performance between strategies

In [None]:
# Analyze the sample data
print("=== Data Analysis ===")
print(f"Total examples: {len(sample_data)}")

# Calculate difficulty statistics
difficulties = [item['difficulty'] for item in sample_data]
avg_difficulty = sum(difficulties) / len(difficulties)
min_difficulty = min(difficulties)
max_difficulty = max(difficulties)

print(f"\nDifficulty Statistics:")
print(f"  Average: {avg_difficulty:.3f}")
print(f"  Range: {min_difficulty:.3f} - {max_difficulty:.3f}")

# Show all examples in a formatted table
print(f"\n=== All Examples ===")
for item in sample_data:
    print(f"ID: {item['id']}")
    print(f"Question: {item['question']}")
    print(f"Answer: {item['answer']}")
    print(f"Difficulty: {item['difficulty']:.3f}")
    print("-" * 50)

In [None]:
@dataclass
class DKWController:
    """DKW-guided fusion/fission controller with statistical guarantees."""
    
    # Configuration parameters
    epsilon_target: float = 0.10      # Target error threshold (10%)
    delta: float = 0.05              # Confidence parameter (5% risk)
    min_samples: int = 100           # Minimum samples before mode switching
    hysteresis: float = 0.05         # Hysteresis to prevent oscillation

    # State tracking
    samples: list = field(default_factory=list)
    current_state: str = "fission"   # Start in conservative mode

    def dkw_epsilon(self, n: int) -> float:
        """
        Compute DKW epsilon bound for n samples.
        
        The DKW inequality provides: P(|F_n(x) - F(x)| > Œµ) ‚â§ 2e^(-2nŒµ¬≤)
        Solving for Œµ given confidence Œ¥: Œµ = ‚àö(ln(2/Œ¥) / (2n))
        """
        if n < 2:
            return 1.0  # Conservative bound for very few samples
        return np.sqrt(np.log(2 / self.delta) / (2 * n))

    def add_observation(self, error: float) -> None:
        """Add error observation for calibration."""
        self.samples.append(error)

    def decide(self) -> str:
        """
        Make fusion/fission decision with DKW statistical guarantee.
        
        Returns:
            "fusion" for aggressive mode or "fission" for conservative mode
        """
        n = len(self.samples)
        
        # Need sufficient samples before making decisions
        if n < self.min_samples:
            return self.current_state

        # Compute DKW bound and error estimate
        epsilon = self.dkw_epsilon(n)
        empirical_error = np.mean(self.samples[-self.min_samples:])  # Use recent samples
        error_upper_bound = empirical_error + epsilon
        
        # State transition logic with hysteresis
        if self.current_state == "fusion":
            # Switch to conservative if error bound exceeds target + hysteresis
            if error_upper_bound > self.epsilon_target + self.hysteresis:
                self.current_state = "fission"
                print(f"üîÑ Switching to FISSION: error_bound={error_upper_bound:.3f} > target={self.epsilon_target + self.hysteresis:.3f}")
        else:  # current_state == "fission"
            # Switch to aggressive if error bound is below target - hysteresis  
            if error_upper_bound < self.epsilon_target - self.hysteresis:
                self.current_state = "fusion"
                print(f"üîÑ Switching to FUSION: error_bound={error_upper_bound:.3f} < target={self.epsilon_target - self.hysteresis:.3f}")

        return self.current_state
    
    def get_stats(self) -> dict:
        """Get current controller statistics."""
        n = len(self.samples)
        if n == 0:
            return {"samples": 0, "empirical_error": 0, "epsilon": 1.0, "upper_bound": 1.0}
            
        epsilon = self.dkw_epsilon(n)
        empirical_error = np.mean(self.samples[-self.min_samples:]) if n >= self.min_samples else np.mean(self.samples)
        upper_bound = empirical_error + epsilon
        
        return {
            "samples": n,
            "empirical_error": empirical_error,
            "epsilon": epsilon,
            "upper_bound": upper_bound,
            "current_state": self.current_state
        }

print("‚úÖ DKWController class defined successfully!")
print("üéõÔ∏è Ready to create controller instances")

## Step 5: Data Analysis and Exploration

Let's explore the sample data structure and analyze some basic statistics:

In [None]:
# Uncomment the lines below to collect real data from GSM8K
# data = collect_data()
# print(f"Collected {len(data)} examples")
# 
# # Display first few examples
# for i in range(min(3, len(data))):
#     print(f"\nExample {i}:")
#     print(json.dumps(data[i], indent=2))

## Step 4: Collect Real Data (Optional)

Uncomment and run the following cell if you want to collect data from the actual GSM8K dataset. 
**Note:** This requires internet connection and the `datasets` library to download the GSM8K dataset.

In [None]:
# Sample data showing the expected output structure
# This represents what collect_data() would return
sample_data = [
    {
        "id": "example_000",
        "question": "What is 2+2?",
        "answer": "4",
        "difficulty": 0.15
    },
    {
        "id": "example_001",
        "question": "If x=5, what is 2x?",
        "answer": "10",
        "difficulty": 0.22
    },
    {
        "id": "example_002",
        "question": "Solve: 3y + 6 = 15",
        "answer": "y=3",
        "difficulty": 0.28
    }
]

print(f"Sample contains {len(sample_data)} examples")
print("\nFirst example:")
print(json.dumps(sample_data[0], indent=2))

## üßÆ DKW Controller Implementation

The DKW Controller uses the **Dvoretzky-Kiefer-Wolfowitz inequality** to provide statistical guarantees on empirical error bounds. 

### Key Parameters:
- **epsilon_target**: Target error threshold (10%)
- **delta**: Confidence level parameter (5% risk)
- **min_samples**: Minimum observations before switching modes
- **hysteresis**: Prevents oscillation between modes

### Algorithm:
1. Collect error observations over time
2. Compute DKW epsilon bound: `Œµ = ‚àö(ln(2/Œ¥) / (2n))`
3. Calculate error upper bound: `empirical_error + Œµ`
4. Make fusion/fission decision based on bounds

## Step 3: Demonstration with Sample Data

For demonstration purposes, here's what the collected data would look like. This example shows the structure without requiring the full dataset download:

In [None]:
# Inline dataset - replaces reading from external JSON files
# This simulates the data that would be in "../dataset_001/data_out.json"

experimental_data = [
    {"id": "example_000", "difficulty": 0.1},
    {"id": "example_001", "difficulty": 0.05}, 
    {"id": "example_002", "difficulty": 0.3},
    {"id": "example_003", "difficulty": 0.15},
    {"id": "example_004", "difficulty": 0.08},
    {"id": "example_005", "difficulty": 0.25},
    {"id": "example_006", "difficulty": 0.12},
    {"id": "example_007", "difficulty": 0.18},
    {"id": "example_008", "difficulty": 0.06},
    {"id": "example_009", "difficulty": 0.22},
    {"id": "example_010", "difficulty": 0.04},
    {"id": "example_011", "difficulty": 0.28},
    {"id": "example_012", "difficulty": 0.14},
    {"id": "example_013", "difficulty": 0.09},
    {"id": "example_014", "difficulty": 0.31},
]

print(f"üìà Dataset loaded with {len(experimental_data)} examples")
print(f"üéØ Difficulty range: {min(ex['difficulty'] for ex in experimental_data):.2f} - {max(ex['difficulty'] for ex in experimental_data):.2f}")

# Display first few examples
print("\\nüìã Sample data:")
for i, ex in enumerate(experimental_data[:5]):
    print(f"  {ex['id']}: difficulty = {ex['difficulty']}")

In [None]:
def collect_data():
    """Collect benchmark data for DKW controller evaluation."""
    # Load HuggingFace dataset
    ds = load_dataset("gsm8k", "main", split="test[:200]")

    data = []
    for i, example in enumerate(ds):
        data.append({
            "id": f"example_{i:03d}",
            "question": example["question"],
            "answer": example["answer"],
            "difficulty": len(example["question"]) / 100,  # Simple proxy
        })

    return data

## Step 2: Define Data Collection Function

The `collect_data()` function loads the GSM8K dataset and processes it into our desired format. Each example gets:
- A unique ID
- The original question
- The answer
- A difficulty score (based on question length as a simple proxy)

## üìä Dataset Configuration

The experimental dataset contains examples with varying difficulty levels. Each example has:
- **ID**: Unique identifier 
- **Difficulty**: Probability of error occurrence (0.0 to 1.0)

In real scenarios, this data would come from external files, but here we inline it for self-containment.

In [None]:
"""Dataset collection script for DKW benchmark."""
import json
from datasets import load_dataset

In [None]:
# Import required libraries
import json
import numpy as np
from dataclasses import dataclass, field
import matplotlib.pyplot as plt
import pandas as pd

# Set random seed for reproducible results
np.random.seed(42)

print("‚úÖ Libraries imported successfully!")
print("üìä Ready to run DKW Controller experiments")

## Step 1: Import Required Libraries

We'll import the necessary libraries for data processing and dataset loading.

# Dataset Collection for DKW Benchmark

This notebook demonstrates the dataset collection process for DKW controller evaluation using the GSM8K benchmark dataset.

**Artifact Information:**
- ID: dataset_001
- Name: data.py
- Converted to interactive Jupyter notebook format

## Overview
This notebook collects and processes benchmark data from the GSM8K dataset for evaluation purposes. The original script has been converted into an interactive format with all dependencies inlined for easy use.

# DKW Controller Implementation - Interactive Demo

This notebook demonstrates the **DKW (Dvoretzky-Kiefer-Wolfowitz) Controller** for making fusion/fission decisions with statistical guarantees.

## Overview
The DKW Controller uses statistical bounds to decide between two operating modes:
- **Fusion**: Aggressive mode for better performance 
- **Fission**: Conservative mode for better reliability

The controller provides theoretical guarantees on error rates using the DKW inequality.