In [None]:
# Uncomment the following lines to save data to file
# with open("data_out.json", "w") as f:
#     json.dump(data, f, indent=2)
# print("Data saved to data_out.json")

## Optional: Save Data to File

If you want to save the processed data to a JSON file (like the original script), uncomment and run the following cell:

In [None]:
# Execute the data collection
data = collect_data()

# Display results
print(f"Collected {len(data)} examples")
print("\nProcessed data:")
print(json.dumps(data, indent=2))

## Execute Data Collection

Run the data collection and display the results:

In [None]:
def collect_data():
    """Collect benchmark data for DKW controller evaluation."""
    # Use inlined sample data instead of loading from HuggingFace
    ds = sample_dataset

    data = []
    for i, example in enumerate(ds):
        data.append({
            "id": f"example_{i:03d}",
            "question": example["question"],
            "answer": example["answer"],
            "difficulty": len(example["question"]) / 100,  # Simple proxy
        })

    return data

## Data Processing Function

The `collect_data()` function processes the raw dataset and adds metadata like difficulty scores:

In [None]:
# Inlined sample data (replaces HuggingFace dataset loading)
# This simulates the structure that would come from load_dataset("gsm8k", "main", split="test[:200]")
sample_dataset = [
    {
        "question": "What is 2+2?",
        "answer": "4"
    },
    {
        "question": "If x=5, what is 2x?", 
        "answer": "10"
    },
    {
        "question": "Solve: 3y + 6 = 15",
        "answer": "y=3"
    },
    {
        "question": "A store sells apples for $2 each and oranges for $3 each. If John buys 4 apples and 2 oranges, how much does he spend in total?",
        "answer": "$14"
    },
    {
        "question": "What is 15% of 80?",
        "answer": "12"
    }
]

print(f"Sample dataset loaded with {len(sample_dataset)} examples")

## Inlined Sample Data

Instead of loading from HuggingFace, we'll use this sample dataset that simulates the `gsm8k` format:

In [None]:
"""Dataset collection script for DKW benchmark."""
import json

## Overview

This notebook replaces the original HuggingFace dataset loading with inlined sample data to make it completely self-contained. 

**Key changes from original script:**
- Inlined sample dataset instead of loading from HuggingFace `gsm8k` 
- All data processing logic preserved
- No external file dependencies
- Demonstrates the same data transformation pipeline

# Dataset Collection for DKW Benchmark

**Artifact:** data.py  
**Purpose:** Dataset collection script for DKW controller evaluation

This notebook demonstrates data collection and processing for the DKW benchmark system. The original script has been converted to a self-contained format with inlined data for easy execution.