# Dataset Collection for DKW Benchmark

This notebook demonstrates the `data.py` script converted into an interactive format. It collects benchmark data for DKW controller evaluation using the GSM8K dataset.

## Setup and Imports

First, let's import the required libraries. This notebook uses the HuggingFace `datasets` library to load the GSM8K dataset.

In [None]:
"""Dataset collection script for DKW benchmark."""
import json
from datasets import load_dataset

## Data Collection Function

The `collect_data()` function loads the GSM8K dataset from HuggingFace and processes it into a standardized format. Each example includes:
- **id**: Unique identifier for the example
- **question**: The math problem text
- **answer**: The correct answer
- **difficulty**: A simple proxy based on question length

In [None]:
def collect_data():
    """Collect benchmark data for DKW controller evaluation."""
    # Load HuggingFace dataset
    ds = load_dataset("gsm8k", "main", split="test[:200]")

    data = []
    for i, example in enumerate(ds):
        data.append({
            "id": f"example_{i:03d}",
            "question": example["question"],
            "answer": example["answer"],
            "difficulty": len(example["question"]) / 100,  # Simple proxy
        })

    return data

## Collect and Process Data

Now let's run the data collection function and see how many examples we get:

In [None]:
# Collect the data
data = collect_data()
print(f"Collected {len(data)} examples")

# Display the first few examples
print("\nFirst 3 examples:")
for i in range(min(3, len(data))):
    print(f"\nExample {i+1}:")
    print(f"  ID: {data[i]['id']}")
    print(f"  Question: {data[i]['question'][:100]}...")  # Truncate for display
    print(f"  Answer: {data[i]['answer']}")
    print(f"  Difficulty: {data[i]['difficulty']:.2f}")

## Sample Output Data

For reference, here's what the original script would have written to `data_out.json`. This sample data is now inlined to make the notebook self-contained:

In [None]:
# Sample data that would have been written to data_out.json
# This is inlined to make the notebook self-contained
sample_output_data = [
    {
        "id": "example_000",
        "question": "What is 2+2?",
        "answer": "4",
        "difficulty": 0.15
    },
    {
        "id": "example_001",
        "question": "If x=5, what is 2x?",
        "answer": "10",
        "difficulty": 0.22
    },
    {
        "id": "example_002",
        "question": "Solve: 3y + 6 = 15",
        "answer": "y=3",
        "difficulty": 0.28
    }
]

print("Sample output data structure:")
print(json.dumps(sample_output_data, indent=2))

## Optional: Save Data to File

If you want to save the collected data to a JSON file (as the original script did), you can run the following cell:

In [None]:
# Optional: Save the collected data to a JSON file
# Uncomment the lines below if you want to save the data

# with open("data_out.json", "w") as f:
#     json.dump(data, f, indent=2)
# print(f"Saved {len(data)} examples to data_out.json")

print("To save data, uncomment the lines above and run this cell")

## Usage Notes

- **Self-contained**: This notebook requires no external files - all sample data is inlined
- **Customizable**: You can modify the data collection parameters (e.g., change `test[:200]` to get more/fewer examples)
- **Interactive**: Run cells individually to explore the data step by step
- **Original functionality**: The core logic matches the original `data.py` script

To get started, simply run all cells from top to bottom!