In [None]:
# Save data to JSON file (equivalent to original script functionality)
with open("data_out.json", "w") as f:
    json.dump(data, f, indent=2)

print(f"âœ… Saved {len(data)} examples to data_out.json")

## Optional: Export Data to JSON File

If you want to save the data to a file (as the original script did), run the cell below:

In [None]:
# Convert to pandas DataFrame for easier analysis
df = pd.DataFrame(data)

print("Dataset Overview:")
print(f"Total examples: {len(df)}")
print(f"Average difficulty: {df['difficulty'].mean():.3f}")
print(f"Difficulty range: {df['difficulty'].min():.3f} - {df['difficulty'].max():.3f}")

print("\nFirst few examples:")
print(df.head())

## Data Exploration and Analysis

Let's explore the collected data using pandas for better visualization:

In [None]:
# Choose your data source (uncomment one option):

# Option 1: Live data from HuggingFace (requires internet)
# data = collect_data()

# Option 2: Use sample data (completely self-contained) 
data = sample_data

print(f"Collected {len(data)} examples")
print("First example:")
print(f"ID: {data[0]['id']}")
print(f"Question: {data[0]['question']}")
print(f"Answer: {data[0]['answer']}")
print(f"Difficulty: {data[0]['difficulty']}")

## Choose Your Data Source

You can either:
1. Use the live data collection function (requires internet connection and datasets library)
2. Use the sample data (completely self-contained)

Uncomment the option you prefer:

In [None]:
# Inline sample data (replaces reading from data_out.json)
sample_data = [
    {
        "id": "example_000",
        "question": "What is 2+2?",
        "answer": "4",
        "difficulty": 0.15
    },
    {
        "id": "example_001", 
        "question": "If x=5, what is 2x?",
        "answer": "10",
        "difficulty": 0.22
    },
    {
        "id": "example_002",
        "question": "Solve: 3y + 6 = 15",
        "answer": "y=3", 
        "difficulty": 0.28
    }
]

print(f"Sample data contains {len(sample_data)} examples")

## Option 2: Self-Contained Sample Data

For demonstration purposes or when you don't want to download the full dataset, here's a pre-loaded sample dataset inlined directly in the notebook:

In [None]:
def collect_data():
    """Collect benchmark data for DKW controller evaluation."""
    # Load HuggingFace dataset
    ds = load_dataset("gsm8k", "main", split="test[:200]")

    data = []
    for i, example in enumerate(ds):
        data.append({
            "id": f"example_{i:03d}",
            "question": example["question"],
            "answer": example["answer"],
            "difficulty": len(example["question"]) / 100,  # Simple proxy
        })

    return data

## Option 1: Live Data Collection from HuggingFace

This function loads data directly from the GSM8K dataset on HuggingFace. It processes the first 200 test examples and calculates a simple difficulty metric based on question length.

In [None]:
"""Dataset collection script for DKW benchmark."""
import json
from datasets import load_dataset
import pandas as pd  # For better data display

# Dataset Collection for DKW Benchmark

This notebook contains a dataset collection script for DKW controller evaluation. Originally called `data.py`, this interactive version allows you to explore and modify the data collection process.

## Overview
- Collects benchmark data from GSM8K dataset
- Processes examples with question, answer, and difficulty metrics
- Provides both live data loading and pre-loaded sample data options