# DKW Benchmark Dataset Collection

This notebook demonstrates dataset collection for DKW controller evaluation using the GSM8K benchmark dataset.

**Original Artifact:** `data.py` (dataset_001)

This notebook is completely self-contained and doesn't require any external files.

In [None]:
# Required imports
import json
from datasets import load_dataset

## Data Collection Function

The `collect_data()` function loads the GSM8K dataset from HuggingFace and processes it into our required format:
- Assigns unique IDs to each example
- Extracts question and answer pairs
- Calculates a simple difficulty proxy based on question length

In [None]:
def collect_data():
    """Collect benchmark data for DKW controller evaluation."""
    # Load HuggingFace dataset
    ds = load_dataset("gsm8k", "main", split="test[:200]")

    data = []
    for i, example in enumerate(ds):
        data.append({
            "id": f"example_{i:03d}",
            "question": example["question"],
            "answer": example["answer"],
            "difficulty": len(example["question"]) / 100,  # Simple proxy
        })

    return data

## Self-Contained Data

For demonstration purposes, we'll use inline sample data instead of downloading from HuggingFace. This makes the notebook completely self-contained and runnable without internet access or external dependencies.

The data below represents the expected output format from the original script:

In [None]:
# Inline sample data (equivalent to data_out.json)
sample_data = [
    {
        "id": "example_000",
        "question": "What is 2+2?",
        "answer": "4",
        "difficulty": 0.15
    },
    {
        "id": "example_001",
        "question": "If x=5, what is 2x?",
        "answer": "10",
        "difficulty": 0.22
    },
    {
        "id": "example_002",
        "question": "Solve: 3y + 6 = 15",
        "answer": "y=3",
        "difficulty": 0.28
    }
]

print(f"Loaded {len(sample_data)} sample examples")

## Main Execution

This section demonstrates the original script's functionality. Instead of calling the `collect_data()` function (which would require HuggingFace access), we use our inline sample data and simulate the JSON file output process.

In [None]:
# Main execution (equivalent to the original script's if __name__ == "__main__" block)
data = sample_data  # Using inline data instead of collect_data()

# Simulate writing to JSON file (but display content instead)
json_output = json.dumps(data, indent=2)
print(f"Collected {len(data)} examples")
print("\nJSON output (would be written to 'data_out.json'):")
print("=" * 50)
print(json_output)

## Data Exploration and Analysis

Let's explore the collected data structure and analyze the difficulty distribution:

In [None]:
# Explore the data structure
print("Data structure analysis:")
print(f"Number of examples: {len(data)}")
print(f"Keys per example: {list(data[0].keys())}")

print("\nSample questions and their difficulties:")
for item in data:
    print(f"ID: {item['id']}")
    print(f"Question: {item['question']}")
    print(f"Answer: {item['answer']}")
    print(f"Difficulty: {item['difficulty']:.2f}")
    print("-" * 40)

# Calculate difficulty statistics
difficulties = [item['difficulty'] for item in data]
print(f"\nDifficulty Statistics:")
print(f"Average difficulty: {sum(difficulties) / len(difficulties):.2f}")
print(f"Min difficulty: {min(difficulties):.2f}")
print(f"Max difficulty: {max(difficulties):.2f}")

## How to Modify This Notebook

### To use the original HuggingFace dataset:
1. Uncomment the `collect_data()` function call in the main execution cell
2. Change `data = sample_data` to `data = collect_data()`
3. Install required dependencies: `pip install datasets`

### To add more sample data:
1. Modify the `sample_data` list above
2. Follow the same structure: `{"id": "...", "question": "...", "answer": "...", "difficulty": float}`

### To change the difficulty calculation:
1. Modify the `difficulty` calculation in the `collect_data()` function
2. Current method uses question length as a simple proxy