## How to Modify and Extend

### To use with real HuggingFace data:
1. Install required packages: `pip install datasets`
2. Uncomment the line `data = collect_data()` in the data collection section
3. Comment out or replace the `sample_data` with `data`

### To customize the dataset:
- **Change dataset source:** Modify the `load_dataset()` call in `collect_data()`
- **Adjust sample size:** Change `"test[:200]"` to your desired split
- **Modify difficulty calculation:** Update the difficulty formula in the data processing loop
- **Add new fields:** Extend the data dictionary with additional metadata

### Example modifications:

```python
# Use a different dataset
ds = load_dataset("squad", split="validation[:100]")

# More sophisticated difficulty calculation
difficulty = (len(example["question"]) + len(example["answer"])) / 200

# Add additional metadata
data.append({
    "id": f"example_{i:03d}",
    "question": example["question"],
    "answer": example["answer"], 
    "difficulty": difficulty,
    "word_count": len(example["question"].split()),
    "created_at": datetime.now().isoformat()
})
```

This notebook is now completely self-contained and ready to run!

In [None]:
# Save the data to JSON file (optional)
# Uncomment the lines below if you want to save the data

# with open("data_out.json", "w") as f:
#     json.dump(sample_data, f, indent=2)
# print(f"Data saved to data_out.json")

# For demonstration, let's show what the JSON would look like
print("JSON representation of the data:")
print(json.dumps(sample_data, indent=2))

## Data Export (Optional)

The original script saved data to `data_out.json`. Here's how you could save the processed data if needed:

In [None]:
# Analyze the dataset characteristics
total_examples = len(sample_data)
difficulties = [example["difficulty"] for example in sample_data]
avg_difficulty = sum(difficulties) / len(difficulties)
min_difficulty = min(difficulties)
max_difficulty = max(difficulties)

print("Dataset Statistics:")
print(f"Total examples: {total_examples}")
print(f"Average difficulty: {avg_difficulty:.2f}")
print(f"Difficulty range: {min_difficulty:.2f} - {max_difficulty:.2f}")

print("\nAll examples:")
for example in sample_data:
    print(f"ID: {example['id']}")
    print(f"Question: {example['question']}")
    print(f"Answer: {example['answer']}")
    print(f"Difficulty: {example['difficulty']:.2f}")
    print("-" * 50)

## Data Analysis

Let's analyze the collected data to understand its characteristics.

In [None]:
# For demonstration purposes, we'll use sample data instead of calling HuggingFace
# This makes the notebook completely self-contained

# Simulate the data collection with sample data
# (In a real scenario, uncomment the line below to collect from HuggingFace)
# data = collect_data()

# Sample data that would typically be loaded from the dataset
sample_data = [
    {
        "id": "example_000",
        "question": "What is 2+2?",
        "answer": "4", 
        "difficulty": 0.15
    },
    {
        "id": "example_001",
        "question": "If x=5, what is 2x?",
        "answer": "10",
        "difficulty": 0.22
    },
    {
        "id": "example_002", 
        "question": "Solve: 3y + 6 = 15",
        "answer": "y=3",
        "difficulty": 0.28
    }
]

print(f"Collected {len(sample_data)} examples")
print("Sample data structure:")
for i, example in enumerate(sample_data[:2]):
    print(f"Example {i+1}: {example}")

## Data Collection and Processing

Now let's run the data collection function and examine the results. In the original script, this data would be saved to `data_out.json`.

In [None]:
def collect_data():
    """Collect benchmark data for DKW controller evaluation."""
    # Load HuggingFace dataset
    ds = load_dataset("gsm8k", "main", split="test[:200]")

    data = []
    for i, example in enumerate(ds):
        data.append({
            "id": f"example_{i:03d}",
            "question": example["question"],
            "answer": example["answer"],
            "difficulty": len(example["question"]) / 100,  # Simple proxy
        })

    return data

## Data Collection Function

The main function loads data from the HuggingFace GSM8K dataset and processes it for benchmark evaluation. 

**Key features:**
- Loads the first 200 examples from the test split
- Adds unique IDs for each example  
- Calculates a simple difficulty metric based on question length
- Returns structured data ready for evaluation

In [None]:
"""Dataset collection script for DKW benchmark."""
import json
from datasets import load_dataset

## Setup and Imports

First, let's import the necessary libraries for data collection and processing.

# Dataset Collection for DKW Benchmark

This notebook demonstrates dataset collection and processing for the DKW controller evaluation benchmark. 

**Original script:** `data.py`  
**Purpose:** Collect and process benchmark data from HuggingFace datasets