# Dataset Collection for DKW Benchmark

This notebook demonstrates dataset collection for DKW controller evaluation using the GSM8K dataset from HuggingFace. The script processes mathematical word problems and creates structured benchmark data with difficulty estimates.

**Artifact ID:** dataset_001  
**Original file:** data.py

## Import Required Libraries

We'll need the `datasets` library to load data from HuggingFace and `json` for data serialization.

In [None]:
"""Dataset collection script for DKW benchmark."""
import json
from datasets import load_dataset

## Data Collection Function

The `collect_data()` function loads the GSM8K dataset from HuggingFace and processes it into a structured format suitable for benchmarking. Each example includes:
- **id**: Unique identifier for the example
- **question**: The mathematical word problem
- **answer**: The correct answer
- **difficulty**: A simple difficulty estimate based on question length

In [None]:
def collect_data():
    """Collect benchmark data for DKW controller evaluation."""
    # Load HuggingFace dataset
    ds = load_dataset("gsm8k", "main", split="test[:200]")

    data = []
    for i, example in enumerate(ds):
        data.append({
            "id": f"example_{i:03d}",
            "question": example["question"],
            "answer": example["answer"],
            "difficulty": len(example["question"]) / 100,  # Simple proxy
        })

    return data

## Execute Data Collection

Run the data collection function and display the results. This will download 200 examples from the GSM8K test set and process them into our benchmark format.

**Note:** This is completely self-contained - no external files are needed!

In [None]:
# Collect the data
data = collect_data()

# Display results
print(f"Collected {len(data)} examples")
print(f"\nFirst 3 examples:")
for i in range(min(3, len(data))):
    print(f"\nExample {i+1}:")
    print(f"  ID: {data[i]['id']}")
    print(f"  Question: {data[i]['question'][:100]}...")
    print(f"  Answer: {data[i]['answer']}")
    print(f"  Difficulty: {data[i]['difficulty']:.2f}")

## Save Data to JSON File

Optionally save the collected data to a JSON file for later use. This mimics the original script's behavior.

In [None]:
# Save data to JSON file (optional)
with open("data_out.json", "w") as f:
    json.dump(data, f, indent=2)

print("Data saved to 'data_out.json'")

# Show the JSON structure
print(f"\nJSON file contains {len(data)} examples")
print("Sample JSON structure:")
print(json.dumps(data[:2], indent=2))

## Example Output Format

Here's an example of what the processed data looks like. This demonstrates the expected structure and format:

In [None]:
# Example of processed data structure (for reference)
sample_data = [
    {
        "id": "example_000",
        "question": "What is 2+2?",
        "answer": "4",
        "difficulty": 0.15
    },
    {
        "id": "example_001",
        "question": "If x=5, what is 2x?",
        "answer": "10",
        "difficulty": 0.22
    },
    {
        "id": "example_002",
        "question": "Solve: 3y + 6 = 15",
        "answer": "y=3",
        "difficulty": 0.28
    }
]

print("Sample data structure:")
print(json.dumps(sample_data, indent=2))

## Usage and Customization

This notebook is completely self-contained and ready to run! Here are some ways you can customize it:

### Modify Dataset Parameters:
- Change the number of examples: `split="test[:200]"` â†’ `split="test[:500]"`
- Use different splits: `split="test"` or `split="train"`
- Use validation set: `split="validation"`

### Adjust Difficulty Calculation:
The current difficulty is based on question length. You could modify it to use:
- Number of mathematical operations
- Presence of certain keywords
- Complexity scoring algorithms

### Data Processing:
- Add additional fields (e.g., topic classification, solution steps)
- Filter examples by certain criteria
- Apply text preprocessing

### Running the Notebook:
1. Make sure you have the required packages: `pip install datasets`
2. Run all cells in order
3. The data will be collected from HuggingFace automatically
4. No external files needed!