## 6. Usage Notes and Customization

### Key Features of This Notebook:

1. **Self-Contained**: No external file dependencies - all sample data is inlined
2. **Interactive**: You can modify parameters and see results immediately
3. **Educational**: Each step is clearly documented and explained

### Customization Options:

- **Dataset Size**: Change `split="test[:200]"` to adjust how many examples to load
- **Difficulty Calculation**: Modify the `len(example["question"]) / 100` formula to use different difficulty metrics
- **Output Format**: Add or modify fields in the data structure

### Original Script Equivalent:

This notebook replicates the functionality of the original `data.py` script but in an interactive, educational format. The original script would run as:

```bash
python data.py
```

And produce the same `data_out.json` file that we've demonstrated here with sample data.

**Ready to run!** ðŸš€ This notebook can be executed from top to bottom without any additional setup or external files.

In [None]:
# Optional: Save data to JSON file (uncomment to enable)
# This replicates the original script's functionality

# with open("data_out.json", "w") as f:
#     json.dump(data, f, indent=2)
# print(f"Saved {len(data)} examples to data_out.json")

# For demonstration, let's save the sample data instead
with open("sample_data_out.json", "w") as f:
    json.dump(sample_data, f, indent=2)
print(f"Saved {len(sample_data)} sample examples to sample_data_out.json")

## 5. Save Data to File (Optional)

If you want to save the collected data to a JSON file (as in the original script), you can run the following cell. This is optional since the notebook is designed to work without external files.

In [None]:
# Sample data that would be saved to data_out.json
# This is inlined to make the notebook self-contained
sample_data = [
    {
        "id": "example_000",
        "question": "What is 2+2?",
        "answer": "4",
        "difficulty": 0.15
    },
    {
        "id": "example_001",
        "question": "If x=5, what is 2x?",
        "answer": "10",
        "difficulty": 0.22
    },
    {
        "id": "example_002",
        "question": "Solve: 3y + 6 = 15",
        "answer": "y=3",
        "difficulty": 0.28
    }
]

print("Sample data format:")
print(json.dumps(sample_data, indent=2))

## 4. Sample Output Data (Self-Contained)

Since this notebook is designed to be completely self-contained, here's the sample data that would be generated and saved to `data_out.json` in the original script. This demonstrates the expected format without requiring external files.

In [None]:
# Collect the data
data = collect_data()
print(f"Collected {len(data)} examples")

# Display the first few examples
print("\nFirst 3 examples:")
for i in range(min(3, len(data))):
    print(f"\nExample {i}:")
    print(f"  ID: {data[i]['id']}")
    print(f"  Question: {data[i]['question'][:100]}...")  # Truncate for display
    print(f"  Answer: {data[i]['answer']}")
    print(f"  Difficulty: {data[i]['difficulty']:.2f}")

## 3. Collect Data

Let's run the data collection function to see how it works:

**Note:** This will download data from HuggingFace. For demonstration purposes, we'll also show you what the expected output looks like.

In [None]:
def collect_data():
    """Collect benchmark data for DKW controller evaluation."""
    # Load HuggingFace dataset
    ds = load_dataset("gsm8k", "main", split="test[:200]")

    data = []
    for i, example in enumerate(ds):
        data.append({
            "id": f"example_{i:03d}",
            "question": example["question"],
            "answer": example["answer"],
            "difficulty": len(example["question"]) / 100,  # Simple proxy
        })

    return data

## 2. Data Collection Function

The `collect_data()` function loads the GSM8K dataset from HuggingFace and processes it into our desired format. Each example gets:
- A unique ID
- The original question
- The answer
- A difficulty score (based on question length as a simple proxy)

In [None]:
"""Dataset collection script for DKW benchmark."""
import json
from datasets import load_dataset

## 1. Import Required Libraries

First, let's import the necessary libraries for data collection and JSON handling.

# Dataset Collection Script for DKW Benchmark

**Artifact ID:** dataset_001  
**Name:** data.py

This notebook contains a dataset collection script for DKW benchmark evaluation. It demonstrates how to collect and process benchmark data from the GSM8K dataset and format it for evaluation purposes.

The notebook is completely self-contained and doesn't require any external files.