# DKW Benchmark Dataset Collection

This notebook demonstrates the dataset collection process for DKW controller evaluation using the GSM8K benchmark dataset.

**Artifact:** dataset_001 (data.py)  
**Purpose:** Collect and process benchmark data for evaluation

## Import Required Libraries

We'll need these libraries for data processing and dataset handling.

In [None]:
"""Dataset collection script for DKW benchmark."""
import json
from datasets import load_dataset

## Data Collection Function

The `collect_data()` function loads the GSM8K dataset from HuggingFace and processes it into our required format. 

Each example includes:
- `id`: Unique identifier for the example
- `question`: The math problem question
- `answer`: The correct answer
- `difficulty`: A simple proxy based on question length

In [None]:
def collect_data():
    """Collect benchmark data for DKW controller evaluation."""
    # Load HuggingFace dataset
    ds = load_dataset("gsm8k", "main", split="test[:200]")

    data = []
    for i, example in enumerate(ds):
        data.append({
            "id": f"example_{i:03d}",
            "question": example["question"],
            "answer": example["answer"],
            "difficulty": len(example["question"]) / 100,  # Simple proxy
        })

    return data

## Sample Data (Inlined for Self-Contained Demo)

For demonstration purposes, we'll use pre-collected sample data instead of loading from HuggingFace. This makes the notebook completely self-contained and runnable without external dependencies.

The data below represents the expected output format from the `collect_data()` function.

In [None]:
# Sample data inlined for self-contained demo
sample_data = [
    {
        "id": "example_000",
        "question": "What is 2+2?",
        "answer": "4",
        "difficulty": 0.15
    },
    {
        "id": "example_001", 
        "question": "If x=5, what is 2x?",
        "answer": "10",
        "difficulty": 0.22
    },
    {
        "id": "example_002",
        "question": "Solve: 3y + 6 = 15",
        "answer": "y=3",
        "difficulty": 0.28
    }
]

print(f"Loaded {len(sample_data)} sample examples")
print("\nFirst example:")
print(json.dumps(sample_data[0], indent=2))

## Data Analysis and Exploration

Let's explore the dataset structure and characteristics of our benchmark data.

In [None]:
# Analyze the dataset
print("Dataset Statistics:")
print(f"Total examples: {len(sample_data)}")

# Calculate difficulty statistics
difficulties = [item["difficulty"] for item in sample_data]
print(f"Difficulty range: {min(difficulties):.2f} - {max(difficulties):.2f}")
print(f"Average difficulty: {sum(difficulties) / len(difficulties):.2f}")

# Display all examples
print("\nAll examples:")
for i, example in enumerate(sample_data):
    print(f"\n{i+1}. {example['question']}")
    print(f"   Answer: {example['answer']}")
    print(f"   Difficulty: {example['difficulty']:.2f}")

## Original Script Functionality

The original script would save the data to a JSON file. Here we demonstrate this functionality using our sample data.

In [None]:
# Simulate the original script's main functionality
if __name__ == "__main__":
    # Use our sample data instead of collecting from HuggingFace
    data = sample_data
    
    # Original script would save to file - here we just display the JSON
    print("JSON output that would be saved to 'data_out.json':")
    print("=" * 50)
    print(json.dumps(data, indent=2))
    print("=" * 50)
    print(f"Collected {len(data)} examples")

# For interactive use, you can also work with individual examples:
print(f"\nExample access patterns:")
print(f"First question: {sample_data[0]['question']}")
print(f"All IDs: {[item['id'] for item in sample_data]}")

## Usage and Modification Notes

This notebook is completely self-contained and can be run without any external files or dependencies (except for the `datasets` library if you want to use the original `collect_data()` function).

### Key Changes from Original Script:
- **Inlined JSON data**: The sample data is now embedded as a Python list instead of being read from an external JSON file
- **Interactive exploration**: Added analysis and visualization of the dataset
- **Self-contained**: No external file dependencies for the demo

### To Modify:
1. **Use real data**: Uncomment and run `data = collect_data()` to fetch from HuggingFace
2. **Add more examples**: Extend the `sample_data` list with additional examples
3. **Change difficulty calculation**: Modify the difficulty formula in the `collect_data()` function
4. **Export results**: Save `sample_data` to a file using `json.dump()` if needed

### Original Artifact:
- **ID**: dataset_001
- **Name**: data.py
- **Purpose**: DKW benchmark dataset collection