# DKW Benchmark Dataset Collection

**Artifact:** dataset_001 (data.py)

This notebook demonstrates dataset collection for the DKW benchmark. It processes mathematical questions from the GSM8K dataset and formats them for controller evaluation.

## Overview
- Loads sample data (normally from HuggingFace GSM8K dataset)
- Processes questions with difficulty scoring
- Outputs structured JSON format for benchmarking

## 1. Import Dependencies

We'll import the necessary libraries for data processing and JSON handling.

In [None]:
"""Dataset collection script for DKW benchmark."""
import json
# Note: In a real scenario, you would use: from datasets import load_dataset

print("Dependencies imported successfully!")

## 2. Sample Dataset (Self-Contained)

Instead of loading from external sources, we'll use sample data that represents what would typically come from the GSM8K dataset. This makes the notebook completely self-contained.

In [None]:
# Sample data that simulates what we'd get from load_dataset("gsm8k", "main", split="test[:200]")
sample_dataset = [
    {
        "question": "What is 2+2?",
        "answer": "2+2=4"
    },
    {
        "question": "If x=5, what is 2x?",
        "answer": "If x=5, then 2x = 2*5 = 10"
    },
    {
        "question": "Solve: 3y + 6 = 15",
        "answer": "3y + 6 = 15\n3y = 15 - 6\n3y = 9\ny = 3"
    },
    {
        "question": "A store has 24 apples. If they sell 8 apples in the morning and 6 apples in the afternoon, how many apples are left?",
        "answer": "The store started with 24 apples.\nThey sold 8 + 6 = 14 apples.\n24 - 14 = 10 apples are left."
    },
    {
        "question": "What is 15% of 80?",
        "answer": "15% of 80 = 0.15 * 80 = 12"
    }
]

print(f"Sample dataset loaded with {len(sample_dataset)} examples")

## 3. Data Collection Function

The core function that processes the dataset and adds metadata like difficulty scoring.

In [None]:
def collect_data(dataset=None):
    """Collect benchmark data for DKW controller evaluation."""
    # Use provided dataset or default sample dataset
    if dataset is None:
        # In the original script, this would be:
        # ds = load_dataset("gsm8k", "main", split="test[:200]")
        dataset = sample_dataset
    
    data = []
    for i, example in enumerate(dataset):
        data.append({
            "id": f"example_{i:03d}",
            "question": example["question"],
            "answer": example["answer"],
            "difficulty": len(example["question"]) / 100,  # Simple proxy based on question length
        })

    return data

# Test the function
test_data = collect_data()
print(f"Processed {len(test_data)} examples")
print("\nFirst example:")
print(json.dumps(test_data[0], indent=2))

## 4. Process Complete Dataset

Run the collection process on our sample data and display the results.

In [None]:
# Main execution - equivalent to the original script's __main__ section
data = collect_data()

# Instead of writing to a file, we'll store the data in memory and display it
collected_data = data

print(f"Collected {len(collected_data)} examples")
print("\nComplete dataset:")
print(json.dumps(collected_data, indent=2))

## 5. Expected Output Format

This section shows the expected output format that would have been saved to `data_out.json` in the original script. The inlined data below represents the typical output structure.

In [None]:
# Expected output format (from data_out.json - inlined for self-contained demo)
expected_output = [
    {
        "id": "example_000",
        "question": "What is 2+2?",
        "answer": "4",
        "difficulty": 0.15
    },
    {
        "id": "example_001", 
        "question": "If x=5, what is 2x?",
        "answer": "10",
        "difficulty": 0.22
    },
    {
        "id": "example_002",
        "question": "Solve: 3y + 6 = 15",
        "answer": "y=3",
        "difficulty": 0.28
    }
]

print("Expected output format:")
print(json.dumps(expected_output, indent=2))

# Compare with our generated data structure
print(f"\nOur data has {len(collected_data)} examples")
print(f"Expected format has {len(expected_output)} examples")
print(f"Structure matches: {all(key in collected_data[0] for key in expected_output[0].keys())}")

## 6. File Output (Optional)

The original script saves data to a JSON file. Here we show how to do that, though it's commented out to keep the demo self-contained.

In [None]:
# Equivalent to the original script's file output
# Uncomment the lines below if you want to save to a file:

# with open("data_out.json", "w") as f:
#     json.dump(collected_data, f, indent=2)
# print(f"Saved {len(collected_data)} examples to data_out.json")

print("ðŸ’¡ To save the data to a file, uncomment the lines above and run this cell.")
print(f"ðŸ“Š Successfully processed {len(collected_data)} examples in memory!")

## 7. Customization and Extensions

This notebook is fully self-contained and ready to use. Here are some ways you can extend it:

### Modifications you can make:
1. **Add more sample data**: Extend the `sample_dataset` list with additional questions
2. **Improve difficulty scoring**: Replace the simple length-based difficulty with more sophisticated metrics
3. **Add data validation**: Include checks for required fields and data quality
4. **Export options**: Add cells to save data to different formats (CSV, JSONL, etc.)

### To use with real HuggingFace data:
1. Uncomment the `from datasets import load_dataset` import
2. Replace `sample_dataset` with `load_dataset("gsm8k", "main", split="test[:200]")`
3. Install required packages: `pip install datasets`

## 8. Interactive Experimentation

Try modifying the function or adding your own data below:

In [None]:
# Try adding your own questions here!
custom_data = [
    {
        "question": "What is the area of a rectangle with length 8 and width 5?",
        "answer": "Area = length Ã— width = 8 Ã— 5 = 40 square units"
    },
    {
        "question": "Calculate 25% of 200",
        "answer": "25% of 200 = 0.25 Ã— 200 = 50"
    }
    # Add more questions here...
]

# Process your custom data
if custom_data:
    custom_processed = collect_data(custom_data)
    print("Your custom data processed:")
    print(json.dumps(custom_processed, indent=2))
else:
    print("Add some questions to custom_data to see them processed!")