# Dataset Collection for DKW Benchmark

This notebook demonstrates the dataset collection script for DKW controller evaluation. The script processes benchmark data from the GSM8K dataset and formats it for evaluation purposes.

**Original Artifact:** data.py  
**Purpose:** Collect and format benchmark data for mathematical reasoning tasks

## Import Dependencies

First, let's import the required libraries for data processing.

In [None]:
"""Dataset collection script for DKW benchmark."""
import json
from typing import List, Dict, Any

# Note: In the original script, this would be: from datasets import load_dataset
# For this self-contained notebook, we'll use inline data instead

## Sample Dataset

Since this is a self-contained notebook, we'll simulate the GSM8K dataset with sample data instead of loading from HuggingFace. In the original script, this would be loaded using `load_dataset("gsm8k", "main", split="test[:200]")`.  

Here we include some mathematical reasoning questions that represent the type of data from GSM8K:

In [None]:
# Simulated GSM8K dataset samples (normally loaded from HuggingFace)
simulated_dataset = [
    {
        "question": "Janet's ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
        "answer": "Janet sells 16 - 3 - 4 = 9 duck eggs a day. She makes 9 * 2 = $18 every day at the farmer's market."
    },
    {
        "question": "A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts are needed?", 
        "answer": "It takes 2 bolts of blue fiber and 2/2 = 1 bolt of white fiber. So the total is 2 + 1 = 3 bolts."
    },
    {
        "question": "Josh decides to try flipping a house. He buys a house for $80,000 and then puts in $50,000 in repairs. This increased the value of the house by 150%. How much profit did he make?",
        "answer": "The cost of the house and repairs came out to 80,000+50,000=$130,000. He increased the value of the house by 80,000*1.5=120,000. So the new value of the house is 120,000+80,000=$200,000. So he made a profit of 200,000-130,000=$70,000."
    },
    {
        "question": "James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?",
        "answer": "He runs 3*3=9 sprints a week. So he runs 9*60=540 meters a week."
    },
    {
        "question": "Every day, Wendi feeds each of her chickens three cups of mixed chicken feed, containing seeds, mealworms and vegetables. She gives the chickens their feed in three separate meals. In the morning, she gives each chicken 15 grams of seeds. In the afternoon, she gives each chicken 25 grams of mealworms. How many grams of vegetables does she give each chicken in the evening?",
        "answer": "Each chicken gets 3 cups of feed daily, and this is split into 3 meals. The weight of feed per meal is not specified, but we know the seeds are 15 grams and mealworms are 25 grams. If we assume the total daily feed per chicken weighs 150 grams (50 grams per meal), then vegetables would be 150 - 15 - 25 = 110 grams. However, without the total weight specified, we can't determine the exact vegetable amount."
    }
]

print(f"Loaded {len(simulated_dataset)} sample questions")
print(f"\nFirst question preview: {simulated_dataset[0]['question'][:100]}...")

## Data Processing Function

The `collect_data()` function processes the raw dataset and adds additional metadata like difficulty scoring based on question length.

In [None]:
def collect_data() -> List[Dict[str, Any]]:
    """Collect benchmark data for DKW controller evaluation."""
    
    # In the original script, this would be:
    # ds = load_dataset("gsm8k", "main", split="test[:200]")
    # Here we use our simulated dataset instead
    ds = simulated_dataset

    data = []
    for i, example in enumerate(ds):
        processed_item = {
            "id": f"example_{i:03d}",
            "question": example["question"],
            "answer": example["answer"],
            "difficulty": len(example["question"]) / 100,  # Simple proxy for difficulty
        }
        data.append(processed_item)

    return data

print("âœ… Data processing function defined successfully!")

## Execute Data Collection

Let's run the data collection function and examine the results.

In [None]:
# Execute the data collection
data = collect_data()

print(f"Collected {len(data)} examples")
print("\nFirst few examples:")
for item in data[:3]:
    print(f"- ID: {item['id']}")
    print(f"  Question: {item['question'][:60]}...")
    print(f"  Answer: {item['answer'][:60]}...")
    print(f"  Difficulty: {item['difficulty']:.2f}")
    print()

## View Complete Dataset

Let's examine the complete processed dataset structure that would normally be saved to `data_out.json`.

In [None]:
# Display the complete processed dataset
print("Complete processed dataset:")
print(json.dumps(data, indent=2))

# In the original script, this would be saved to a file:
# with open("data_out.json", "w") as f:
#     json.dump(data, f, indent=2)

## Expected Output Reference

For comparison, here's the expected output structure that was provided in the original specification:

In [None]:
# Expected output format (from original data_out.json)
expected_output = [
    {
        "id": "example_000",
        "question": "What is 2+2?",
        "answer": "4",
        "difficulty": 0.15
    },
    {
        "id": "example_001", 
        "question": "If x=5, what is 2x?",
        "answer": "10",
        "difficulty": 0.22
    },
    {
        "id": "example_002",
        "question": "Solve: 3y + 6 = 15",
        "answer": "y=3",
        "difficulty": 0.28
    }
]

print("Expected output format:")
print(json.dumps(expected_output, indent=2))

## File Saving (Original Script Behavior)

In the original script, the final step was to save the data to a JSON file and print a summary. Let's replicate that behavior:

In [None]:
# Simulate the original script's main execution
if __name__ == "__main__":
    # This replicates the original script's main block
    final_data = collect_data()
    
    # In original script: with open("data_out.json", "w") as f: json.dump(data, f, indent=2)
    print("ðŸ’¾ [Simulated] Data would be saved to 'data_out.json'")
    
    # Original script's final print statement
    print(f"Collected {len(final_data)} examples")
    
    print("\nðŸŽ¯ Dataset collection completed successfully!")
    print("ðŸ“Š Data is now ready for DKW benchmark evaluation.")

## Usage Notes & Customization

### How to use this notebook:
1. **Self-contained**: This notebook runs without any external files or dependencies
2. **Customizable**: Modify the `simulated_dataset` to test with your own questions
3. **Extensible**: Add new fields to the output format by modifying the `collect_data()` function

### Original vs Notebook differences:
- **Original**: Loads data from HuggingFace datasets library
- **Notebook**: Uses inline sample data for demonstration
- **Original**: Saves output to `data_out.json` file  
- **Notebook**: Displays output directly in cells

### To restore original functionality:
1. Install dependencies: `pip install datasets`
2. Replace `simulated_dataset` with: `load_dataset("gsm8k", "main", split="test[:200]")`
3. Uncomment the file writing code to save to `data_out.json`

### Difficulty Metric:
The current difficulty calculation (`len(question) / 100`) is a simple proxy. You could enhance this with:
- Word count-based metrics
- Mathematical complexity analysis
- Machine learning-based difficulty prediction