In [None]:
# Uncomment to save data to file
# with open("data_out.json", "w") as f:
#     json.dump(processed_data, f, indent=2)
# print("Data saved to data_out.json")

# Or create a more detailed export
def export_data(filename="benchmark_data.json"):
    """Export processed data with metadata."""
    export_dict = {
        "metadata": {
            "source": "DKW Benchmark Dataset",
            "total_examples": len(processed_data),
            "avg_difficulty": sum(item['difficulty'] for item in processed_data) / len(processed_data)
        },
        "data": processed_data
    }
    
    with open(filename, "w") as f:
        json.dump(export_dict, f, indent=2)
    print(f"Enhanced data exported to {filename}")

# Uncomment to run the enhanced export
# export_data()

## Optional: Save to File

Uncomment the code below if you want to save the processed data to a JSON file:

In [None]:
# Display the processed data
print("Processed Data:")
print(json.dumps(processed_data, indent=2))

print("\n" + "="*50)
print("Data Summary:")
for item in processed_data:
    print(f"ID: {item['id']}")
    print(f"Question: {item['question']}")
    print(f"Answer: {item['answer']}")
    print(f"Difficulty: {item['difficulty']:.2f}")
    print("-" * 30)

## Results Display

Let's examine the processed data structure and optionally save it to JSON format for use in other applications.

In [None]:
# Execute the main data collection process
processed_data = collect_data()
print(f"Collected {len(processed_data)} examples")

# Display basic statistics
total_length = sum(len(item['question']) for item in processed_data)
avg_length = total_length / len(processed_data)
print(f"Average question length: {avg_length:.1f} characters")
print(f"Difficulty range: {min(item['difficulty'] for item in processed_data):.2f} - {max(item['difficulty'] for item in processed_data):.2f}")

## Execute Data Collection

Now let's run the data collection process and save the results. Instead of writing to a file, we'll store the data in memory for inspection.

In [None]:
def collect_data():
    """Collect benchmark data for DKW controller evaluation."""
    # Process our inlined sample dataset instead of loading from HuggingFace
    data = []
    for i, example in enumerate(sample_dataset):
        data.append({
            "id": f"example_{i:03d}",
            "question": example["question"],
            "answer": example["answer"],
            "difficulty": len(example["question"]) / 100,  # Simple proxy metric
        })
    
    return data

## Data Processing Function

The `collect_data()` function processes the raw dataset and adds:
- Unique identifiers for each example
- Difficulty scores based on question length (simple proxy metric)
- Structured formatting for downstream processing

In [None]:
# Sample dataset representing GSM8K questions and answers
# This replaces the original load_dataset("gsm8k", "main", split="test[:200]") call
sample_dataset = [
    {
        "question": "What is 2+2?",
        "answer": "4"
    },
    {
        "question": "If x=5, what is 2x?", 
        "answer": "10"
    },
    {
        "question": "Solve: 3y + 6 = 15",
        "answer": "y=3"
    }
]

print(f"Loaded {len(sample_dataset)} sample questions")

## Inlined Dataset

Instead of loading from external sources, we've inlined sample data to make this notebook completely self-contained. This represents the type of data that would be loaded from the HuggingFace GSM8K dataset.

In [None]:
"""Dataset collection script for DKW benchmark."""
import json

## Imports

We'll use the standard `json` library for data handling. The original script used `datasets` from HuggingFace, but we've inlined the data to make this notebook self-contained.

# Dataset Collection Script for DKW Benchmark

This notebook demonstrates dataset collection and processing for DKW controller evaluation. The original script has been converted to be completely self-contained with inlined data.

**Original Artifact**: `dataset_001` - `data.py`

## Overview
- Loads and processes benchmark data
- Calculates difficulty metrics based on question length
- Outputs structured data for evaluation purposes