## Summary

This notebook successfully converts the original Python script into an interactive format with the following features:

✅ **Self-contained**: No external dependencies on HuggingFace datasets or external files  
✅ **Interactive**: Easy to run and modify each step individually  
✅ **Well-documented**: Clear explanations for each section  
✅ **Equivalent functionality**: Reproduces the core logic of the original script  

**Key Modifications Made:**
- Replaced HuggingFace dataset loading with inline sample data
- Added detailed explanations and documentation
- Broke the code into logical, executable cells
- Added data analysis and visualization capabilities
- Maintained the original data processing and saving functionality

You can now run each cell individually to understand how the dataset processing works, or modify the sample data to test with your own examples!

In [None]:
# Save to JSON file (equivalent to the original script's main execution)
output_filename = "data_out.json"

with open(output_filename, "w") as f:
    json.dump(processed_data, f, indent=2)
    
print(f"Saved {len(processed_data)} examples to {output_filename}")

# Verify the file was created and show its size
import os
if os.path.exists(output_filename):
    file_size = os.path.getsize(output_filename)
    print(f"Output file size: {file_size} bytes")
else:
    print("Error: Output file was not created")

## Saving the Data

The original script saved the processed data to a JSON file. Here's how you can save the results (optional):

In [None]:
# Display the complete processed dataset
print("Complete processed dataset:")
print(json.dumps(processed_data, indent=2))
print("\n" + "="*50 + "\n")

# Basic analysis
print("Dataset Statistics:")
print(f"Total examples: {len(processed_data)}")
print(f"Average question length: {sum(len(item['question']) for item in processed_data) / len(processed_data):.1f} characters")
print(f"Average difficulty: {sum(item['difficulty'] for item in processed_data) / len(processed_data):.3f}")

# Show difficulty distribution
print("\nDifficulty distribution:")
for item in processed_data:
    print(f"  {item['id']}: {item['difficulty']:.3f} ('{item['question'][:30]}...')")

## Results and Analysis

Let's examine the complete processed dataset and perform some basic analysis:

In [None]:
def collect_data(dataset_examples):
    """Collect benchmark data for DKW controller evaluation.
    
    Args:
        dataset_examples: List of examples with 'question' and 'answer' fields
        
    Returns:
        List of processed examples with id, question, answer, and difficulty
    """
    # Process the dataset examples
    data = []
    for i, example in enumerate(dataset_examples):
        data.append({
            "id": f"example_{i:03d}",
            "question": example["question"],
            "answer": example["answer"],
            "difficulty": len(example["question"]) / 100,  # Simple proxy based on question length
        })
    
    return data

# Test the function with our sample data
processed_data = collect_data(sample_dataset)
print(f"Processed {len(processed_data)} examples")
print("First processed example:")
print(json.dumps(processed_data[0], indent=2))

## Data Collection Function

Here's the main data processing function that converts raw dataset examples into the format needed for DKW benchmark evaluation:

In [None]:
# Sample data simulating the output from the original GSM8K dataset processing
# This replaces the need to load from external sources
sample_dataset = [
    {
        "question": "What is 2+2?",
        "answer": "4"
    },
    {
        "question": "If x=5, what is 2x?", 
        "answer": "10"
    },
    {
        "question": "Solve: 3y + 6 = 15",
        "answer": "y=3"
    }
]

print(f"Loaded {len(sample_dataset)} sample examples")
print("Sample question:", sample_dataset[0]["question"])

## Sample Data

In the original script, data would be loaded from HuggingFace's GSM8K dataset. For this self-contained demo, we'll use sample data that represents the expected output format:

In [None]:
"""Dataset collection script for DKW benchmark."""
import json
# Note: In the original script, we would import datasets from HuggingFace
# from datasets import load_dataset

print("Libraries imported successfully!")

## Setup and Imports

First, let's import the necessary libraries for data processing:

# Dataset Collection for DKW Benchmark

**Artifact ID:** dataset_001  
**Name:** data.py

This notebook demonstrates a dataset collection script for DKW controller evaluation. The original script loads data from HuggingFace's GSM8K dataset, processes it, and saves it to JSON format. This self-contained version includes sample data and shows the complete workflow.