# Dataset Collection Script for DKW Benchmark

**Artifact Name:** data.py  
**Description:** Dataset collection script for DKW benchmark evaluation

This notebook demonstrates how to collect and process benchmark data for evaluating DKW controllers. The original script loads data from HuggingFace datasets, but this version is completely self-contained with inline sample data.

## Overview

This notebook contains a self-contained version of the dataset collection script. Instead of loading data from external sources like HuggingFace datasets, we'll use inline sample data to demonstrate the data processing pipeline.

### What this notebook does:
1. Defines a data collection function
2. Processes sample benchmark questions and answers
3. Calculates difficulty metrics
4. Outputs structured data for DKW evaluation

In [None]:
"""Dataset collection script for DKW benchmark."""
import json
from typing import List, Dict, Any

# Note: Original script used 'datasets' library to load from HuggingFace
# This version is self-contained with inline sample data

## Sample Dataset

Instead of loading from HuggingFace, we'll use inline sample data that represents the GSM8K benchmark format. This makes the notebook completely self-contained.

In [None]:
# Inline sample data (replaces HuggingFace dataset loading)
# This represents a subset of GSM8K-style math problems
sample_dataset = [
    {
        "question": "What is 2+2?",
        "answer": "4"
    },
    {
        "question": "If x=5, what is 2x?", 
        "answer": "10"
    },
    {
        "question": "Solve: 3y + 6 = 15",
        "answer": "y=3"
    },
    {
        "question": "A store has 45 apples. If they sell 18 apples in the morning and 12 apples in the afternoon, how many apples do they have left?",
        "answer": "45 - 18 - 12 = 15 apples"
    },
    {
        "question": "Sarah has $20. She buys 3 pencils for $2 each and 2 notebooks for $4 each. How much money does she have left?",
        "answer": "20 - (3 * 2) - (2 * 4) = 20 - 6 - 8 = $6"
    }
]

print(f"Loaded {len(sample_dataset)} sample examples")

## Data Collection Function

This function processes the sample dataset and creates structured benchmark data with additional metadata like difficulty scores.

In [None]:
def collect_data(dataset: List[Dict[str, str]] = None) -> List[Dict[str, Any]]:
    """
    Collect benchmark data for DKW controller evaluation.
    
    Args:
        dataset: Optional dataset to process. If None, uses global sample_dataset.
    
    Returns:
        List of dictionaries containing processed benchmark data.
    """
    # Use provided dataset or fall back to global sample_dataset
    if dataset is None:
        dataset = sample_dataset
    
    data = []
    for i, example in enumerate(dataset):
        # Process each example and add metadata
        processed_example = {
            "id": f"example_{i:03d}",
            "question": example["question"],
            "answer": example["answer"],
            "difficulty": len(example["question"]) / 100,  # Simple proxy for difficulty
        }
        data.append(processed_example)

    return data

# Test the function with our sample data
print("Data collection function defined successfully!")

## Execute Data Collection

Let's run the data collection function and examine the results.

In [None]:
# Run the data collection function
collected_data = collect_data()

print(f"Collected {len(collected_data)} examples")
print("\nFirst few examples:")
for i, example in enumerate(collected_data[:3]):
    print(f"\nExample {i+1}:")
    print(f"  ID: {example['id']}")
    print(f"  Question: {example['question']}")
    print(f"  Answer: {example['answer']}")
    print(f"  Difficulty: {example['difficulty']:.2f}")

## Data Analysis

Let's analyze the collected data to understand the difficulty distribution and other characteristics.

In [None]:
# Analyze the collected data
difficulties = [example['difficulty'] for example in collected_data]
question_lengths = [len(example['question']) for example in collected_data]

print("=== Data Analysis ===")
print(f"Total examples: {len(collected_data)}")
print(f"Average difficulty: {sum(difficulties) / len(difficulties):.3f}")
print(f"Difficulty range: {min(difficulties):.3f} - {max(difficulties):.3f}")
print(f"Average question length: {sum(question_lengths) / len(question_lengths):.1f} characters")

print("\n=== Difficulty Distribution ===")
for i, example in enumerate(collected_data):
    difficulty_bar = "â–ˆ" * int(example['difficulty'] * 50)  # Scale for visualization
    print(f"{example['id']}: {difficulty_bar} ({example['difficulty']:.3f})")
    
# Show complete data structure for reference
print(f"\n=== Complete Data Structure ===")
print(json.dumps(collected_data, indent=2))

## Save Results (Optional)

Optionally save the processed data to a JSON file for external use.

In [None]:
# Save the collected data to a JSON file (equivalent to original script behavior)
output_filename = "data_out.json"

# Uncomment the following lines to save to file:
# with open(output_filename, "w") as f:
#     json.dump(collected_data, f, indent=2)
# print(f"Data saved to {output_filename}")

# For demonstration, let's show what would be saved:
print("=== Data that would be saved to JSON file ===")
json_output = json.dumps(collected_data, indent=2)
print(json_output)

print(f"\nâœ… Notebook execution complete!")
print(f"ðŸ“Š Processed {len(collected_data)} examples for DKW benchmark evaluation")
print(f"ðŸ’¾ Data is ready for use (uncomment save lines to write to file)")

## Usage and Modification Notes

This notebook is completely self-contained and can be run without any external files or dependencies (except for the `datasets` library if you want to use the original `collect_data()` function).

### Key Changes from Original Script:
- **Inlined JSON data**: The sample data is now embedded as a Python list instead of being read from an external JSON file
- **Interactive exploration**: Added analysis and visualization of the dataset
- **Self-contained**: No external file dependencies for the demo

### To Modify:
1. **Use real data**: Uncomment and run `data = collect_data()` to fetch from HuggingFace
2. **Add more examples**: Extend the `sample_dataset` list with additional examples
3. **Change difficulty calculation**: Modify the difficulty formula in the `collect_data()` function
4. **Export results**: Save `collected_data` to a file using `json.dump()` if needed

### Original Artifact:
- **ID**: dataset_001
- **Name**: data.py
- **Purpose**: DKW benchmark dataset collection