## üéØ Usage Instructions

This notebook is now completely self-contained and ready to use! Here's how to customize it:

### üîß Configuration Options:

1. **Data Source** (Cell 6):
   - Set `USE_LIVE_DATA = True` to fetch real data from HuggingFace GSM8K dataset
   - Set `USE_LIVE_DATA = False` to use the inlined sample data

2. **Export Format** (Cell 8):
   - `"display"` - Show JSON output in notebook (default)
   - `"json"` - Save to `data_out.json` file
   - `"csv"` - Save to `data_out.csv` file

### üöÄ Next Steps:

- Modify the difficulty calculation in the `collect_data()` function
- Add more data processing steps
- Integrate with your DKW benchmark evaluation pipeline
- Extend the sample data with more examples

### üìù Notes:

- No external file dependencies required
- All data is either fetched from HuggingFace or inlined as Python objects
- Fully interactive with pandas DataFrame display
- Error handling included for network issues

**Enjoy your self-contained dataset collection notebook!** üéâ

In [None]:
# Export options
def export_data(data, format_type="json"):
    """Export data in various formats."""
    if format_type == "json":
        # Original functionality - save as JSON
        with open("data_out.json", "w") as f:
            json.dump(data, f, indent=2)
        print(f"‚úÖ Exported {len(data)} examples to data_out.json")
        
    elif format_type == "csv":
        # Export as CSV using pandas
        df.to_csv("data_out.csv", index=False)
        print(f"‚úÖ Exported {len(data)} examples to data_out.csv")
        
    elif format_type == "display":
        # Just display the JSON in notebook
        print("üìÑ JSON Output:")
        print(json.dumps(data, indent=2))

# Choose export format
EXPORT_FORMAT = "display"  # Options: "json", "csv", "display"

if EXPORT_FORMAT in ["json", "csv"]:
    export_data(data, EXPORT_FORMAT)
else:
    export_data(data, "display")

print(f"\\nüéâ Data collection complete! Processed {len(data)} examples.")

## üíæ Data Export

Export the processed data to various formats. This replaces the original file writing functionality with more flexible export options.

In [None]:
# Convert to pandas DataFrame for better display
df = pd.DataFrame(data)

# Display basic statistics
print("üìà Data Statistics:")
print(f"   Questions range from {df['difficulty'].min():.3f} to {df['difficulty'].max():.3f} difficulty")
print(f"   Question lengths: {df['question'].str.len().min()} - {df['question'].str.len().max()} characters")

# Display the data table
print(f"\nüóÇÔ∏è Complete Dataset ({len(df)} rows):")
display(df)

# Show difficulty distribution
print(f"\nüìä Difficulty Distribution:")
difficulty_bins = pd.cut(df['difficulty'], bins=3, labels=['Easy', 'Medium', 'Hard'])
print(difficulty_bins.value_counts())

## üìã Interactive Data Display

View the collected data in an interactive table format with sorting and filtering capabilities.

In [None]:
# Choose your data source
USE_LIVE_DATA = False  # Set to True to fetch from HuggingFace, False to use sample data

if USE_LIVE_DATA:
    print("üåê Collecting live data from HuggingFace GSM8K dataset...")
    try:
        data = collect_data()
        print(f"‚úÖ Successfully collected {len(data)} examples from HuggingFace")
    except Exception as e:
        print(f"‚ùå Error loading live data: {e}")
        print("üìã Falling back to sample data...")
        data = sample_data
else:
    print("üìã Using inlined sample data...")
    data = sample_data

print(f"\nDataset Summary:")
print(f"üìä Total examples: {len(data)}")
print(f"üí° Average difficulty: {sum(item['difficulty'] for item in data) / len(data):.3f}")
print(f"üìè Average question length: {sum(len(item['question']) for item in data) / len(data):.1f} characters")

## üöÄ Run Data Collection

Execute the data collection function and display the results. You can choose to either:
1. Collect live data from HuggingFace (requires internet connection)
2. Use the inlined sample data for demonstration

In [None]:
# Sample data inlined from data_out.json
sample_data = [
    {
        "id": "example_000",
        "question": "What is 2+2?",
        "answer": "4",
        "difficulty": 0.15
    },
    {
        "id": "example_001",
        "question": "If x=5, what is 2x?",
        "answer": "10",
        "difficulty": 0.22
    },
    {
        "id": "example_002",
        "question": "Solve: 3y + 6 = 15",
        "answer": "y=3",
        "difficulty": 0.28
    }
]

print(f"Sample dataset contains {len(sample_data)} examples")
print("Sample data structure:")
for item in sample_data[:1]:  # Show structure of first item
    for key, value in item.items():
        print(f"  {key}: {value} ({type(value).__name__})")

## üìä Sample Data (Inlined)

Instead of reading from external JSON files, here's the sample data inlined for demonstration purposes. This makes the notebook completely self-contained.

In [None]:
def collect_data():
    """Collect benchmark data for DKW controller evaluation."""
    # Load HuggingFace dataset
    print("Loading GSM8K dataset from HuggingFace...")
    ds = load_dataset("gsm8k", "main", split="test[:200]")
    print(f"Loaded {len(ds)} examples")

    data = []
    for i, example in enumerate(ds):
        data.append({
            "id": f"example_{i:03d}",
            "question": example["question"],
            "answer": example["answer"],
            "difficulty": len(example["question"]) / 100,  # Simple proxy based on question length
        })

    return data

## üîß Data Collection Function

The main function that processes the GSM8K dataset and creates structured benchmark data with difficulty scoring.

In [None]:
"""Dataset collection script for DKW benchmark."""
import json
from datasets import load_dataset
import pandas as pd
from IPython.display import display, HTML

## üì¶ Imports and Setup

# Dataset Collection Script for DKW Benchmark

**Artifact:** dataset_001 - data.py

This notebook contains a self-contained version of the dataset collection script for DKW benchmark evaluation. It loads and processes the GSM8K dataset to create benchmark data for controller evaluation.

## Features:
- Loads HuggingFace GSM8K dataset
- Processes examples with metadata (ID, difficulty score)
- Displays collected data in an interactive format
- Completely self-contained with inlined sample data

## Usage Notes

### Running this notebook:
1. **Self-contained mode**: Run all cells as-is to see sample data processing
2. **Live mode**: Uncomment the live data collection lines to fetch real GSM8K data

### Modifications you can make:
- Change the dataset split in `collect_data()` (e.g., `"test[:500]"` for more examples)
- Modify the difficulty calculation logic
- Add additional data processing steps
- Export results to different formats

### Original vs. Notebook differences:
- ‚úÖ **Inlined data**: No dependency on `data_out.json` file
- ‚úÖ **Interactive**: Can run sections independently
- ‚úÖ **Documented**: Clear explanations for each step
- ‚úÖ **Flexible**: Easy to modify and experiment with

**This notebook is completely self-contained and ready to run!**

In [None]:
# Display the collected data in a readable format
for i, example in enumerate(data):
    print(f"\n--- Example {i+1} ---")
    print(f"ID: {example['id']}")
    print(f"Question: {example['question']}")
    print(f"Answer: {example['answer']}")
    print(f"Difficulty: {example['difficulty']:.2f}")

print(f"\n‚úÖ Successfully processed {len(data)} examples")
print("üìù Data structure matches the original script output")
print("üîÑ Notebook is completely self-contained - no external files needed!")

In [None]:
# Option 1: Live data collection (uncomment to run with internet access)
# print("Collecting live data from HuggingFace...")
# live_data = collect_data()
# print(f"Collected {len(live_data)} examples from GSM8K dataset")

# Option 2: Use sample data for demonstration (self-contained)
print("Using sample data for demonstration:")
data = sample_output_data

# Original script would save to file - here we'll just display
# with open("data_out.json", "w") as f:
#     json.dump(data, f, indent=2)

print(f"Collected {len(data)} examples")
print("\nFirst few examples:")

## Execute Data Collection

Now let's run the data collection function. In the original script, this would load from HuggingFace and save to a file. Here we'll demonstrate both approaches:

1. **Live data collection** (requires internet and HuggingFace datasets)
2. **Display of sample data** (self-contained)

In [None]:
# Inlined sample data (replaces data_out.json for self-contained execution)
sample_output_data = [
    {
        "id": "example_000",
        "question": "What is 2+2?",
        "answer": "4",
        "difficulty": 0.15
    },
    {
        "id": "example_001",
        "question": "If x=5, what is 2x?",
        "answer": "10",
        "difficulty": 0.22
    },
    {
        "id": "example_002",
        "question": "Solve: 3y + 6 = 15",
        "answer": "y=3",
        "difficulty": 0.28
    }
]

print(f"Sample data contains {len(sample_output_data)} examples")

## Self-Contained Data (No External Dependencies)

Instead of reading from `data_out.json`, we'll inline the sample data here to make this notebook completely self-contained. This demonstrates what the output would look like without requiring external files.

In [None]:
def collect_data():
    """Collect benchmark data for DKW controller evaluation."""
    # Load HuggingFace dataset
    ds = load_dataset("gsm8k", "main", split="test[:200]")

    data = []
    for i, example in enumerate(ds):
        data.append({
            "id": f"example_{i:03d}",
            "question": example["question"],
            "answer": example["answer"],
            "difficulty": len(example["question"]) / 100,  # Simple proxy
        })

    return data

In [None]:
"""Dataset collection script for DKW benchmark."""
import json
from datasets import load_dataset

## Overview

This notebook demonstrates:
1. **Data Collection**: Loading benchmark data from HuggingFace's GSM8K dataset
2. **Data Processing**: Formatting examples with IDs, questions, answers, and difficulty scores
3. **Self-contained execution**: All data is inlined to eliminate external dependencies

The original script would save data to `data_out.json` - here we'll display the results directly and provide the data inline for demonstration.

# Dataset Collection for DKW Benchmark

**Artifact:** dataset_001 (data.py)

This notebook contains a self-contained version of the dataset collection script for DKW benchmark evaluation. The original script has been converted to run entirely within this notebook without any external file dependencies.