## Export Processed Data

Finally, let's create the standardized output format as done in the original script.

In [None]:
# Analyze dataset statistics
print("=" * 60)
print("DATASET STATISTICS")
print("=" * 60)

total_examples = len(processed_examples)
original_examples = sum(1 for ex in processed_examples if not ex['context'].get('is_duplicate', False))
duplicated_examples = total_examples - original_examples

print(f"Total examples: {total_examples}")
print(f"Original examples: {original_examples}")
print(f"Duplicated examples: {duplicated_examples}")
print()

# Analyze example characteristics
has_objective = sum(1 for ex in processed_examples if ex['context']['has_objective'])
has_gold_action = sum(1 for ex in processed_examples if ex['context']['has_gold_action'])

print(f"Examples with objectives: {has_objective}/{total_examples} ({has_objective/total_examples*100:.1f}%)")
print(f"Examples with gold actions: {has_gold_action}/{total_examples} ({has_gold_action/total_examples*100:.1f}%)")
print()

# Analyze text lengths
input_lengths = [len(ex['input']) for ex in processed_examples]
output_lengths = [len(ex['output']) for ex in processed_examples]

print(f"Input text lengths:")
print(f"  - Average: {sum(input_lengths)/len(input_lengths):.1f} characters")
print(f"  - Min: {min(input_lengths)} characters")
print(f"  - Max: {max(input_lengths)} characters")
print()

print(f"Output text lengths:")
print(f"  - Average: {sum(output_lengths)/len(output_lengths):.1f} characters")
print(f"  - Min: {min(output_lengths)} characters")
print(f"  - Max: {max(output_lengths)} characters")
print()

# Count unique domains/URLs
urls = [ex['context'].get('url', '') for ex in processed_examples if 'url' in ex['context']]
unique_domains = set()
for url in urls:
    if url:
        # Extract domain from URL
        domain = url.split('://')[1].split('/')[0] if '://' in url else url.split('/')[0]
        unique_domains.add(domain)

print(f"Unique domains: {len(unique_domains)}")
for domain in sorted(unique_domains):
    print(f"  - {domain}")
    
print()
print("All examples have 3 model predictions (proprietary model, GPT-4o-Mini, GPT-4o)")

## Data Analysis and Statistics

Let's analyze the processed dataset to understand its characteristics and distribution.

In [None]:
# Display the second processed example  
print("=" * 60)
print("EXAMPLE 2: GitLab Repository Search Task")
print("=" * 60)

example = processed_examples[1]
print(f"OBJECTIVE: {example['context'].get('objective', 'N/A')}")
print()
print("INPUT:")
print(example['input'][:300] + "..." if len(example['input']) > 300 else example['input'])
print()
print("GROUND TRUTH OUTPUT:")
print(example['output'][:300] + "..." if len(example['output']) > 300 else example['output'])
print()
print("CONTEXT SUMMARY:")
print(f"  - Has objective: {example['context']['has_objective']}")
print(f"  - Has gold action: {example['context']['has_gold_action']}")
print(f"  - Number of models compared: {example['context']['num_models_compared']}")
print(f"  - Example ID: {example['context']['example_id']}")
if 'url' in example['context']:
    print(f"  - URL: {example['context']['url']}")
if 'gold_action' in example['context']:
    print(f"  - Gold action: {example['context']['gold_action'][:100]}...")

print()
print("MULTI-LLM PREDICTIONS:")
print("  [Ours]:", example['context']['predicted_next_state'][:150] + "...")
print("  [GPT-4o-Mini]:", example['context']['predicted_gpt4o_mini'][:150] + "...")  
print("  [GPT-4o]:", example['context']['predicted_gpt4o'][:150] + "...")

In [None]:
# Display the first processed example
print("=" * 60)
print("EXAMPLE 1: E-commerce Search Task")
print("=" * 60)

example = processed_examples[0]
print(f"OBJECTIVE: {example['context'].get('objective', 'N/A')}")
print()
print("INPUT:")
print(example['input'][:300] + "..." if len(example['input']) > 300 else example['input'])
print()
print("GROUND TRUTH OUTPUT:")
print(example['output'][:300] + "..." if len(example['output']) > 300 else example['output'])
print()
print("CONTEXT SUMMARY:")
print(f"  - Has objective: {example['context']['has_objective']}")
print(f"  - Has gold action: {example['context']['has_gold_action']}")
print(f"  - Number of models compared: {example['context']['num_models_compared']}")
print(f"  - Example ID: {example['context']['example_id']}")
if 'url' in example['context']:
    print(f"  - URL: {example['context']['url']}")
if 'gold_action' in example['context']:
    print(f"  - Gold action: {example['context']['gold_action'][:100]}...")

print()
print("MULTI-LLM PREDICTIONS:")
print("  [Ours]:", example['context']['predicted_next_state'][:150] + "...")
print("  [GPT-4o-Mini]:", example['context']['predicted_gpt4o_mini'][:150] + "...")  
print("  [GPT-4o]:", example['context']['predicted_gpt4o'][:150] + "...")

## Examine Processed Examples

Let's look at the structure and content of the processed examples to understand how the multi-LLM predictions are organized.

In [None]:
# Process the sample data
SELECTED_DATASET = "LangAGI-Lab/human_eval-next_state_prediction_w_gpt4o"
TARGET_EXAMPLES = 10  # Using smaller number for demo

print("=" * 60)
print("Dataset Processing for Multi-LLM Hidden-State Research")
print("=" * 60)
print(f"Selected Dataset: {SELECTED_DATASET}")
print(f"Target Examples: {TARGET_EXAMPLES}")
print()

# Process the sample LangAGI data
processed_examples = process_state_prediction(
    sample_langagi_data, 
    SELECTED_DATASET,
    TARGET_EXAMPLES
)

print(f"\\n✓ Successfully processed {len(processed_examples)} examples")
print(f"\\nFirst example structure:")
print(f"  - Input length: {len(processed_examples[0]['input'])} characters")
print(f"  - Output length: {len(processed_examples[0]['output'])} characters") 
print(f"  - Context keys: {list(processed_examples[0]['context'].keys())}")
print(f"  - Dataset: {processed_examples[0]['dataset']}")
print(f"  - Split: {processed_examples[0]['split']}")

## Process Sample Data

Now let's process our sample data using the function above. We'll use a smaller target (10 examples) for demonstration purposes.

In [None]:
def process_state_prediction(data: List[Dict], dataset_name: str, target_examples: int = 200) -> List[Dict[str, Any]]:
    """
    Process LangAGI state prediction dataset with GPT-4o.
    
    Args:
        data: List of raw data examples
        dataset_name: Name of the dataset
        target_examples: Target number of examples (duplicates if needed)
    
    Returns:
        List of processed examples in standardized format
    """
    examples = []
    original_count = len(data)
    
    print(f"Processing {original_count} original examples...")
    
    for idx, row in enumerate(data):
        web_state = row.get('web_state', {})
        ground_truth = row.get('next_state(Ground_Truth)', '')
        predicted_ours = row.get('predicted_next_state(Ours)', '')
        predicted_gpt4o_mini = row.get('predicted_next_state(GPT-4o-Mini)', '')
        predicted_gpt4o = row.get('predicted_next_state(GPT-4o)', '')

        # Extract fields from web_state
        current_obs = web_state.get('current_observation', '')
        objective = web_state.get('objective', '')
        gold_action = web_state.get('gold_action', '')
        url = web_state.get('url', '')
        previous_actions = web_state.get('previous_actions', [])

        # Build input prompt
        input_text = f"Current State Observation:\\n{current_obs[:500]}..."
        if objective:
            input_text = f"Objective: {objective}\\n\\n{input_text}"

        # Build context with all available info
        context = {
            "current_state": current_obs[:200] + "..." if len(current_obs) > 200 else current_obs,
            "ground_truth_next_state": ground_truth[:200] + "..." if len(ground_truth) > 200 else ground_truth,
            "predicted_next_state": predicted_ours[:200] + "..." if len(predicted_ours) > 200 else predicted_ours,
            "predicted_gpt4o_mini": predicted_gpt4o_mini[:200] + "..." if len(predicted_gpt4o_mini) > 200 else predicted_gpt4o_mini,
            "predicted_gpt4o": predicted_gpt4o[:200] + "..." if len(predicted_gpt4o) > 200 else predicted_gpt4o,
            "has_objective": bool(objective),
            "has_gold_action": bool(gold_action),
            "num_models_compared": 3,
            "example_id": idx
        }

        if objective:
            context["objective"] = objective
        if gold_action:
            context["gold_action"] = gold_action
        if url:
            context["url"] = url
        if previous_actions:
            context["num_previous_actions"] = len(previous_actions)

        example = {
            "input": input_text,
            "context": context,
            "output": ground_truth,
            "dataset": dataset_name,
            "split": "train"
        }
        examples.append(example)

    # If we have fewer than target examples, duplicate with variations
    if len(examples) < target_examples:
        print(f"  Note: Original dataset has {len(examples)} examples")
        print(f"  Duplicating examples to reach {target_examples}...")

        while len(examples) < target_examples:
            # Take examples cyclically
            idx_to_duplicate = len(examples) % original_count
            duplicate = examples[idx_to_duplicate].copy()

            # Mark as duplicate in context
            duplicate["context"] = duplicate["context"].copy()
            duplicate["context"]["is_duplicate"] = True
            duplicate["context"]["original_example_id"] = idx_to_duplicate
            duplicate["context"]["duplicate_number"] = len(examples) // original_count

            examples.append(duplicate)

    # Return exactly target_examples
    return examples[:target_examples]

print("Data processing function defined successfully!")

## Data Processing Functions

The following functions replicate the data processing logic from the original script. They transform the raw LangAGI data into a standardized format suitable for multi-LLM analysis.

In [None]:
# Sample data from the LangAGI dataset (inline for self-contained notebook)
sample_langagi_data = [
    {
        'web_state': {
            'current_observation': '''Tab 0 (current): One Stop Market

[1] RootWebArea 'One Stop Market' focused: True
	[95] link 'My Account'
	[96] link 'My Wish List'
	[97] link 'Sign Out'
	[3838] StaticText 'Welcome to One Stop Market'
	[39] link 'Skip to Content'
	[23] link 'store logo'
		[41] img 'one_stop_market_logo'
	[42] link '\\ue611 My Cart'
	[281] StaticText 'Search'
	[172] combobox '\\ue615 Search' autocomplete: both hasPopup: listbox required: False expanded: False
	[284] link 'Advanced Search'
	[139] button 'Search' disabled: False''',
            'objective': 'Search for "batteries for iphone 13"',
            'gold_action': 'type [172] [iphone 13] where [172] is combobox \'\\ue615 Search\' autocomplete: both hasPopup: listbox required: False expanded: False',
            'url': 'http://ec2-3-139-65-91.us-east-2.compute.amazonaws.com:7770',
            'previous_actions': []
        },
        'next_state(Ground_Truth)': 'The expected effect is that: \\n1. The webpage will now display a search results page tailored specifically for \'iphone 13\', showcasing relevant products.\\n2. Users will see a variety of items, including accessories like cases and protectors for the iPhone 13, complete with prices and ratings, allowing them to make informed choices.\\n3. The interface will allow users to interact with these products through options to add to cart or wish list, enhancing their shopping experience.\\n4. The layout will be more organized and visually focused on the products relevant to the search query, making it easier to navigate through the available options.\\n5. The transition from the general homepage to a focused search results page indicates that the user\'s input has successfully triggered a relevant and comprehensive display of products related to their query.',
        'predicted_next_state(Ours)': '1. The webpage will be updated to a new RootWebArea titled "Search results for: \'iphone 13\'," focusing entirely on the search outcomes.\\n2. A heading corresponding to the search term used will be prominently displayed, indicating the successful search execution.\\n3. The search results will include a list of relevant products such as iPhone 13 accessories, cases, and other related items, formatted in a user-friendly layout.\\n4. Each product listed will have interactive elements such as links or buttons that allow users to view details, add items to cart, or add them to their wish list.\\n5. Navigation options will be available to refine the search or return to the previous page, providing a comprehensive shopping experience tailored to the user\'s query.',
        'predicted_next_state(GPT-4o-Mini)': 'The webpage will update to show search results for "batteries for iphone 13." This includes:\\n1. The RootWebArea will now be focused on displaying the results relevant to the search query.\\n2. A new heading will indicate that results are being shown for "iphone 13" or similar.\\n3. A list of products will be presented, likely including iPhone 13 batteries and related accessories.\\n4. Each product entry will include options to view more details, add to cart, or add to the wish list.\\n5. The search interface will remain accessible for further queries, and users will have easy navigation back to the main store or other sections.',
        'predicted_next_state(GPT-4o)': '1. The combobox [172] will show the text "iphone 13".\\n2. The search button [139] might become enabled if the input validation criteria are met, allowing the user to initiate the search.\\n\\n[Next State] After typing "iphone 13" in the search box:\\n\\nTab 0 (current): One Stop Market\\n\\n[1] RootWebArea \'One Stop Market\' focused: True\\n\\t[95] link \'My Account\'\\n\\t[96] link \'My Wish List\'\\n\\t[97] link \'Sign Out\'\\n\\t[3838] StaticText \'Welcome to One Stop Market\'\\n\\t[39] link \'Skip to Content\'\\n\\t[23] link \'store logo\'\\n\\t\\t[41] img \'one_stop_market_logo\'\\n\\t[42] link \'\\\\ue611 My Cart\'\\n\\t[281] StaticText \'Search\'\\n\\t[172] combobox \'\\\\ue615 Search\' autocomplete: both hasPopup: listbox required: False expanded: False value: "iphone 13"\\n\\t[284] link \'Advanced Search\'\\n\\t[139] button \'Search\' disabled: False'
    },
    {
        'web_state': {
            'current_observation': '''Tab 0 (current): Projects · Dashboard · GitLab

[1] RootWebArea 'Projects · Dashboard · GitLab' focused: True
	[5] link 'Skip to content'
	[18] link 'Dashboard'
	[2720] button '' hasPopup: menu expanded: False
	[64] textbox 'Search GitLab' required: False
	[56] generic 'Use the shortcut key <kbd>/</kbd> to start a search'
	[34] link 'Create new...'
	[105] link 'Issues'
		[136] generic '13 assigned issues'
	[106] link 'Merge requests'
		[138] generic '8 merge requests'
	[108] link 'To-Do List'''',
            'objective': 'Tell me who has made the most contributions, in terms of number of commits, to the AndroidSlidingUpPanel project',
            'gold_action': 'type [64] [AndroidSlidingUpPanel] where [64] is textbox \'Search GitLab\' required: False',
            'url': 'http://ec2-3-139-65-91.us-east-2.compute.amazonaws.com:8023',
            'previous_actions': []
        },
        'next_state(Ground_Truth)': 'The expected effect is that: \\n1. The webpage will now display the search results specific to the AndroidSlidingUpPanel project.\\n2. The project will be highlighted with its description, owner, and information about contributions.\\n3. Users will be able to see details about the contributors, including the number of commits made to the project, facilitating the identification of the person with the most contributions.\\n4. The interface will provide navigational links to further explore the project\'s issues, merge requests, and additional statistics related to contributions, allowing for a comprehensive understanding of the project\'s activity and contributors.',
        'predicted_next_state(Ours)': 'The web page will be updated to a new RootWebArea titled "Search results for: \'AndroidSlidingUpPanel\'," focusing entirely on the search outcomes.\\n1. A heading corresponding to the search term will be prominently displayed.\\n2. The search results will include a list of projects related to "AndroidSlidingUpPanel," with the most relevant result likely being the specific project repository.\\n3. Each search result will provide options to access the project details, including repository information, contributors, and commit history.\\n4. The interface will allow users to click on the relevant project to explore further details about contributions and project activity.',
        'predicted_next_state(GPT-4o-Mini)': 'The webpage will transition to display search results for the term "AndroidSlidingUpPanel." This includes:\\n1. The RootWebArea will be updated to indicate it is now showing results for the search term "AndroidSlidingUpPanel."\\n2. A list of search results will be presented, likely including repositories, projects, or issues related to "AndroidSlidingUpPanel."\\n3. Each result will have links that allow the user to access detailed information about the project, including contributor data and commit history.\\n4. The search interface will remain functional for additional queries, and navigation options will be available to return to the dashboard or explore specific projects.',
        'predicted_next_state(GPT-4o)': '1. The webpage will be updated to a new RootWebArea titled "Search results for: \'AndroidSlidingUpPanel\'," focusing on the search outcomes.\\n2. A heading corresponding to the search term used will be prominently displayed at the top of the page.\\n3. The search results will show a list of projects, repositories, or related content matching "AndroidSlidingUpPanel."\\n4. Each search result item will have clickable links that allow the user to access detailed information about the specific project, including contributor statistics and commit history.\\n5. Navigation elements will be present to allow the user to refine the search or return to the main dashboard.'
    },
    {
        'web_state': {
            'current_observation': '''Tab 0 (current): Postmill

[1] RootWebArea 'Postmill' focused: True
	[17] HeaderAsNonLandmark ''
		[27] link 'Home'
	[69] link 'Forums'
	[70] link 'Wiki'
	[51] searchbox 'Search query'
	[52] link 'Notifications (0)'
	[53] link 'Submit'
	[210] button 'MarvelsGrantMan136' hasPopup: menu expanded: False
	[12] main ''
		[101] link 'Submissions'
		[102] link 'Comments'
		[211] button 'Filter on: Featured' hasPopup: menu expanded: False
		[212] button 'Sort by: Hot' hasPopup: menu expanded: False''',
            'objective': 'Post my question, "is car necessary in NYC", in a subreddit where I\'m likely to get an answer',
            'gold_action': 'click [53] where [53] is link \'Submit\'',
            'url': 'http://localhost:3000',
            'previous_actions': []
        },
        'next_state(Ground_Truth)': 'The expected effect is that:\\n1. The user will be presented with a submission form to enter a title and body for their question, "is car necessary in NYC".\\n2. The user will input the title into the \'Title *\' field and may also provide additional context in the \'Body\' section.\\n3. The user will utilize the \'Choose one…\' combobox to select a relevant subreddit that is likely to yield responses to their query regarding car necessity in NYC, potentially selecting a forum like r/AskNYC or r/NYC.\\n4. Once all required fields are filled out, the user will be able to submit their question by clicking the \'Create submission\' button, successfully posting their inquiry to the chosen subreddit for community feedback.\\n5. The interface will facilitate the user\'s posting experience, ensuring they have all needed tools and options for effective submission.',
        'predicted_next_state(Ours)': 'The webpage will transition to a submission form for creating a new post, allowing the user to input their question: \\n1. A heading will confirm that the user is in the submission form, providing clear guidance for posting.\\n2. Input fields will be available for the post title and body, where the user can enter "is car necessary in NYC" and any additional context.\\n3. A dropdown or selection menu will be present to choose the appropriate subreddit or forum for the post.\\n4. A submit button will be prominently displayed, enabling the user to finalize and publish their question to the selected community.\\n5. The interface will be user-friendly, ensuring the posting process is straightforward and accessible.',
        'predicted_next_state(GPT-4o-Mini)': 'The webpage will transition to a new RootWebArea dedicated to creating a post. This will include:\\n1. A prominent heading indicating that the user is in the "Create Post" or "Submit a Question" section.\\n2. Input fields will be displayed for entering the post title and body, where the user can type their question "is car necessary in NYC."\\n3. A dropdown menu or selection option will be available to choose the most appropriate subreddit or forum for posting the question.\\n4. A "Submit" or "Create Post" button will be present to finalize the submission.\\n5. The overall layout will be clean and focused on facilitating the post creation process.',
        'predicted_next_state(GPT-4o)': 'The expected effect is that:\\n1. The RootWebArea will now be focused on a page titled "Submit a new post" or similar.\\n2. The main section will include input fields for the post title and body, allowing the user to enter their question "is car necessary in NYC".\\n3. There will be a selection dropdown or input field to choose the appropriate subreddit or forum where the question should be posted.\\n4. A submission button will be available to complete the posting process.\\n5. The interface will provide guidance on posting rules and help the user navigate the submission process effectively.'
    }
]

print(f"Sample data loaded: {len(sample_langagi_data)} examples")
print("Each example contains web state observations and predictions from 3 different LLMs")

## Sample Data

Below is sample data from the LangAGI-Lab/human_eval-next_state_prediction_w_gpt4o dataset. This includes web agent interactions on GitLab and e-commerce sites where the agent needs to predict the next state after performing an action.

In [None]:
import json
from typing import Dict, List, Any
import pandas as pd
from pathlib import Path

print("Multi-LLM State Prediction Dataset Processing")
print("=" * 50)

# Multi-LLM State Prediction Reference Dataset

This notebook demonstrates the processing of web agent state transition data for multi-LLM hidden-state trajectory research. The dataset contains examples of web agent interactions with state predictions from multiple language models (GPT-4o-Mini, GPT-4o, and a proprietary model).

## Overview

The original script processes the LangAGI-Lab/human_eval-next_state_prediction_w_gpt4o dataset, which contains web agent state transition examples. Each example includes:
- Current state observations from web pages
- Objectives for the web agent
- Predicted next states from 3 different LLMs
- Ground truth outcomes
- Gold actions (correct actions to take)

This notebook provides a self-contained version that demonstrates the data processing pipeline without requiring external files.