# ComputerAgent HUD Integration for OSWorld

This notebook demonstrates how to use the ComputerAgent with HUD for OSWorld benchmarking.
The ComputerAgent integration provides the same interface as OperatorAgent but works with both Claude and OpenAI models.

In [None]:
# # Install dependencies if needed
# !uv venv 
# !source .venv/bin/activate
# !uv sync

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
# Required environment variables:
# - HUD_API_KEY (for HUD access)
# - ANTHROPIC_API_KEY (for Claude models)
# - OPENAI_API_KEY (for OpenAI models)

from pprint import pprint

## Quick single-task smoke test on OSWorld-Verified

The ComputerAgent integration can use Claude, OpenAI, UI-TARS, or composed models just like the original ComputerAgent:

In [1]:
from agent.integrations.hud import run_single_task

# Quick single-task smoke test on OSWorld-Verified-XLang
# You can swap "hud-evals/OSWorld-Verified-XLang" -> "hud-evals/SheetBench-V2" to test SheetBench.
await run_single_task(
    dataset="hud-evals/OSWorld-Verified-XLang",
    model="openai/computer-use-preview",  # or any supported model string
    task_id=155 # open last tab task (easy)
)

  from .autonotebook import tqdm as notebook_tqdm



[90m╔═════════════════════════════════════════════════════════════════╗[0m
[90m║[0m                    🚀 See your agent live at:                   [90m║[0m
[90m╟─────────────────────────────────────────────────────────────────╢[0m
[90m║[0m  [1m[33mhttps://app.hud.so/trace/426ed182-564d-4b12-b950-c551caeeb8a8[0m  [90m║[0m
[90m╚═════════════════════════════════════════════════════════════════╝[0m

Running: Can you make my computer bring back the last tab I shut down?


2025-08-27 13:36:03,660 - agent.ComputerAgent - INFO - LLM processing started with 2 messages
2025-08-27 13:36:21,971 - agent.ComputerAgent - INFO - LLM processing started with 6 messages
Tool execution failed: Tool evaluate has an output schema but did not return structured content
Evaluation phase failed: [MCPToolResult(meta=None, content=[TextContent(type='text', text='Tool evaluate has an output schema but did not return structured content', annotations=None, meta=None)], structuredContent=None, isError=True)]


✅ Reward: 0.0

[92m✓ Trace complete![0m [2mView at:[0m [1m[33mhttps://app.hud.so/trace/426ed182-564d-4b12-b950-c551caeeb8a8[0m



## Run OSWorld-Verified in parallel

In [None]:
import uuid
from agent.integrations.hud import run_full_dataset

# Full dataset evaluation (runs via HUD's run_dataset under the hood)
job_name = f"osworld-test-{str(uuid.uuid4())[:4]}"

results = await run_full_dataset(
    dataset="hud-evals/OSWorld-Verified-XLang",          # You can also pass a Dataset or a list[dict]
    job_name=job_name,                   # Optional; defaults to a timestamp for custom datasets
    model="openai/computer-use-preview", # Or any supported model string
    max_concurrent=20,                   # Tune to your infra
    max_steps=50,                        # Safety cap per task
    split="train[:3]"                    # Limit to just 3 tasks
)

# results is a list from hud.datasets.run_dataset; inspect/aggregate as needed
print(f"Job: {job_name}")
print(f"Total results: {len(results)}")
pprint(results[:3])  # preview