This repo is the official implementation of our paper β Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to the User's Digital World β and its follow-ups.
Important
We believe the next leap for always-on LLM agents lies in scaling agent context β expanding the slice of the user's digital world an assistant can continuously perceive, reason over, and act on.
Claw-Anything operationalizes this view, evaluating always-on LLM agents across three axes of real-world context: long-horizon event streams, various interconnected services, and cross-device interaction (e.g., GUI and CLI). Even the strongest model, GPT-5.5, reaches only 34.5% pass@1, revealing substantial capability gaps. Alongside the benchmark, we release an automated data-generation pipeline that produces 2,000 training environments and boosts the base model by 23.7%.
Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to the User's Digital World
Yusong Lin, Xinyuan Liang, Haiyang Wangβ , Qipeng Gu, Siqi Cheng
Jiangui Chen, Shuzhe Wu, Feiyang Pan, Lue Fan, Sanyuan Zhaoβ , Dandan Tuββ Corresponding authors.
Primary contact: Yusong Lin (linyusong4@huawei.com), Haiyang Wang (haiyang.wang@huawei.com)
- π οΈ [2026-05-27] TODO: One-click evaluation for easier use. It's not good enough yet β stay tuned. :)
- π [2026-05-26] The arXiv preprint has been released.
- π [2026-05-26] Data pipeline has been released β the two-stage
build-personaβgen-evalflow scales to 2,000 training environments and powers the benchmark's data generation. - π [2026-05-26] Benchmark and Training Environments has been released.
Claw-Anything is an end-to-end framework that does two things with one codebase:
- Benchmarks AI agents on realistic, always-on personal-assistant tasks β long-horizon activity histories, dozens of interdependent backend services, and integrated GUI+CLI interaction across devices.
- Generates those tasks automatically from a persona seed β months of simulated user activity, persistent fixtures, executable graders, and noise (irrelevant or conflicting events) included.
| Module | Role |
|---|---|
π§ͺ benchmark/ |
Evaluate β 200 human-verified tasks split into skill/ (the agent dynamically loads tools on demand) and tool/ (the agent is pre-loaded with the full tool set) |
ποΈ gen/ |
Build data β build-persona + gen-eval two-phase pipeline; 2,000 training environments at scale |
π€ runner/ |
Execute β Think β Act β Observe loop, OpenAI-compatible model backend, per-trial Docker sandbox with port isolation |
π graders/ |
Score β Multi-dimensional grading (completion Β· robustness Β· communication Β· safety) + LLM-as-judge + Pass^k aggregation |
π οΈ mock_services/ |
Simulate β 35 FastAPI mocked services (Gmail, Calendar, Slack, Notion, Feishu, WeChat, Zotero, ...) all sharing a frozen-time fixture base |
Existing agent benchmarks expose only narrow, static slices of user state. Claw-Anything expands agent context along three axes simultaneously:
- Long-horizon event streams β months of fine-grained user activity linking past and present, forcing agents to reason over an evolving timeline.
- Interconnected services β information is scattered across multiple stateful backends and signals from different services may conflict, demanding cross-service reconciliation and coordinated actions rather than single-API tool-use.
- Cross-device interaction (GUI + CLI) β devices fragment the user's digital world into silos; a truly attentive assistant must weave them together across heterogeneous GUI and CLI surfaces, acting as a connector across the user's daily life.
This expanded scope also unlocks evaluation of proactive assistance: tasks that reward acting before an explicit user request.
Left β environment. The environment comprises connected devices with system event streams and multiple services with persistent states and service-specific histories.
Right β automated data pipeline. From a persona-grounded initial state, the pipeline iteratively samples task or noise templates and uses an LLM-based simulator to adapt events and update the world state. A final simulation produces the task query, reference solution, and grader; automatic filtering yields task instances, with optional human verification for benchmark cases.
| Benchmark | Event Stream | Device Interfaces | # Services (avg. / max.) | Proactive | # Context Length (words) | # Ins (Eval) | # Ins (Train) |
|---|---|---|---|---|---|---|---|
| ClawBench | β | CLI | 1.6 / 5 | β | 2.2k | 313 | 0 |
| WildClawBench | β | CLI | 0.5 / 3 | β | 2.6k | 60 | 0 |
| PinchBench | β | CLI | 0.1 / 3 | β | 1.7k | 53 | 0 |
| ClawMark | β | CLI | 3.9 / 5 | β | 2.0k | 100 | 0 |
| QwenClawBench | β | CLI | 0.3 / 6 | β | 12.1k | 100 | 0 |
| Claw-Eval | β | CLI | 1.3 / 6 | β | 5.3k | 300 | 0 |
| Claw-Anything (ours) | β | CLI + GUI | 10.1 / 18 | β | 191.7k | 200 | 2000 |
- 200 human-verified evaluation tasks spanning patrol, decision-making, and multi-service coordination.
- 2,000 training environments generated by the pipeline for downstream training.
We evaluate state-of-the-art open- and closed-source models under a unified OpenHarness framework for fair comparison. Bold marks the best result in each column within each subgroup.
| Model | # Params | Score | Pass@1 | Pass@3 | Pass^3 | # Tokens (I / O) |
|---|---|---|---|---|---|---|
| Open-Source | ||||||
| Qwen3.5-27B | 27B | 0.50 | 9.8 | 19.0 | 2.0 | 83.8M / 0.9M |
| MiniMax-M2.7 | 229B | 0.52 | 13.5 | 28.5 | 3.5 | 79.0M / 1.1M |
| Qwen3.6-27B | 27B | 0.58 | 22.5 | 42.0 | 6.0 | 99.4M / 2.0M |
| Kimi-K2.6 | 1.1T | 0.57 | 22.8 | 44.0 | 6.5 | 178.1M / 2.3M |
| GLM-5.1 | 754B | 0.59 | 31.7 | 47.0 | 17.0 | 125.0M / 2.2M |
| Claw-Anything-Qwen3.5-27B (ours) | 27B | 0.61 | 33.5 | 52.0 | 15.5 | 117.8M / 1.1M |
| Gain over Qwen3.5-27B | β | +0.11 | +23.7 | +33.0 | +13.5 | β |
| Closed-Source | ||||||
| Claude Sonnet 4.5 | β | 0.59 | 28.0 | 45.0 | 12.0 | 149.0M / 1.5M |
| Claude Opus 4.7 | β | 0.62 | 31.8 | 48.0 | 13.5 | 123.5M / 1.5M |
| GPT-5.5 | β | 0.65 | 34.5 | 53.5 | 20.0 | 77.7M / 0.9M |
- State-of-the-art frontier models still leave significant headroom on always-on personal-assistant tasks.
- Our generated training environments are effective β fine-tuning Qwen3.5-27B on 2,000 of them yields Claw-Anything-Qwen3.5-27B, very strong open-source result in this comparison (+23.7 over the base model) and competitive with leading closed-source systems.
Requires Python 3.11+ and (optionally) Docker for the trial-in-container sandbox. This project uses uv for dependency management.
# 1. Install uv once (skip if already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Clone the repo and enter the package directory
git clone https://github.com/LiberCoders/CLaw-Anything.git
cd CLaw-Anything
# 3. Create the venv and install the package
uv venv --python 3.11
source .venv/bin/activate
uv pip install -e ".[mock,sandbox]"
# 4. Configure the model endpoint
cp config.example.yaml config.yaml
# edit config.yaml: api_key / base_url / model_id
# 5. Build a trial-in-container image (one-time; pick the agent backend you'll use)
claw-anything build-image # default: --agent openharness-ext (image claw-anything-oh-ext)
claw-anything build-image --agent loop # smallest image: claw-anything-loop
claw-anything build-image --agent openharness # vanilla OH: claw-anything-ohThe OH-Ext build needs an
adbbinary and the OpenHarnessExtended source. Either let the script clone OH-Ext intovendor/and supplyADB_PATH, or set both:OH_EXT_DIR=$HOME/code/OpenHarnessExtended \ ADB_PATH=$HOME/android-sdk/platform-tools/adb \ scripts/build_oh_ext_image.shThe image expects the OH-Ext working copy to be on branch
main-clawguiβ the build script prints a warning otherwise. Sample OH settings file:examples/oh-settings.example.json(copy and fill inapi_key,base_url, etc.).
Available extras (declared in pyproject.toml):
| Extra | When to install | Pulls in |
|---|---|---|
mock |
Required β needed by all run / batch / gen-* commands |
fastapi, uvicorn, pypdf, trafilatura, requests |
sandbox |
Recommended β required for --trial-in-container |
docker |
web |
Optional β only if you exercise the web_real mock service |
trafilatura, requests |
openharness |
Optional β only if agent_type: openharness or openharness-ext in config.yaml |
openharness-ai |
dev |
Optional β only if you run pytest tests/ |
pytest |
So the typical install is uv pip install -e ".[mock,sandbox]". Add ,dev if you'll run the test suite, ,openharness if you'll use the OH agent backend.
After install you can either
source .venv/bin/activateand callclaw-anything ...directly, or useuv run claw-anything ...to let uv manage the environment for you.
The benchmark is split into three subsets. claw-anything batch without --tasks-dir runs the full 200-task suite:
skill(100, CLI,prompt.skill_mode = true)tool(50, CLI,prompt.skill_mode = false)gui(50, Android GUI, forced toopenharness-extβ needs an emulator +--oh-settings; see Run mobile GUI / Android tasks)
Each subset writes to its own trace subdirectory. Pass --cli-only to run only the CLI subsets (150 tasks). Note that batch always runs trials in containers β there is no --trial-in-container flag (only run exposes it).
# Full benchmark (200 tasks: skill + tool + gui)
claw-anything batch \
--config config.yaml \
--oh-settings /path/to/oh-settings.json \
--trials 3 \
--parallel 10
# CLI subsets only (150 tasks: skill + tool)
claw-anything batch \
--config config.yaml \
--cli-only \
--trials 3 \
--parallel 10If --cli-only is omitted and the gui subset's prerequisites aren't met (empty android.emulator_pool, or no --oh-settings), the suite fails fast at second 0 with a clear message β so you don't burn 150 CLI tasks before discovering the gui phase can't start.
Output:
traces/loop_<model>_<ts>/
βββ skill/ # benchmark/skill, prompt.skill_mode = true
β βββ batch_results.json
β βββ batch_summary.json
βββ tool/ # benchmark/tool, prompt.skill_mode = false
β βββ batch_results.json
β βββ batch_summary.json
βββ gui/ # benchmark/gui, agent forced to openharness-ext (skipped with --cli-only)
βββ batch_results.json
βββ batch_summary.json
Or run only one subset:
claw-anything batch --tasks-dir benchmark/skill --config config.yaml --trials 3 --parallel 10
claw-anything batch --tasks-dir benchmark/tool --config config.yaml --trials 3 --parallel 10
claw-anything batch --tasks-dir benchmark/gui --config config.yaml --agent openharness-ext --oh-settings /path/to/oh-settings.json --trials 3 --parallel 10To resume or repair a previous batch run, point at its trace dir with one of:
claw-anything batch --tasks-dir benchmark/skill --trace-dir traces/<prev_run>/ --continue # skip completed
claw-anything batch --tasks-dir benchmark/skill --trace-dir traces/<prev_run>/ --rerun-errors # only failed# Loop agent β no sandbox (mock services started locally)
claw-anything run --task examples/ready_to_run/T001_demo --config config.yaml
# Loop agent β inside Docker (trial-in-container)
claw-anything run --task examples/ready_to_run/T001_demo --config config.yaml --trial-in-container
# OpenHarness agent (vanilla, trial-in-container)
# Requires: scripts/build_oh_image.sh (one-time)
claw-anything run \
--task examples/ready_to_run/T001_demo \
--config config.yaml \
--agent openharness \
--trial-in-container \
--oh-settings /path/to/oh-settings.json
# OpenHarness-Ext agent (GUI/mobile tasks, trial-in-container)
# Requires: scripts/build_oh_ext_image.sh (one-time)
claw-anything run \
--task examples/ready_to_run/T001_demo \
--config config.yaml \
--agent openharness-ext \
--trial-in-container \
--oh-settings /path/to/oh-settings.json
# Re-grade an existing trace
claw-anything grade --trace traces/<dir>/<trace>.jsonl --task examples/ready_to_run/T001_demoThe two-phase pipeline turns a single persona YAML into a fully populated digital world plus eval tasks with executable graders.
# Phase 1 β build a gold environment from a persona
claw-anything build-persona \
--persona personas/sarah_chen_pm_persona.yaml \
--seed-tasks seed_tasks/ \
--rounds 30 \
--seed-noise seed_noise/ \
--noise-ratio 2 \
--output gold_envs/sarah_chen_pm/ \
--config config.yaml
# Phase 2 β generate eval tasks from the gold environment
claw-anything gen-eval \
--env gold_envs/sarah_chen_pm/ \
--seed-tasks seed_tasks/ \
--output gen_tasks/sarah_chen_pm_simple/ \
--max-tasks 20 \
--difficulty simple \
--execution-date 2026-04-03 \
--config config.yaml
# Then evaluate the generated tasks
claw-anything batch \
--tasks-dir gen_tasks/sarah_chen_pm_simple/ \
--config config.yaml \
--trials 3 --parallel 10Tasks whose task.yaml declares task_env: [mobile_gui] drive an Android emulator via adb. They require the OH-Ext agent and image:
# In config.yaml, list the available emulator serials:
# android:
# emulator_pool:
# - emulator-5554
# - 127.0.0.1:5555 # TCP-shaped serials trigger `adb connect` before each trial
claw-anything run \
--task gen_tasks/<mobile_gui_task>/ \
--config config.yaml \
--agent openharness-ext \
--trial-in-container \
--oh-settings /path/to/oh-settings.jsonThe host calls init_gui_task() to inject calendar events, contacts, etc. into the emulator before the agent starts; the trial container then runs the OH-Ext agent against that prepared device.
| Group | Command / Script | Purpose |
|---|---|---|
| Run | run |
Run an agent on a single task (loop: --trial-in-container; OH: --agent openharness[βext] --trial-in-container --oh-settings) |
| Run | batch |
Run all tasks under --tasks-dir in parallel, N trials each (always in containers β no --trial-in-container flag). Defaults to the full 200-task suite (skill + tool + gui) when --tasks-dir is omitted; pass --cli-only to run just the CLI subsets (150 tasks). Supports --continue and --rerun-errors against an existing --trace-dir. |
| Run | grade |
Re-grade an existing trace JSONL against a task |
| Run | list |
List task ids under --tasks-dir |
| Images | build-image |
Build the trial-in-container image for the selected agent (--agent loop|openharness|openharness-ext, default: openharness-ext) |
| Images | scripts/build_{loop,oh,oh_ext}_image.sh |
Lower-level shell builders. build_oh_ext_image.sh needs OH_EXT_DIR and ADB_PATH. |
| Sandbox | cleanup |
Remove all claw-anything trial containers (label app=claw-anything) |
| Generate | build-persona |
Phase 1 β adapt seed tasks to a persona, build a gold environment |
| Generate | gen-eval |
Phase 2 β generate evaluation tasks from a gold environment |
Common run flags: --agent {loop, openai-compat, openharness, openharness-ext} Β· --trial-in-container Β· --docker-image (override image name) Β· --oh-settings PATH (OH-only) Β· --oh-disable-builtin-tools (only expose claw-anything tools, deny all OH builtins) Β· --proxy URL (for model / judge API traffic) Β· --judge-model / --no-judge.
claw-anything <cmd> --help shows full options for each command.
src/claw_anything/ # core package
ββ cli.py # all CLI subcommands
ββ runner/ # container_launcher, ServiceManager, dispatchers, OH plugin gen
ββ agents/ # agent backends (loop Β· openharness Β· openharness-ext)
ββ task/mobile_gui/ # Android GUI init + adb inject helpers (calendar / contacts / β¦)
ββ graders/ # grading framework (rule + LLM judge)
ββ gen/ # build-persona + gen-eval pipeline
ββ models/ # pydantic models (task, message, trace, scoring)
ββ trace/ # JSONL trace reader/writer
mock_services/ # FastAPI mock services (CLI + GUI app shadows)
docker/oh/ # patch_*.py β build-time patches baked into the OH image
# patch_print_mode_usage.py β surface per-turn `usage` in stream-json
# patch_openai_client.py β keep `stream_options.include_usage` with tools
# patch_environment_date.py β honour CLAW_TASK_EXECUTION_DATE env var
scripts/ # build_{loop,oh,oh_ext}_image.sh
Dockerfile.{loop,oh,oh_ext} # one Dockerfile per agent backend
benchmark/ # 200 human-verified tasks
ββ skill/ # 100 skill-mode CLI tasks (agent loads tools dynamically on demand)
ββ tool/ # 50 tool-mode CLI tasks (agent is pre-loaded with the full tool set)
ββ gui/ # 50 CLI + GUI tasks
personas/ # hand-written persona YAMLs (input to build-persona)
seed_tasks/ # abstract task templates (M000βMxxx)
seed_noise/ # noise templates injected during persona build
gold_envs/ # outputs of build-persona (persona + fixtures)
gen_tasks/ # outputs of gen-eval
examples/ # minimal runnable examples + oh-settings.example.json (OH settings template)
template/ # task.yaml / grader.py templates for authors
docs/ # task authoring guides
- Hand-written tasks: copy
template/task_template.yaml+template/grader_template.pyand adapt them. - Generated tasks: use the two-phase pipeline instead of writing tasks by hand.
See CONTRIBUTING.md for the full workflow. Bug fixes, new mock services, additional seed tasks, and persona templates are all welcome.
Claw-Anything is built on top of Claw-Eval β we reuse its task abstraction, mock-service scaffolding, and grader conventions as the starting point of this work, and extend them along three context-scaling axes (long-horizon event streams, interconnected services, and cross-device GUI + CLI) with an automated data-generation pipeline. We thank the Claw-Eval authors for open-sourcing a clean foundation to build on.
We also thank the broader community behind the open-source LLMs, agent harnesses, and mock-service inspirations that made this benchmark possible.
@article{lin2026clawanything,
title = {Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to Userβs Digital World},
author = {Lin, Yusong and Liang, Xinyuan and Wang, Haiyang and Gu, Qipeng and Cheng, Siqi and Chen, Jiangui and Wu, Shuzhe and Pan, Feiyang and Fan, Lue and Zhao, Sanyuan and Tu, Dandan},
year = {2026},
journal = {arXiv preprint arXiv:2605.26086}
}This project is licensed under the MIT License.



