AceBench

Empowering Edge Agents: A Multi-Dimensional Benchmark for Agentic Edge-Cloud Collaboration.

128 executable tasks · 100 privacy-annotated · 6 strategies · Utility · Cost · Privacy.

AceBench is a benchmark for edge-cloud collaboration in LLM agents. Cloud models reason best but see all your data; on-device edge models keep data local but are weaker — collaboration promises the best of both, if you organize it well. AceBench measures exactly that, but in a setting prior edge-cloud studies skip: real agent execution, where agents work over live workspaces (files, tools, commands, APIs, app states) and every cloud call mid-trajectory can expose accumulated local context.

We evaluate six execution strategies — pure edge, pure cloud, and four edge-cloud collaboration patterns — across 128 executable tasks (100 with fine-grained privacy annotations) on an OpenClaw harness, scoring every run on three axes at once: task utility, resource cost, and privacy exposure. The result exposes how when the cloud is invoked and what context is sent trade capability against cost and leakage.

Design Highlights

	What we test	Why it matters
🦞 OpenClaw-native	The real OpenClaw agent loop — bash, browser, file ops, APIs, and reusable `SKILL.md` skills — driving a live local workspace	Tasks need long-horizon planning, state tracking, and error recovery; cloud calls land mid-trajectory over accumulated workspace context, not on a static prompt
🔐 Privacy-aware	100 tasks annotated with sensitivity units (PII + org secrets)	Every cloud invocation is a potential leakage channel — we audit what crosses the boundary
⚖️ Multi-dimensional	Utility · Cost · Privacy, reported jointly	No single number hides the trade-off; you see the whole Pareto picture
🔀 Strategy-centric	6 edge / cloud / edge-cloud strategies, one task suite	Isolates how collaboration is organized from which models are used
📦 Reproducible	Each task runs in its own Docker container	Graders are injected only after the agent finishes — never visible during execution

Tasks

128 executable tasks across 6 categories (Chinese & English); 100 carry fine-grained privacy annotations. Each is a self-contained Markdown file under tasks/ACE_Bench/ — a prompt, an inline grade() verifier, and a workspace path.

Category	#	Example tasks	Core challenges
Office & Daily Tasks	36	ambiguous contact email, meeting notes, expense report, daily summary	Multi-source aggregation, clarification, structured output
Information Search & Gathering	34	email search, competitive intelligence, paper affiliation lookup, CRM bug hunt	Web + local data reconciliation, source verification
Safety & Security	21	leaked API-key detection, prompt injection, malicious skill, HIPAA/PHI referral	Adversarial robustness, credential awareness, refusal
Data Analysis	14	order-profit analysis, month-end reconciliation, quarterly business insight	Spreadsheet reasoning, state verification
Development & Operations	13	system health check, automation-failure recovery, LLM API gateway skill	Undocumented setups, debugging, skill creation
Automation	10	flight booking, n8n workflow report, scheduled-briefing skill	Long-horizon orchestration, recovery

Scoring. Every run is graded on three dimensions at once:

Utility — completion score + Pass³ (3-trial consistency), from each task's own verifier.
Cost — cloud tokens & USD, plus edge-side FLOPs.
Privacy — how much annotated sensitive context (PII / org secrets) reaches the cloud.

Leaderboard

Edge = Qwen3.5-9B / 27B, Cloud = GPT-5.4, judge = GPT-5.4-mini, averaged over 3 runs. Cloud Tok. = raw / cache / output (millions); Cost in USD; Edge FLOPs in PetaFLOPs; Utility & Privacy in %.

Edge-cloud collaboration beats both single-side extremes on the utility–privacy trade-off; Sketch-Guided keeps privacy at 100%, Task-Routing is the most balanced, and Adaptive Assistance gets the best Pass³ at <10% of Cloud-only cost.

Quick Start

AceBench runs each task in an isolated Docker container bundling the OpenClaw harness and per-task mock services.

1. Install dependencies & load the image

The prebuilt container image is hosted on Hugging Face as a docker save tarball (Hugging Face is not a Docker registry, so download + docker load instead of docker pull):

cd AceBench
conda create -n acebench python=3.13 -y
pip install -r requirements.txt
# download the image
hf download chengpingan/AceBench \
    Images/acebench-openclaw-v1.0.tar.gz --repo-type dataset --local-dir .

# download the workspaces (single tarball) and extract
hf download chengpingan/AceBench \
    workspace/ACE_Bench.tar.gz --repo-type dataset --local-dir .
tar -xzf workspace/ACE_Bench.tar.gz      # extracts into workspace/ACE_Bench/

docker load -i Images/acebench-openclaw-v1.0.tar.gz   # loads acebench-openclaw:v1.0 (must match DOCKER_IMAGE in .env)

2. Configure keys — copy .env.example to .env and fill in:

OPENROUTER_API_KEY=...                 # cloud collaborator (any OpenAI-compatible provider)
JUDGE_API_KEY=...                      # LLM-as-a-judge for utility & privacy
JUDGE_MODEL=gpt-5.4-mini

Edge / cloud model endpoints live in my_api.json (e.g. a local vLLM server) and are passed via --models-config.

3. Prepare task assets

bash script/prepare.sh

4. Run — pick a strategy via --run-mode. Six strategies share one suite (full commands in script/run.sh):

Strategy	`--run-mode`	Cloud use	Idea
Edge-only	`local-only`	none	All steps on the edge model
Cloud-only	`cloud-only`	every step	Capability upper bound; highest exposure
Sketch-Guided	`pipeline-plan-executor`	once, upfront	Cloud drafts a high-level sketch; edge executes
Task-Routing	`query-router`	once, offline	RouteLLM routes the whole task to edge or cloud
Step-Routing	`step-router`	per uncertain step	Edge-first; escalate to cloud on high token entropy
Adaptive Assistance	`advisor`	on demand	Edge asks the cloud for a plan/hint when stuck

# Edge-only baseline
python3 eval/run_batch.py --category ACE_Bench --parallel 16 --repeat 3 \
  --edge-model vllm/Qwen/Qwen3.5-27B --models-config my_api.json \
  --output-dir output/edge_only/qwen3.5-27b

# Edge-cloud (e.g. adaptive cloud assistance)
python3 eval/run_batch.py --category ACE_Bench --parallel 8 --repeat 3 \
  --run-mode advisor \
  --edge-model vllm/Qwen/Qwen3.5-27B --cloud-model your-provider/gpt-5.4 \
  --models-config my_api.json \
  --output-dir output/adaptive-assistant/qwen3.5-27b_to_gpt5.4

Single-task runs (--task tasks/ACE_Bench/ACE_Bench_task_44_ambiguous_contact_email.md), task filters (--task-filter), and privacy-judge timing (--privacy-judge-mode {inline,deferred,off}) are also supported.

Check the Results

Per-task outputs land under output/<run>/<task_id>/... (scores, token/cost usage, agent trace, produced files), with a per-category and global summary generated automatically once the run finishes.

Acknowledgements

AceBench stands on the shoulders of a remarkable open-source agent community, and we are deeply grateful for it. The OpenClaw harness gives us a real, full-featured agent runtime — tools, skills, and a live workspace — to build on. Our tasks and evaluation design draw inspiration and adapted material from a series of outstanding agent benchmarks: Claw-Eval, WildClawBench, QwenClawBench, LiveClawBench, PinchBench, and ClawBench. Their meticulous task curation, rigorous grading, and reproducible harness design set the bar for trustworthy agent evaluation, and made the privacy-aware, edge-cloud extension in AceBench possible.

License

Released under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AceBench

Design Highlights

Tasks

Leaderboard

Quick Start

Check the Results

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
eval		eval
script		script
skills		skills
tasks/ACE_Bench		tasks/ACE_Bench
utils		utils
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
my_api.json		my_api.json
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AceBench

Design Highlights

Tasks

Leaderboard

Quick Start

Check the Results

Acknowledgements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages