macbench

The first publicly published macOS-native computer-use benchmark for autonomous agents. As far as we know.

Paper: macbench: A macOS-Native Computer-Use Benchmark for Autonomous Agents — concept DOI 10.5281/zenodo.20094244 (CC-BY-4.0, single PDF bundles EN + 中文; also rendered at localkin.dev/papers/macbench).

369 task slots defined        ← matches OSWorld's task count exactly
171 implemented (v0.1.2)       ← runnable today (+11 finder stubs filled 2026-05-10)
198 stubbed (v0.1.2)           ← real prompts + categories; no setup.sh / eval.sh yet
                                 → filled in progressively v0.2 → v1.0
                                 (notes 31/31 ✓ finder 50/50 ✓ remaining: mail, settings, etc.)

15 categories            Finder · Safari · Mail · Notes · Calendar ·
                         Reminders · Settings · Terminal · Pages ·
                         Numbers · Keynote · Music · Photos · Maps ·
                         Multi-app
3 difficulty tiers       T1 single-step  ·  T2 multi-step  ·  T3 cross-app
agent-agnostic           any binary that takes a prompt → drives macOS

Reference scores

v0.2 (2026-05-11) — paper #11: grep-routed agents

Adds 10 web tasks (380-389) bringing total to 379 tasks, and companion releases the grep-router stack (kinclaw kinthink + a 478- action cerebellum skill library — see paper #11):

kinclaw + kinthink + cerebellum + Kimi-K2.6(cloud) on macbench v0.2
  STRICT:       182 / 379  =  48.0%   (incl. web subcategory at 80%)
  Total time:   76 minutes  (avg 12.0 s / task)
  LLM tokens:   ZERO on Layer-0-routed tasks (244 of 379)

The cross-OS comparison table updates accordingly:

Agent + Stack	Benchmark	Score	LLM tokens
kinclaw + kinthink + Kimi-K2.6	macbench v0.2 (macOS)	48.0% (182/379)	0 on hit path
kinclaw + Kimi-K2.5 (LLM-only, v0.1)	macbench v0.1 (macOS)	30.4% (112/369)	Full
Reference verifier (no LLM, ceiling)	macbench v0.1 (185 cov.)	84.3% (156/185)	0
Anthropic Computer Use (Claude Sonnet 4)	OSWorld-Verified (Ubuntu)	~38%	Full
GPT-4o + Set-of-Mark	OSWorld (Ubuntu)	~12-15%	Full

Web subcategory (paper #11's marquee comparison vs OpenAI's Codex Chrome Extension): 8/10 PASS, 750 ms average, 0 LLM tokens.

v0.1 (2026-05-08) — first reference run

kinclaw v1.15.0 + Kimi-K2.5(cloud) on macbench v0.1
  IMPLEMENTED:  101 / 150  =  67.3%
  STRICT:       101 / 369  =  27.4%
  Total time:   ~95 minutes

The categories where kinclaw was strongest in v0.1: Finder (28/39 ≈ 72%), Reminders (75%), Settings (~68%), Calendar (>50% post-isolation). Notes / Mail / Pages / Numbers / Keynote were weakest — partly real agent limitations, partly v0.1 task-design choices (some required infrastructure beyond bash + AppleScript that's now in v0.2).

Two scores, both honest

When you run the suite, the runner reports two separate pass rates:

IMPLEMENTED: P / I (X.X%) — passed P of I tasks that have setup.sh + eval.sh, ignoring stubs. The "interesting" score: what the agent did against tasks it could actually try.
STRICT: P / 369 (Y.Y%) — passed P of 369 total task slots, stubs counted as fail. The "long-game" score: progress against the full benchmark including unimplemented work.

Both numbers go in run.json. Leaderboards / blog posts must report both — STRICT alone hides agent capability behind v0.1's incomplete implementation; IMPLEMENTED alone hides how much benchmark is missing.

Per-task isolation

The runner snapshots PIDs of bench-touched apps (Safari, Mail, Notes, Calendar, Reminders, Music, Photos, Maps, TextEdit, Pages, Numbers, Keynote, System Settings) at startup. Between every task, it kills only the PIDs the bench itself spawned, leaving any pre-existing user instance untouched. This:

Prevents the "agent does 5 tasks worth of work in one prompt" pollution we observed in the first run (root cause was actually pilot-soul memory + reuse of the same Safari window across tasks).
Stops Notes / Calendar / Reminders from accumulating AppleScript hangs after ~5-10 invocations.
Doesn't nuke the user's pre-existing app windows — if you're running bench while Safari is open with your work, that Safari stays alive.

The startup line (isolation: N pre-existing PIDs across 14 apps will be preserved) confirms the snapshot was captured.

Warmup

./warmup.sh (or make warmup) does four things before bench:

Force-quit every app the bench touches (so PID snapshot starts empty — strongest isolation).
Wipe the bench sandbox (~/Desktop/kinbench/, ~/.kinbench/).
Clean any KinBench-prefix data in app data stores (Notes / Reminders / Calendar / Mail / Photos / Music — leftover from prior runs that crashed before teardown).
Probe each app via osascript with a 5-second timeout. Reports ✓ healthy / ⚠ HUNG / ✗ TCC denied per app.

make bench auto-runs warmup. Set SKIP_WARMUP=1 to skip (useful during eval-script iteration when you don't want to nuke state).

Reference verifier — platform ceiling, no agent

tools/reference_verifier.sh runs every notes-category task with a canonical osascript / shell solution INSTEAD of an agent. Total runtime: ~100 seconds for all 31 notes tasks. The PASS rate this produces is the platform ceiling — what's achievable on this Mac + iCloud combination via direct AppleScript, regardless of how good the agent is.

tools/reference_verifier.sh
  → 21/31 PASS (67.7%) — first reference baseline (2026-05-10)
  → 10 fails are platform timing / AS-dictionary limits (e.g.
    AppleScript `pinned` property removed in macOS 14+,
    Mail draft creation racing with iCloud sync, Notes UI
    keystroke timing variance).

Use this to (1) verify eval correctness without burning agent inference time, (2) distinguish "agent capability gaps" from "platform-locked tests", (3) iterate eval.sh changes in seconds not minutes.

OSWorld (NeurIPS 2024) is the de facto standard for desktop computer-use agents — but it benchmarks inside an Ubuntu/Windows VM. Nobody has published a comparable benchmark for macOS native apps. macbench fills that gap.

Why this exists

Apple Intelligence is rolling out, but Apple hasn't published a benchmark for measuring agent capability on Mac.
The OSWorld leaderboard tells you Claude Sonnet 4 hits ~38% on Linux desktop tasks. It tells you nothing about how the same model drives Finder, Mail, Calendar, Notes, System Settings — apps people actually use on Macs.
macbench measures that. Same three-file pattern as OSWorld (task.json + setup.sh + eval.sh), same evaluator-script philosophy. Different OS surface.

What it tests

Each task is a natural-language prompt — exactly what a user would type into a chat box — that the agent must complete by driving real macOS apps. The eval is deterministic: filesystem state, defaults read, sqlite queries, AppleScript Automation queries. No "ask another LLM if it looks right".

Sample tasks (all 50 in tasks/):

ID	Category	Difficulty	What
001	finder	T1	Rename a file
005	finder	T2	Compress 3 files into a zip
011	safari	T1	Search Google for a phrase
016	mail	T1	Compose a draft (don't send)
018	calendar	T1	Create event tomorrow at 12:30 PM
021	settings	T2	Turn on Do Not Disturb
029	multi-app	T3	Take screenshot → attach to draft email
048	multi-app	T3	Find file in Finder → email it as attachment
050	multi-app	T3	Pages doc → export as PDF

See tasks/ for the full set, and AUTHOR_GUIDE.md for the schema if you want to write more.

Quickstart

You need:

macOS 14+ (Sonoma or newer) — macbench itself is intentionally macOS-only (its eval scripts use AppleScript + Mac apps). The agent you point at it can be anything; e.g. kinclaw is now cross-platform (macOS / Linux / Windows as of 2026-05-12), but macbench's tasks measure macOS Finder, Mail, Calendar, Notes, etc., so a Linux build of kinclaw won't have anything to drive here. Use the macOS build.
Go 1.22+ (only to compile the runner; tasks themselves are shell + AppleScript)
An agent binary that takes a prompt and drives macOS — e.g. kinclaw, or a wrapper script around any vision-LLM-driven framework

# 1. Clone + build
git clone https://github.com/LocalKinAI/macbench
cd macbench
make build

# 2. One-time AppleScript Automation TCC priming (Mail / Notes /
#    Calendar / Reminders / Safari) — click "Allow" on each popup
make warmup

# 3. Run the benchmark with your agent
make bench AGENT=/path/to/kinclaw \
           AGENT_ARGS='-soul pilot.soul.md -exec {prompt}'

# Or: just N tasks
make bench AGENT=./kinclaw AGENT_ARGS='-exec {prompt}' TASKS=001,005,016

# Or: with screen recording (mp4 per task)
make bench-record AGENT=./kinclaw AGENT_ARGS='-exec {prompt}'

Output:

macbench: 50 task(s), agent=kinclaw, args="-soul pilot.soul.md -exec {prompt}", record=false
──────────────────────────────────────────────────────────────────────
  ✓ 001-finder-rename              T1   1284ms
  ✓ 002-safari-open-url            T1   3122ms
  ✗ 005-finder-zip                 T2   8941ms [eval] expected ≥3 files in archive
  ...
──────────────────────────────────────────────────────────────────────
PASSED: 19 / 50  (38.0%)
→ wrote results/20260508-141530/run.json

Per-task report saved to results/<timestamp>/run.json. If -record was on, mp4s land in results/<timestamp>/recordings/.

How the agent is invoked

-agent-args is a template. The literal string {prompt} gets replaced with the task's natural-language prompt at run time. Tokens are split on whitespace (no shell quoting), so each {prompt} becomes exactly one argv slot.

# kinclaw (LocalKinAI agent)
-agent kinclaw \
-agent-args "-soul pilot.soul.md -exec {prompt}"

# Anthropic Computer Use wrapper (hypothetical)
-agent anthropic-cua \
-agent-args "--task {prompt} --max-tokens 4096"

# Any custom shell wrapper
-agent ./my-wrapper.sh \
-agent-args "{prompt}"

If your agent's CLI shape isn't expressible as a single template, write a 3-line shell wrapper. macbench treats the agent as a black box.

Permissions

macbench tasks drive real macOS apps, so your agent binary needs:

Accessibility — System Settings → Privacy & Security → Accessibility → toggle on.
Screen Recording — same screen, Screen Recording → on.
AppleScript Automation for Mail / Notes / Calendar / Reminders / Safari / System Events. make warmup triggers any missing dialogs in one batch.
(optional) Screen Recording for kinrec — only if you use make bench-record. The kinrec binary needs its own grant.

All grants are one-time; macOS remembers per code-signing identity.

Comparison with related benchmarks

	macbench	OSWorld	WebArena	AndroidWorld	WindowsAgentArena
Platform	macOS native	Ubuntu / Windows VM	Browser only	Android emulator	Windows VM
Total task slots	369	369	~800	~116	~150
Implemented (v0.1)	50	369	800+	116	150
Eval style	filesystem + `defaults` + AppleScript + sqlite	filesystem + ROS + screenshot match	DOM + side-effect	UI tree + side-effect	filesystem + registry
Agent contract	exec(prompt) → drive macOS	VNC channel into VM	browser DOM API	adb-style	RDP-style
First public release	2026	2024	2023	2024	2024

We borrow OSWorld's three-file pattern and difficulty taxonomy, but the implementation is original.

What's missing in v0.1

319 stub tasks need setup.sh + eval.sh — every stub already has a real, specific prompt and a well-thought category/difficulty assignment; what's missing is the deterministic eval. See ROADMAP.md for the per-category implementation schedule across v0.2 → v1.0.
Anthropic / OpenAI / open-source agent backends. Right now we've only run the suite with kinclaw; contributions wiring up other agents are first-class welcome.
CI — needs a Mac runner with TCC pre-granted. GitHub Actions macos runners can't be granted programmatically, so this requires a self-hosted Mac mini. Probably v0.3 or v1.0.
Token / cost tracking — most agents don't expose this uniformly. Punted to v0.2.
Leaderboard site. The README will start carrying numbers as we collect them.

Implementing a stub

# 1. Pick a stub from tasks/ — its task.json has the spec
cat tasks/051-finder-show-hidden-files/task.json

# 2. Write setup.sh + eval.sh per AUTHOR_GUIDE.md
$EDITOR tasks/051-finder-show-hidden-files/setup.sh
$EDITOR tasks/051-finder-show-hidden-files/eval.sh
chmod +x tasks/051-finder-show-hidden-files/{setup,eval}.sh

# 3. Remove the "stub" status field from task.json
$EDITOR tasks/051-finder-show-hidden-files/task.json

# 4. Verify standalone (per AUTHOR_GUIDE.md self-check)
cd tasks/051-finder-show-hidden-files
bash setup.sh
# (manually do what the prompt says)
bash eval.sh && echo PASS

# 5. Open a PR

License

MIT. See LICENSE.

The three-file evaluator pattern + difficulty taxonomy are inspired by OSWorld (Apache-2.0); all task content + runner implementation here are original.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
cmd/gen-stubs		cmd/gen-stubs
results		results
tasks		tasks
tools		tools
.gitignore		.gitignore
AUTHOR_GUIDE.md		AUTHOR_GUIDE.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ROADMAP.md		ROADMAP.md
go.mod		go.mod
runner.go		runner.go
warmup.sh		warmup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

macbench

Reference scores

v0.2 (2026-05-11) — paper #11: grep-routed agents

v0.1 (2026-05-08) — first reference run

Two scores, both honest

Per-task isolation

Warmup

Reference verifier — platform ceiling, no agent

Why this exists

What it tests

Quickstart

How the agent is invoked

Permissions

Comparison with related benchmarks

What's missing in v0.1

Implementing a stub

License

See also

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

macbench

Reference scores

v0.2 (2026-05-11) — paper #11: grep-routed agents

v0.1 (2026-05-08) — first reference run

Two scores, both honest

Per-task isolation

Warmup

Reference verifier — platform ceiling, no agent

Why this exists

What it tests

Quickstart

How the agent is invoked

Permissions

Comparison with related benchmarks

What's missing in v0.1

Implementing a stub

License

See also

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages