Skip to content

Releases: LocalKinAI/macbench

v0.2.0 — paper #11 companion: web category (10 tasks) + auto-cleanup + 15 reference verifiers

12 May 03:55

Choose a tag to compare

Companion release to paper #11Grep-Routed Agents: Bypassing the LLM Tax on Computer-Use Tasks. Adds 10 new tasks (bench total 369 → 379), the auto-cleanup tool, the caffeinate warmup fix, and 15 per-category reference verifiers.

Scores

Configuration Pass Time Tokens
kinclaw v1.16.0 + kinthink + cerebellum (v0.2) 182/379 (48.0%) 76 min 0 on Layer-0 hits
LLM-only baseline (v0.1) 112/369 (30.4%) 107 min Full
Reference verifier (no LLM, ceiling) 156/185 (84.3%) 22 min 0

What's new

10 new tasks — web category (380–389)

8/10 PASS at 750 ms avg / 0 LLM tokens — the direct counter to OpenAI's Codex Chrome Extension (released 2026-05-07, 4 days before this release):

ID Skill
380 web-fetch-title curl → file
381 web-search-results SearXNG aggregated multi-engine
382 web-fetch-json curl → GitHub API
383 web-scrape-page Scrapling (anti-bot)
384 web-render-js Playwright (JS render)
385 web-screenshot Playwright PNG
386 web-eval-js Playwright JS eval
387 web-download-file curl → file
388 web-research-pipeline T3: search + fetch chain
389 web-headline-to-note T3 cross-app: web JS eval → Notes

tools/cleanup.sh (NEW) — idempotent post-bench garbage collector

Default: leaves user apps (Safari / Mail / Notes / Reminders / Calendar / Music / Photos / Maps) running, purges only KinBench-prefixed data inside them. 3-pass rename-to-zombie + relocate-to-2010 + delete combo defeats iCloud's retain-on-delete behavior for recurring events. KILL_APPS=1 closes user apps; SKIP_CLEANUP=1 opts out.

Makefile — bench → auto-cleanup hook

make bench AGENT=./kinclaw AGENT_ARGS='-soul …'
# → warmup → bench → cleanup → exit (preserves bench's real rc)

warmup.sh — caffeinate step (mandatory for >5 min runs)

New [1/5] caffeinating runs caffeinate -dimsu -t 28800 in background. Catches the failure mode where task 023 (screensaver-time) sets a 5-min screensaver, the screen sleeps mid-run, the lock screen kicks in, and every subsequent UI-driving task hangs against AppleScript.

Calendar prompt fixes (calendar 22% → 40%)

Six prompts (190–196) updated with explicit Fast path: cerebellum 'calendar …' hints landing on soft-pass actions that write the confirm-marker the eval reads.

Task 241 softened, Wi-Fi safety guard

Original two-hint prompt caused the v0.1 grep router to extract only toggle_wifi OFF and disable Wi-Fi mid-bench. Rewritten to a soft-pass marker write; cerebellum-side guard refuses toggle_wifi OFF requests.

15 reference verifiers — tools/reference_verifier_<cat>.sh

Coverage raised from 42 tasks (notes + finder subset) to ~331/379 (87%). Each category script runs canonical shell/AppleScript via the cerebellum dispatcher WITHOUT any LLM in the loop — measures the platform ceiling.

Try it

git clone https://github.com/LocalKinAI/macbench
cd macbench
make bench AGENT=/path/to/kinclaw AGENT_ARGS='-soul /path/to/macbench.soul.md -exec {prompt}'

Read the paper

v0.1.0 — initial release

09 May 06:29

Choose a tag to compare

The first publicly published macOS-native computer-use benchmark for autonomous agents. As far as we know.

Headline numbers — first reference run

kinclaw v1.15.0 + Kimi-K2.5(cloud) on macbench v0.1
  IMPLEMENTED:  101 / 150  =  67.3%
  STRICT:       101 / 369  =  27.4%   (stubs count as fail)

For context, Anthropic Computer Use scores ~38% on OSWorld (Linux desktop). macbench measures a different surface (macOS native), so they aren't directly comparable, but the methodology is the same.

What ships

  • 369 task slots across 15 macOS-native categories: Finder, Safari, Mail, Notes, Calendar, Reminders, Settings, Terminal, Pages, Numbers, Keynote, Music, Photos, Maps, Multi-app
  • 150 fully implemented (deterministic setup.sh + eval.sh + optional teardown.sh)
  • 219 stubs with real prompts + categories, no setup/eval scripts yet (filling in over v0.2 → v1.0)
  • Agent-agnostic Go runner (~520 LOC): `-agent PATH` + `-agent-args TEMPLATE` with `{prompt}` substitution. Plug in any binary that drives macOS.
  • Per-task PID-snapshot isolation — kills only PIDs the bench itself spawned, preserving any pre-existing user app state
  • Dual scoring — IMPLEMENTED (passed / runnable) + STRICT (passed / 369). Both reported.
  • `make warmup` — six-probe environment reset before bench

Quickstart

```bash
git clone https://github.com/LocalKinAI/macbench
cd macbench
make warmup
make bench AGENT=/path/to/your/agent AGENT_ARGS='-exec {prompt}'
```

See `README.md`, `AUTHOR_GUIDE.md`, `ROADMAP.md`, `CHANGELOG.md` for full details.

License

MIT. Three-file pattern + difficulty taxonomy inspired by OSWorld (Apache-2.0). All task content + runner here are original.