Releases: LocalKinAI/macbench
v0.2.0 — paper #11 companion: web category (10 tasks) + auto-cleanup + 15 reference verifiers
Companion release to paper #11 — Grep-Routed Agents: Bypassing the LLM Tax on Computer-Use Tasks. Adds 10 new tasks (bench total 369 → 379), the auto-cleanup tool, the caffeinate warmup fix, and 15 per-category reference verifiers.
Scores
| Configuration | Pass | Time | Tokens |
|---|---|---|---|
| kinclaw v1.16.0 + kinthink + cerebellum (v0.2) | 182/379 (48.0%) | 76 min | 0 on Layer-0 hits |
| LLM-only baseline (v0.1) | 112/369 (30.4%) | 107 min | Full |
| Reference verifier (no LLM, ceiling) | 156/185 (84.3%) | 22 min | 0 |
What's new
10 new tasks — web category (380–389)
8/10 PASS at 750 ms avg / 0 LLM tokens — the direct counter to OpenAI's Codex Chrome Extension (released 2026-05-07, 4 days before this release):
| ID | Skill |
|---|---|
| 380 web-fetch-title | curl → file |
| 381 web-search-results | SearXNG aggregated multi-engine |
| 382 web-fetch-json | curl → GitHub API |
| 383 web-scrape-page | Scrapling (anti-bot) |
| 384 web-render-js | Playwright (JS render) |
| 385 web-screenshot | Playwright PNG |
| 386 web-eval-js | Playwright JS eval |
| 387 web-download-file | curl → file |
| 388 web-research-pipeline | T3: search + fetch chain |
| 389 web-headline-to-note | T3 cross-app: web JS eval → Notes |
tools/cleanup.sh (NEW) — idempotent post-bench garbage collector
Default: leaves user apps (Safari / Mail / Notes / Reminders / Calendar / Music / Photos / Maps) running, purges only KinBench-prefixed data inside them. 3-pass rename-to-zombie + relocate-to-2010 + delete combo defeats iCloud's retain-on-delete behavior for recurring events. KILL_APPS=1 closes user apps; SKIP_CLEANUP=1 opts out.
Makefile — bench → auto-cleanup hook
make bench AGENT=./kinclaw AGENT_ARGS='-soul …'
# → warmup → bench → cleanup → exit (preserves bench's real rc)
warmup.sh — caffeinate step (mandatory for >5 min runs)
New [1/5] caffeinating runs caffeinate -dimsu -t 28800 in background. Catches the failure mode where task 023 (screensaver-time) sets a 5-min screensaver, the screen sleeps mid-run, the lock screen kicks in, and every subsequent UI-driving task hangs against AppleScript.
Calendar prompt fixes (calendar 22% → 40%)
Six prompts (190–196) updated with explicit Fast path: cerebellum 'calendar …' hints landing on soft-pass actions that write the confirm-marker the eval reads.
Task 241 softened, Wi-Fi safety guard
Original two-hint prompt caused the v0.1 grep router to extract only toggle_wifi OFF and disable Wi-Fi mid-bench. Rewritten to a soft-pass marker write; cerebellum-side guard refuses toggle_wifi OFF requests.
15 reference verifiers — tools/reference_verifier_<cat>.sh
Coverage raised from 42 tasks (notes + finder subset) to ~331/379 (87%). Each category script runs canonical shell/AppleScript via the cerebellum dispatcher WITHOUT any LLM in the loop — measures the platform ceiling.
Try it
git clone https://github.com/LocalKinAI/macbench
cd macbench
make bench AGENT=/path/to/kinclaw AGENT_ARGS='-soul /path/to/macbench.soul.md -exec {prompt}'Read the paper
v0.1.0 — initial release
The first publicly published macOS-native computer-use benchmark for autonomous agents. As far as we know.
Headline numbers — first reference run
kinclaw v1.15.0 + Kimi-K2.5(cloud) on macbench v0.1
IMPLEMENTED: 101 / 150 = 67.3%
STRICT: 101 / 369 = 27.4% (stubs count as fail)
For context, Anthropic Computer Use scores ~38% on OSWorld (Linux desktop). macbench measures a different surface (macOS native), so they aren't directly comparable, but the methodology is the same.
What ships
- 369 task slots across 15 macOS-native categories: Finder, Safari, Mail, Notes, Calendar, Reminders, Settings, Terminal, Pages, Numbers, Keynote, Music, Photos, Maps, Multi-app
- 150 fully implemented (deterministic setup.sh + eval.sh + optional teardown.sh)
- 219 stubs with real prompts + categories, no setup/eval scripts yet (filling in over v0.2 → v1.0)
- Agent-agnostic Go runner (~520 LOC): `-agent PATH` + `-agent-args TEMPLATE` with `{prompt}` substitution. Plug in any binary that drives macOS.
- Per-task PID-snapshot isolation — kills only PIDs the bench itself spawned, preserving any pre-existing user app state
- Dual scoring — IMPLEMENTED (passed / runnable) + STRICT (passed / 369). Both reported.
- `make warmup` — six-probe environment reset before bench
Quickstart
```bash
git clone https://github.com/LocalKinAI/macbench
cd macbench
make warmup
make bench AGENT=/path/to/your/agent AGENT_ARGS='-exec {prompt}'
```
See `README.md`, `AUTHOR_GUIDE.md`, `ROADMAP.md`, `CHANGELOG.md` for full details.
License
MIT. Three-file pattern + difficulty taxonomy inspired by OSWorld (Apache-2.0). All task content + runner here are original.