Releases · LocalKinAI/macbench

Companion release to paper #11 — Grep-Routed Agents: Bypassing the LLM Tax on Computer-Use Tasks. Adds 10 new tasks (bench total 369 → 379), the auto-cleanup tool, the caffeinate warmup fix, and 15 per-category reference verifiers.

Scores

Configuration	Pass	Time	Tokens
kinclaw v1.16.0 + kinthink + cerebellum (v0.2)	182/379 (48.0%)	76 min	0 on Layer-0 hits
LLM-only baseline (v0.1)	112/369 (30.4%)	107 min	Full
Reference verifier (no LLM, ceiling)	156/185 (84.3%)	22 min	0

What's new

10 new tasks — `web` category (380–389)

8/10 PASS at 750 ms avg / 0 LLM tokens — the direct counter to OpenAI's Codex Chrome Extension (released 2026-05-07, 4 days before this release):

ID	Skill
380 web-fetch-title	curl → file
381 web-search-results	SearXNG aggregated multi-engine
382 web-fetch-json	curl → GitHub API
383 web-scrape-page	Scrapling (anti-bot)
384 web-render-js	Playwright (JS render)
385 web-screenshot	Playwright PNG
386 web-eval-js	Playwright JS eval
387 web-download-file	curl → file
388 web-research-pipeline	T3: search + fetch chain
389 web-headline-to-note	T3 cross-app: web JS eval → Notes

`tools/cleanup.sh` (NEW) — idempotent post-bench garbage collector

Default: leaves user apps (Safari / Mail / Notes / Reminders / Calendar / Music / Photos / Maps) running, purges only KinBench-prefixed data inside them. 3-pass rename-to-zombie + relocate-to-2010 + delete combo defeats iCloud's retain-on-delete behavior for recurring events. KILL_APPS=1 closes user apps; SKIP_CLEANUP=1 opts out.

Makefile — bench → auto-cleanup hook

make bench AGENT=./kinclaw AGENT_ARGS='-soul …'
# → warmup → bench → cleanup → exit (preserves bench's real rc)

`warmup.sh` — caffeinate step (mandatory for >5 min runs)

New [1/5] caffeinating runs caffeinate -dimsu -t 28800 in background. Catches the failure mode where task 023 (screensaver-time) sets a 5-min screensaver, the screen sleeps mid-run, the lock screen kicks in, and every subsequent UI-driving task hangs against AppleScript.

Calendar prompt fixes (calendar 22% → 40%)

Six prompts (190–196) updated with explicit Fast path: cerebellum 'calendar …' hints landing on soft-pass actions that write the confirm-marker the eval reads.

Task 241 softened, Wi-Fi safety guard

Original two-hint prompt caused the v0.1 grep router to extract only toggle_wifi OFF and disable Wi-Fi mid-bench. Rewritten to a soft-pass marker write; cerebellum-side guard refuses toggle_wifi OFF requests.

15 reference verifiers — `tools/reference_verifier_<cat>.sh`

Coverage raised from 42 tasks (notes + finder subset) to ~331/379 (87%). Each category script runs canonical shell/AppleScript via the cerebellum dispatcher WITHOUT any LLM in the loop — measures the platform ceiling.

Try it

git clone https://github.com/LocalKinAI/macbench
cd macbench
make bench AGENT=/path/to/kinclaw AGENT_ARGS='-soul /path/to/macbench.soul.md -exec {prompt}'

Read the paper

Headline numbers — first reference run

kinclaw v1.15.0 + Kimi-K2.5(cloud) on macbench v0.1
  IMPLEMENTED:  101 / 150  =  67.3%
  STRICT:       101 / 369  =  27.4%   (stubs count as fail)

For context, Anthropic Computer Use scores ~38% on OSWorld (Linux desktop). macbench measures a different surface (macOS native), so they aren't directly comparable, but the methodology is the same.

What ships

369 task slots across 15 macOS-native categories: Finder, Safari, Mail, Notes, Calendar, Reminders, Settings, Terminal, Pages, Numbers, Keynote, Music, Photos, Maps, Multi-app

150 fully implemented (deterministic setup.sh + eval.sh + optional teardown.sh)

219 stubs with real prompts + categories, no setup/eval scripts yet (filling in over v0.2 → v1.0)

Agent-agnostic Go runner (~520 LOC): `-agent PATH` + `-agent-args TEMPLATE` with `{prompt}` substitution. Plug in any binary that drives macOS.

Per-task PID-snapshot isolation — kills only PIDs the bench itself spawned, preserving any pre-existing user app state

Dual scoring — IMPLEMENTED (passed / runnable) + STRICT (passed / 369). Both reported.

`make warmup` — six-probe environment reset before bench

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Scores

What's new

10 new tasks — `web` category (380–389)

`tools/cleanup.sh` (NEW) — idempotent post-bench garbage collector

Makefile — bench → auto-cleanup hook

`warmup.sh` — caffeinate step (mandatory for >5 min runs)

Calendar prompt fixes (calendar 22% → 40%)

Task 241 softened, Wi-Fi safety guard

15 reference verifiers — `tools/reference_verifier_<cat>.sh`

Try it

Read the paper

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Headline numbers — first reference run

What ships

Quickstart

License

Uh oh!

Releases: LocalKinAI/macbench

v0.2.0 — paper #11 companion: web category (10 tasks) + auto-cleanup + 15 reference verifiers

Scores

What's new

10 new tasks — web category (380–389)

tools/cleanup.sh (NEW) — idempotent post-bench garbage collector

Makefile — bench → auto-cleanup hook

warmup.sh — caffeinate step (mandatory for >5 min runs)

Calendar prompt fixes (calendar 22% → 40%)

Task 241 softened, Wi-Fi safety guard

15 reference verifiers — tools/reference_verifier_<cat>.sh

Try it

Read the paper

Uh oh!

v0.1.0 — initial release

Headline numbers — first reference run

What ships

Quickstart

License

Uh oh!

10 new tasks — `web` category (380–389)

`tools/cleanup.sh` (NEW) — idempotent post-bench garbage collector

`warmup.sh` — caffeinate step (mandatory for >5 min runs)

15 reference verifiers — `tools/reference_verifier_<cat>.sh`