MacAgentBench: Benchmark agents where they actually work — on macOS.

A macOS benchmark for evaluating AI agents on real desktop tasks.
Reproducible tasks, rule-based evaluation, and native app coverage across everyday work scenarios.
Fully configured OpenClaw macOS image for Linux and Windows

🔎TL;DR

We provide a fully configured macOS-based OpenClaw environment that runs out-of-the-box on both Linux and Windows via Docker. In addition, we address the complex permission issues in macOS, ensuring that OpenClaw can execute tasks without failing due to permission restrictions. Everyone is welcome to try it out and contribute! 👏
We introduce a benchmark designed to evaluate OpenClaw in real-world usage scenarios. It covers both the default tasks supported by OpenClaw and tasks commonly encountered in real working environments.
Our evaluation relies on manually designed rule-based evaluators, avoiding the uncertainty introduced by using LLMs as judges. Each task is executed in an independent container, ensuring that tasks do not interfere with each other during evaluation.

🚀Quick Start

Want to try the image first? Start with Part 1. Want to run benchmark evaluation? Jump to Part 2.

1. Try OpenClaw in macOS on Linux or Windows

Download the required image files from Hugging Face (~40GB):

pip install huggingface_hub

huggingface-cli download JetLM/OpenClaw-macOS --local-dir .

Start the macOS Docker container:

bash launcher/docker/simple_start.sh

Connect to macOS:

vncviewer localhost:5901

Next, open OpenClaw. You should see the OpenClaw icon in the top menu bar, which means you are ready to start using OpenClaw on macOS.

2. Run the evaluation benchmark

Download the required image files from Hugging Face (~50GB):

pip install huggingface_hub

huggingface-cli download JetLM/OpenClaw-macOS --local-dir .

Fill in your API key and base URL in openclaw.json.
Update the required local paths in eval.sh, especially the image paths and your local WORK_DIR.
Start the evaluation:

bash eval.sh

📋Benchmark feature

Compared to existing benchmarks for evaluating OpenClaw (e.g., PinchBench and Terminal Bench), our benchmark focuses primarily on real-world user scenarios. All tasks are executed on macOS rather than Linux, bridging the gap between evaluation environments and actual usage scenarios. This provides a more realistic assessment of OpenClaw’s capabilities in everyday workflows.
Our benchmark supports testing all skills provided by OpenClaw by default, ensuring that even newly deployed lobsters 🦞 can be evaluated fairly and comprehensively.
The benchmark includes a wide range of macOS-specific scenarios and capabilities, such as productivity workflows using Apple Numbers, Apple Pages, and Apple Keynote, as well as native macOS applications like Notes, Reminders, and other GUI-dependent tasks. This allows us to evaluate OpenClaw’s capabilities from multiple perspectives across realistic desktop workflows.

Our benchmark currently covers 110 tasks across 18 macOS app and tool categories, reflecting real-world OpenClaw usage scenarios. Each task is designed to test specific capabilities, from productivity apps to terminal utilities.

App	Task Focus	Example Instruction	# of Tasks
Apple Notes	Note management	`In Apple Notes, edit the note titled "Project Ideas" and append the line "Reviewed on Tuesday.".`	8
BlogWatcher	Subscribe to blogs	`In blogwatcher, scan all tracked blogs for updates.`	7
🦞 ClawHub	Control OpenClaw SKILLs	`From ClawHub, install "incident-triage-playbook" at version "1.0.0" into "$HOME/Desktop/clawhub_skills_1_1".`	2
GifGrep	Search & share GIFs	`Find the first Tenor GIF result for "cat typing on keyboard", then download that GIF and save it as "$HOME/Desktop/gifgrep_download_2_1.gif".`	4
GitHub	Store & share code	`Find the latest open issue in "openclaw/openclaw" with label "bug", and save it as "#<number> \| <labels> \| <title>" to "$HOME/Desktop/gh_open_issues_2_1.txt".`	5
Himalaya	Send & receive emails	`Check the latest OTP email in Inbox (search across all pages) and save only one line to "$HOME/Desktop/mail_otp_1_1.txt" as "otp_code: <6-digit-code>". If no OTP email exists, write "otp_code: none".`	8
Obsidian	Manage Markdown files	`In Obsidian, create a new note called "Inbox/Vendor Call Notes.md" and set its content exactly to the provided meeting notes.`	8
Peekaboo	Launch and inspect apps	`Capture a PNG screenshot of the TextEdit window and save it to "$HOME/Desktop/peekaboo_front_window_1.png".`	8
Reminders	Task reminders	`In the Reminders app, create a new reminder titled "Dentist follow-up" due "2026-04-15 09:30".`	10
Sherpa-ONNX-TTS	Text to speech	`In Terminal, create a spoken reminder that says "Reminder: submit the travel reimbursement form today.", and save it as a WAV file to "$HOME/Desktop/sherpa_tts_2_1.wav".`	2
SongSee	Audio to spectrogram	`In Terminal, use songsee to generate a multi-panel visualization for "$HOME/Desktop/Ode to Joy.mp3" with spectrogram, mel, chroma, and mfcc, then save it to "$HOME/Desktop/songsee_panel_1.png".`	9
tmux	Terminal session management	`In Terminal, send "status" to the tmux session "tmux-status-send-c1", then save the reply to "$HOME/Desktop/tmux_result_4_1.json".`	6
Video Frames	Extract frames	`In Terminal, extract a frame at 00:00:01 from "$HOME/Desktop/benchmark_source.mp4", resize it to 640x720, and save it as "$HOME/Desktop/vf_thumb_1.png".`	3
Weather	Check weather	`Can you check tomorrow's weather in "San Francisco" and tell me whether it will be rainy? Save only one yes/no line to "$HOME/Desktop/weather_tomorrow_rain_2_1.txt".`	4
Whisper	Speech to text	`In Terminal, use Whisper's tiny model to make an SRT subtitle file for "$HOME/Documents/whisper_audio_en_2.ogg" and save it to "$HOME/Desktop/whisper_subtitles_2_1.srt".`	4
Numbers	Create & manage tables	`In the Numbers app, create a new blank document, insert a table into the first sheet if needed, then enter the value "test" into cell A1.`	5
Pages	Writing & reading	`In the Pages app, create a new document using the "Blank Black" template and write: Passwd: 111111.`	8
Keynote	Build slides	`In the Keynote app, create a new document, export it to PowerPoint, and save it to the Documents folder.`	9

🗺️RoadMap

What we have done

Upcoming Plans(coming soon...)

🙌 Contribution Guide

We warmly welcome contributions to improve the OpenClaw Benchmark for macOS! Here’s how you can help:

1️⃣ Add Support for New Models

Integrate support for new language or agent models.
Test existing tasks on new models and report results.
Ensure compatibility with our Docker-based OpenClaw environment.

2️⃣ Add New Tasks

Submit new macOS tasks that reflect real-world user scenarios.
Provide clear instructions, expected outputs, and, if possible, configuration screenshots or screen recordings.
Include tasks across different apps (productivity, terminal utilities, GUI-based apps) to enrich the benchmark coverage.

3️⃣ Verification & Automation

Write verification scripts for tasks to ensure consistency and correctness.
Automate detection of inactive platforms or failed task execution.

4️⃣ Translations & Localization

Translate task instructions or documentation into English, Japanese, Korean, or other languages.
Help make the benchmark accessible to a broader community.

How to Contribute

Fork the repository.
Make your changes in a separate branch.
Submit a Pull Request with a clear description of your contribution.
If unsure, open an Issue first to discuss your idea—we respond quickly!

💡 Tip: When adding tasks, try to follow the existing format in tasks/ and include example instructions similar to the ones in our benchmark table.

❤ Acknowledgments

We would like to thank the following projects and communities for their invaluable contributions, which helped make this benchmark possible:

Claw-Eval - https://github.com/claw-eval/claw-eval

OpenClaw - https://github.com/openclaw/openclaw

OpenGVLab – https://github.com/OpenGVLab

Docker-OSX – https://github.com/sickcodes/Docker-OSX

PinchBench - https://github.com/pinchbench/

📬 Contact

If you have questions or would like to collaborate, please contact us at:

Yikun Fu, Shanghai AI Laboratory 📧 fuyikun123456@163.com
Bowen Fu, XJTU 📧 HappyBug@stu.xjtu.edu.cn
zhenyu wu 📧 zywu01@sjtu.edu.cn
kaiyan zhang 📧 zhang-ky22@mails.tsinghua.edu.cn
Biqing Qi, Shanghai AI Laboratory 📧 qibiqing@pjlab.org.cn

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
config		config
controllers		controllers
evaluators		evaluators
launcher		launcher
mm_agents		mm_agents
task		task
utils		utils
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
batch_run.py		batch_run.py
eval.sh		eval.sh
openclaw.json		openclaw.json
requirements.txt		requirements.txt
single_run.py		single_run.py
single_run.sh		single_run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MacAgentBench: Benchmark agents where they actually work — on macOS.

🔎TL;DR

🚀Quick Start

1. Try OpenClaw in macOS on Linux or Windows

2. Run the evaluation benchmark

📋Benchmark feature

🗺️RoadMap

What we have done

Upcoming Plans(coming soon...)

🙌 Contribution Guide

How to Contribute

❤ Acknowledgments

📬 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

MacAgentBench: Benchmark agents where they actually work — on macOS.

🔎TL;DR

🚀Quick Start

1. Try OpenClaw in macOS on Linux or Windows

2. Run the evaluation benchmark

📋Benchmark feature

🗺️RoadMap

What we have done

Upcoming Plans(coming soon...)

🙌 Contribution Guide

How to Contribute

❤ Acknowledgments

📬 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages