🦾 A mixed-action orchestration harness and benchmark for phone agents across CLI, GUI, and MCP tools.
✅ Evaluate phone agents by verifiable side effects, not only by the next tap.
🏠 Homepage • 🤗 HF Dataset • 🚀 Quick Start
PhoneHarness is a phone-agent evaluation stack for workflows that cannot be represented as pure GUI navigation. Agents run against Android emulators, operate through device-side tools and host-side proxies, and are graded by verifiable evidence such as files, system settings, app state, and safety side-effect checks.
| ⚡ CLI-native status checks | 🧭 Hybrid GUI + tool workflow | 📱 Virtual-display control |
![]() |
![]() |
![]() |
- 🧰 Mixed action surface:
shell_exec,python_exec,load_skill, andrun_seed_gui_subtaskcoexist in one phone-agent loop. - 👀 Delegated GUI control: the outer orchestration model plans and calls tools, while a dedicated GUI worker handles screenshot-grounded app interaction.
- ⚙️ Deterministic-first routing: routing cards prefer CLI or MCP completion when a task has an exact executable path, and fall back to GUI only when needed.
- 🔍 Trace-backed grading: JSONL traces and HTML viewers make failures auditable as model reasoning errors, GUI grounding errors, environment faults, tool failures, or verifier mismatches.
PhoneHarness Bench is released as a Hugging Face dataset:
https://huggingface.co/datasets/PhoneHarness/phoneharness-bench
The dataset contains the task definitions and metadata used by the paper. Keep generated traces and local run outputs out of git unless you intentionally publish an artifact snapshot.
PhoneHarness is the public project name, phoneharness is the runtime Python package, and PHONEHARNESS_* is the standard environment-variable prefix.
Host (macOS/Linux) Android Emulator + Termux
├── OpenAI-compatible model endpoint ├── phoneharness server :8920
├── gui_proxy :8919 + slot*10 ├── shell_exec / python_exec
│ screenshot, tap, swipe, type ├── load_skill -> host tool proxy
└── trace viewers └── run_seed_gui_subtask -> GUI worker
The default mode is delegated:
orchestration model (--model)
├── CLI and device operations
├── MCP / skill-backed host tools
└── run_seed_gui_subtask(...)
└── GUI model (--gui-model)
└── screenshot-grounded app actions
PhoneHarness expects OpenAI-compatible chat-completions endpoints. Export credentials in your shell or secret manager.
export OPENAI_BASE_URL="<openai-compatible-base-url>"
export OPENAI_API_KEY="<api-key>"
export PHONEHARNESS_GUI_API_URL="<optional-gui-model-base-url>"
export PHONEHARNESS_GUI_API_KEY="<optional-gui-model-api-key>"python3 -m phoneharness console \
--model "<orchestration-model>" \
--gui-model "<gui-model>" \
--base-url "$OPENAI_BASE_URL" \
--api-key "$OPENAI_API_KEY"python3 -m phoneharness server \
--port 8920 \
--model "<orchestration-model>" \
--gui-model "<gui-model>" \
--base-url "$OPENAI_BASE_URL" \
--api-key "$OPENAI_API_KEY" \
--skill-file skills/routing.yaml \
--skill-file skills/index.yaml \
--skill-file skills/file_output_paths.yamlpython3 scripts/trace2html.py path/to/trace.jsonl
python3 scripts/trace2html_all.py path/to/trace-directoryphoneharness/
├── phoneharness/ # Runtime package for the server, agent loop, tools, and GUI controllers
├── skills/ # Runtime routing cards and progressive skill-disclosure YAMLs
├── scripts/ # Emulator, GUI proxy, trace viewer, and helper scripts
├── tests/ # Unit tests for adapters and harness behavior
└── vdisplay-helper/ # Virtual-display helper source


