PhoneHarness

🦾 A mixed-action orchestration harness and benchmark for phone agents across CLI, GUI, and MCP tools.

✅ Evaluate phone agents by verifiable side effects, not only by the next tap.

🏠 Homepage • 🤗 HF Dataset • 🚀 Quick Start

PhoneHarness is a phone-agent evaluation stack for workflows that cannot be represented as pure GUI navigation. Agents run against Android emulators, operate through device-side tools and host-side proxies, and are graded by verifiable evidence such as files, system settings, app state, and safety side-effect checks.

🎬 Demos

⚡ CLI-native status checks	🧭 Hybrid GUI + tool workflow	📱 Virtual-display control

✨ Features

🧰 Mixed action surface: shell_exec, python_exec, load_skill, and run_seed_gui_subtask coexist in one phone-agent loop.
👀 Delegated GUI control: the outer orchestration model plans and calls tools, while a dedicated GUI worker handles screenshot-grounded app interaction.
⚙️ Deterministic-first routing: routing cards prefer CLI or MCP completion when a task has an exact executable path, and fall back to GUI only when needed.
🔍 Trace-backed grading: JSONL traces and HTML viewers make failures auditable as model reasoning errors, GUI grounding errors, environment faults, tool failures, or verifier mismatches.

📦 Benchmark

PhoneHarness Bench is released as a Hugging Face dataset:

https://huggingface.co/datasets/PhoneHarness/phoneharness-bench

The dataset contains the task definitions and metadata used by the paper. Keep generated traces and local run outputs out of git unless you intentionally publish an artifact snapshot.

🧩 Architecture

PhoneHarness is the public project name, phoneharness is the runtime Python package, and PHONEHARNESS_* is the standard environment-variable prefix.

Host (macOS/Linux)                            Android Emulator + Termux
├── OpenAI-compatible model endpoint          ├── phoneharness server :8920
├── gui_proxy :8919 + slot*10                 ├── shell_exec / python_exec
│   screenshot, tap, swipe, type              ├── load_skill -> host tool proxy
└── trace viewers                             └── run_seed_gui_subtask -> GUI worker

The default mode is delegated:

orchestration model (--model)
  ├── CLI and device operations
  ├── MCP / skill-backed host tools
  └── run_seed_gui_subtask(...)
        └── GUI model (--gui-model)
            └── screenshot-grounded app actions

🚀 Quick Start

1. 🔐 Configure model credentials

PhoneHarness expects OpenAI-compatible chat-completions endpoints. Export credentials in your shell or secret manager.

export OPENAI_BASE_URL="<openai-compatible-base-url>"
export OPENAI_API_KEY="<api-key>"
export PHONEHARNESS_GUI_API_URL="<optional-gui-model-base-url>"
export PHONEHARNESS_GUI_API_KEY="<optional-gui-model-api-key>"

2. 💻 Start a local console

python3 -m phoneharness console \
  --model "<orchestration-model>" \
  --gui-model "<gui-model>" \
  --base-url "$OPENAI_BASE_URL" \
  --api-key "$OPENAI_API_KEY"

3. 📱 Start an on-device server

python3 -m phoneharness server \
  --port 8920 \
  --model "<orchestration-model>" \
  --gui-model "<gui-model>" \
  --base-url "$OPENAI_BASE_URL" \
  --api-key "$OPENAI_API_KEY" \
  --skill-file skills/routing.yaml \
  --skill-file skills/index.yaml \
  --skill-file skills/file_output_paths.yaml

4. 🧾 Inspect traces

python3 scripts/trace2html.py path/to/trace.jsonl
python3 scripts/trace2html_all.py path/to/trace-directory

🗂️ Repository Layout

phoneharness/
├── phoneharness/            # Runtime package for the server, agent loop, tools, and GUI controllers
├── skills/                  # Runtime routing cards and progressive skill-disclosure YAMLs
├── scripts/                 # Emulator, GUI proxy, trace viewer, and helper scripts
├── tests/                   # Unit tests for adapters and harness behavior
└── vdisplay-helper/         # Virtual-display helper source

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
docs/assets/demo_gifs		docs/assets/demo_gifs
phoneharness		phoneharness
scripts		scripts
skills		skills
tests		tests
vdisplay-helper		vdisplay-helper
.gitignore		.gitignore
README.md		README.md
start_console.sh		start_console.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhoneHarness

🎬 Demos

✨ Features

📦 Benchmark

🧩 Architecture

🚀 Quick Start

1. 🔐 Configure model credentials

2. 💻 Start a local console

3. 📱 Start an on-device server

4. 🧾 Inspect traces

🗂️ Repository Layout

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PhoneHarness

🎬 Demos

✨ Features

📦 Benchmark

🧩 Architecture

🚀 Quick Start

1. 🔐 Configure model credentials

2. 💻 Start a local console

3. 📱 Start an on-device server

4. 🧾 Inspect traces

🗂️ Repository Layout

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages