Skip to content

PhoneHarness/PhoneHarness

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PhoneHarness

🦾 A mixed-action orchestration harness and benchmark for phone agents across CLI, GUI, and MCP tools.

✅ Evaluate phone agents by verifiable side effects, not only by the next tap.

🏠 Homepage🤗 HF Dataset🚀 Quick Start

Homepage HF Dataset Action space Trace

PhoneHarness is a phone-agent evaluation stack for workflows that cannot be represented as pure GUI navigation. Agents run against Android emulators, operate through device-side tools and host-side proxies, and are graded by verifiable evidence such as files, system settings, app state, and safety side-effect checks.

🎬 Demos

⚡ CLI-native status checks 🧭 Hybrid GUI + tool workflow 📱 Virtual-display control
PhoneHarness CLI status demo PhoneHarness hybrid workflow demo PhoneHarness virtual display demo

✨ Features

  • 🧰 Mixed action surface: shell_exec, python_exec, load_skill, and run_seed_gui_subtask coexist in one phone-agent loop.
  • 👀 Delegated GUI control: the outer orchestration model plans and calls tools, while a dedicated GUI worker handles screenshot-grounded app interaction.
  • ⚙️ Deterministic-first routing: routing cards prefer CLI or MCP completion when a task has an exact executable path, and fall back to GUI only when needed.
  • 🔍 Trace-backed grading: JSONL traces and HTML viewers make failures auditable as model reasoning errors, GUI grounding errors, environment faults, tool failures, or verifier mismatches.

📦 Benchmark

PhoneHarness Bench is released as a Hugging Face dataset:

https://huggingface.co/datasets/PhoneHarness/phoneharness-bench

The dataset contains the task definitions and metadata used by the paper. Keep generated traces and local run outputs out of git unless you intentionally publish an artifact snapshot.

🧩 Architecture

PhoneHarness is the public project name, phoneharness is the runtime Python package, and PHONEHARNESS_* is the standard environment-variable prefix.

Host (macOS/Linux)                            Android Emulator + Termux
├── OpenAI-compatible model endpoint          ├── phoneharness server :8920
├── gui_proxy :8919 + slot*10                 ├── shell_exec / python_exec
│   screenshot, tap, swipe, type              ├── load_skill -> host tool proxy
└── trace viewers                             └── run_seed_gui_subtask -> GUI worker

The default mode is delegated:

orchestration model (--model)
  ├── CLI and device operations
  ├── MCP / skill-backed host tools
  └── run_seed_gui_subtask(...)
        └── GUI model (--gui-model)
            └── screenshot-grounded app actions

🚀 Quick Start

1. 🔐 Configure model credentials

PhoneHarness expects OpenAI-compatible chat-completions endpoints. Export credentials in your shell or secret manager.

export OPENAI_BASE_URL="<openai-compatible-base-url>"
export OPENAI_API_KEY="<api-key>"
export PHONEHARNESS_GUI_API_URL="<optional-gui-model-base-url>"
export PHONEHARNESS_GUI_API_KEY="<optional-gui-model-api-key>"

2. 💻 Start a local console

python3 -m phoneharness console \
  --model "<orchestration-model>" \
  --gui-model "<gui-model>" \
  --base-url "$OPENAI_BASE_URL" \
  --api-key "$OPENAI_API_KEY"

3. 📱 Start an on-device server

python3 -m phoneharness server \
  --port 8920 \
  --model "<orchestration-model>" \
  --gui-model "<gui-model>" \
  --base-url "$OPENAI_BASE_URL" \
  --api-key "$OPENAI_API_KEY" \
  --skill-file skills/routing.yaml \
  --skill-file skills/index.yaml \
  --skill-file skills/file_output_paths.yaml

4. 🧾 Inspect traces

python3 scripts/trace2html.py path/to/trace.jsonl
python3 scripts/trace2html_all.py path/to/trace-directory

🗂️ Repository Layout

phoneharness/
├── phoneharness/            # Runtime package for the server, agent loop, tools, and GUI controllers
├── skills/                  # Runtime routing cards and progressive skill-disclosure YAMLs
├── scripts/                 # Emulator, GUI proxy, trace viewer, and helper scripts
├── tests/                   # Unit tests for adapters and harness behavior
└── vdisplay-helper/         # Virtual-display helper source

About

PhoneHarness runtime harness for mixed-action phone agents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors