π-BENCH is a benchmark for proactive personal assistant agents in
long-horizon workflows, where users start with underspecified requests and
important requirements emerge across interaction. It contains 100 multi-turn
tasks across 5 domain-specific personas (researcher, marketer,
pharmacist, law_trainee, financier) and organizes them as multi-session
episodes in persistent workspaces.
The benchmark jointly measures Proactivity (PROC) and Completeness (COMP). PROC evaluates whether an agent resolves hidden intents early (through inference or focused elicitation) to reduce avoidable user burden, while COMP evaluates whether final deliverables satisfy checklist requirements and artifact-level obligations. Scoring combines rubric-based hidden-intent judgment and checklist validation, and audit results show low judge disagreement (<4%), which supports evaluation reliability.
Compared with benchmarks focused mainly on short-horizon tasks, GUI/mobile
interactions, or memory retrieval alone, π-BENCH emphasizes persistent,
artifact-centric workflows with hidden intents, inter-task
dependencies, and cross-session continuity, enabling clearer separation
between reactive task completion and proactive assistance quality.
Overall results for Proc / Comp (%). Results are averaged over three runs,
with subscripts denoting standard deviation.
| Model | Average Proc | Average Comp | Researcher | Marketer | Pharmacist | Law Trainee | Financier |
|---|---|---|---|---|---|---|---|
| GPT-5.4 | 67.02.1 | 65.61.8 | 46.0 / 66.4 | 78.2 / 67.1 | 75.9 / 71.5 | 56.9 / 61.9 | 78.1 / 61.2 |
| Gemini 3.1 Pro | 57.10.9 | 60.00.8 | 41.1 / 59.2 | 65.0 / 62.1 | 71.0 / 72.1 | 50.0 / 55.3 | 58.6 / 51.1 |
| Claude Opus 4.6 | 65.51.4 | 67.61.5 | 50.3 / 74.5 | 75.0 / 74.6 | 82.8 / 68.6 | 45.7 / 57.2 | 73.8 / 63.2 |
| DeepSeek V3.2 | 53.31.9 | 57.83.0 | 29.0 / 66.9 | 69.1 / 59.4 | 75.9 / 62.6 | 33.2 / 51.1 | 59.1 / 48.9 |
| MiniMax M2.7 | 55.63.2 | 60.01.8 | 33.4 / 63.9 | 71.9 / 61.9 | 77.1 / 63.6 | 38.6 / 52.5 | 57.2 / 58.1 |
| Kimi K2.5 | 43.10.2 | 61.61.9 | 28.9 / 63.5 | 41.2 / 62.3 | 70.1 / 74.8 | 34.8 / 54.4 | 40.4 / 52.9 |
| Seed2.0 Pro | 58.40.9 | 52.13.8 | 38.9 / 59.6 | 71.4 / 44.2 | 77.0 / 67.6 | 46.0 / 44.7 | 58.7 / 44.5 |
| GLM-5.1 | 58.40.8 | 63.62.9 | 41.8 / 61.6 | 62.6 / 69.1 | 75.2 / 70.3 | 45.5 / 57.3 | 66.7 / 59.8 |
| Qwen3.6 Plus | 64.01.1 | 64.10.6 | 40.1 / 70.0 | 77.5 / 66.6 | 79.7 / 70.2 | 45.7 / 60.2 | 77.1 / 53.6 |
- Create and activate a Python environment:
conda create -n pi-bench python=3.11
conda activate pi-bench- Install local dependencies and prepare AppWorld data:
pip install -e .
pip install -e third_party/nanobot
bash scripts/setup_appworld.sh- Create a local environment file and fill in the provider credentials:
cp env.example.sh env.shThe template leaves all values empty. Edit env.sh with your local credentials,
then source it in every shell where you run the benchmark:
source env.shenv.sh is ignored by git. The default model configs read credentials from
MODEL_BASE_URL, MODEL_API_KEY, USER_BASE_URL, USER_API_KEY,
JUDGER_BASE_URL, JUDGER_API_KEY, and BRAVE_SEARCH_API_KEY. These values
fill the placeholders in config/models/*.yaml, such as
config/models/example.full.yaml.
- Pull the benchmark Docker image:
docker pull zzzhr97/pi-bench:latest- Optionally edit the target model file under
config/models/.
Most users only need to configure env.sh. Edit config/models/<model-id>.yaml
only when you need to change model names, endpoints, proxy settings, timeouts,
or other per-model overrides. The YAML filename stem is the model id passed to
pibench; see config/models/example.full.yaml for the complete schema.
Run from the repository root. For benchmark reporting, use three repeated trials as the default run pattern:
pibench --model-id deepseek-v3.2 --run 3Each repeat is written to a separate output directory with a __runNN suffix.
Additional examples:
| Goal | Command |
|---|---|
| Single trial | pibench --model-id deepseek-v3.2 |
| Specific user | pibench --user-id law_trainee --model-id deepseek-v3.2 |
| Multiple models | pibench --model-id deepseek-v3.2,MiniMax-M2.5 |
| Multiple users and models | pibench --user-id researcher,law_trainee --model-id deepseek-v3.2,MiniMax-M2.5 |
Results and logs are written under:
outputs/<model-id>/<user-id>/
Runtime logs for each container run are under:
outputs/<model-id>/<user-id>/run/<timestamp>-runtime/
