Skip to content

Simplified-Reasoning/Pi-Bench

Repository files navigation

Pi-Bench Banner

Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflow

arXiv Project Page


🧭 Introduction

π-BENCH is a benchmark for proactive personal assistant agents in long-horizon workflows, where users start with underspecified requests and important requirements emerge across interaction. It contains 100 multi-turn tasks across 5 domain-specific personas (researcher, marketer, pharmacist, law_trainee, financier) and organizes them as multi-session episodes in persistent workspaces.

The benchmark jointly measures Proactivity (PROC) and Completeness (COMP). PROC evaluates whether an agent resolves hidden intents early (through inference or focused elicitation) to reduce avoidable user burden, while COMP evaluates whether final deliverables satisfy checklist requirements and artifact-level obligations. Scoring combines rubric-based hidden-intent judgment and checklist validation, and audit results show low judge disagreement (<4%), which supports evaluation reliability.

Compared with benchmarks focused mainly on short-horizon tasks, GUI/mobile interactions, or memory retrieval alone, π-BENCH emphasizes persistent, artifact-centric workflows with hidden intents, inter-task dependencies, and cross-session continuity, enabling clearer separation between reactive task completion and proactive assistance quality.

🏆 Leaderboard

Overall results for Proc / Comp (%). Results are averaged over three runs, with subscripts denoting standard deviation.

Model Average Proc Average Comp Researcher Marketer Pharmacist Law Trainee Financier
GPT-5.4 67.02.1 65.61.8 46.0 / 66.4 78.2 / 67.1 75.9 / 71.5 56.9 / 61.9 78.1 / 61.2
Gemini 3.1 Pro 57.10.9 60.00.8 41.1 / 59.2 65.0 / 62.1 71.0 / 72.1 50.0 / 55.3 58.6 / 51.1
Claude Opus 4.6 65.51.4 67.61.5 50.3 / 74.5 75.0 / 74.6 82.8 / 68.6 45.7 / 57.2 73.8 / 63.2
DeepSeek V3.2 53.31.9 57.83.0 29.0 / 66.9 69.1 / 59.4 75.9 / 62.6 33.2 / 51.1 59.1 / 48.9
MiniMax M2.7 55.63.2 60.01.8 33.4 / 63.9 71.9 / 61.9 77.1 / 63.6 38.6 / 52.5 57.2 / 58.1
Kimi K2.5 43.10.2 61.61.9 28.9 / 63.5 41.2 / 62.3 70.1 / 74.8 34.8 / 54.4 40.4 / 52.9
Seed2.0 Pro 58.40.9 52.13.8 38.9 / 59.6 71.4 / 44.2 77.0 / 67.6 46.0 / 44.7 58.7 / 44.5
GLM-5.1 58.40.8 63.62.9 41.8 / 61.6 62.6 / 69.1 75.2 / 70.3 45.5 / 57.3 66.7 / 59.8
Qwen3.6 Plus 64.01.1 64.10.6 40.1 / 70.0 77.5 / 66.6 79.7 / 70.2 45.7 / 60.2 77.1 / 53.6

🧰 Setup

  1. Create and activate a Python environment:
conda create -n pi-bench python=3.11
conda activate pi-bench
  1. Install local dependencies and prepare AppWorld data:
pip install -e .
pip install -e third_party/nanobot
bash scripts/setup_appworld.sh
  1. Create a local environment file and fill in the provider credentials:
cp env.example.sh env.sh

The template leaves all values empty. Edit env.sh with your local credentials, then source it in every shell where you run the benchmark:

source env.sh

env.sh is ignored by git. The default model configs read credentials from MODEL_BASE_URL, MODEL_API_KEY, USER_BASE_URL, USER_API_KEY, JUDGER_BASE_URL, JUDGER_API_KEY, and BRAVE_SEARCH_API_KEY. These values fill the placeholders in config/models/*.yaml, such as config/models/example.full.yaml.

  1. Pull the benchmark Docker image:
docker pull zzzhr97/pi-bench:latest
  1. Optionally edit the target model file under config/models/.

Most users only need to configure env.sh. Edit config/models/<model-id>.yaml only when you need to change model names, endpoints, proxy settings, timeouts, or other per-model overrides. The YAML filename stem is the model id passed to pibench; see config/models/example.full.yaml for the complete schema.

▶️ Run

Run from the repository root. For benchmark reporting, use three repeated trials as the default run pattern:

pibench --model-id deepseek-v3.2 --run 3

Each repeat is written to a separate output directory with a __runNN suffix.

Additional examples:

Goal Command
Single trial pibench --model-id deepseek-v3.2
Specific user pibench --user-id law_trainee --model-id deepseek-v3.2
Multiple models pibench --model-id deepseek-v3.2,MiniMax-M2.5
Multiple users and models pibench --user-id researcher,law_trainee --model-id deepseek-v3.2,MiniMax-M2.5

📦 Outputs

Results and logs are written under:

outputs/<model-id>/<user-id>/

Runtime logs for each container run are under:

outputs/<model-id>/<user-id>/run/<timestamp>-runtime/

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors