Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflow

🧭 Introduction

π-BENCH is a benchmark for proactive personal assistant agents in long-horizon workflows, where users start with underspecified requests and important requirements emerge across interaction. It contains 100 multi-turn tasks across 5 domain-specific personas (researcher, marketer, pharmacist, law_trainee, financier) and organizes them as multi-session episodes in persistent workspaces.

The benchmark jointly measures Proactivity (PROC) and Completeness (COMP). PROC evaluates whether an agent resolves hidden intents early (through inference or focused elicitation) to reduce avoidable user burden, while COMP evaluates whether final deliverables satisfy checklist requirements and artifact-level obligations. Scoring combines rubric-based hidden-intent judgment and checklist validation, and audit results show low judge disagreement (<4%), which supports evaluation reliability.

Compared with benchmarks focused mainly on short-horizon tasks, GUI/mobile interactions, or memory retrieval alone, π-BENCH emphasizes persistent, artifact-centric workflows with hidden intents, inter-task dependencies, and cross-session continuity, enabling clearer separation between reactive task completion and proactive assistance quality.

🏆 Leaderboard

Overall results for Proc / Comp (%). Results are averaged over three runs, with subscripts denoting standard deviation.

Model	Average Proc	Average Comp	Researcher	Marketer	Pharmacist	Law Trainee	Financier
GPT-5.4	67.0_2.1	65.6_1.8	46.0 / 66.4	78.2 / 67.1	75.9 / 71.5	56.9 / 61.9	78.1 / 61.2
Gemini 3.1 Pro	57.1_0.9	60.0_0.8	41.1 / 59.2	65.0 / 62.1	71.0 / 72.1	50.0 / 55.3	58.6 / 51.1
Claude Opus 4.6	65.5_1.4	67.6_1.5	50.3 / 74.5	75.0 / 74.6	82.8 / 68.6	45.7 / 57.2	73.8 / 63.2
DeepSeek V3.2	53.3_1.9	57.8_3.0	29.0 / 66.9	69.1 / 59.4	75.9 / 62.6	33.2 / 51.1	59.1 / 48.9
MiniMax M2.7	55.6_3.2	60.0_1.8	33.4 / 63.9	71.9 / 61.9	77.1 / 63.6	38.6 / 52.5	57.2 / 58.1
Kimi K2.5	43.1_0.2	61.6_1.9	28.9 / 63.5	41.2 / 62.3	70.1 / 74.8	34.8 / 54.4	40.4 / 52.9
Seed2.0 Pro	58.4_0.9	52.1_3.8	38.9 / 59.6	71.4 / 44.2	77.0 / 67.6	46.0 / 44.7	58.7 / 44.5
GLM-5.1	58.4_0.8	63.6_2.9	41.8 / 61.6	62.6 / 69.1	75.2 / 70.3	45.5 / 57.3	66.7 / 59.8
Qwen3.6 Plus	64.0_1.1	64.1_0.6	40.1 / 70.0	77.5 / 66.6	79.7 / 70.2	45.7 / 60.2	77.1 / 53.6

🧰 Setup

Create and activate a Python environment:

conda create -n pi-bench python=3.11
conda activate pi-bench

Install local dependencies and prepare AppWorld data:

pip install -e .
pip install -e third_party/nanobot
bash scripts/setup_appworld.sh

Create a local environment file and fill in the provider credentials:

cp env.example.sh env.sh

The template leaves all values empty. Edit env.sh with your local credentials, then source it in every shell where you run the benchmark:

source env.sh

env.sh is ignored by git. The default model configs read credentials from MODEL_BASE_URL, MODEL_API_KEY, USER_BASE_URL, USER_API_KEY, JUDGER_BASE_URL, JUDGER_API_KEY, and BRAVE_SEARCH_API_KEY. These values fill the placeholders in config/models/*.yaml, such as config/models/example.full.yaml.

Pull the benchmark Docker image:

docker pull zzzhr97/pi-bench:latest

Optionally edit the target model file under config/models/.

Most users only need to configure env.sh. Edit config/models/<model-id>.yaml only when you need to change model names, endpoints, proxy settings, timeouts, or other per-model overrides. The YAML filename stem is the model id passed to pibench; see config/models/example.full.yaml for the complete schema.

▶️ Run

Run from the repository root. For benchmark reporting, use three repeated trials as the default run pattern:

pibench --model-id deepseek-v3.2 --run 3

Each repeat is written to a separate output directory with a __runNN suffix.

Additional examples:

Goal	Command
Single trial	`pibench --model-id deepseek-v3.2`
Specific user	`pibench --user-id law_trainee --model-id deepseek-v3.2`
Multiple models	`pibench --model-id deepseek-v3.2,MiniMax-M2.5`
Multiple users and models	`pibench --user-id researcher,law_trainee --model-id deepseek-v3.2,MiniMax-M2.5`

📦 Outputs

Results and logs are written under:

outputs/<model-id>/<user-id>/

Runtime logs for each container run are under:

outputs/<model-id>/<user-id>/run/<timestamp>-runtime/

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
assets		assets
config		config
data		data
page		page
scripts		scripts
src		src
tests		tests
third_party		third_party
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.example.sh		env.example.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflow

🧭 Introduction

🏆 Leaderboard

🧰 Setup

▶️ Run

📦 Outputs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflow

🧭 Introduction

🏆 Leaderboard

🧰 Setup

▶️ Run

📦 Outputs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages