feat(agentic): add Hidden Intent proactivity tracking framework (pi-Bench)#846
Open
harryfan1985 wants to merge 6 commits into
Open
feat(agentic): add Hidden Intent proactivity tracking framework (pi-Bench)#846harryfan1985 wants to merge 6 commits into
harryfan1985 wants to merge 6 commits into
Conversation
Based on the pi-Bench Hidden Intent framework (arXiv 2605.14678), this
introduces infrastructure for tracking proactive assistance quality in
long-horizon agent workflows.
Paper reference:
pi-Bench: Evaluating Proactive Personal Assistant Agents in
Long-Horizon Workflows
Zhang et al., arXiv 2605.14678, May 2026
What this adds:
- Hidden Intent types: IntentTerminalStatus (Completed/Inferred/Provided),
HiddenIntent, PersistentIntent, SessionIntentTracking,
ProactivityScore, CompletenessScore in services-core
- IntentEvidenceCollector and IntentTurnEvidence in the ExecutionEngine
for lightweight per-turn signal collection
- Proactivity behavior guidance in agentic_mode.md and claw_mode.md
system prompts
- Extended facet_extraction.md with proactivity/completeness
analysis dimensions
- SessionUsageReport extensions with ProactivityReport and
CompletenessRepor
Based on the pi-Bench Hidden Intent framework (arXiv 2605.14678), this
introduces infrastructure for tracking p edintroduces infrastructure for tracking proactive assistance quality ig.long-horizon agent workflows.
Paper reference:
pi-Bench: Evaluatinho
Paper reference:
pi-Benchden pi-Bench: Evas Long-Horizon Workflows
Zhang et al., arXiv 2605.14678, Mer Zhang et al., arXiv 2ou
What this adds:
- Hidden Intent types: As - Hidden Intde HiddenIntent, PersistentIntent, SessionIntentTracking,
ProactivitySal ProactivityScore, CompletenessScore in services-core
ds - IntentEvidenceCollector and IntentTurnEvidence in t
Owner
|
This PR involves significant changes and affects the Agentic agent; it will be considered for merging after verification. |
Contributor
Author
sure! |
- round_executor: detect AskUserQuestion even when no topic headers are
extractable, so the call is no longer silently dropped
- execution_engine/session_manager: drop unused turn_id param; warn on
poisoned intent evidence mutex instead of silent skip
- hidden_intent_types: centralize proactivity level thresholds in
ProactivityLevel::{from_score,as_str}; add explicit IntentAssignment
is_proxy flag so proxy detection no longer relies solely on a fragile
intent_id string heuristic (heuristic kept as legacy fallback)
- session_usage: use is_proxy flag first; document the single-provided
suppression rationale
- add regression tests for AskUserQuestion detection and proxy filtering
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Author
Follow-up: 代码审查问题修复 (commit
|
| 问题 | 文件 | 修复 |
|---|---|---|
| AskUserQuestion 漏报 | round_executor.rs |
detect_ask_user_question 此前在 questions[].header 缺失时返回 (false, []),导致工具调用被静默丢弃。改为用独立的 called 标记记录调用本身,topic 提取保持 best-effort。 |
| 未使用参数 | session_manager.rs / execution_engine.rs |
删除 record_intent_evidence 中从未使用的 _turn_id 参数(实际通过 evidence.turn_index 定位),同步更新唯一调用方。 |
| Mutex 中毒静默丢失 | execution_engine.rs |
证据收集器锁中毒时由静默跳过改为输出 warn! 日志,便于排查。 |
| 阈值三处重复 | hidden_intent_types.rs / intent_evidence.rs / service.rs |
将 0.8/0.5/0.2 等级阈值统一收敛到 ProactivityLevel::from_score() 与 as_str(),另外两处改为代理调用。 |
| 代理赋值检测脆弱 | hidden_intent_types.rs / service.rs |
IntentAssignment 新增显式 is_proxy: bool 字段(serde 默认 false,向后兼容)。is_legacy_proxy_intent_assignment 优先读取该字段,原 intent_id.starts_with("turn-") 字符串启发式保留为旧数据兜底,避免误判真实意图。 |
| 单条 Provided 过滤无说明 | service.rs |
补充注释,解释为何单条 Provided(total=1) 不构成有意义的报告而需抑制。 |
新增测试
round_executor.rs:6 个detect_ask_user_question用例,覆盖有 header / 无 header / 空数组 / 缺 key / 不存在 / 混合工具调用。session_usage/service.rs:2 个代理过滤用例 ——is_proxy=true必须排除(无论 intent_id),以及turn-前缀的真实意图在is_proxy=false时不被误过滤。
验证
cargo test -p bitfun-services-core hidden_intent # 10 passed
cargo test -p bitfun-core intent_evidence # 12 passed
cargo test -p bitfun-core report_ # 22 passed (含 2 新增)
cargo test -p bitfun-core detect_ask_user_question # 6 passed (新增)
cargo check --tests -p bitfun-services-core / bitfun-core / bitfun-desktop # 全部通过
说明:问题 #2(从
trigger_description自由文本解析proactive_tools=)属于设计层面,建议随后续的结构化评估器一并替换为专用字段,本次未改动以控制范围。
Mirror the Rust IntentAssignment is_proxy field so the frontend can read and filter proxy assignments. Optional to stay backward compatible. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR adds the platform-neutral groundwork for Hidden Intent / proactivity tracking in BitFun agent sessions, inspired by the pi-Bench paper (arXiv 2605.14678).
The current implementation does not claim to fully reproduce pi-Bench's hidden-intent evaluator. Instead, it introduces the session/config/data contracts, prompt guidance, runtime evidence capture, and report fields needed to evaluate whether an agent proactively handles latent requirements once real hidden-intent assignments are available.
Paper Source
pi-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows
Zhang et al., arXiv 2605.14678, May 2026
https://arxiv.org/abs/2605.14678
The paper evaluates proactive personal assistants through hidden intents with three terminal states:
Completed: the agent directly satisfies the hidden intent without the user explicitly providing it.Inferred: the agent asks a targeted clarification and the user reveals the hidden intent.Provided: the user must proactively supply the hidden intent.Proactivity Score =
(Completed + Inferred) / Total Hidden Intents.Task completeness is a separate final-output/task-requirement judgment in the paper, not something that should be inferred from the hidden-intent terminal states.
What This PR Changes
1. Prompt Guidance
agentic_mode.md: adds guidance for coding agents to infer likely requirements from workspace context and ask targeted questions when information is missing.claw_mode.md: strengthens personal-assistant proactivity guidance, including preference/context recovery across longer workflows.facet_extraction.md: extends the session-insights extraction prompt with proactivity and completeness fields. This is prompt/schema guidance only; it is not yet wired as the authoritative hidden-intent assignment grader.2. Platform-Neutral Data Contracts
hidden_intent_types.rs: adds DTOs/enums for hidden intents, persistent intents, terminal statuses, session-level tracking, raw turn evidence, and score/report value types.session/types.rs: adds persistedintent_assignmentsandintent_evidenceon dialog turns, plus session metadata fields for intent tracking and optional score snapshots.core/session.rs: addsenable_intent_trackingtoSessionConfig, defaulting tofalse.services-coreowns the shared contracts so the logic stays platform-agnostic and can be exposed through desktop/web/server adapters.3. Runtime Evidence Collection
intent_evidence.rs: collects lightweight per-turn trajectory signals such as targeted user-question usage, question topics, proactive tool calls, output production, and round count.round_executor.rs: detectsAskUserQuestiontool usage and extracts question-topic hints from tool-call arguments.execution_engine.rs: accumulates evidence during the turn and persists a snapshot after the dialog loop completes.coordinator.rs: creates anIntentEvidenceCollectoronly whenenable_intent_tracking=true.session_manager.rs: persists raw evidence to both session metadata and the dialog-turn file without converting it into hidden-intent terminal assignments.4. Session Usage Report Surface
session_usage/types.rs: adds optionalproactivityandcompletenessreport fields.session_usage/service.rs: aggregates real hidden-intent assignments into a proactivity report when such assignments exist.turn-*assignments generated from raw evidence are ignored so old heuristic data is not reported as real hidden-intent evaluation.5. Frontend / Adapter Plumbing
enable_intent_trackingand pass it intoSessionConfig.enableIntentTracking.intentEvidence, so future UI/report features can inspect raw evidence separately from hidden-intent assignments.Incremental Refactor / pi-Bench Alignment
A follow-up refactor tightened the implementation against the paper's functional model:
IntentTurnEvidence, not syntheticIntentAssignmentrows.Completed/Inferred/Provided; it is reserved for an independent final-task grader.services-core, execution evidence collection lives inbitfun-core, and UI code consumes typed API/session-history data.Current Limitations / Follow-ups
AskUserQuestiontopic extraction depends on the current tool-call argument shape and should be revisited if the tool schema changes.Follow-up TODO: Validation and Optimization
Validation TODO
Completed,Inferred, andProvidedclassification semantics independently from runtime evidence collection.enable_intent_tracking=true, covering metadata persistence, turn-fileintentEvidence, and usage report aggregation after reload.turn-*assignments.Inferred, and user-provided hidden intent.Optimization TODO
AskUserQuestiontool contract or versioned parser to avoid silent drift when the tool schema changes.not evaluated,partially evaluated, andfully evaluatedstates.Risk Assessment
Low Risk
enable_intent_trackingdefaults tofalse, so evidence collection is opt-in.Option,Vec, serde defaults, and aliases for backward-compatible deserialization.Medium Risk
Inferred) from generic questions or passive waiting.Verification
cargo test -p bitfun-services-core hidden_intent -- --nocapturecargo test -p bitfun-core intent_evidence -- --nocapturecargo test -p bitfun-core report_ -- --nocapturecargo check --tests -p bitfun-services-corecargo check --tests -p bitfun-corecargo check --tests -p bitfun-desktoppnpm run type-check:webpnpm run lint:webpnpm --dir src/web-ui run test:run(139 files / 744 tests passed)Generated with BitFun