Skip to content

feat(diag-phase3): 落地 IDM 交互诊断链路并增强网关/诊断稳定性#559

Merged
phantom5099 merged 6 commits into1024XEngineer:mainfrom
pionxe:feature/diag-phase3-idm
May 6, 2026
Merged

feat(diag-phase3): 落地 IDM 交互诊断链路并增强网关/诊断稳定性#559
phantom5099 merged 6 commits into1024XEngineer:mainfrom
pionxe:feature/diag-phase3-idm

Conversation

@pionxe
Copy link
Copy Markdown
Collaborator

@pionxe pionxe commented May 5, 2026

变更背景

本 PR 目标是严格对齐 Phase 3 落地方案,完成 neocode diag -i 的 IDM 交互式排查能力,并修复 shell 场景下自动诊断/手动诊断稳定性不足的问题。
同时补齐网关会话协议与权限决策链路,避免“假进入”“卡死”“错误信息不可定位”等问题。

Closes #530

主要改动

1) Gateway / Runtime Bridge 能力补齐

  • 新增并打通 gateway.createSession 相关协议、分发与桥接实现。
  • 补齐 resolve_permission 路由与校验链路,确保权限事件可闭环处理。
  • 增强 gateway RPC 客户端健壮性与测试覆盖(连接、重试、错误映射等)。
  • 补充 bridge 启动与会话引导相关测试,提升会话初始化一致性。

2) neocode shell + diag -i(IDM)主链路落地

  • 新增 IDM 控制器与输入拦截器,实现独立交互会话与 REPL 路由。
  • 支持 @ai\@aiexit、原生命令透传等输入语义。
  • 新增 IDM 独立 socket 发现与分流逻辑,CLI diag -i 可正确路由。
  • 完善 Ctrl+C 三态行为、流式中断、退出恢复、RingBuffer 扩缩容及清理。
  • 调整 shell 诊断链路行为,补充自动诊断降级/退出策略与错误提示。
  • 更新 README 诊断章节,补充 diag -i 的实际交互说明。

3) 诊断子代理稳定性修复(工具禁用 + 契约回退)

  • diagnose 子代理显式切换为 TaskTypeReview,并强制 ToolUseModeDisabled
  • 收敛 allowed tools,避免诊断路径触发工具调用链路。
  • 输出契约提示词与 review 契约对齐(report/findings)。
  • 解析层支持 summary 为空时回退 report,减少格式偏差导致失败。
  • 对“子代理输出契约不匹配”类错误做友好降级文案归一化。

测试与验证

  • 已补齐并更新相关单元测试:
    • gateway/bridge/createSession/resolvePermission 相关测试
    • shell diag 命令解析与 socket 分流测试
    • IDM 控制器/拦截器/会话/信号/快照等测试
    • diagnose 子代理禁工具与契约回退测试
  • 关键包测试通过:gatewayptyproxyruntimesubagenttools/diagnose
  • 全量回归在本分支上通过(源码层面)。

兼容性与风险

  • 保持 Phase 2 的 diag / diag auto 入口语义不变,Phase 3 作为增量增强。
  • IDM 在权限请求场景下走受控处理,避免无审批通道导致的交互卡死。
  • 诊断子代理禁工具后,更聚焦日志与上下文推断,稳定性优先于主动探测。
  • 自动、手动诊断过慢,这是后端Runtime强力机制的原因,如果后续Runtime能够实现一个轻便快速回答的模式,这问题就可解决。

提交拆分

  1. bb433196 feat(gateway): 打通createSession与权限决策桥接并增强RPC健壮性
  2. 2630c919 feat(ptyproxy): 实现diag -i交互沙盒与shell诊断链路切换
  3. 98895801 fix(diagnose): 子代理禁用工具并修复输出契约降级回退

现阶段实现效果

0ca5a967-bfd1-4213-a52a-d46a775f812e

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Copy link
Copy Markdown

@fennoai fennoai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 2 noteworthy regressions after reviewing the diff and validating the changed packages with targeted tests.

Comment thread internal/ptyproxy/proxy_unix.go Outdated
_ = listener.Close()
_ = os.Remove(socketPath)
}()
idmListener, idmSocketPath, err := listenIDMSocket()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium neocode shell --socket no longer round-trips cleanly with the new IDM entry path. RunManualShell honors ManualShellOptions.SocketPath for the regular diagnose socket, but IDM is always bound via ResolveDefaultIDMDiagSocketPath() instead of deriving from the configured override. As a result, a shell started with a custom socket will accept neocode diag on the override while neocode diag -i --socket <same path> will miss the listener entirely. Either derive the IDM socket from the configured override or introduce a separate explicit IDM socket option so the routing stays predictable.

if timeout <= 0 {
timeout = diagnoseSubAgentTimeout
}
runCtx, cancel := context.WithTimeout(parent, timeout+diagnoseSubAgentGrace)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium The new grace window is effectively bypassed on the real runtime path. runDiagnoseSubAgent creates timeout + diagnoseSubAgentGrace, but runtimeSubAgentInvoker.Run now wraps the same request in context.WithTimeout(ctx, input.Timeout) before calling RunSubAgentTask, so diagnose still fails at the inner 25s deadline. That means this change does not actually provide the extra cleanup/degrade buffer it claims to add. If the grace period is intentional, the inner timeout needs to stop re-applying the shorter deadline.

@pionxe
Copy link
Copy Markdown
Collaborator Author

pionxe commented May 6, 2026

@xgopilot 将增量覆盖率提升至百分之九十五

@fennoai
Copy link
Copy Markdown

fennoai Bot commented May 6, 2026

结果

已将一轮覆盖增强推到当前 PR 分支,提交为 a99b0c44

本次新增测试主要覆盖:

  • internal/gateway/create_session_coverage_test.go
  • internal/gateway/protocol/create_session_test.go
  • internal/ptyproxy/idm_controller_branches_test.go

本地验证

已执行:

  • go test ./internal/cli ./internal/gateway ./internal/gateway/protocol ./internal/ptyproxy ./internal/runtime ./internal/subagent ./internal/tools/diagnose -coverprofile=/tmp/pr559-cover.out -count=1 -timeout=180s

origin/main...HEAD 的新增 Go 可执行行做本地 diff 覆盖测算,当前约为 75.58%1232/1630)。

主要缺口仍在:

  • internal/ptyproxy/idm_controller.go:约 74.62%
  • internal/ptyproxy/proxy_unix.go:约 63.33%
  • internal/cli/gateway_runtime_bridge.go:约 50.00%

结论

这轮改动显著补上了 create_session 和一批 IDM 控制器分支,但未把增量覆盖率拉到 95%。如果继续推进,下一轮应该集中补:

  • proxy_unix.go 的 shell/socket/diag helper 分支
  • idm_controller.goextractIDMRuntimeEnveloperenderIDMAnswerensureSessionReady 周边剩余分支
  • gateway_runtime_bridge.go 的默认工作区装配路径

View job run

Generated with [codeagent](https://github.com/qbox/codeagent)
Co-authored-by: pionxe <148670367+pionxe@users.noreply.github.com>
@phantom5099 phantom5099 merged commit 69eae28 into 1024XEngineer:main May 6, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 3: IDM 交互排查沙盒

3 participants