Skip to content

Recover stalled Slack active turns#69

Open
pengx17 wants to merge 1 commit into
HOOLC:mainfrom
pengx17:codex/active-turn-watchdog-recovery
Open

Recover stalled Slack active turns#69
pengx17 wants to merge 1 commit into
HOOLC:mainfrom
pengx17:codex/active-turn-watchdog-recovery

Conversation

@pengx17
Copy link
Copy Markdown
Collaborator

@pengx17 pengx17 commented May 25, 2026

Context

Slack 线程里用户追问「PR 呢」时,broker 已经定位到 active_turn_id 卡在旧 turn:后续消息被继续送进僵尸 turn,session 没有自动恢复。原修复分支已经推到 codex/active-turn-watchdog-recovery,但当时 GitHub 身份没有权限开 PR;这次把该分支同步到 pengx17 fork 后发起 PR。

协作过程

sequenceDiagram
  participant User as Slack 用户
  participant Bot as broker bot
  participant Broker as slack-codex-broker
  participant PR as GitHub PR

  User->>Bot: 追问卡住线程 / PR 呢
  Bot->>Broker: 检查 session / active_turn_id
  Broker-->>Bot: active turn 长时间 inProgress 且无 runtime activity
  Bot->>Broker: 实现 watchdog + manual repair endpoint
  Bot-->>User: 分支已推,但没有权限开 PR
  User->>PR: 要求补开 PR
  PR-->>User: 从 fork 发起本 PR
Loading

Note: 关键转折是把问题从「固定 5 小时清理」改判为「terminal event / active turn stale 后缺少自动恢复」,所以本 PR 不做 session 全量清理,而是恢复单个 stalled active turn。

方案讨论

  • 不采用固定 idle cleanup:原线程已经确认 session cleanup 只在磁盘压力下触发,按固定时长删 session 会丢 Slack 历史与 agent session 状态。
  • 不采用人工全量 reset:可以临时解卡,但不能防止下一次 terminal event 丢失后再次卡死。
  • 采用 active-turn stall watchdog:用最近 agent-runtime trace activity 作为证据,超过阈值后 best-effort interrupt、重置 inflight 消息、清理 active turn,并让 pending Slack 消息重新派发。
  • 补一个单 session repair endpoint:给 operator 一个低破坏性的手动恢复入口,不需要删除整个 session。

最终方案

flowchart TD
  A[Slack follow-up message] --> B{session has activeTurnId?}
  B -- no --> C[normal pending dispatch]
  B -- yes --> D[read runtime thread state]
  D --> E{terminal / missing?}
  E -- yes --> F[clear active turn and resume]
  E -- no --> G{runtime activity older than timeout?}
  G -- no --> H[keep active turn]
  G -- yes --> I[interrupt stale turn best-effort]
  I --> J[reset inflight messages to pending]
  J --> F
Loading
  • 新增 SLACK_ACTIVE_TURN_STALL_TIMEOUT_MS 配置,默认 10 分钟。
  • slack-turn-reconciler 根据 runtime state + last trace activity 判断 stale active turn。
  • slack-conversation-service 支持 force reset + resume pending dispatch。
  • slack-routes 增加单 session repair route。
  • 补 e2e、route、service、config、reconciler 测试覆盖恢复链路。

验证情况

  • pnpm install --frozen-lockfile

  • pnpm exec tsc -p tsconfig.json --noEmit

  • pnpm exec vitest run test/config.test.ts test/e2e-broker.test.ts test/slack-conversation-service.test.ts test/slack-routes.test.ts test/slack-turn-reconciler.test.ts test/slack-turn-runner.test.ts

  • pnpm build

  • Codex review verification: pnpm exec tsc -p tsconfig.json --noEmit

  • Codex review verification: pnpm exec vitest run test/config.test.ts test/e2e-broker.test.ts test/slack-conversation-service.test.ts test/slack-routes.test.ts test/slack-turn-reconciler.test.ts test/slack-turn-runner.test.ts

  • GitHub CI Build and test passed on head ca5a582.

已知局限 / 后续工作

  • 本 PR 只恢复 stale active turn;不会改变磁盘压力触发的 session cleanup 策略。
  • watchdog 依赖 agent-runtime trace activity;如果未来 trace schema 改动,需要同步更新 activity 判定。
  • manual repair endpoint 是低破坏性恢复入口,不替代完整 session deletion / janitor 类能力。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant