Skip to content

fix(feishu): await lark-cli subprocess exit in stop()#24

Merged
LinekForge merged 3 commits into
LinekForge:mainfrom
AmberCXX:fix/feishu-stop-await-subprocess
May 8, 2026
Merged

fix(feishu): await lark-cli subprocess exit in stop()#24
LinekForge merged 3 commits into
LinekForge:mainfrom
AmberCXX:fix/feishu-stop-await-subprocess

Conversation

@AmberCXX
Copy link
Copy Markdown
Contributor

@AmberCXX AmberCXX commented May 8, 2026

动机

Hub 重启后飞书通道反复被跳过,日志显示「已有 lark-cli subscriber 正在占用飞书事件流」。

stop() 只发了 SIGTERM 就立即返回——没有等 lark-cli 真正退出。hub.ts 的 shutdown 流程正确地 await stopAllChannels(),但 feishu 的 stop() 是假 async。Hub 退出时 lark-cli 可能还活着;launchd KeepAlive 立即启动新 Hub,start() 的 pgrep 检测到残留进程,判定为用户主动运行的 subscriber,跳过飞书(stoppedReason="config",watchdog 不自动重试)。每次重启循环触发,飞书永久离线。

改动概要

stop() 改为真正 await 子进程退出:先发 SIGTERM 并监听 close 事件,3s 未响应则 SIGKILL,退出后再 return。

影响范围

hub-server/channels/feishu.tsstop() 方法。不影响 start()send()、消息处理路径、其他通道。

Self-test 结果

(cd hub-server && bun install && bunx tsc --noEmit)  ✅
bun hub-test-harness/harness.ts  →  8/8 通过       ✅
fh hub self-test                 →  8/8 通过       ✅

手动验证:launchctl kickstart -k gui/$(id -u)/com.forge-hub 重启后飞书通道正常加载,不再出现 stale subscriber 日志。

🤖 Generated with Claude Code

Xuxian Chen and others added 3 commits May 8, 2026 08:08
chatId.startsWith("oc_") was duplicated at replyTo; now uses the
isGroupMessage variable (already fixed to use chat_type === "group").

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
stop() was sending SIGTERM but returning immediately without waiting
for the process to die. On Hub restart (launchd kickstart -k), the
new instance starts before lark-cli exits, pgrep detects it as a
stale subscriber, and feishu skips itself.

Fix: await process close event, SIGKILL after 3s timeout.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@LinekForge LinekForge merged commit 34b14d1 into LinekForge:main May 8, 2026
@LinekForge
Copy link
Copy Markdown
Owner

Merged — thanks @AmberCXX!

This is a real issue we hadn't caught. The root cause chain is subtle:

  1. stop() sends SIGTERM but returns immediately
  2. Hub process exits, launchd KeepAlive restarts it
  3. New Hub's start() runs pgrep -f "lark-cli event +subscribe" — finds the old lark-cli still alive
  4. Assumes it's a user-started subscriber → skips feishu with stoppedReason="config"
  5. Watchdog doesn't retry "config" stops → feishu permanently offline

Your fix correctly awaits the subprocess exit (SIGTERM → 3s timeout → SIGKILL), breaking this cycle. The local-variable trick (const proc = subscribeProc; subscribeProc = null;) to avoid race conditions is clean.

One note for future reference: there's a pre-existing issue where setTimeout(startSubscription, delay) timers from the health state machine's onFailure handler aren't cancelled by stop(). Not introduced by this PR, but worth being aware of — a stale timer could re-trigger startSubscription after stop() returns. We'll track that separately.

— Forge (maintainer)

LinekForge added a commit that referenced this pull request May 8, 2026
Co-authored-by: Forge <270260515+ForgeLinek@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants