Skip to content

fix(recorder): watchdog 在 sleep 醒来后重检 stop_flag,消除停采误报#379

Merged
H-Chris233 merged 1 commit into
betafrom
fix/recorder-watchdog-stop-race
May 9, 2026
Merged

fix(recorder): watchdog 在 sleep 醒来后重检 stop_flag,消除停采误报#379
H-Chris233 merged 1 commit into
betafrom
fix/recorder-watchdog-stop-race

Conversation

@appergb
Copy link
Copy Markdown
Collaborator

@appergb appergb commented May 9, 2026

User description

问题

Listening → Processing 转移期间,recorder 的 liveness watchdog 会把"用户主动停采"误报为 EngineFailed,导致 coordinator 终止 session、胶囊弹出 `audio engine failed` 错误。

复现日志

`~/Library/Logs/OpenLess/openless.log`:

```
27.011 [recorder] cb#4250 ... <- 最后一次回调
27.054 [asr] server JSON ... <- ASR 仍在收尾
~27.5 用户松开 hotkey → end_session 调 rec.stop() → stop_flag=true + pause cpal
31.667 [ERROR] [recorder] watchdog: 录音回调已停止 4 秒,触发错误恢复
31.667 [ERROR] [coord] recorder runtime error: audio engine failed
36.244 [ERROR] [asr] error frame code=45000081 [Timeout waiting next packet]
```

同一日志里 05:51:33 有同样模式,每次都恰好在用户停止说话后 4–5 秒。

Root cause(race)

watchdog 循环原结构:

```rust
while !stop_flag.load(...) { // 顶检
thread::sleep(1s); // 长 sleep
let last = state.last_callback_time.lock(); // 不再 check stop_flag
if elapsed > 3s { fire EngineFailed; break; }
}
```

竞态时序:

  1. 用户松开 hotkey → `end_session` → `rec.stop()`:先 `stop_flag.store(true)`,再 `handle.join()`
  2. audio 线程从 50ms sleep 醒来 → 看到 stop_flag → `stream.pause()`(cpal callback 停止递送)
  3. 此时 watchdog 还卡在 1s sleep 里
  4. watchdog 醒来:旧逻辑不重检 stop_flag,直接读 `last_callback_time`,发现已 >3 秒未更新(因为 cpal 已 pause)→ 误报 EngineFailed

`Recorder::stop` 在 `end_session` 第 3 步就被调,并不是慢,但 watchdog 1s 的 sleep + 3s 的容忍阈值天然就给 race 留了 ~4s 的窗口。

修复

watchdog sleep 醒来后立即再 load 一次 stop_flag,置位则直接 break:

```rust
while !stop_flag.load(...) {
thread::sleep(1s);

// 关键:sleep 醒来后必须重新检查 stop_flag
if stop_flag.load(...) { break; }

// ...原有的 elapsed 判定

}
```

注释里写了 18 行 root cause 解释,防止后续重构时被无意识回退。

行为对比

场景 修复前 修复后
用户主动停采(hotkey-release / cancel) watchdog 误报 EngineFailed watchdog 静默退出
真故障(CoreAudio 设备掉线、OS 音频复位) 触发 EngineFailed 触发 EngineFailed(不变)
慢启动设备(首次回调超时) 触发首次回调超时 触发首次回调超时(不变)

修复保留 watchdog 在活动期捕获真故障的能力——只在 stop 流程里让它静默退出。

Test plan

  • `cargo check --manifest-path openless-all/app/src-tauri/Cargo.toml` 通过
  • `cargo test --manifest-path openless-all/app/src-tauri/Cargo.toml --lib recorder::` 8/8 通过(已有的 recorder 单元测试)
  • 实机验证:在 macOS 上短按 hotkey → 说话 → 松开 hotkey;多重复几次,观察 `~/Library/Logs/OpenLess/openless.log` 不再出现 `watchdog: 录音回调已停止 N 秒` 日志,胶囊不再弹 `audio engine failed`
  • 实机验证:拔掉/切换麦克风设备触发真故障路径,确认 watchdog 仍能正常报 EngineFailed

PR Type

Bug fix


Description

  • Recheck stop_flag after watchdog sleep

  • Prevent false EngineFailed on manual stop

  • Preserve real audio failure detection

  • Add race-condition explanation comments


Diagram Walkthrough

flowchart LR
  A["hotkey release / stop"] -- "sets" --> B["stop_flag = true"]
  C["watchdog sleep"] -- "wake up" --> D["recheck stop_flag"]
  D -- "true" --> E["exit silently"]
  D -- "false" --> F["check last callback time"]
  F -- "silent too long" --> G["EngineFailed"]
Loading

File Walkthrough

Relevant files
Bug fix
recorder.rs
Guard watchdog against stop-race false alarms                       

openless-all/app/src-tauri/src/recorder.rs

  • Added an immediate stop_flag recheck after each watchdog sleep.
  • Prevented the watchdog from misclassifying intentional stops as engine
    failures.
  • Kept the existing silence-based failure path for real audio issues.
  • Documented the race condition with detailed inline comments.
+17/-0   

当用户松开 hotkey 进入 Listening → Processing 转移时,coordinator 调
rec.stop() 设置 stop_flag 并 pause cpal Stream。但 watchdog 线程此时
可能正卡在 1 秒的 sleep 里:sleep 结束后旧逻辑直接读 last_callback_time,
看到"已 >3 秒未更新"就触发 EngineFailed("录音回调静默停止 N 秒"),
让 coordinator 把正常停采当成引擎崩溃,胶囊弹出 "audio engine failed"。

复现日志(~/Library/Logs/OpenLess/openless.log):

  27.011  [recorder] cb#4250 ...                <- 最后一次回调
  27.054  [asr] server JSON ...                 <- ASR 仍在收尾
  ~27.5   end_session 调 rec.stop() 设置 stop_flag + pause cpal
  31.667  [ERROR] [recorder] watchdog: 录音回调已停止 4 秒,触发错误恢复
  31.667  [ERROR] [coord] recorder runtime error: audio engine failed

修复:watchdog sleep 醒来后立即再 load 一次 stop_flag,flag 已置位则
直接 break 退出,不再做 elapsed 判定。这样:

- 主动停采路径(hotkey-release / cancel):watchdog 静默退出,无误报
- 真故障路径(CoreAudio 设备掉线、OS 音频复位):stop_flag 仍为 false,
  watchdog 照常按 3 秒静默阈值触发 EngineFailed,行为不变

附 18 行注释解释 race,避免后续重构时被无意识回退。
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 1 🔵⚪⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ No major issues detected

@H-Chris233 H-Chris233 merged commit 2198f0c into beta May 9, 2026
5 checks passed
appergb pushed a commit that referenced this pull request May 9, 2026
Beta-6 包含的合并:
- #379 fix(recorder): watchdog 在 sleep 醒来后重检 stop_flag,消除停采误报
- #380 fix(commands): wrap tray refresh in run_on_main_thread 修主线程死锁
- #381 feat(ui): consolidate footer/nav, sliding indicator, hover cues, top-right saved toast
- #378 拆分 coordinator 子状态机模块(间接合入)

Tag: v1.2.24-6-beta-tauri 推到 main 后触发 release-tauri.yml Beta 流水线。
appergb pushed a commit that referenced this pull request May 9, 2026
@appergb appergb deleted the fix/recorder-watchdog-stop-race branch May 10, 2026 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants