Observed
After merging #937 (which fixed the _notify_thread_change call site), the same crash pattern surfaced from a different call site:
21:49:34 CRITICAL [-] uncaught exception in thread reorder-home
File "/workspace/src/fido/tasks.py", line 578, in reorder_tasks
_on_done()
File "/workspace/src/fido/events.py", line 1534, in on_done
rewrite_fn(...)
File "/workspace/src/fido/events.py", line 1481, in _rewrite_pr_description
_write_pr_description(...)
File "/workspace/src/fido/worker.py", line 866, in _write_pr_description
raise ValueError(
"_write_pr_description: provider returned no <body> content for PR #934 (raw='')"
)
Same root cause: the voice turn was preempted by an incoming webhook (result_len=0, cancelled=True), caller couldn't distinguish that from a real empty-Opus-reply, and raise ValueError killed the daemon thread.
What to audit
Every raise ValueError / RuntimeError inside an agent call path whose caller runs on a thread without surrounding try/except. Known instances so far:
Probable cousins (worth checking):
_rewrite_pr_description (events.py ~1481)
- PR-body composers that call
agent.run_turn
- Reply generators (
generate_ACT_reply, generate_DO_reply, etc.) if any raise on empty
- The
generate_persona_status / generate_persona_emoji pair in gh_status.py (already raises ValueError in the code we read earlier — same shape, probably same bug; falls back to _FALLBACK_MESSAGES via the caller's except Exception, but that's hidden behind a broad catch)
Fix shape
Same as #937:
- Pass
retry_on_preempt=True to run_turn at each site.
- On genuine empty-after-retries, fall back to a plain-text default (or skip + log, depending on call-site semantics — PR-description rewrite can legitimately no-op if the rewrite would have been a no-op anyway).
- Update the corresponding
raise ValueError sites to log + fallback, not raise, since the callers run on daemon threads without a top-level handler.
Alternative design consideration: a module-level safe_voice_turn helper that wraps run_turn with retry_on_preempt=True + empty-result → fallback + log. Every call site that currently raises on empty gets switched to use this helper. One place to fix, one place to audit.
Related
Observed
After merging #937 (which fixed the
_notify_thread_changecall site), the same crash pattern surfaced from a different call site:Same root cause: the voice turn was preempted by an incoming webhook (
result_len=0, cancelled=True), caller couldn't distinguish that from a real empty-Opus-reply, andraise ValueErrorkilled the daemon thread.What to audit
Every
raise ValueError / RuntimeErrorinside an agent call path whose caller runs on a thread without surrounding try/except. Known instances so far:_notify_thread_change(events.py ~1378) — fixed in Don't treat preempted voice turn as Opus failure (fixes #935) #937_write_pr_description(worker.py ~866) — unfixed, hit 21:49Probable cousins (worth checking):
_rewrite_pr_description(events.py ~1481)agent.run_turngenerate_ACT_reply,generate_DO_reply, etc.) if any raise on emptygenerate_persona_status/generate_persona_emojipair ingh_status.py(already raisesValueErrorin the code we read earlier — same shape, probably same bug; falls back to_FALLBACK_MESSAGESvia the caller'sexcept Exception, but that's hidden behind a broad catch)Fix shape
Same as #937:
retry_on_preempt=Truetorun_turnat each site.raise ValueErrorsites to log + fallback, not raise, since the callers run on daemon threads without a top-level handler.Alternative design consideration: a module-level
safe_voice_turnhelper that wrapsrun_turnwithretry_on_preempt=True+ empty-result → fallback + log. Every call site that currently raises on empty gets switched to use this helper. One place to fix, one place to audit.Related
EOF
)