set_status is intentionally transient: a fresh one-shot claude --print subprocess every few minutes, independent of the persistent ClaudeSession. The one-shot is the right design (status depends on workers being honest about what they're doing — the persistent session is for work, not introspection).
The bug is the subprocess hangs past its own timeout.
Observed
Home worker at 19:22:26 logged:
19:22:26 INFO [home] checking: tasks
19:22:26 INFO [home] task: Gate draft PR promote on CI green in handle_promote_merge
Then nothing from home for 4+ minutes. kennel status showed:
FidoCanCode/home: fido running → claude pid 2520356 (running 2m, session idle)
Worker: task 1/2 — Gate draft PR promote on CI green in handle_promote_merge
Worker thread was blocked on a futex. The persistent session was untouched (expected — status doesn't use it). Only explanation: generate_status_with_session's subprocess.run(..., capture_output=True, timeout=15) blocked and the timeout never fired.
Likely cause
One or more of:
subprocess.run with capture_output=True drains pipes before returning — but if the child blocks on a write, and the parent is waiting for the child to exit, the timeout's signal/wait logic may race with a full-pipe deadlock.
claude --print on the happy path emits stream-json; some code path may be emitting more output than fits in the 64 KB pipe buffer, blocking the child while the parent waits past the timeout.
- The
subprocess.run timeout check is 15s wall-time but the kill-on-timeout may not actually tear down a wedged child cleanly under the free-threaded runtime.
Fix directions
- Use a
Popen + explicit communicate(timeout=15) with try/except TimeoutExpired: proc.kill(); proc.wait() — gives us control over the tear-down.
- Or use the same idle-timeout + select pattern that
_run_streaming uses; it's already proven to wake and kill a wedged process.
- Or run the status-generation helpers with
stdin=DEVNULL + stdout=DEVNULL after capturing the single result line — reduces pipe-fill surface.
Do not route these through the persistent session: status is a separate, transient, low-frequency enrichment path by design.
set_statusis intentionally transient: a fresh one-shotclaude --printsubprocess every few minutes, independent of the persistent ClaudeSession. The one-shot is the right design (status depends on workers being honest about what they're doing — the persistent session is for work, not introspection).The bug is the subprocess hangs past its own timeout.
Observed
Home worker at 19:22:26 logged:
Then nothing from home for 4+ minutes.
kennel statusshowed:Worker thread was blocked on a futex. The persistent session was untouched (expected — status doesn't use it). Only explanation:
generate_status_with_session'ssubprocess.run(..., capture_output=True, timeout=15)blocked and the timeout never fired.Likely cause
One or more of:
subprocess.runwithcapture_output=Truedrains pipes before returning — but if the child blocks on a write, and the parent is waiting for the child to exit, the timeout's signal/wait logic may race with a full-pipe deadlock.claude --printon the happy path emits stream-json; some code path may be emitting more output than fits in the 64 KB pipe buffer, blocking the child while the parent waits past the timeout.subprocess.runtimeout check is 15s wall-time but the kill-on-timeout may not actually tear down a wedged child cleanly under the free-threaded runtime.Fix directions
Popen+ explicitcommunicate(timeout=15)withtry/except TimeoutExpired: proc.kill(); proc.wait()— gives us control over the tear-down._run_streaminguses; it's already proven to wake and kill a wedged process.stdin=DEVNULL+stdout=DEVNULLafter capturing the single result line — reduces pipe-fill surface.Do not route these through the persistent session: status is a separate, transient, low-frequency enrichment path by design.