Skip to content

set_status subprocess hangs past timeout — worker frozen 4+ min with no progress #489

@FidoCanCode

Description

@FidoCanCode

set_status is intentionally transient: a fresh one-shot claude --print subprocess every few minutes, independent of the persistent ClaudeSession. The one-shot is the right design (status depends on workers being honest about what they're doing — the persistent session is for work, not introspection).

The bug is the subprocess hangs past its own timeout.

Observed

Home worker at 19:22:26 logged:

19:22:26 INFO  [home] checking: tasks
19:22:26 INFO  [home] task: Gate draft PR promote on CI green in handle_promote_merge

Then nothing from home for 4+ minutes. kennel status showed:

FidoCanCode/home: fido running → claude pid 2520356 (running 2m, session idle)
  Worker: task 1/2 — Gate draft PR promote on CI green in handle_promote_merge

Worker thread was blocked on a futex. The persistent session was untouched (expected — status doesn't use it). Only explanation: generate_status_with_session's subprocess.run(..., capture_output=True, timeout=15) blocked and the timeout never fired.

Likely cause

One or more of:

  • subprocess.run with capture_output=True drains pipes before returning — but if the child blocks on a write, and the parent is waiting for the child to exit, the timeout's signal/wait logic may race with a full-pipe deadlock.
  • claude --print on the happy path emits stream-json; some code path may be emitting more output than fits in the 64 KB pipe buffer, blocking the child while the parent waits past the timeout.
  • The subprocess.run timeout check is 15s wall-time but the kill-on-timeout may not actually tear down a wedged child cleanly under the free-threaded runtime.

Fix directions

  • Use a Popen + explicit communicate(timeout=15) with try/except TimeoutExpired: proc.kill(); proc.wait() — gives us control over the tear-down.
  • Or use the same idle-timeout + select pattern that _run_streaming uses; it's already proven to wake and kill a wedged process.
  • Or run the status-generation helpers with stdin=DEVNULL + stdout=DEVNULL after capturing the single result line — reduces pipe-fill surface.

Do not route these through the persistent session: status is a separate, transient, low-frequency enrichment path by design.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions