Summary
When the claude headless subprocess exits cleanly (return_code == 0) but its stdout reader thread leaks before capturing the final {"type":"result"} block, agent-server logs the task as "completed successfully" and returns 200 to the backend — even though no usable result was captured (cost is None, duration is None, no response body). The execution then sits in running until the orphan watchdog reaps it minutes later. Operators see a misleading "Execution completed on agent but status not reported — recovered by watchdog" error against an execution that the agent itself believed was successful.
This is a sibling to #516 (which I filed earlier) — same code file, same root condition (orphaned reader thread leaks the result block), different exit path. #516/#517 covered the signal-killed case (return_code < 0); this is the clean-exit case (return_code == 0).
Component
Agent Runtime (agent_server.services.claude_code, agent_server.routers.chat)
Priority
P2 — masks real failures with false "success" status; watchdog provides eventual cleanup so it's not silent data loss, but the diagnostic UX is broken (agent log says success, backend log says orphan-recovered) and the lost ~26 min of work happens silently.
Error
Inside the agent container (typical sequence around the failure):
[Headless Task] Starting task <task_id>: claude --print --output-format stream-json --verbose...
[ProcessRegistry] Registered execution <task_id>
... agent does real work for ~20 min, 7+ tool calls, 22+ raw_messages stream through ...
[Subprocess] Reader thread(s) stuck after process exit (pid=NNNNN, stuck_count=1) — killing process group and closing pipes to unwind
[Subprocess] 1 reader thread(s) leaked for pid=NNNNN after close+killpg; continuing anyway
[Headless Task] Error reading stdout: I/O operation on closed file.
[Headless Task] Task <task_id> completed: cost=$None, duration=Nonems, tools=7, raw_messages=22
[ProcessRegistry] Unregistered execution <task_id>
[Task] Task <task_id> completed successfully ← misleading: there is no result
Meanwhile in the backend / scheduler (same execution, same window):
[Slots] Agent '<agent>' acquired slot 1/3 for execution <task_id> (TTL=3900s)
... ~26 min of polling ...
scheduler.service: Execution <task_id> still running (1500s elapsed, poll #150)
scheduler.service: Execution <task_id> still running (1560s elapsed, poll #156)
scheduler.service: Execution <task_id> completed: status=failed (polled 161 times)
Background poll: execution <task_id> failed: Execution completed on agent but status not reported — recovered by watchdog
[Watchdog] orphan_recovered: execution <task_id> on agent '<agent>'
Location
- File:
docker/base-image/agent_server/services/claude_code.py
- Function:
_run_headless_task / the task completion branch (after the return_code != 0 checks finish)
- Co-conspirator:
docker/base-image/agent_server/routers/chat.py — [Task] ... completed successfully is logged unconditionally on the success branch, without inspecting whether the metadata is empty
Root Cause
After #517's signal-kill special case lands and the auth-failure heuristics gate on return_code != 0, anything with return_code == 0 falls through to the success path. The success path doesn't validate that the metadata it's about to return is actually populated.
Two contributing facts make the empty-result case real:
Reader thread(s) stuck after process exit happens on return_code == 0 too — not just on signal kills. When an MCP tool spawns a subprocess that inherits stdout (or otherwise keeps the pipe open after claude exits), the reader thread blocks waiting for EOF that never arrives. agent-server's pgroup unwind kills it, but by then the final {"type":"result"} JSON line has already been dropped on the floor.
- The "completed successfully" log fires from
routers/chat.py based on the absence of a raised exception — not on the presence of a populated result. So cost=None, duration=None, tools=7, raw_messages=22 returns to the router, the router sees no exception, and prints "completed successfully".
The downstream effect is that the HTTP response back to the backend is either (a) a 200 with an empty body that backend can't act on, or (b) lost in transit (likely #408 territory — long-running uvicorn worker recycle drops the response). Either way, backend's DB stays at running and the watchdog has to clean it up.
Reproduction Steps
This is hard to repro deterministically because it depends on an MCP tool (or some other subprocess) inheriting stdout. The cleanest synthetic trigger:
- Stand up an agent with at least one MCP server that spawns long-lived child processes.
- Schedule a task that calls into that MCP server enough to provoke the pipe-leak — in practice, any heavy skill that fires inner HTTP / async fan-out (
mcp__trinity__chat_with_agent(parallel:true, async:true) is a known trigger pattern).
- Wait for an execution where you see
Subprocess: Reader thread(s) stuck after process exit followed by Task ... completed: cost=$None, duration=Nonems in the agent log.
- Observe the same execution sit in
running in the backend until the watchdog reaps it ~5–10 min after the actual completion.
In a real production cron, the symptom is: scheduled tasks that "always fail" with the watchdog message, while the agent's own log claims they succeeded.
Suggested Fix
Validate that the headless task actually produced a result before declaring success. Mirror the shape of the #517 fix — special-case the empty-result condition and raise a meaningful HTTP error.
# In claude_code.py, AFTER the existing return_code != 0 / max_turns / signal-kill checks,
# BEFORE returning the result on the return_code == 0 path:
# Empty-result detection: subprocess exited cleanly, but the final result message
# was never captured (typically because a child process inherited stdout and the
# reader thread leaked). Returning "success" here is actively misleading — agent
# logs "completed successfully", backend never gets a usable response, watchdog
# orphan-recovers minutes later.
if metadata.cost is None and metadata.duration_ms is None:
logger.error(
f"[Headless Task] Subprocess exited 0 but result message missing "
f"(tools={metadata.tool_use_count}, raw_messages={metadata.raw_message_count}). "
f"Likely cause: a tool/subprocess inherited stdout and prevented EOF. "
f"See preceding 'Reader thread(s) stuck' / 'I/O operation on closed file' logs."
)
raise HTTPException(
status_code=502, # Bad Gateway — upstream (claude) gave us a malformed response
detail=(
f"Execution completed without a result message after "
f"{metadata.tool_use_count} tool calls / "
f"{metadata.raw_message_count} stream messages. "
f"Likely a subprocess kept stdout open past claude exit. "
f"Check agent-server logs for 'Reader thread(s) stuck after process exit'."
),
)
Then in routers/chat.py, the "completed successfully" log line should be guarded so it can never claim success when the result is empty — but a clean fix at the claude_code.py layer means router-side code naturally takes the failure branch.
A unit test fixture: spawn a fake claude that emits valid stream-json prefix messages then exits 0 without writing the final result block. The contract should be HTTP 502 / 504 / 422 (any 4xx/5xx), never HTTP 200 with cost=null.
Environment
Related
Summary
When the
claudeheadless subprocess exits cleanly (return_code == 0) but its stdout reader thread leaks before capturing the final{"type":"result"}block, agent-server logs the task as "completed successfully" and returns 200 to the backend — even though no usable result was captured (cost isNone, duration isNone, no response body). The execution then sits inrunninguntil the orphan watchdog reaps it minutes later. Operators see a misleading "Execution completed on agent but status not reported — recovered by watchdog" error against an execution that the agent itself believed was successful.This is a sibling to #516 (which I filed earlier) — same code file, same root condition (orphaned reader thread leaks the result block), different exit path. #516/#517 covered the signal-killed case (
return_code < 0); this is the clean-exit case (return_code == 0).Component
Agent Runtime (
agent_server.services.claude_code,agent_server.routers.chat)Priority
P2 — masks real failures with false "success" status; watchdog provides eventual cleanup so it's not silent data loss, but the diagnostic UX is broken (agent log says success, backend log says orphan-recovered) and the lost ~26 min of work happens silently.
Error
Inside the agent container (typical sequence around the failure):
Meanwhile in the backend / scheduler (same execution, same window):
Location
docker/base-image/agent_server/services/claude_code.py_run_headless_task/ the task completion branch (after thereturn_code != 0checks finish)docker/base-image/agent_server/routers/chat.py—[Task] ... completed successfullyis logged unconditionally on the success branch, without inspecting whether the metadata is emptyRoot Cause
After #517's signal-kill special case lands and the auth-failure heuristics gate on
return_code != 0, anything withreturn_code == 0falls through to the success path. The success path doesn't validate that the metadata it's about to return is actually populated.Two contributing facts make the empty-result case real:
Reader thread(s) stuck after process exithappens onreturn_code == 0too — not just on signal kills. When an MCP tool spawns a subprocess that inherits stdout (or otherwise keeps the pipe open afterclaudeexits), the reader thread blocks waiting for EOF that never arrives. agent-server's pgroup unwind kills it, but by then the final{"type":"result"}JSON line has already been dropped on the floor.routers/chat.pybased on the absence of a raised exception — not on the presence of a populated result. Socost=None, duration=None, tools=7, raw_messages=22returns to the router, the router sees no exception, and prints "completed successfully".The downstream effect is that the HTTP response back to the backend is either (a) a 200 with an empty body that backend can't act on, or (b) lost in transit (likely #408 territory — long-running uvicorn worker recycle drops the response). Either way, backend's DB stays at
runningand the watchdog has to clean it up.Reproduction Steps
This is hard to repro deterministically because it depends on an MCP tool (or some other subprocess) inheriting stdout. The cleanest synthetic trigger:
mcp__trinity__chat_with_agent(parallel:true, async:true)is a known trigger pattern).Subprocess: Reader thread(s) stuck after process exitfollowed byTask ... completed: cost=$None, duration=Nonemsin the agent log.runningin the backend until the watchdog reaps it ~5–10 min after the actual completion.In a real production cron, the symptom is: scheduled tasks that "always fail" with the watchdog message, while the agent's own log claims they succeeded.
Suggested Fix
Validate that the headless task actually produced a result before declaring success. Mirror the shape of the #517 fix — special-case the empty-result condition and raise a meaningful HTTP error.
Then in
routers/chat.py, the "completed successfully" log line should be guarded so it can never claim success when the result is empty — but a clean fix at theclaude_code.pylayer means router-side code naturally takes the failure branch.A unit test fixture: spawn a fake claude that emits valid stream-json prefix messages then exits 0 without writing the final result block. The contract should be HTTP 502 / 504 / 422 (any 4xx/5xx), never HTTP 200 with
cost=null.Environment
42f75be(the commit that landed fix(agent): classify signal-killed claude exits as 504, not fake auth failure (#516) #517's signal-kill fix). The bug pre-existed; fix(agent): classify signal-killed claude exits as 504, not fake auth failure (#516) #517 just made it more visible because it was previously masked by the auth-fallback heuristic catching this same case as "auth failure".Related
return_code != 0path; this report covers thereturn_code == 0path).docker/base-image/agent_server/services/claude_code.pydocker/base-image/agent_server/routers/chat.py