Skip to content

bug: agent_server reports task as 'completed successfully' when claude exits 0 with no result message #520

@vybe

Description

@vybe

Summary

When the claude headless subprocess exits cleanly (return_code == 0) but its stdout reader thread leaks before capturing the final {"type":"result"} block, agent-server logs the task as "completed successfully" and returns 200 to the backend — even though no usable result was captured (cost is None, duration is None, no response body). The execution then sits in running until the orphan watchdog reaps it minutes later. Operators see a misleading "Execution completed on agent but status not reported — recovered by watchdog" error against an execution that the agent itself believed was successful.

This is a sibling to #516 (which I filed earlier) — same code file, same root condition (orphaned reader thread leaks the result block), different exit path. #516/#517 covered the signal-killed case (return_code < 0); this is the clean-exit case (return_code == 0).

Component

Agent Runtime (agent_server.services.claude_code, agent_server.routers.chat)

Priority

P2 — masks real failures with false "success" status; watchdog provides eventual cleanup so it's not silent data loss, but the diagnostic UX is broken (agent log says success, backend log says orphan-recovered) and the lost ~26 min of work happens silently.

Error

Inside the agent container (typical sequence around the failure):

[Headless Task] Starting task <task_id>: claude --print --output-format stream-json --verbose...
[ProcessRegistry] Registered execution <task_id>
... agent does real work for ~20 min, 7+ tool calls, 22+ raw_messages stream through ...
[Subprocess] Reader thread(s) stuck after process exit (pid=NNNNN, stuck_count=1) — killing process group and closing pipes to unwind
[Subprocess] 1 reader thread(s) leaked for pid=NNNNN after close+killpg; continuing anyway
[Headless Task] Error reading stdout: I/O operation on closed file.
[Headless Task] Task <task_id> completed: cost=$None, duration=Nonems, tools=7, raw_messages=22
[ProcessRegistry] Unregistered execution <task_id>
[Task] Task <task_id> completed successfully     ← misleading: there is no result

Meanwhile in the backend / scheduler (same execution, same window):

[Slots] Agent '<agent>' acquired slot 1/3 for execution <task_id> (TTL=3900s)
... ~26 min of polling ...
scheduler.service: Execution <task_id> still running (1500s elapsed, poll #150)
scheduler.service: Execution <task_id> still running (1560s elapsed, poll #156)
scheduler.service: Execution <task_id> completed: status=failed (polled 161 times)
Background poll: execution <task_id> failed: Execution completed on agent but status not reported — recovered by watchdog
[Watchdog] orphan_recovered: execution <task_id> on agent '<agent>'

Location

  • File: docker/base-image/agent_server/services/claude_code.py
  • Function: _run_headless_task / the task completion branch (after the return_code != 0 checks finish)
  • Co-conspirator: docker/base-image/agent_server/routers/chat.py[Task] ... completed successfully is logged unconditionally on the success branch, without inspecting whether the metadata is empty

Root Cause

After #517's signal-kill special case lands and the auth-failure heuristics gate on return_code != 0, anything with return_code == 0 falls through to the success path. The success path doesn't validate that the metadata it's about to return is actually populated.

Two contributing facts make the empty-result case real:

  1. Reader thread(s) stuck after process exit happens on return_code == 0 too — not just on signal kills. When an MCP tool spawns a subprocess that inherits stdout (or otherwise keeps the pipe open after claude exits), the reader thread blocks waiting for EOF that never arrives. agent-server's pgroup unwind kills it, but by then the final {"type":"result"} JSON line has already been dropped on the floor.
  2. The "completed successfully" log fires from routers/chat.py based on the absence of a raised exception — not on the presence of a populated result. So cost=None, duration=None, tools=7, raw_messages=22 returns to the router, the router sees no exception, and prints "completed successfully".

The downstream effect is that the HTTP response back to the backend is either (a) a 200 with an empty body that backend can't act on, or (b) lost in transit (likely #408 territory — long-running uvicorn worker recycle drops the response). Either way, backend's DB stays at running and the watchdog has to clean it up.

Reproduction Steps

This is hard to repro deterministically because it depends on an MCP tool (or some other subprocess) inheriting stdout. The cleanest synthetic trigger:

  1. Stand up an agent with at least one MCP server that spawns long-lived child processes.
  2. Schedule a task that calls into that MCP server enough to provoke the pipe-leak — in practice, any heavy skill that fires inner HTTP / async fan-out (mcp__trinity__chat_with_agent(parallel:true, async:true) is a known trigger pattern).
  3. Wait for an execution where you see Subprocess: Reader thread(s) stuck after process exit followed by Task ... completed: cost=$None, duration=Nonems in the agent log.
  4. Observe the same execution sit in running in the backend until the watchdog reaps it ~5–10 min after the actual completion.

In a real production cron, the symptom is: scheduled tasks that "always fail" with the watchdog message, while the agent's own log claims they succeeded.

Suggested Fix

Validate that the headless task actually produced a result before declaring success. Mirror the shape of the #517 fix — special-case the empty-result condition and raise a meaningful HTTP error.

# In claude_code.py, AFTER the existing return_code != 0 / max_turns / signal-kill checks,
# BEFORE returning the result on the return_code == 0 path:

# Empty-result detection: subprocess exited cleanly, but the final result message
# was never captured (typically because a child process inherited stdout and the
# reader thread leaked). Returning "success" here is actively misleading — agent
# logs "completed successfully", backend never gets a usable response, watchdog
# orphan-recovers minutes later.
if metadata.cost is None and metadata.duration_ms is None:
    logger.error(
        f"[Headless Task] Subprocess exited 0 but result message missing "
        f"(tools={metadata.tool_use_count}, raw_messages={metadata.raw_message_count}). "
        f"Likely cause: a tool/subprocess inherited stdout and prevented EOF. "
        f"See preceding 'Reader thread(s) stuck' / 'I/O operation on closed file' logs."
    )
    raise HTTPException(
        status_code=502,  # Bad Gateway — upstream (claude) gave us a malformed response
        detail=(
            f"Execution completed without a result message after "
            f"{metadata.tool_use_count} tool calls / "
            f"{metadata.raw_message_count} stream messages. "
            f"Likely a subprocess kept stdout open past claude exit. "
            f"Check agent-server logs for 'Reader thread(s) stuck after process exit'."
        ),
    )

Then in routers/chat.py, the "completed successfully" log line should be guarded so it can never claim success when the result is empty — but a clean fix at the claude_code.py layer means router-side code naturally takes the failure branch.

A unit test fixture: spawn a fake claude that emits valid stream-json prefix messages then exits 0 without writing the final result block. The contract should be HTTP 502 / 504 / 422 (any 4xx/5xx), never HTTP 200 with cost=null.

Environment

Related

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions