bug: agent_server reports task as 'completed successfully' when claude exits 0 with no result message

## Summary

When the `claude` headless subprocess exits cleanly (`return_code == 0`) but its stdout reader thread leaks before capturing the final `{"type":"result"}` block, agent-server logs the task as "completed successfully" and returns 200 to the backend — even though no usable result was captured (cost is `None`, duration is `None`, no response body). The execution then sits in `running` until the orphan watchdog reaps it minutes later. Operators see a misleading "Execution completed on agent but status not reported — recovered by watchdog" error against an execution that the agent itself believed was successful.

This is a sibling to #516 (which I filed earlier) — same code file, same root condition (orphaned reader thread leaks the result block), different exit path. #516/#517 covered the signal-killed case (`return_code < 0`); this is the clean-exit case (`return_code == 0`).

## Component

Agent Runtime (`agent_server.services.claude_code`, `agent_server.routers.chat`)

## Priority

P2 — masks real failures with false "success" status; watchdog provides eventual cleanup so it's not silent data loss, but the diagnostic UX is broken (agent log says success, backend log says orphan-recovered) and the lost ~26 min of work happens silently.

## Error

Inside the agent container (typical sequence around the failure):

```
[Headless Task] Starting task <task_id>: claude --print --output-format stream-json --verbose...
[ProcessRegistry] Registered execution <task_id>
... agent does real work for ~20 min, 7+ tool calls, 22+ raw_messages stream through ...
[Subprocess] Reader thread(s) stuck after process exit (pid=NNNNN, stuck_count=1) — killing process group and closing pipes to unwind
[Subprocess] 1 reader thread(s) leaked for pid=NNNNN after close+killpg; continuing anyway
[Headless Task] Error reading stdout: I/O operation on closed file.
[Headless Task] Task <task_id> completed: cost=$None, duration=Nonems, tools=7, raw_messages=22
[ProcessRegistry] Unregistered execution <task_id>
[Task] Task <task_id> completed successfully     ← misleading: there is no result
```

Meanwhile in the backend / scheduler (same execution, same window):

```
[Slots] Agent '<agent>' acquired slot 1/3 for execution <task_id> (TTL=3900s)
... ~26 min of polling ...
scheduler.service: Execution <task_id> still running (1500s elapsed, poll #150)
scheduler.service: Execution <task_id> still running (1560s elapsed, poll #156)
scheduler.service: Execution <task_id> completed: status=failed (polled 161 times)
Background poll: execution <task_id> failed: Execution completed on agent but status not reported — recovered by watchdog
[Watchdog] orphan_recovered: execution <task_id> on agent '<agent>'
```

## Location

- **File**: `docker/base-image/agent_server/services/claude_code.py`
- **Function**: `_run_headless_task` / the task completion branch (after the `return_code != 0` checks finish)
- **Co-conspirator**: `docker/base-image/agent_server/routers/chat.py` — `[Task] ... completed successfully` is logged unconditionally on the success branch, without inspecting whether the metadata is empty

## Root Cause

After #517's signal-kill special case lands and the auth-failure heuristics gate on `return_code != 0`, anything with `return_code == 0` falls through to the success path. The success path doesn't validate that the metadata it's about to return is actually populated.

Two contributing facts make the empty-result case real:

1. **`Reader thread(s) stuck after process exit` happens on `return_code == 0` too** — not just on signal kills. When an MCP tool spawns a subprocess that inherits stdout (or otherwise keeps the pipe open after `claude` exits), the reader thread blocks waiting for EOF that never arrives. agent-server's pgroup unwind kills it, but by then the final `{"type":"result"}` JSON line has already been dropped on the floor.
2. **The "completed successfully" log fires from `routers/chat.py` based on the absence of a raised exception** — not on the presence of a populated result. So `cost=None, duration=None, tools=7, raw_messages=22` returns to the router, the router sees no exception, and prints "completed successfully".

The downstream effect is that the HTTP response back to the backend is either (a) a 200 with an empty body that backend can't act on, or (b) lost in transit (likely #408 territory — long-running uvicorn worker recycle drops the response). Either way, backend's DB stays at `running` and the watchdog has to clean it up.

## Reproduction Steps

This is hard to repro deterministically because it depends on an MCP tool (or some other subprocess) inheriting stdout. The cleanest synthetic trigger:

1. Stand up an agent with at least one MCP server that spawns long-lived child processes.
2. Schedule a task that calls into that MCP server enough to provoke the pipe-leak — in practice, any heavy skill that fires inner HTTP / async fan-out (`mcp__trinity__chat_with_agent(parallel:true, async:true)` is a known trigger pattern).
3. Wait for an execution where you see `Subprocess: Reader thread(s) stuck after process exit` followed by `Task ... completed: cost=$None, duration=Nonems` in the agent log.
4. Observe the same execution sit in `running` in the backend until the watchdog reaps it ~5–10 min after the actual completion.

In a real production cron, the symptom is: scheduled tasks that "always fail" with the watchdog message, while the agent's own log claims they succeeded.

## Suggested Fix

Validate that the headless task actually produced a result before declaring success. Mirror the shape of the #517 fix — special-case the empty-result condition and raise a meaningful HTTP error.

```python
# In claude_code.py, AFTER the existing return_code != 0 / max_turns / signal-kill checks,
# BEFORE returning the result on the return_code == 0 path:

# Empty-result detection: subprocess exited cleanly, but the final result message
# was never captured (typically because a child process inherited stdout and the
# reader thread leaked). Returning "success" here is actively misleading — agent
# logs "completed successfully", backend never gets a usable response, watchdog
# orphan-recovers minutes later.
if metadata.cost is None and metadata.duration_ms is None:
    logger.error(
        f"[Headless Task] Subprocess exited 0 but result message missing "
        f"(tools={metadata.tool_use_count}, raw_messages={metadata.raw_message_count}). "
        f"Likely cause: a tool/subprocess inherited stdout and prevented EOF. "
        f"See preceding 'Reader thread(s) stuck' / 'I/O operation on closed file' logs."
    )
    raise HTTPException(
        status_code=502,  # Bad Gateway — upstream (claude) gave us a malformed response
        detail=(
            f"Execution completed without a result message after "
            f"{metadata.tool_use_count} tool calls / "
            f"{metadata.raw_message_count} stream messages. "
            f"Likely a subprocess kept stdout open past claude exit. "
            f"Check agent-server logs for 'Reader thread(s) stuck after process exit'."
        ),
    )
```

Then in `routers/chat.py`, the "completed successfully" log line should be guarded so it can never claim success when the result is empty — but a clean fix at the `claude_code.py` layer means router-side code naturally takes the failure branch.

A unit test fixture: spawn a fake claude that emits valid stream-json prefix messages then exits 0 without writing the final result block. The contract should be HTTP 502 / 504 / 422 (any 4xx/5xx), never HTTP 200 with `cost=null`.

## Environment

- Trinity version: observed on dev `42f75be` (the commit that landed #517's signal-kill fix). The bug pre-existed; #517 just made it more visible because it was previously masked by the auth-fallback heuristic catching this same case as "auth failure".
- Trigger surface: any agent whose skills fan out to MCP tools that spawn subprocesses or hold long-lived HTTP connections.

## Related

- #516 / #517 — Signal-kill misclassification (sibling bug, same file, fixed for `return_code != 0` path; this report covers the `return_code == 0` path).
- #408 — Long-running backend→agent task calls dropped when uvicorn worker recycles. Likely the *transport* reason agent-server's "completed" response never reaches backend in this scenario, but even with #408 fixed, returning 200 with empty cost/duration is wrong on its own merits.
- #407 — Subprocess zombie reaper. Same plumbing area — orphan reader threads are part of the same family of bugs.
- File: `docker/base-image/agent_server/services/claude_code.py`
- File: `docker/base-image/agent_server/routers/chat.py`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: agent_server reports task as 'completed successfully' when claude exits 0 with no result message #520

Summary

Component

Priority

Error

Location

Root Cause

Reproduction Steps

Suggested Fix

Environment

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bug: agent_server reports task as 'completed successfully' when claude exits 0 with no result message #520

Description

Summary

Component

Priority

Error

Location

Root Cause

Reproduction Steps

Suggested Fix

Environment

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions