fix(agent-server): bound accumulators + /health diagnostics (#333 hardening)#807
Merged
Conversation
…dening) Three speculative-but-defensible reductions in agent-server accumulator surface, plus the observability hook that turns the next reproduction into a one-curl diagnosis instead of an strace expedition. 1. gemini_runtime: replace per-call ThreadPoolExecutor(max_workers=1) at execute() and execute_task() with a module-level singleton, mirroring claude_code.py:63. Per-call executors rely on CPython's weakref-callback cleanup of worker threads, which is not deterministic under load — a plausible (though unproven) contributor to the pthread_cond_timedwait pattern in the strace. 2. state.AgentState.add_message: FIFO-trim conversation_history once it exceeds history_limit (default 1000, overridable via AGENT_HISTORY_LIMIT). Persistent history is in the backend DB; the in-memory list is only used by /api/chat/history + session-info counts. Without the cap it grew unbounded across days-long uptime. 3. /health: add a `diagnostics` block exposing thread_count, asyncio_task_count, running_executions, conversation_history_size, and conversation_history_limit. A future repro can now be triaged with `curl /health` instead of strace -c -p. Does NOT claim to fix the symptom — only multi-day soak can verify that, and the root cause in #333 is still unproven. Adjacent subprocess teardown fixes (#531, #618, #649, #657, #728/#730) have already shipped the most likely culprits; this PR locks in the remaining leak-surface reductions and the diagnostic hook so the issue stays falsifiable. Tests: tests/unit/test_agent_server_hardening.py (13 cases, all green). Relates to #333. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c3e84be to
844dc2c
Compare
tests/lint_sys_modules.py flags bare `sys.modules.pop` and `sys.modules[...] = ...` at test-module scope. tests/unit/conftest.py already registers docker/base-image/agent_server as a namespace package in sys.modules via _preload_real_agent_server(), so the file-loader shim in this test was redundant — plain `from agent_server.<sub> import X` resolves through the conftest-registered __path__. 13 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vybe
approved these changes
May 12, 2026
Contributor
vybe
left a comment
There was a problem hiding this comment.
Clean, focused hardening. Three targeted accumulator-leak reductions with good test coverage. Executor singleton correctly mirrors claude_code.py:63; history FIFO trim is well-guarded; diagnostics block is graceful under sync/async contexts. Leaving #333 open is the right call — root cause still needs a multi-day soak to verify.
dolho
added a commit
that referenced
this pull request
May 12, 2026
Same fix as the #807 follow-up: tests/lint_sys_modules.py flags bare sys.modules mutations at module scope. tests/unit/conftest.py's _preload_real_agent_server() already registers the namespace package, so plain `from agent_server.<sub> import X` resolves. 5 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vybe
pushed a commit
that referenced
this pull request
May 12, 2026
…rk skills (#160) (#813) * fix(headless): accept clean exit + empty parent stream as success for context:fork skills (#160) Headless tasks that invoke skills with `context: fork` in their frontmatter were failing with HTTP 500 "Task returned empty response", silently failing every scheduled invocation. The issue documented 8 consecutive daily failures on one agent. Root cause: the fork mechanism in Claude Code runs the skill's work in a sub-context whose output never reaches the parent stdout. The parent process exits cleanly with `return_code == 0` and a populated `result` line (`cost_usd` and `duration_ms` set — that's why `_classify_empty_result` returns None and execution falls through to the "build response_text" block). Pre-#160 the unconditional `if not response_text: raise 500` at the bottom of `_finalize_headless_result` had no notion of fork-style skills. Fix: when the parent process reports clean completion (`return_code == 0` AND `metadata.cost_usd is not None`), trust it. Synthesize a short placeholder response instead of raising 500. Real plumbing failures (lost result line, dropped stdout pipe, child held stdout) are already caught by `_classify_empty_result` earlier in the function and never reach this branch — the new code path is exclusively for the "claude completed cleanly with no parent-stream output" case. Tests: tests/unit/test_headless_context_fork_empty_response.py (5 cases). Pins the placeholder fires only when both guards hold, the real-output case is untouched, non-zero exits still error, and the #520 lost-result case still returns 502 via `_classify_empty_result`. Relates to #160. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(tests): drop bare sys.modules patching, rely on conftest preload Same fix as the #807 follow-up: tests/lint_sys_modules.py flags bare sys.modules mutations at module scope. tests/unit/conftest.py's _preload_real_agent_server() already registers the namespace package, so plain `from agent_server.<sub> import X` resolves. 5 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Speculative-but-defensible hardening for the long-uptime futex-spin
symptom reported in #333. Does not claim to fix the symptom (root
cause in #333 is still unproven, and verification needs a multi-day
soak). It locks in the remaining accumulator-leak surface reductions and
adds the observability hook that turns the next reproduction into a
one-curl diagnosis instead of an strace expedition.
What changed
gemini_runtime.py: replace per-callThreadPoolExecutor(max_workers=1)at
execute()andexecute_task()with a module-level singleton, mirroringclaude_code.py:63. Per-call executors rely on CPython's weakref-callbackcleanup of worker threads, which is not deterministic under load.
state.AgentState.add_message: FIFO-trimconversation_historyonceit exceeds
history_limit(default 1000, overridable viaAGENT_HISTORY_LIMIT). Persistent history is in the backend DB; thein-memory list is only used by
/api/chat/history+ session-info counts./health: add adiagnosticsblock exposingthread_count,asyncio_task_count,running_executions,conversation_history_size,and
conversation_history_limit. A future repro can now be triaged withcurl /healthinstead ofstrace -c -p.What this does NOT do
reproducing host.
shipped since the report (bug: drain_reader_threads closes stdout pipe before reader can drain backlog — silently loses final result line on long agentic tasks #531, bug(agent-runtime): npx MCP server subprocesses outside claude pgid hold stdout pipe open — reader stuck after #531 grace period #618, bug: drain_reader_threads post-kill timeout not enforced; leaked thread blocks capacity slot indefinitely #649, bug: drain_reader_threads t.join(30) deadlocks asyncio event loop permanently when reader thread is stuck #657, bug: agent-server.py spins at 90% CPU on OAuth token auth failure, blocking CB recovery #728/fix(agent-server): cap drain executor thread at 90 s to fix CPU spin on auth failure (#728) #730) may have
already addressed the most likely culprits — but that remains
unverified.
#333 stays open until reproduced on current main.
Test plan
tests/unit/test_agent_server_hardening.py— 13 cases coveringhistory bound, gemini executor singleton,
/healthdiagnostics./healthdiagnostics block scraped atregular intervals — looking for
thread_count,asyncio_task_count,or
running_executionstrending up over time.Relates to #333.
🤖 Generated with Claude Code