fix(agent-server): bound accumulators + /health diagnostics (#333 hardening) by dolho · Pull Request #807 · Abilityai/trinity

dolho · 2026-05-12T09:00:42Z

Summary

Speculative-but-defensible hardening for the long-uptime futex-spin
symptom reported in #333. Does not claim to fix the symptom (root
cause in #333 is still unproven, and verification needs a multi-day
soak). It locks in the remaining accumulator-leak surface reductions and
adds the observability hook that turns the next reproduction into a
one-curl diagnosis instead of an strace expedition.

What changed

gemini_runtime.py: replace per-call ThreadPoolExecutor(max_workers=1)
at execute() and execute_task() with a module-level singleton, mirroring
claude_code.py:63. Per-call executors rely on CPython's weakref-callback
cleanup of worker threads, which is not deterministic under load.
state.AgentState.add_message: FIFO-trim conversation_history once
it exceeds history_limit (default 1000, overridable via
AGENT_HISTORY_LIMIT). Persistent history is in the backend DB; the
in-memory list is only used by /api/chat/history + session-info counts.
/health: add a diagnostics block exposing thread_count,
asyncio_task_count, running_executions, conversation_history_size,
and conversation_history_limit. A future repro can now be triaged with
curl /health instead of strace -c -p.

What this does NOT do

Prove the original symptom is gone. That requires a 2–6 day soak on a
reproducing host.
Identify the root cause of bug: agent-server.py develops futex spin loop after days of uptime #333. Adjacent subprocess teardown fixes
shipped since the report (bug: drain_reader_threads closes stdout pipe before reader can drain backlog — silently loses final result line on long agentic tasks #531, bug(agent-runtime): npx MCP server subprocesses outside claude pgid hold stdout pipe open — reader stuck after #531 grace period #618, bug: drain_reader_threads post-kill timeout not enforced; leaked thread blocks capacity slot indefinitely #649, bug: drain_reader_threads t.join(30) deadlocks asyncio event loop permanently when reader thread is stuck #657, bug: agent-server.py spins at 90% CPU on OAuth token auth failure, blocking CB recovery #728/fix(agent-server): cap drain executor thread at 90 s to fix CPU spin on auth failure (#728) #730) may have
already addressed the most likely culprits — but that remains
unverified.

#333 stays open until reproduced on current main.

Test plan

tests/unit/test_agent_server_hardening.py — 13 cases covering
history bound, gemini executor singleton, /health diagnostics.
Multi-day soak with the new /health diagnostics block scraped at
regular intervals — looking for thread_count, asyncio_task_count,
or running_executions trending up over time.

Relates to #333.

🤖 Generated with Claude Code

…dening) Three speculative-but-defensible reductions in agent-server accumulator surface, plus the observability hook that turns the next reproduction into a one-curl diagnosis instead of an strace expedition. 1. gemini_runtime: replace per-call ThreadPoolExecutor(max_workers=1) at execute() and execute_task() with a module-level singleton, mirroring claude_code.py:63. Per-call executors rely on CPython's weakref-callback cleanup of worker threads, which is not deterministic under load — a plausible (though unproven) contributor to the pthread_cond_timedwait pattern in the strace. 2. state.AgentState.add_message: FIFO-trim conversation_history once it exceeds history_limit (default 1000, overridable via AGENT_HISTORY_LIMIT). Persistent history is in the backend DB; the in-memory list is only used by /api/chat/history + session-info counts. Without the cap it grew unbounded across days-long uptime. 3. /health: add a `diagnostics` block exposing thread_count, asyncio_task_count, running_executions, conversation_history_size, and conversation_history_limit. A future repro can now be triaged with `curl /health` instead of strace -c -p. Does NOT claim to fix the symptom — only multi-day soak can verify that, and the root cause in #333 is still unproven. Adjacent subprocess teardown fixes (#531, #618, #649, #657, #728/#730) have already shipped the most likely culprits; this PR locks in the remaining leak-surface reductions and the diagnostic hook so the issue stays falsifiable. Tests: tests/unit/test_agent_server_hardening.py (13 cases, all green). Relates to #333. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tests/lint_sys_modules.py flags bare `sys.modules.pop` and `sys.modules[...] = ...` at test-module scope. tests/unit/conftest.py already registers docker/base-image/agent_server as a namespace package in sys.modules via _preload_real_agent_server(), so the file-loader shim in this test was redundant — plain `from agent_server.<sub> import X` resolves through the conftest-registered __path__. 13 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vybe

Clean, focused hardening. Three targeted accumulator-leak reductions with good test coverage. Executor singleton correctly mirrors claude_code.py:63; history FIFO trim is well-guarded; diagnostics block is graceful under sync/async contexts. Leaving #333 open is the right call — root cause still needs a multi-day soak to verify.

Same fix as the #807 follow-up: tests/lint_sys_modules.py flags bare sys.modules mutations at module scope. tests/unit/conftest.py's _preload_real_agent_server() already registers the namespace package, so plain `from agent_server.<sub> import X` resolves. 5 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rk skills (#160) (#813) * fix(headless): accept clean exit + empty parent stream as success for context:fork skills (#160) Headless tasks that invoke skills with `context: fork` in their frontmatter were failing with HTTP 500 "Task returned empty response", silently failing every scheduled invocation. The issue documented 8 consecutive daily failures on one agent. Root cause: the fork mechanism in Claude Code runs the skill's work in a sub-context whose output never reaches the parent stdout. The parent process exits cleanly with `return_code == 0` and a populated `result` line (`cost_usd` and `duration_ms` set — that's why `_classify_empty_result` returns None and execution falls through to the "build response_text" block). Pre-#160 the unconditional `if not response_text: raise 500` at the bottom of `_finalize_headless_result` had no notion of fork-style skills. Fix: when the parent process reports clean completion (`return_code == 0` AND `metadata.cost_usd is not None`), trust it. Synthesize a short placeholder response instead of raising 500. Real plumbing failures (lost result line, dropped stdout pipe, child held stdout) are already caught by `_classify_empty_result` earlier in the function and never reach this branch — the new code path is exclusively for the "claude completed cleanly with no parent-stream output" case. Tests: tests/unit/test_headless_context_fork_empty_response.py (5 cases). Pins the placeholder fires only when both guards hold, the real-output case is untouched, non-zero exits still error, and the #520 lost-result case still returns 502 via `_classify_empty_result`. Relates to #160. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(tests): drop bare sys.modules patching, rely on conftest preload Same fix as the #807 follow-up: tests/lint_sys_modules.py flags bare sys.modules mutations at module scope. tests/unit/conftest.py's _preload_real_agent_server() already registers the namespace package, so plain `from agent_server.<sub> import X` resolves. 5 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dolho mentioned this pull request May 12, 2026

bug: agent-server.py develops futex spin loop after days of uptime #333

Open

dolho changed the base branch from main to dev May 12, 2026 09:02

dolho force-pushed the feature/333-agent-server-hardening branch from c3e84be to 844dc2c Compare May 12, 2026 09:06

vybe approved these changes May 12, 2026

View reviewed changes

vybe merged commit af5e408 into dev May 12, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent-server): bound accumulators + /health diagnostics (#333 hardening)#807

fix(agent-server): bound accumulators + /health diagnostics (#333 hardening)#807
vybe merged 2 commits into
devfrom
feature/333-agent-server-hardening

dolho commented May 12, 2026

Uh oh!

vybe left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dolho commented May 12, 2026

Summary

What changed

What this does NOT do

Test plan

Uh oh!

vybe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants