Skip to content

fix(agent-server): bound accumulators + /health diagnostics (#333 hardening)#807

Merged
vybe merged 2 commits into
devfrom
feature/333-agent-server-hardening
May 12, 2026
Merged

fix(agent-server): bound accumulators + /health diagnostics (#333 hardening)#807
vybe merged 2 commits into
devfrom
feature/333-agent-server-hardening

Conversation

@dolho
Copy link
Copy Markdown
Contributor

@dolho dolho commented May 12, 2026

Summary

Speculative-but-defensible hardening for the long-uptime futex-spin
symptom reported in #333. Does not claim to fix the symptom (root
cause in #333 is still unproven, and verification needs a multi-day
soak). It locks in the remaining accumulator-leak surface reductions and
adds the observability hook that turns the next reproduction into a
one-curl diagnosis instead of an strace expedition.

What changed

  • gemini_runtime.py: replace per-call ThreadPoolExecutor(max_workers=1)
    at execute() and execute_task() with a module-level singleton, mirroring
    claude_code.py:63. Per-call executors rely on CPython's weakref-callback
    cleanup of worker threads, which is not deterministic under load.
  • state.AgentState.add_message: FIFO-trim conversation_history once
    it exceeds history_limit (default 1000, overridable via
    AGENT_HISTORY_LIMIT). Persistent history is in the backend DB; the
    in-memory list is only used by /api/chat/history + session-info counts.
  • /health: add a diagnostics block exposing thread_count,
    asyncio_task_count, running_executions, conversation_history_size,
    and conversation_history_limit. A future repro can now be triaged with
    curl /health instead of strace -c -p.

What this does NOT do

#333 stays open until reproduced on current main.

Test plan

  • tests/unit/test_agent_server_hardening.py — 13 cases covering
    history bound, gemini executor singleton, /health diagnostics.
  • Multi-day soak with the new /health diagnostics block scraped at
    regular intervals — looking for thread_count, asyncio_task_count,
    or running_executions trending up over time.

Relates to #333.

🤖 Generated with Claude Code

…dening)

Three speculative-but-defensible reductions in agent-server accumulator
surface, plus the observability hook that turns the next reproduction
into a one-curl diagnosis instead of an strace expedition.

1. gemini_runtime: replace per-call ThreadPoolExecutor(max_workers=1) at
   execute() and execute_task() with a module-level singleton, mirroring
   claude_code.py:63. Per-call executors rely on CPython's weakref-callback
   cleanup of worker threads, which is not deterministic under load — a
   plausible (though unproven) contributor to the pthread_cond_timedwait
   pattern in the strace.

2. state.AgentState.add_message: FIFO-trim conversation_history once it
   exceeds history_limit (default 1000, overridable via AGENT_HISTORY_LIMIT).
   Persistent history is in the backend DB; the in-memory list is only
   used by /api/chat/history + session-info counts. Without the cap it
   grew unbounded across days-long uptime.

3. /health: add a `diagnostics` block exposing thread_count,
   asyncio_task_count, running_executions, conversation_history_size,
   and conversation_history_limit. A future repro can now be triaged
   with `curl /health` instead of strace -c -p.

Does NOT claim to fix the symptom — only multi-day soak can verify that,
and the root cause in #333 is still unproven. Adjacent subprocess
teardown fixes (#531, #618, #649, #657, #728/#730) have already shipped
the most likely culprits; this PR locks in the remaining leak-surface
reductions and the diagnostic hook so the issue stays falsifiable.

Tests: tests/unit/test_agent_server_hardening.py (13 cases, all green).

Relates to #333.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dolho dolho force-pushed the feature/333-agent-server-hardening branch from c3e84be to 844dc2c Compare May 12, 2026 09:06
tests/lint_sys_modules.py flags bare `sys.modules.pop` and
`sys.modules[...] = ...` at test-module scope. tests/unit/conftest.py
already registers docker/base-image/agent_server as a namespace package
in sys.modules via _preload_real_agent_server(), so the file-loader shim
in this test was redundant — plain `from agent_server.<sub> import X`
resolves through the conftest-registered __path__.

13 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@vybe vybe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, focused hardening. Three targeted accumulator-leak reductions with good test coverage. Executor singleton correctly mirrors claude_code.py:63; history FIFO trim is well-guarded; diagnostics block is graceful under sync/async contexts. Leaving #333 open is the right call — root cause still needs a multi-day soak to verify.

@vybe vybe merged commit af5e408 into dev May 12, 2026
9 checks passed
dolho added a commit that referenced this pull request May 12, 2026
Same fix as the #807 follow-up: tests/lint_sys_modules.py flags bare
sys.modules mutations at module scope. tests/unit/conftest.py's
_preload_real_agent_server() already registers the namespace package,
so plain `from agent_server.<sub> import X` resolves.

5 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vybe pushed a commit that referenced this pull request May 12, 2026
…rk skills (#160) (#813)

* fix(headless): accept clean exit + empty parent stream as success for context:fork skills (#160)

Headless tasks that invoke skills with `context: fork` in their frontmatter
were failing with HTTP 500 "Task returned empty response", silently
failing every scheduled invocation. The issue documented 8 consecutive
daily failures on one agent.

Root cause: the fork mechanism in Claude Code runs the skill's work in
a sub-context whose output never reaches the parent stdout. The parent
process exits cleanly with `return_code == 0` and a populated `result`
line (`cost_usd` and `duration_ms` set — that's why
`_classify_empty_result` returns None and execution falls through to
the "build response_text" block). Pre-#160 the unconditional
`if not response_text: raise 500` at the bottom of
`_finalize_headless_result` had no notion of fork-style skills.

Fix: when the parent process reports clean completion (`return_code == 0`
AND `metadata.cost_usd is not None`), trust it. Synthesize a short
placeholder response instead of raising 500. Real plumbing failures
(lost result line, dropped stdout pipe, child held stdout) are already
caught by `_classify_empty_result` earlier in the function and never
reach this branch — the new code path is exclusively for the
"claude completed cleanly with no parent-stream output" case.

Tests: tests/unit/test_headless_context_fork_empty_response.py (5 cases).
Pins the placeholder fires only when both guards hold, the real-output
case is untouched, non-zero exits still error, and the #520 lost-result
case still returns 502 via `_classify_empty_result`.

Relates to #160.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(tests): drop bare sys.modules patching, rely on conftest preload

Same fix as the #807 follow-up: tests/lint_sys_modules.py flags bare
sys.modules mutations at module scope. tests/unit/conftest.py's
_preload_real_agent_server() already registers the namespace package,
so plain `from agent_server.<sub> import X` resolves.

5 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants