Skip to content

fix(circuit-breaker): drop-grace + pipe-drop reclassification — #474 follow-up to #798#873

Merged
vybe merged 8 commits into
devfrom
AndriiPasternak31/issue-474-plan
May 18, 2026
Merged

fix(circuit-breaker): drop-grace + pipe-drop reclassification — #474 follow-up to #798#873
vybe merged 8 commits into
devfrom
AndriiPasternak31/issue-474-plan

Conversation

@AndriiPasternak31
Copy link
Copy Markdown
Contributor

@AndriiPasternak31 AndriiPasternak31 commented May 17, 2026

Summary

Follow-up to #798 (now in dev) which narrowed the circuit-breaker
classifier to ConnectError / ConnectTimeout only. That fix is
correct but doesn't address two adjacent failure modes that became
visible once the classifier got tightened:

  1. Sibling-collapse on concurrent transport drops — when one
    caller catches a pipe drop and evicts the pooled httpx client,
    sibling callers race to build a fresh client against a half-closed
    peer and see ConnectError / TimeoutException. Under fix(circuit-breaker): only count TCP unreachability toward circuit (#474) #798's
    classifier those do increment the circuit, so 9-of-10 concurrent
    drops on a healthy agent still trip the breaker.

  2. Pipe drop on subprocess writes — when the Claude/Gemini child
    process exits early, headless_executor / gemini_runtime returned
    500 with an ERROR-level [Errno 32] log. That's adjacent to the
    503 status that SUB-003 in task_execution_service interprets as
    auth-class failure and triggers subscription auto-switch.

Composition with #798

This branch is rebased onto current dev (post-#798). The handlers
compose rather than replace:

  • CIRCUIT_FAILURE_EXCEPTIONS / TRANSIENT_TRANSPORT_EXCEPTIONS /
    is_circuit_failure() from fix(circuit-breaker): only count TCP unreachability toward circuit (#474) #798 are preserved as the canonical
    classification source.
  • The drop-grace logic is layered on top: in _request(), an explicit
    (httpx.ReadError, WriteError, RemoteProtocolError, BrokenPipeError, ConnectionResetError) handler is placed BEFORE the tuple-based
    TRANSIENT_TRANSPORT_EXCEPTIONS handler so the stamp+evict side
    effects always run on the genuine drop signals (and raw OSError
    subclasses, which fix(circuit-breaker): only count TCP unreachability toward circuit (#474) #798 deliberately let propagate, are caught here
    for the drop-coordination).
  • ConnectError / ConnectTimeout keep fix(circuit-breaker): only count TCP unreachability toward circuit (#474) #798's record_failure()
    semantics by default, with an in-grace-window override that raises
    AgentConnectionDroppedError instead — that's the actual
    sibling-collapse fix.
  • monitoring_service.check_network_health() follows the same layered
    pattern: my /health-specific liveness handler
    (ReadError/WriteError/RemoteProtocolErrorrecord_failure())
    sits BEFORE the shared TRANSIENT_TRANSPORT_EXCEPTIONS handler, so
    Python's first-match enforces the divergent semantics. This is the
    documented /health-specific divergence from _request().

Changes

  • agent_client.py_recent_drops per-base_url grace map
    (2 s window) + AgentConnectionDroppedError (subclass of
    AgentNotReachableError so existing tenacity / catch chains keep
    working) + _acquire_client(base_url) -> (client, is_pooled) so the
    pool isn't repopulated with transient sockets during a burst.
    Explicit drop handler stamps the grace window, evicts the pooled
    client (identity-checked so siblings don't double-close), and skips
    record_failure(). finally closes non-pooled (fresh-during-grace)
    clients on every exit path to prevent connection-handle leaks.
  • monitoring_service.check_network_health() — explicit
    (BrokenPipeError, ConnectionResetError) (circuit-neutral) and
    (httpx.ReadError, WriteError, RemoteProtocolError) (record_failure
    because partial response on /health = agent crash) handlers
    layered above the shared TRANSIENT_TRANSPORT_EXCEPTIONS handler.
  • headless_executor.py + gemini_runtime.py — catch
    BrokenPipeError / ConnectionResetError from subprocess stdin
    writes, return 502 ("Bad Gateway to Claude subprocess") instead
    of 500. 502 is semantically correct and collision-free with the
    SUB-003 auto-switch path. Log demoted to INFO.
  • docker-compose.sibling.yml — minimal Redis-only sibling stack
    on port 6390 (trinity-sibling-* containers) for running the
    Redis-backed integration tests without colliding with the production
    stack on 6379.

Tests

  • tests/unit/test_circuit_breaker.py — adds drop-classification
    tests on top of fix(circuit-breaker): only count TCP unreachability toward circuit (#474) #798's classifier tests. Burst of 10 concurrent
    transport drops produces 0 record_failure calls (was 9 under
    fix(circuit-breaker): only count TCP unreachability toward circuit (#474) #798's classifier alone); sibling ConnectError inside the grace
    window is collateral, same ConnectError outside is a real failure;
    pool eviction is idempotent across siblings; non-pooled clients
    are aclose'd on every exit path. 29 tests total.
  • tests/integration/test_circuit_breaker.py — adds
    TestConcurrentTransportDrops next to fix(circuit-breaker): only count TCP unreachability toward circuit (#474) #798's
    TestFailureClassification. Real AgentClient + real Redis via
    the sibling compose, drives the drop-burst path. test_raw_broken_pipe
    and test_raw_connection_reset updated to expect
    AgentConnectionDroppedError since the drop handler now catches
    these explicitly (was: expected raw exception to propagate).
  • tests/unit/test_monitoring_health_check_classification.py
    pins the client-vs-agent disconnect split on /health.
  • tests/unit/test_headless_executor_pipe_drop.py — pipe drop
    returns 502 (not 500, not 503), log level INFO.
  • tests/unit/test_pipe_close_no_auto_switch.py — SUB-003
    auto-switch is not triggered on 502.

Security

docs/security-reports/cso-2026-05-13-474-diff.md — CSO --diff audit
of the working tree pre-PR. 0 critical / 0 high / 0 medium across
secrets, dependencies, auth, injection, and platform patterns.

Test plan

  • uv run pytest tests/unit/test_circuit_breaker.py tests/unit/test_monitoring_health_check_classification.py tests/unit/test_headless_executor_pipe_drop.py tests/unit/test_pipe_close_no_auto_switch.py
  • Sibling stack up, then
    uv run pytest tests/integration/test_circuit_breaker.py
  • Manual: kick a multi-poller MCP-sync drop and confirm the
    per-agent circuit stays closed
  • Manual: trigger a Claude subprocess early-exit, confirm 502 and
    no SUB-003 auto-switch

Refs #474, follows #798

🤖 Generated with Claude Code

AndriiPasternak31 and others added 5 commits May 17, 2026 17:39
When the Claude/Gemini child process exits early (auth abort,
permission-mode kill, upstream cancellation), the parent receives
BrokenPipeError / ConnectionResetError on stdin write. The previous
broad-except path logged [Errno 32] at ERROR and returned 500.

Two problems with that:
  1. SUB-003 in task_execution_service.py treats 503 from the agent as
     auth-class failure and triggers subscription auto-switch. 500 is
     adjacent and produces operator-noise; 502 ("Bad Gateway to Claude
     subprocess") is the semantically correct status here and is
     collision-free with the auto-switch path.
  2. The ERROR log line was misleading — the agent itself is not faulted;
     the child process exited and the OS surfaced the pipe close. INFO is
     the right level.

Adds parallel handlers in headless_executor.execute_headless_task and
GeminiRuntime headless path. Tests pin:
  - pipe-drop returns 502 (not 500, not 503)
  - SUB-003 auto-switch is NOT triggered on this status

Refs #474

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
 Layer 2)

check_network_health() now distinguishes two error classes that #798's
narrow classifier treated as one:

  - BrokenPipeError / ConnectionResetError on a /health probe means the
    client-side socket died mid-flight (e.g., upstream MCP-sync
    cancellation cascading into the pooled keepalive). The agent's
    health hasn't been observed at all, so we MUST NOT record_failure().
    Return reachable=False but stay circuit-neutral.

  - httpx.ReadError / WriteError / RemoteProtocolError on a /health
    probe ARE liveness signals — if the agent partially writes then
    drops (event-loop wedge, OOM mid-write, segfault), the agent IS
    unhealthy. record_failure() applies, distinct from the
    client-side pipe drop above.

Tests pin the split — same exception types, opposite circuit semantics
depending on whether the disconnect was client-side or agent-side.

Refs #474

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lapse (#474)

Follow-up to #798's narrow classifier. That fix correctly stopped
ReadError/WriteError/RemoteProtocolError from incrementing the circuit,
but it didn't address the eviction-then-fresh-client race during a
concurrent transport-drop burst: when one caller catches a pipe drop
and evicts the pool entry, sibling callers race to build a fresh client
against a half-closed peer and see ConnectError/TimeoutException. Under
the old classifier those got record_failure(), so 9-of-10 concurrent
drops still tripped the breaker on a healthy agent.

This patch adds:

  - `AgentConnectionDroppedError` (subclass of AgentNotReachableError) —
    distinct typed signal for "in-flight transport broke" vs "agent
    unreachable from the start". Inherits from AgentNotReachableError
    so existing tenacity `retry_if_exception_type` chains and callers
    catching AgentNotReachableError are unaffected.

  - `_recent_drops: Dict[base_url, monotonic_ts]` + `_DROP_GRACE_SEC=2.0`
    — first caller to catch a transport drop stamps the base_url;
    siblings whose fresh-client retry fails with ConnectError/Timeout
    within the grace window are classified as collateral drops and
    raise AgentConnectionDroppedError without record_failure().

  - `_acquire_client(base_url) -> (client, is_pooled)` — replaces
    `_get_http_client`. While a drop-grace window is active, returns a
    fresh single-use client (so the pool isn't repopulated with
    transient sockets during a burst); the caller's `finally` closes it.
    `_get_http_client` retained as backward-compat wrapper.

  - Explicit handlers for ReadError/WriteError/RemoteProtocolError/
    BrokenPipeError/ConnectionResetError that stamp the drop, evict the
    pooled client (with an `is client` identity check so siblings don't
    double-close), and raise AgentConnectionDroppedError.

Scope: both `_recent_drops` and `_client_pool` are process-local. Under
multi-worker uvicorn deployments each worker has its own grace map and
pool, so the burst-neutralisation is per-worker. The Redis-backed
circuit (`CircuitState`) remains the single fleet-wide source of truth,
so transport drops still never hit `record_failure()` in any worker.

Tests (both unit + integration) pin:
  - Burst of 10 concurrent transport drops produces 0 record_failure
    calls (was 9 under #798's classifier alone).
  - Sibling ConnectError inside the grace window is collateral
    (no record_failure); same ConnectError outside the window is a
    real failure.
  - Pool eviction is idempotent across siblings (no double-close).
  - Non-pooled clients are always aclose()d on every exit path.

docker-compose.sibling.yml: minimal Redis-only sibling override
(port 6390, project `trinity-sibling`) for running
test_circuit_breaker integration tests against a real Redis without
spinning the full production stack.

Refs #474

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-contained security audit of the uncommitted working tree on this
branch (HEAD == merge-base with origin/dev pre-merge) before opening the
PR. 0 critical / 0 high / 0 medium across secrets, deps, auth, injection,
and platform patterns; 1 low (configuration); 2 info.

Refs #474

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three flows updated to reflect the commits earlier in this branch:

- agent-monitoring.md: revision-history entry for the
  check_network_health() exception-classification split (commit
  d53a2d6) — BrokenPipeError/ConnectionResetError as client-side
  drops (no record_failure) vs httpx.ReadError/WriteError/
  RemoteProtocolError as agent liveness signals on /health
  (record_failure). Line range for check_network_health updated
  170-269.

- execution-queue.md: revision-history entry for the agent_client
  drop-grace coordination (commit c9d6a09) — _recent_drops map,
  AgentConnectionDroppedError, _acquire_client tuple API,
  pool-eviction identity check. agent_client.py file-stats line
  count refreshed to 1130. Response-data-class and parsing-logic
  line refs realigned to post-rewrite positions.

- parallel-headless-execution.md: revision-history entry for the
  subprocess pipe-drop reclassification (commit 1cdbc57) — 502 not
  500, log demotion to INFO, no SUB-003 503-auth-class collision.
  Updated >Updated< front-matter line.

Refs #474

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@AndriiPasternak31 AndriiPasternak31 force-pushed the AndriiPasternak31/issue-474-plan branch from bd2f8b3 to 53214af Compare May 17, 2026 16:48
AndriiPasternak31 and others added 3 commits May 17, 2026 18:10
…ness (#474)

- agent_client: TRANSIENT_TRANSPORT_EXCEPTIONS now uses httpx.TimeoutException
  (parent class) instead of enumerating Read/Write/Pool — covers any future
  subclass without re-touching the tuple. ConnectTimeout stays in
  CIRCUIT_FAILURE_EXCEPTIONS above (first-match in _request() wins).
- monitoring_service: lift CIRCUIT_FAILURE_EXCEPTIONS + TRANSIENT_TRANSPORT_EXCEPTIONS
  imports to module-top (with ImportError fallback for stub-fixtures) so test
  patches replacing services.agent_client with a MagicMock don't turn them
  into non-exception values that fail `except` at runtime.
- monitoring_service: add /health-specific `except httpx.TimeoutException`
  ABOVE the transient handler — for /health a timeout is a liveness signal
  (event-loop wedged), so record_failure() applies, opposite contract to
  AgentClient._request().
- monitoring_service: stabilise user-facing error string to "Connection refused"
  / "HTTP timeout"; full classname+message stays in logger.debug for triage,
  no longer leaks into dashboards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ISSUE-001 — Found by /qa on 2026-05-17
Report: .gstack/qa-reports/qa-report-localhost-2026-05-17.md

The #474 follow-up added BrokenPipeError/ConnectionResetError handling
to GeminiRuntime.execute_headless (gemini_runtime.py:728-739), parallel
to the Claude path in headless_executor.py:856-872. The Claude path
ships with tests/unit/test_headless_executor_pipe_drop.py (3 tests);
the Gemini path had none.

This file mirrors the Claude regression suite:
  - BrokenPipeError → INFO log + HTTP 502 + descriptive detail
  - ConnectionResetError → INFO log + HTTP 502
  - RuntimeError → ERROR log + HTTP 500 (negative case: branch must
    not absorb non-pipe failures)
  - TimeoutError → HTTP 504 (negative case: branch must not steal
    timeout classification, which is layered above the pipe handler)

Fixture pattern: monkeypatch GeminiRuntime.is_available -> True and
plant the exception via gemini_runtime.subprocess.Popen so the outer
except block is reached.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers the four commits added since the 2026-05-13 report
(`cso-2026-05-13-474-diff.md`): per-base_url drop-grace (c0599c2),
monitoring split (7831a81), TimeoutException broadening + /health
timeout liveness (7af831f), and regression coverage (58c2ec4).

Result: 0 CRITICAL / 0 HIGH / 0 MEDIUM / 0 LOW. Two INFO-level positive
notes — stable user-facing error strings narrow info-disclosure surface
in fleet-health UI; sibling Redis compose review is clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@vybe vybe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid follow-up to #798. The drop-grace mechanism is correctly designed — per-base_url map bounded by fleet size, identity-checked pool eviction to prevent sibling double-close, aclose() in finally on every exit path. The pipe-drop 502 reclassification closes the SUB-003 false auto-switch path cleanly (502 is collision-free with the 503-or-auth-string predicate). Monitoring /health divergence from _request() semantics is correctly documented and pinned in tests.

Two non-blocking notes:

  1. architecture.md entry for agent_client.py doesn't mention AgentConnectionDroppedError, _recent_drops, or _acquire_client — worth a one-liner update in a follow-up or the next architecture pass.
  2. Worth commenting on #474 to note what this PR resolved vs. what (if anything) remains, so the issue status is clear.

Two CSO audits included in the PR — appreciated.

@vybe vybe merged commit 5765ccc into dev May 18, 2026
9 checks passed
vybe added a commit that referenced this pull request May 18, 2026
- email-authentication.md: add #890 revision entry (contextual subject/body, HTML template)
- public-agent-links.md: note agent_name now passed to verification email
- write-user-memory.md: new flow for write_user_memory MCP tool (#888)
- gemini-runtime.md: pipe-drop reclassification to 502 (#873)
- parallel-headless-execution.md: same pipe-drop fix in headless_executor, useProcessWebSocket.js deleted
- agent-monitoring.md: note 502 handled correctly by existing health check classification

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
vybe added a commit that referenced this pull request May 18, 2026
…on email (#890) (#892)

* feat(email): add context, agent name, and HTML template to verification email (#890)

- extend send_verification_code() with optional agent_name and context_label params
- subject now reads e.g. 'Your Trinity access code for "Research Assistant"' or 'Your Trinity login verification code'
- plain-text body names the agent/context and explains why the code was sent
- HTML email added with clean layout and large prominent code block
- auth.py passes context_label="Trinity login"; public.py passes agent_name from link

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs(feature-flows): sync flows for #888, #890, #873

- email-authentication.md: add #890 revision entry (contextual subject/body, HTML template)
- public-agent-links.md: note agent_name now passed to verification email
- write-user-memory.md: new flow for write_user_memory MCP tool (#888)
- gemini-runtime.md: pipe-drop reclassification to 502 (#873)
- parallel-headless-execution.md: same pipe-drop fix in headless_executor, useProcessWebSocket.js deleted
- agent-monitoring.md: note 502 handled correctly by existing health check classification

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
vybe added a commit that referenced this pull request May 18, 2026
- docs/memory/feature-flows.md: add #887 entry to Recent Updates table
- .claude/agents/test-runner.md: add 6 new unit test files (49 tests)
  from commits #887, #890, #873 to categories + Recent Test Additions
  (2026-05-18); bump unit test count ~207→~256, total ~2300→~2349

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
vybe added a commit that referenced this pull request May 18, 2026
…sed (#887) (#893)

* fix(read-only): bake guard into base image, cover MultiEdit, fail-closed (#887)

The read-only guard was stored in the agent-writable .trinity/hooks/ path
and injected dynamically into settings.local.json. An agent could overwrite
the guard script or the hook registration, and MultiEdit calls were never
checked (no top-level file_path).

- Move read-only-guard.py to /opt/trinity/hooks/ (root-owned 0555 in base image)
- Register hook permanently in ~/.claude/settings.json via claude-settings.json
  (matcher now includes MultiEdit)
- inject_read_only_hooks() writes ONE file only: ~/.trinity/read-only-config.json
- remove_read_only_hooks() writes {"enabled": false}; strips legacy settings.local.json
  hook entry via _remove_legacy_settings_hook() for pre-#887 agents
- lifecycle.py always syncs config on every agent start (both enable and disable
  paths) to prevent stale enabled:true config persisting on the volume
- Add path_deny and bash_deny in guardrails-baseline.json to protect config file
- Wrap main() in run_hook() for fail-closed behavior (uncaught exception → exit 2)
- Add 18 unit tests in tests/unit/test_read_only_guard.py (all passing)

Fixes #887

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs(flows): sync feature-flows.md and test catalog for #887

- docs/memory/feature-flows.md: add #887 entry to Recent Updates table
- .claude/agents/test-runner.md: add 6 new unit test files (49 tests)
  from commits #887, #890, #873 to categories + Recent Test Additions
  (2026-05-18); bump unit test count ~207→~256, total ~2300→~2349

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(read-only): stub remove_read_only_hooks in readiness probe fixture

PR #893 added `remove_read_only_hooks` to lifecycle.py's import line
(`from .read_only import inject_read_only_hooks, remove_read_only_hooks`)
to support the always-sync-on-start behavior. The readiness-probe test
fixture stubs `services.agent_service.read_only` so lifecycle.py can be
loaded in isolation, but the stub only exposed `inject_read_only_hooks`.

Result: lifecycle.py module load raises ImportError during test
collection, so all 5 tests in test_agent_readiness_probe.py error out.
Caught by the regression-diff CI job.

Fix: add `remove_read_only_hooks=None` to the SimpleNamespace stub.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
AndriiPasternak31 added a commit that referenced this pull request May 20, 2026
Brings dev forward through 2026-05-20 head (which now includes #873's
pipe-drop carry-forward, #474's pipe-drop reclassification, the canary
B-tier invariants, public_memory + a2a routers, sleep-echo template,
read-only guard, and verification-email work).

Conflict resolved: docs/memory/feature-flows/parallel-headless-execution.md —
both sides added 2026-05-13 revision-history rows. Kept all three new
entries in reverse-chronological order: #873 (2026-05-18, from dev),
then #678 (2026-05-13, ours), then #474 (2026-05-13, from dev). The
file's "Updated" header already reads 2026-05-18 from dev's side; our
2026-05-13 #678 entry is older than that, so the header is preserved
as-is.

All other touched files (task_execution_service.py, headless_executor.py,
architecture.md, database.py, migrations.py, schema.py) auto-merged
cleanly — git's strategy picked up our additive changes against dev's
independent ones without semantic overlap.

Refs #678
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants