fix(circuit-breaker): drop-grace + pipe-drop reclassification — #474 follow-up to #798#873
Merged
Merged
Conversation
When the Claude/Gemini child process exits early (auth abort,
permission-mode kill, upstream cancellation), the parent receives
BrokenPipeError / ConnectionResetError on stdin write. The previous
broad-except path logged [Errno 32] at ERROR and returned 500.
Two problems with that:
1. SUB-003 in task_execution_service.py treats 503 from the agent as
auth-class failure and triggers subscription auto-switch. 500 is
adjacent and produces operator-noise; 502 ("Bad Gateway to Claude
subprocess") is the semantically correct status here and is
collision-free with the auto-switch path.
2. The ERROR log line was misleading — the agent itself is not faulted;
the child process exited and the OS surfaced the pipe close. INFO is
the right level.
Adds parallel handlers in headless_executor.execute_headless_task and
GeminiRuntime headless path. Tests pin:
- pipe-drop returns 502 (not 500, not 503)
- SUB-003 auto-switch is NOT triggered on this status
Refs #474
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Layer 2) check_network_health() now distinguishes two error classes that #798's narrow classifier treated as one: - BrokenPipeError / ConnectionResetError on a /health probe means the client-side socket died mid-flight (e.g., upstream MCP-sync cancellation cascading into the pooled keepalive). The agent's health hasn't been observed at all, so we MUST NOT record_failure(). Return reachable=False but stay circuit-neutral. - httpx.ReadError / WriteError / RemoteProtocolError on a /health probe ARE liveness signals — if the agent partially writes then drops (event-loop wedge, OOM mid-write, segfault), the agent IS unhealthy. record_failure() applies, distinct from the client-side pipe drop above. Tests pin the split — same exception types, opposite circuit semantics depending on whether the disconnect was client-side or agent-side. Refs #474 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lapse (#474) Follow-up to #798's narrow classifier. That fix correctly stopped ReadError/WriteError/RemoteProtocolError from incrementing the circuit, but it didn't address the eviction-then-fresh-client race during a concurrent transport-drop burst: when one caller catches a pipe drop and evicts the pool entry, sibling callers race to build a fresh client against a half-closed peer and see ConnectError/TimeoutException. Under the old classifier those got record_failure(), so 9-of-10 concurrent drops still tripped the breaker on a healthy agent. This patch adds: - `AgentConnectionDroppedError` (subclass of AgentNotReachableError) — distinct typed signal for "in-flight transport broke" vs "agent unreachable from the start". Inherits from AgentNotReachableError so existing tenacity `retry_if_exception_type` chains and callers catching AgentNotReachableError are unaffected. - `_recent_drops: Dict[base_url, monotonic_ts]` + `_DROP_GRACE_SEC=2.0` — first caller to catch a transport drop stamps the base_url; siblings whose fresh-client retry fails with ConnectError/Timeout within the grace window are classified as collateral drops and raise AgentConnectionDroppedError without record_failure(). - `_acquire_client(base_url) -> (client, is_pooled)` — replaces `_get_http_client`. While a drop-grace window is active, returns a fresh single-use client (so the pool isn't repopulated with transient sockets during a burst); the caller's `finally` closes it. `_get_http_client` retained as backward-compat wrapper. - Explicit handlers for ReadError/WriteError/RemoteProtocolError/ BrokenPipeError/ConnectionResetError that stamp the drop, evict the pooled client (with an `is client` identity check so siblings don't double-close), and raise AgentConnectionDroppedError. Scope: both `_recent_drops` and `_client_pool` are process-local. Under multi-worker uvicorn deployments each worker has its own grace map and pool, so the burst-neutralisation is per-worker. The Redis-backed circuit (`CircuitState`) remains the single fleet-wide source of truth, so transport drops still never hit `record_failure()` in any worker. Tests (both unit + integration) pin: - Burst of 10 concurrent transport drops produces 0 record_failure calls (was 9 under #798's classifier alone). - Sibling ConnectError inside the grace window is collateral (no record_failure); same ConnectError outside the window is a real failure. - Pool eviction is idempotent across siblings (no double-close). - Non-pooled clients are always aclose()d on every exit path. docker-compose.sibling.yml: minimal Redis-only sibling override (port 6390, project `trinity-sibling`) for running test_circuit_breaker integration tests against a real Redis without spinning the full production stack. Refs #474 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-contained security audit of the uncommitted working tree on this branch (HEAD == merge-base with origin/dev pre-merge) before opening the PR. 0 critical / 0 high / 0 medium across secrets, deps, auth, injection, and platform patterns; 1 low (configuration); 2 info. Refs #474 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three flows updated to reflect the commits earlier in this branch: - agent-monitoring.md: revision-history entry for the check_network_health() exception-classification split (commit d53a2d6) — BrokenPipeError/ConnectionResetError as client-side drops (no record_failure) vs httpx.ReadError/WriteError/ RemoteProtocolError as agent liveness signals on /health (record_failure). Line range for check_network_health updated 170-269. - execution-queue.md: revision-history entry for the agent_client drop-grace coordination (commit c9d6a09) — _recent_drops map, AgentConnectionDroppedError, _acquire_client tuple API, pool-eviction identity check. agent_client.py file-stats line count refreshed to 1130. Response-data-class and parsing-logic line refs realigned to post-rewrite positions. - parallel-headless-execution.md: revision-history entry for the subprocess pipe-drop reclassification (commit 1cdbc57) — 502 not 500, log demotion to INFO, no SUB-003 503-auth-class collision. Updated >Updated< front-matter line. Refs #474 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bd2f8b3 to
53214af
Compare
…ness (#474) - agent_client: TRANSIENT_TRANSPORT_EXCEPTIONS now uses httpx.TimeoutException (parent class) instead of enumerating Read/Write/Pool — covers any future subclass without re-touching the tuple. ConnectTimeout stays in CIRCUIT_FAILURE_EXCEPTIONS above (first-match in _request() wins). - monitoring_service: lift CIRCUIT_FAILURE_EXCEPTIONS + TRANSIENT_TRANSPORT_EXCEPTIONS imports to module-top (with ImportError fallback for stub-fixtures) so test patches replacing services.agent_client with a MagicMock don't turn them into non-exception values that fail `except` at runtime. - monitoring_service: add /health-specific `except httpx.TimeoutException` ABOVE the transient handler — for /health a timeout is a liveness signal (event-loop wedged), so record_failure() applies, opposite contract to AgentClient._request(). - monitoring_service: stabilise user-facing error string to "Connection refused" / "HTTP timeout"; full classname+message stays in logger.debug for triage, no longer leaks into dashboards. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ISSUE-001 — Found by /qa on 2026-05-17 Report: .gstack/qa-reports/qa-report-localhost-2026-05-17.md The #474 follow-up added BrokenPipeError/ConnectionResetError handling to GeminiRuntime.execute_headless (gemini_runtime.py:728-739), parallel to the Claude path in headless_executor.py:856-872. The Claude path ships with tests/unit/test_headless_executor_pipe_drop.py (3 tests); the Gemini path had none. This file mirrors the Claude regression suite: - BrokenPipeError → INFO log + HTTP 502 + descriptive detail - ConnectionResetError → INFO log + HTTP 502 - RuntimeError → ERROR log + HTTP 500 (negative case: branch must not absorb non-pipe failures) - TimeoutError → HTTP 504 (negative case: branch must not steal timeout classification, which is layered above the pipe handler) Fixture pattern: monkeypatch GeminiRuntime.is_available -> True and plant the exception via gemini_runtime.subprocess.Popen so the outer except block is reached. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers the four commits added since the 2026-05-13 report (`cso-2026-05-13-474-diff.md`): per-base_url drop-grace (c0599c2), monitoring split (7831a81), TimeoutException broadening + /health timeout liveness (7af831f), and regression coverage (58c2ec4). Result: 0 CRITICAL / 0 HIGH / 0 MEDIUM / 0 LOW. Two INFO-level positive notes — stable user-facing error strings narrow info-disclosure surface in fleet-health UI; sibling Redis compose review is clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vybe
approved these changes
May 18, 2026
Contributor
vybe
left a comment
There was a problem hiding this comment.
Solid follow-up to #798. The drop-grace mechanism is correctly designed — per-base_url map bounded by fleet size, identity-checked pool eviction to prevent sibling double-close, aclose() in finally on every exit path. The pipe-drop 502 reclassification closes the SUB-003 false auto-switch path cleanly (502 is collision-free with the 503-or-auth-string predicate). Monitoring /health divergence from _request() semantics is correctly documented and pinned in tests.
Two non-blocking notes:
architecture.mdentry foragent_client.pydoesn't mentionAgentConnectionDroppedError,_recent_drops, or_acquire_client— worth a one-liner update in a follow-up or the next architecture pass.- Worth commenting on #474 to note what this PR resolved vs. what (if anything) remains, so the issue status is clear.
Two CSO audits included in the PR — appreciated.
vybe
added a commit
that referenced
this pull request
May 18, 2026
- email-authentication.md: add #890 revision entry (contextual subject/body, HTML template) - public-agent-links.md: note agent_name now passed to verification email - write-user-memory.md: new flow for write_user_memory MCP tool (#888) - gemini-runtime.md: pipe-drop reclassification to 502 (#873) - parallel-headless-execution.md: same pipe-drop fix in headless_executor, useProcessWebSocket.js deleted - agent-monitoring.md: note 502 handled correctly by existing health check classification Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
vybe
added a commit
that referenced
this pull request
May 18, 2026
…on email (#890) (#892) * feat(email): add context, agent name, and HTML template to verification email (#890) - extend send_verification_code() with optional agent_name and context_label params - subject now reads e.g. 'Your Trinity access code for "Research Assistant"' or 'Your Trinity login verification code' - plain-text body names the agent/context and explains why the code was sent - HTML email added with clean layout and large prominent code block - auth.py passes context_label="Trinity login"; public.py passes agent_name from link Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(feature-flows): sync flows for #888, #890, #873 - email-authentication.md: add #890 revision entry (contextual subject/body, HTML template) - public-agent-links.md: note agent_name now passed to verification email - write-user-memory.md: new flow for write_user_memory MCP tool (#888) - gemini-runtime.md: pipe-drop reclassification to 502 (#873) - parallel-headless-execution.md: same pipe-drop fix in headless_executor, useProcessWebSocket.js deleted - agent-monitoring.md: note 502 handled correctly by existing health check classification Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
vybe
added a commit
that referenced
this pull request
May 18, 2026
- docs/memory/feature-flows.md: add #887 entry to Recent Updates table - .claude/agents/test-runner.md: add 6 new unit test files (49 tests) from commits #887, #890, #873 to categories + Recent Test Additions (2026-05-18); bump unit test count ~207→~256, total ~2300→~2349 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
vybe
added a commit
that referenced
this pull request
May 18, 2026
…sed (#887) (#893) * fix(read-only): bake guard into base image, cover MultiEdit, fail-closed (#887) The read-only guard was stored in the agent-writable .trinity/hooks/ path and injected dynamically into settings.local.json. An agent could overwrite the guard script or the hook registration, and MultiEdit calls were never checked (no top-level file_path). - Move read-only-guard.py to /opt/trinity/hooks/ (root-owned 0555 in base image) - Register hook permanently in ~/.claude/settings.json via claude-settings.json (matcher now includes MultiEdit) - inject_read_only_hooks() writes ONE file only: ~/.trinity/read-only-config.json - remove_read_only_hooks() writes {"enabled": false}; strips legacy settings.local.json hook entry via _remove_legacy_settings_hook() for pre-#887 agents - lifecycle.py always syncs config on every agent start (both enable and disable paths) to prevent stale enabled:true config persisting on the volume - Add path_deny and bash_deny in guardrails-baseline.json to protect config file - Wrap main() in run_hook() for fail-closed behavior (uncaught exception → exit 2) - Add 18 unit tests in tests/unit/test_read_only_guard.py (all passing) Fixes #887 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(flows): sync feature-flows.md and test catalog for #887 - docs/memory/feature-flows.md: add #887 entry to Recent Updates table - .claude/agents/test-runner.md: add 6 new unit test files (49 tests) from commits #887, #890, #873 to categories + Recent Test Additions (2026-05-18); bump unit test count ~207→~256, total ~2300→~2349 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(read-only): stub remove_read_only_hooks in readiness probe fixture PR #893 added `remove_read_only_hooks` to lifecycle.py's import line (`from .read_only import inject_read_only_hooks, remove_read_only_hooks`) to support the always-sync-on-start behavior. The readiness-probe test fixture stubs `services.agent_service.read_only` so lifecycle.py can be loaded in isolation, but the stub only exposed `inject_read_only_hooks`. Result: lifecycle.py module load raises ImportError during test collection, so all 5 tests in test_agent_readiness_probe.py error out. Caught by the regression-diff CI job. Fix: add `remove_read_only_hooks=None` to the SimpleNamespace stub. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
4 tasks
AndriiPasternak31
added a commit
that referenced
this pull request
May 20, 2026
Brings dev forward through 2026-05-20 head (which now includes #873's pipe-drop carry-forward, #474's pipe-drop reclassification, the canary B-tier invariants, public_memory + a2a routers, sleep-echo template, read-only guard, and verification-email work). Conflict resolved: docs/memory/feature-flows/parallel-headless-execution.md — both sides added 2026-05-13 revision-history rows. Kept all three new entries in reverse-chronological order: #873 (2026-05-18, from dev), then #678 (2026-05-13, ours), then #474 (2026-05-13, from dev). The file's "Updated" header already reads 2026-05-18 from dev's side; our 2026-05-13 #678 entry is older than that, so the header is preserved as-is. All other touched files (task_execution_service.py, headless_executor.py, architecture.md, database.py, migrations.py, schema.py) auto-merged cleanly — git's strategy picked up our additive changes against dev's independent ones without semantic overlap. Refs #678
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #798 (now in
dev) which narrowed the circuit-breakerclassifier to
ConnectError/ConnectTimeoutonly. That fix iscorrect but doesn't address two adjacent failure modes that became
visible once the classifier got tightened:
Sibling-collapse on concurrent transport drops — when one
caller catches a pipe drop and evicts the pooled httpx client,
sibling callers race to build a fresh client against a half-closed
peer and see
ConnectError/TimeoutException. Under fix(circuit-breaker): only count TCP unreachability toward circuit (#474) #798'sclassifier those do increment the circuit, so 9-of-10 concurrent
drops on a healthy agent still trip the breaker.
Pipe drop on subprocess writes — when the Claude/Gemini child
process exits early,
headless_executor/gemini_runtimereturned500 with an ERROR-level
[Errno 32]log. That's adjacent to the503 status that SUB-003 in
task_execution_serviceinterprets asauth-class failure and triggers subscription auto-switch.
Composition with #798
This branch is rebased onto current
dev(post-#798). The handlerscompose rather than replace:
CIRCUIT_FAILURE_EXCEPTIONS/TRANSIENT_TRANSPORT_EXCEPTIONS/is_circuit_failure()from fix(circuit-breaker): only count TCP unreachability toward circuit (#474) #798 are preserved as the canonicalclassification source.
_request(), an explicit(httpx.ReadError, WriteError, RemoteProtocolError, BrokenPipeError, ConnectionResetError)handler is placed BEFORE the tuple-basedTRANSIENT_TRANSPORT_EXCEPTIONShandler so the stamp+evict sideeffects always run on the genuine drop signals (and raw OSError
subclasses, which fix(circuit-breaker): only count TCP unreachability toward circuit (#474) #798 deliberately let propagate, are caught here
for the drop-coordination).
ConnectError/ConnectTimeoutkeep fix(circuit-breaker): only count TCP unreachability toward circuit (#474) #798'srecord_failure()semantics by default, with an in-grace-window override that raises
AgentConnectionDroppedErrorinstead — that's the actualsibling-collapse fix.
monitoring_service.check_network_health()follows the same layeredpattern: my
/health-specific liveness handler(
ReadError/WriteError/RemoteProtocolError→record_failure())sits BEFORE the shared
TRANSIENT_TRANSPORT_EXCEPTIONShandler, soPython's first-match enforces the divergent semantics. This is the
documented
/health-specific divergence from_request().Changes
agent_client.py—_recent_dropsper-base_url grace map(2 s window) +
AgentConnectionDroppedError(subclass ofAgentNotReachableErrorso existing tenacity / catch chains keepworking) +
_acquire_client(base_url) -> (client, is_pooled)so thepool isn't repopulated with transient sockets during a burst.
Explicit drop handler stamps the grace window, evicts the pooled
client (identity-checked so siblings don't double-close), and skips
record_failure().finallycloses non-pooled (fresh-during-grace)clients on every exit path to prevent connection-handle leaks.
monitoring_service.check_network_health()— explicit(BrokenPipeError, ConnectionResetError)(circuit-neutral) and(httpx.ReadError, WriteError, RemoteProtocolError)(record_failurebecause partial response on
/health= agent crash) handlerslayered above the shared
TRANSIENT_TRANSPORT_EXCEPTIONShandler.headless_executor.py+gemini_runtime.py— catchBrokenPipeError/ConnectionResetErrorfrom subprocess stdinwrites, return 502 ("Bad Gateway to Claude subprocess") instead
of 500. 502 is semantically correct and collision-free with the
SUB-003 auto-switch path. Log demoted to INFO.
docker-compose.sibling.yml— minimal Redis-only sibling stackon port 6390 (
trinity-sibling-*containers) for running theRedis-backed integration tests without colliding with the production
stack on 6379.
Tests
tests/unit/test_circuit_breaker.py— adds drop-classificationtests on top of fix(circuit-breaker): only count TCP unreachability toward circuit (#474) #798's classifier tests. Burst of 10 concurrent
transport drops produces 0
record_failurecalls (was 9 underfix(circuit-breaker): only count TCP unreachability toward circuit (#474) #798's classifier alone); sibling
ConnectErrorinside the gracewindow is collateral, same
ConnectErroroutside is a real failure;pool eviction is idempotent across siblings; non-pooled clients
are aclose'd on every exit path. 29 tests total.
tests/integration/test_circuit_breaker.py— addsTestConcurrentTransportDropsnext to fix(circuit-breaker): only count TCP unreachability toward circuit (#474) #798'sTestFailureClassification. RealAgentClient+ real Redis viathe sibling compose, drives the drop-burst path.
test_raw_broken_pipeand
test_raw_connection_resetupdated to expectAgentConnectionDroppedErrorsince the drop handler now catchesthese explicitly (was: expected raw exception to propagate).
tests/unit/test_monitoring_health_check_classification.py—pins the client-vs-agent disconnect split on
/health.tests/unit/test_headless_executor_pipe_drop.py— pipe dropreturns 502 (not 500, not 503), log level INFO.
tests/unit/test_pipe_close_no_auto_switch.py— SUB-003auto-switch is not triggered on 502.
Security
docs/security-reports/cso-2026-05-13-474-diff.md— CSO--diffauditof the working tree pre-PR. 0 critical / 0 high / 0 medium across
secrets, dependencies, auth, injection, and platform patterns.
Test plan
uv run pytest tests/unit/test_circuit_breaker.py tests/unit/test_monitoring_health_check_classification.py tests/unit/test_headless_executor_pipe_drop.py tests/unit/test_pipe_close_no_auto_switch.pyuv run pytest tests/integration/test_circuit_breaker.pyper-agent circuit stays closed
no SUB-003 auto-switch
Refs #474, follows #798
🤖 Generated with Claude Code