fix: make interception proxy streaming resilient to upstream cuts by rasdani · Pull Request #1194 · PrimeIntellect-ai/verifiers

rasdani · 2026-04-19T17:00:11Z

Makes the interception proxy's streaming response resilient to upstream cuts: 10s SSE keepalive comments keep idle streams warm, a per-chunk asyncio.sleep(0) forces an event-loop yield so content+close can't race the transport flush under warmup-burst contention, and transport exceptions at prepare/write/write_eof are surfaced as StreamInterrupted into state["error"] (builds on #1191) so rollouts reschedule instead of looking like clean zero-turn completions.
Validated on job 5814 (opencode-swe, tasks_per_minute=256, pristine frpc+prime-tunnel): 0 silent turns=0-agent_completed and 0 StreamInterrupted in 19 min / 771 rollouts; ablations confirm both mechanisms are load-bearing.

The `except BaseException` at the post-loop `await response_future` was catching `asyncio.CancelledError` (which inherits from `BaseException`, not `Exception`, since Python 3.8) and funneling it into `_set_rollout_error` as a `StreamInterrupted`. `CancelledError` is raised during normal teardown when `InterceptionServer.unregister_rollout` calls `future.cancel()` on still-pending intercepts — this is expected cleanup, not a stream failure. Narrow the catch to `except Exception` and re-raise `CancelledError` explicitly so cancellation semantics stay intact. Real error paths (vLLM raised, backend disconnect mid-compute, etc.) are all `Exception` subclasses and still surface correctly. Addresses Cursor Bugbot review on #1194. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Crash before the agent made a single LLM call means the interception server never saw the rollout, so it could not have populated `state["error"]` with anything. Dropping the AgentError assignment in this branch left `state["error"]` unset for real crashes — sandbox OOM, install/bootstrap failure, opencode binary crash — which broke retries and muted a whole class of legitimate agent-side failures. Keep the refactor's intent (don't clobber proxy-originated errors after the agent has started) by gating on `num_turns`: - `num_turns == 0`: agent crashed before any interception — set `state["error"] = AgentError(...)` as before. - `num_turns > 0`: agent crashed after at least one LLM call — trust whatever the interception server recorded (if anything). Log the exit details for diagnostics; do not clobber `state["error"]`. Addresses Cursor Bugbot review on #1194. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reuse a single ``chunk_queue.get()`` task across keepalive cycles instead of recreating it each iteration. Python 3.10/3.11's ``asyncio.wait_for`` has a race where a timeout cancels the inner task; if a producer put an item on the queue in the same event-loop step, the task may dequeue the item before being marked cancelled, silently dropping it. Python 3.12 fixed this by switching to ``asyncio.timeout`` internally, but ``verifiers`` supports 3.10+. ``asyncio.wait`` does not cancel its tasks on timeout, so a pending get() task carries forward safely across keepalive cycles. Ensure the task is cancelled on any handler exit via ``finally``. Addresses Cursor Bugbot review on #1194. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The interception proxy currently holds an idle HTTP/SSE connection to the agent for ~27s while CliAgentEnv.get_model_response awaits a non-streaming vLLM response. During that silence, an upstream idle-timeout (tunnel, LB, or kube-proxy — not yet precisely pinpointed) closes the connection; the proxy's next response.write(chunk) raises "Cannot write to closing transport". Opencode's Vercel-AI-SDK sees a truncated stream, treats it as clean completion, and exits with exit_code=0 and zero turns — bypassing every retry / non-zero-exit mitigation. Fix by emitting SSE keepalive comments (: keepalive\n\n) every 10s while waiting on chunk_queue. SSE comments are spec-compliant and every client ignores them, but the bytes keep the TCP connection warm so upstream idle-timeouts don't fire. Also attach elapsed-time info to every StreamInterrupted so logs yield a histogram of how long we were waiting before each cut — useful for confirming the fix landed and for catching any residual failure class. Builds on #1191 (StreamInterrupted + state plumbing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reuse a single ``chunk_queue.get()`` task across keepalive cycles instead of recreating it each iteration. Python 3.10/3.11's ``asyncio.wait_for`` has a race where a timeout cancels the inner task; if a producer put an item on the queue in the same event-loop step, the task may dequeue the item before being marked cancelled, silently dropping it. Python 3.12 fixed this by switching to ``asyncio.timeout`` internally, but ``verifiers`` supports 3.10+. ``asyncio.wait`` does not cancel its tasks on timeout, so a pending get() task carries forward safely across keepalive cycles. Ensure the task is cancelled on any handler exit via ``finally``. Addresses Cursor Bugbot review on #1194. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The two ``logger.error`` sites in ``_handle_streaming_response`` include the raw exception message from aiohttp's transport / the upstream model call. Those messages can contain URLs with tokens, Authorization headers in tracebacks, or other internals. At ERROR severity they propagate to centralized log aggregation where anyone with log access sees them verbatim. Drop both to ``debug`` — local triage still has access, but nothing sensitive ships to aggregators. Matches the treatment of the ``response_future`` error logs in #1196. The rollout itself still halts visibly: ``_set_rollout_error`` sets ``state["error"]`` and ``MultiTurnEnv.has_error`` fires at the next ``is_completed()`` check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit f809d35. Configure here.}

``create_task`` is the modern idiom for scheduling a coroutine — explicit about intent. ``ensure_future`` is the older adapter form; it accepts Tasks/Futures/awaitables in addition to coroutines and is still useful in library code that handles arbitrary inputs, but here we know we have a coroutine from ``chunk_queue.get()``. Same Task, same behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…upted Extend the elapsed-time instrumentation with per-socket diagnostics: - peer=<ip>:<port> — which hop reset the connection (tunnel/LB/pod) - tcp_state=<name> — CLOSE_WAIT means peer sent FIN first; FIN_WAIT1/2 means we did; ESTABLISHED at error time points at an abort path - retx=<n> — nonzero retransmits at reset suggest packet loss / NIC issues rather than clean idle-timeout Pulls TCP_INFO state + retransmits from the underlying socket via SOL_TCP getsockopt. All lookups are best-effort — if socket is already closed or TCP_INFO unavailable, the diag string is simply empty. Peer address is cached at stream start because get_extra_info may return None once the socket closes. The 10s keepalive landed in #1194 prevents the long-wait idle-timeout class, but 5612 logs showed many cuts at 1-9s (before the first keepalive fires). TCP diag tells us whether those are FIN-from-upstream (CLOSE_WAIT) vs FIN-from-client (opencode agent closing its side) vs abortive RSTs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a side-task that every 2s reads TCP_INFO.state on the aiohttp stream socket and logs state transitions. Lets us see the socket leave ESTABLISHED (e.g. CLOSE_WAIT = peer sent FIN first) BEFORE the write that RSTs, which our post-mortem _tcp_diag couldn't capture because the socket is already gone at error time. Sample interval tunable via VF_TCP_SAMPLE_INTERVAL_SECONDS env var; set to 0 to disable. Sampler is started alongside the write loop and cancelled in the same `finally` as the chunk-queue get task. Diagnostic for the residual loopback-side cuts (peer=127.0.0.1) that remain after the frpc pooled-work-conn keepalive fix. Expected output on a cut: [rollout_X] tcp_state ESTABLISHED -> CLOSE_WAIT at 62.3s (retx=0) which immediately distinguishes "opencode cancelled" (CLOSE_WAIT arrives from us) from "frpc crashed" (socket vanishes without state transition). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Prior revision routed state transitions and Streaming-error logs through logger.debug(), but env-worker subprocesses don't always have verifiers.utils.interception_utils at DEBUG (the host config's [log] level=debug applies to orchestrator/trainer, not every submodule). Result: sampler ran but produced no visible output on the actual cut we saw. - Sampler now writes each observed state/retx/elapsed into a shared dict (state_ref) AND logs transitions at WARNING, so the signal appears regardless of per-module log level. - _tcp_diag reads the cached state as a fallback when the live getsockopt(TCP_INFO) fails at error time (socket already closed, which was exactly what happened on the 601s cut). - Stream-error log itself bumped DEBUG -> WARNING for visibility. Net effect: every cut now has tcp_state=<STATE> in its StreamInterrupted message, even when the socket is gone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes two silent-exit paths that the #1194 keepalive/StreamInterrupted instrumentation didn't cover: 1. response.prepare(): sends response headers. If the transport is already half-open when we enter the handler (peer closed between accept and our first write), prepare raises and the exception used to escape to aiohttp's middleware. Rollout ended with no state error set, 0 turns, and stop=agent_completed — the original silent empty-trajectory class. 2. response.write_eof(): flushes the transport after the main stream loop. Old path only caught ConnectionResetError and logged at DEBUG. Any other exception here (or a reset happening between last chunk write and write_eof) had the same silent-exit outcome. Both paths now set state["error"] = StreamInterrupted with TCP diag info, so the scheduler rescheudles the rollout instead of accepting it as a clean zero-turn completion. On a recent run we saw ~59 "silent" empty-trajectory rollouts (stop=agent_completed, exit_code=0, turns=0, no error) despite the keepalive + TCP_INFO instrumentation. Those should convert to visible StreamInterrupted with this patch, letting us see whether they are infra cuts (would surface here) or genuine model empty-response cases (remain silent). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously the sandbox-collected opencode.txt bytes were only surfaced on collection failure (as a WARNING). For the silent-empty-trajectory case (turns=0, stop=agent_completed, exit_code=0) the successfully collected log was attached to state but never written anywhere a human could grep. Add a best-effort per-rollout file write under VF_AGENT_LOG_DIR when the env var is set. File path: ${VF_AGENT_LOG_DIR}/${rollout_id}.log. Writes happen in ComposableEnv.post_rollout right before the "Finished rollout_id=..." line so the log is on disk for every completed rollout. Disk-full / permission errors never propagate — instrumentation must not break rollouts. Old files are pruned once the directory exceeds 20k entries.

Every chunk emitted to the agent via _handle_streaming_response now flows through a best-effort NDJSON writer when VF_CHUNK_TRACE_DIR is set. File path: ${VF_CHUNK_TRACE_DIR}/${rollout_id}.ndjson. Each line is a JSON object with event, ts, and (for chunks) delta_content, delta_tool_calls, delta_reasoning_content, finish_reason, id. Sentinel events mark how the stream ended: open / chunk / done / prepare_failed / keepalive_failed / stream_interrupted / write_eof_failed / cancelled Writes are synchronous with line buffering so the last line always reflects the final observed state, even on crash. Disabled by default to avoid paying the I/O cost when it isn't needed. Old files are pruned once the directory exceeds 20k entries. Adds two tests: one confirms the NDJSON file is written when the env var is set, one confirms nothing is written when it isn't.

…TRACE_DIR" This reverts commit 7befdb5.

The NDJSON chunk-trace workaround (reverted in 4f900a2) suppressed silent `turns=0 agent_completed` rollouts by adding ~10-100us of synchronous work per SSE chunk. Empirically this rebalanced the event loop enough to avoid a truncation-as-clean-close race under high connection-storm contention during warmup. Replacing the filesystem write with asyncio.sleep(0) gets the event-loop yield alone, without the I/O cost or the 20k-file chunk directory. Goal: keep the suppression, drop everything else. Expected to work if the dominant mechanism is loop fairness or cadence smoothing. If silents return, the I/O itself was load-bearing and we'll need a different cheap perturbation (e.g., a time.perf_counter() call, or a small asyncio.sleep(0.0001)). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ep(0)) Undoes the behavior from commit 890a693 (SSE keepalive + elapsed-time instrumentation). Auto git-revert conflicted with later commits that layered TCP_INFO diagnostics and the sleep(0) yield on top of the keepalive loop; this is a manual equivalent that removes only the keepalive branch. Dropped: - KEEPALIVE_INTERVAL_SECONDS constant - get_task caching + asyncio.wait timeout pattern in streaming loop - ": keepalive\n\n" heartbeat write + its StreamInterrupted error path - Two keepalive-specific unit tests Kept: - TCP_INFO sampler + _tcp_diag (later commits) - asyncio.sleep(0) per-chunk yield (e2fc5d8) - StreamInterrupted plumbing, elapsed-time in error messages - prepare()/write_eof() defensive patches - agent log persistence Rationale: 5718 (with asyncio.sleep(0) + pristine frpc/prime-tunnel) showed 0 silent rollouts and only 2 SI cuts in 44 min across 1734 rollouts. Testing whether the SSE keepalive is still carrying any weight or has become redundant once the warmup-race is mitigated by sleep(0) at the proxy layer and the internet-facing leg is healthy on pristine frp. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…er + sleep(0))" This reverts commit 671254d.

…DIR" This reverts commit bab0a82.

This reverts commit 8fc08ac.

Surgical removal equivalent to reverting 5333c38 (periodic TCP_INFO sampler) and a73b4d2 (TCP diagnostics on StreamInterrupted error messages). Auto-revert conflicted against the asyncio.sleep(0) and keepalive restructuring; this is a hand-done equivalent. Dropped: - import socket, import struct - TCP_SAMPLE_INTERVAL_SECONDS env-gated constant - _TCP_INFO / _TCP_STATE_NAMES module constants - _read_tcp_state, _sample_tcp_state, _tcp_diag functions - tcp_sampler task creation + finally cleanup - tcp_state_ref dict - peer/state/retx substrings in StreamInterrupted messages - cached tcp state fallback in _tcp_diag (8fc08ac-derived) Kept runtime behavior: - SSE keepalive comment loop (890a693) - asyncio.sleep(0) per chunk (e2fc5d8) - prepare/write_eof defensive surface (384d044) - StreamInterrupted plumbing + elapsed-time in messages (9569b53, 890a693) Instrumentation was useful during the hunt. Now that asyncio.sleep(0) + SSE keepalive carry the fix, the observability is dead weight — removing to keep the intercept handler small. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@mikasenghaas

Addresses @mikasenghaas's review comment on #1194: replace hand-rolled `{waited_s:.1f}s` with `print_time(waited_s)` from verifiers.utils.logging_utils across all three error paths (keepalive write, stream write, write_eof). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Removes narrative/history references and WHAT-the-code-does explanations; keeps only the WHY for each non-obvious block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rasdani force-pushed the fix/sse-keepalive-and-instrumentation branch from b3dfda0 to e919115 Compare April 19, 2026 17:03

cursor Bot reviewed Apr 19, 2026

View reviewed changes

Comment thread verifiers/utils/interception_utils.py

Comment thread verifiers/envs/experimental/cli_agent_env.py Outdated

cursor Bot reviewed Apr 19, 2026

View reviewed changes

Comment thread verifiers/utils/interception_utils.py Outdated

rasdani and others added 2 commits April 19, 2026 19:57

rasdani force-pushed the fix/sse-keepalive-and-instrumentation branch from a81d657 to 2e9c33b Compare April 19, 2026 19:57

rasdani mentioned this pull request Apr 19, 2026

feat: surface swallowed response_future errors in interception proxy #1196

Open

cursor Bot reviewed Apr 19, 2026

View reviewed changes

Comment thread verifiers/utils/interception_utils.py Outdated

rasdani requested a review from mikasenghaas April 19, 2026 20:59

rasdani and others added 2 commits April 20, 2026 10:09

rasdani marked this pull request as draft April 20, 2026 15:53

rasdani and others added 12 commits April 21, 2026 04:35

Revert "feat: per-rollout SSE chunk trace (NDJSON) gated on VF_CHUNK_…

4f900a2

…TRACE_DIR" This reverts commit 7befdb5.

Revert "revert SSE keepalive behavior (surgical, preserving TCP sampl…

e6f94ff

…er + sleep(0))" This reverts commit 671254d.

Revert "feat: persist per-rollout opencode agent log to VF_AGENT_LOG_…

db9cef0

…DIR" This reverts commit bab0a82.

Revert "fix: surface TCP state signals regardless of logger config"

338f16b

This reverts commit 8fc08ac.

mikasenghaas reviewed Apr 22, 2026

View reviewed changes

Comment thread verifiers/utils/interception_utils.py Outdated

rasdani changed the title ~~feat: SSE keepalive + elapsed-time instrumentation in interception proxy~~ fix: make interception proxy streaming resilient to upstream cuts Apr 22, 2026

rasdani and others added 2 commits April 22, 2026 22:50

trim multi-line code comments to ≤2 lines each

196c97a

Removes narrative/history references and WHAT-the-code-does explanations; keeps only the WHY for each non-obvious block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rasdani marked this pull request as ready for review April 22, 2026 17:33

rasdani requested a review from mikasenghaas April 22, 2026 17:34

mikasenghaas approved these changes Apr 22, 2026

View reviewed changes

rasdani merged commit cac7ea0 into main Apr 22, 2026
5 checks passed

rasdani mentioned this pull request Apr 22, 2026

chore: v0.1.13.dev5 dev release #1228

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make interception proxy streaming resilient to upstream cuts#1194

fix: make interception proxy streaming resilient to upstream cuts#1194
rasdani merged 19 commits intomainfrom
fix/sse-keepalive-and-instrumentation

rasdani commented Apr 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rasdani commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rasdani commented Apr 19, 2026 •

edited

Loading