fix(session): surface retry exhaustion as non-zero exit with stderr dump#3
Merged
Merged
Conversation
When the session-level retry budget is exhausted (commit 7ddc22c "limit retrying"), opencode previously exited 0 with no evidence the agent gave up. Every LLM call failed, every retry failed, and the process treated that as a successful completion. In hosted RL training, this silently loses ~60% of rollouts on certain problems. Two tightly-coupled changes: 1. Structured stderr dump in session/retry.ts - SessionRetry.dumpRetryExhaust writes a grep-friendly `[retry-exhaust]` JSON line to process.stderr with the HTTP status, URL, body snippet (capped at 500 chars), attempt/retryLimit, and underlying error name. Secrets (api_key, authorization, bearer, x-api-key, token, access_token, userinfo) are scrubbed from both the URL and the body snippet before emission. 2. Tagged-error propagation via a new MessageV2.TerminalRetryExhaustedError added to the Assistant.error discriminated union. The session processor wraps the final underlying error into this type when retries exhaust; cli/cmd/run.ts detects the tag on the session.error event and sets process.exitCode = 1, which the existing top-level process.exit() in index.ts then honors. The existing isTerminalRolloutConflict (409 on /v1/rollouts/) path is left alone per review guidance. Tests: redactUrl/redactBody secret scrubbing (including the ?api_key= query-string case), dumpRetryExhaust stderr format + redaction, and TerminalRetryExhaustedError discriminator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Under tsgo, the assignment of process.stderr.write with a plain arrow function narrows cleanly via the `as typeof process.stderr.write` cast, so the preceding @ts-expect-error directive is flagged as unused. Replace with the cast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Problem. Commit
7ddc22c4a(limit retrying) caps session-level LLM retries inpackages/opencode/src/session/processor.ts. Whenattempt >= retryLimit, the retry is skipped and the error bubbles up through the session handler, but opencode exits cleanly (exit code 0) with no evidence the agent gave up. Every LLM call failed, every retry failed, and the process treats that as a successful completion.Impact. In our hosted RL training on this fork, this produces rollouts with
trajectory=[], exit_code=0, turns=0, duration≈26s(two retries of ~10s request + 2s+4s backoff). Downstream verifiers treat these as "empty trajectory" because no error state is set; on certain problems this silently loses ~60% of rollouts. Completely invisible in orchestrator logs.Base branch.
daniel/rldoes not exist onprime(verified withgit ls-remote), so this PR targetsrl-forkper the fallback rule.The fix
Two tightly-coupled changes inside
packages/opencode/src/session/:Structured stderr dump at retry exhaustion. New
SessionRetry.dumpRetryExhaustwrites a grep-friendly[retry-exhaust] {...}JSON line directly toprocess.stderr(notlog.error, which routes to a file unless--print-logs). Includes HTTP status code, URL, response-body snippet (≤500 chars), attempt/retryLimit, underlying error name, and message. Secrets (api_key,authorization,bearer,x-api-key,token,access_token, URL userinfo) are scrubbed from both the URL and the body snippet before emission.Non-zero exit via tagged error. New
MessageV2.TerminalRetryExhaustedErroris added to theAssistant.errordiscriminated union. When the retry budget is exhausted,session/processor.tswraps the underlying error into this type, publishes it via the existingsession.errorevent, andcli/cmd/run.tsdetects the tag and setsprocess.exitCode = 1. The existing top-levelprocess.exit()inindex.tsthen honors that. This is the low-risk approach: no uncaught exceptions, no changes to the session-ended flow.The existing
SessionRetry.isTerminalRolloutConflictshort-circuit (409 on/v1/rollouts/*) is left alone per the review guidance. Provider timeouts andAbortSignalwiring are untouched.Downstream benefit
agent_stderrand wraps non-zero exits asAgentError(exit_code=1, stderr_snippet=<the dump>).No verifiers-side change is needed for this PR to produce visible failures end-to-end.
How did you verify your code works?
Unit tests (13 new, all passing) in
packages/opencode/test/session/retry.test.ts:redactUrl: scrubsapi_key,authorization,token,x-api-key,access_token,bearerquery params, and URL userinfo. Specifically covers the required?api_key=...case.redactBody: scrubs JSON-ish"api_key": "...",Bearer <token>, key=value headers; caps snippet at 500 chars.dumpRetryExhaust: writes grep-friendly[retry-exhaust] {...}line toprocess.stderrwith correct metadata; redacted secrets never appear in the output.TerminalRetryExhaustedError: discriminator name is stable andisInstanceround-trips throughtoObject().bun test test/session/retry.test.ts→ 30 pass, 0 fail.bun test test/session/(full session suite) → 124 pass, 4 skip, 0 fail.bun run typecheck(via the pre-push hook workspace typecheck) → 12/12 successful across the monorepo.Don'ts observed
RETRY_MAX_DELAYconstants unchanged.provider.tstimeout /BunFetchRequestInitwiring unchanged.?api_key=...path).