Skip to content

fix(session): surface retry exhaustion as non-zero exit with stderr dump#3

Merged
rasdani merged 2 commits into
rl-forkfrom
daniel/rl-retry-exhaust-exits-nonzero
Apr 19, 2026
Merged

fix(session): surface retry exhaustion as non-zero exit with stderr dump#3
rasdani merged 2 commits into
rl-forkfrom
daniel/rl-retry-exhaust-exits-nonzero

Conversation

@rasdani
Copy link
Copy Markdown
Collaborator

@rasdani rasdani commented Apr 19, 2026

What does this PR do?

Problem. Commit 7ddc22c4a (limit retrying) caps session-level LLM retries in packages/opencode/src/session/processor.ts. When attempt >= retryLimit, the retry is skipped and the error bubbles up through the session handler, but opencode exits cleanly (exit code 0) with no evidence the agent gave up. Every LLM call failed, every retry failed, and the process treats that as a successful completion.

Impact. In our hosted RL training on this fork, this produces rollouts with trajectory=[], exit_code=0, turns=0, duration≈26s (two retries of ~10s request + 2s+4s backoff). Downstream verifiers treat these as "empty trajectory" because no error state is set; on certain problems this silently loses ~60% of rollouts. Completely invisible in orchestrator logs.

Base branch. daniel/rl does not exist on prime (verified with git ls-remote), so this PR targets rl-fork per the fallback rule.

The fix

Two tightly-coupled changes inside packages/opencode/src/session/:

  1. Structured stderr dump at retry exhaustion. New SessionRetry.dumpRetryExhaust writes a grep-friendly [retry-exhaust] {...} JSON line directly to process.stderr (not log.error, which routes to a file unless --print-logs). Includes HTTP status code, URL, response-body snippet (≤500 chars), attempt/retryLimit, underlying error name, and message. Secrets (api_key, authorization, bearer, x-api-key, token, access_token, URL userinfo) are scrubbed from both the URL and the body snippet before emission.

  2. Non-zero exit via tagged error. New MessageV2.TerminalRetryExhaustedError is added to the Assistant.error discriminated union. When the retry budget is exhausted, session/processor.ts wraps the underlying error into this type, publishes it via the existing session.error event, and cli/cmd/run.ts detects the tag and sets process.exitCode = 1. The existing top-level process.exit() in index.ts then honors that. This is the low-risk approach: no uncaught exceptions, no changes to the session-ended flow.

The existing SessionRetry.isTerminalRolloutConflict short-circuit (409 on /v1/rollouts/*) is left alone per the review guidance. Provider timeouts and AbortSignal wiring are untouched.

Downstream benefit

No verifiers-side change is needed for this PR to produce visible failures end-to-end.

How did you verify your code works?

  • Unit tests (13 new, all passing) in packages/opencode/test/session/retry.test.ts:

    • redactUrl: scrubs api_key, authorization, token, x-api-key, access_token, bearer query params, and URL userinfo. Specifically covers the required ?api_key=... case.
    • redactBody: scrubs JSON-ish "api_key": "...", Bearer <token>, key=value headers; caps snippet at 500 chars.
    • dumpRetryExhaust: writes grep-friendly [retry-exhaust] {...} line to process.stderr with correct metadata; redacted secrets never appear in the output.
    • TerminalRetryExhaustedError: discriminator name is stable and isInstance round-trips through toObject().
  • bun test test/session/retry.test.ts → 30 pass, 0 fail.

  • bun test test/session/ (full session suite) → 124 pass, 4 skip, 0 fail.

  • bun run typecheck (via the pre-push hook workspace typecheck) → 12/12 successful across the monorepo.

Don'ts observed

  • No force-push; no hooks skipped (pre-push typecheck ran clean).
  • Retry delays and RETRY_MAX_DELAY constants unchanged.
  • provider.ts timeout / BunFetchRequestInit wiring unchanged.
  • No secrets leaked in the stderr dump (explicit test coverage for the ?api_key=... path).
  • Not merged; left for review.

rasdani and others added 2 commits April 19, 2026 00:51
When the session-level retry budget is exhausted (commit 7ddc22c
"limit retrying"), opencode previously exited 0 with no evidence the
agent gave up. Every LLM call failed, every retry failed, and the
process treated that as a successful completion. In hosted RL training,
this silently loses ~60% of rollouts on certain problems.

Two tightly-coupled changes:

1. Structured stderr dump in session/retry.ts - SessionRetry.dumpRetryExhaust
   writes a grep-friendly `[retry-exhaust]` JSON line to process.stderr
   with the HTTP status, URL, body snippet (capped at 500 chars),
   attempt/retryLimit, and underlying error name. Secrets (api_key,
   authorization, bearer, x-api-key, token, access_token, userinfo) are
   scrubbed from both the URL and the body snippet before emission.

2. Tagged-error propagation via a new MessageV2.TerminalRetryExhaustedError
   added to the Assistant.error discriminated union. The session
   processor wraps the final underlying error into this type when
   retries exhaust; cli/cmd/run.ts detects the tag on the session.error
   event and sets process.exitCode = 1, which the existing top-level
   process.exit() in index.ts then honors.

The existing isTerminalRolloutConflict (409 on /v1/rollouts/) path is
left alone per review guidance.

Tests: redactUrl/redactBody secret scrubbing (including the ?api_key=
query-string case), dumpRetryExhaust stderr format + redaction, and
TerminalRetryExhaustedError discriminator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Under tsgo, the assignment of process.stderr.write with a plain arrow
function narrows cleanly via the `as typeof process.stderr.write`
cast, so the preceding @ts-expect-error directive is flagged as unused.
Replace with the cast.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant