Skip to content

fix: retry sandbox capacity 429 before surfacing to caller#1431

Merged
joelorzet merged 4 commits into
stagingfrom
fix/sandbox-429-retry
Jun 1, 2026
Merged

fix: retry sandbox capacity 429 before surfacing to caller#1431
joelorzet merged 4 commits into
stagingfrom
fix/sandbox-429-retry

Conversation

@joelorzet
Copy link
Copy Markdown

Problem

The code sandbox sheds load with 429 sandbox at capacity plus Retry-After: 1 once a pod hits its per-pod concurrency cap. The main-app client treated any non-200 as terminal, so a momentary overshoot above the cluster ceiling surfaced directly to the customer as a failed Code node, even though the server explicitly signals "back off and retry." Paid and enterprise workloads hit this on brief bursts.

Change

runRemote now honors the Retry-After the sandbox already sends:

  • Bounded retries (default 3, tunable via SANDBOX_MAX_RETRIES) with jitter so concurrent callers do not resynchronize into a thundering herd.
  • Abort-aware backoff: a client disconnect short-circuits the wait instead of pinning the workflow step.
  • Only a capacity 429 is retried, since the run never started and resending is safe. Every other failure (5xx, malformed, network) stays terminal and fast.

A SandboxHttpError carries the status and parsed Retry-After so the retry decision does not string-match the message.

This absorbs sub-second micro-bursts at zero infra cost. It does not add capacity; that is the separate pod-sizing work captured in specs/sandbox-scaling.md, included here for the infra discussion.

Tests

Added coverage in tests/unit/sandbox-client.test.ts: retry-then-succeed honoring Retry-After, give-up after the bounded retries on sustained 429, and no-retry on a non-429 error. Full unit suite green, type-check passes.

The sandbox sheds load with 429 + Retry-After when its per-pod concurrency
cap is reached, but the client treated any non-200 as terminal, so a
momentary overshoot surfaced as a hard error on the Code node.

Honor the Retry-After with bounded, jittered, abort-aware retries (default
3, via SANDBOX_MAX_RETRIES). Only capacity 429s are retried since the run
never started; all other failures stay terminal. Also adds a scaling notes
spec for the separate pod-sizing discussion.
Comment thread lib/sandbox/client.ts Fixed
Retry-After is read off an HTTP response, so feeding it straight into
setTimeout let a buggy or hostile value pin a workflow step on a long
timer (CodeQL: resource exhaustion). Clamp the backoff to a 5s ceiling at
both the parse site and the timer sink.
Comment thread lib/sandbox/client.ts Fixed
CodeQL flagged feeding the Retry-After header into setTimeout as a
resource-exhaustion sink, and a Math.min clamp did not break the taint.
Drop the header-derived duration entirely and use a fixed, capped,
exponential backoff schedule. The 429 is still the retry trigger; only the
wait magnitude changes (the sandbox always signals a 1s backoff anyway).
@joelorzet joelorzet requested review from a team, OleksandrUA, eskp and suisuss and removed request for a team June 1, 2026 17:21
@joelorzet joelorzet merged commit 254d57d into staging Jun 1, 2026
42 checks passed
@joelorzet joelorzet deleted the fix/sandbox-429-retry branch June 1, 2026 17:25
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

🧹 PR Environment Cleaned Up

The PR environment has been successfully deleted.

Deleted Resources:

  • Namespace: pr-1431
  • All Helm releases (Keeperhub, Scheduler, Event services)
  • PostgreSQL Database (including data)
  • LocalStack, Redis
  • All associated secrets and configs

All resources have been cleaned up and will no longer incur costs.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

ℹ️ No PR Environment to Clean Up

No PR environment was found for this PR. This is expected if:

  • The PR never had the deploy-pr-environment label
  • The environment was already cleaned up
  • The deployment never completed successfully

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants