fix: retry sandbox capacity 429 before surfacing to caller by joelorzet · Pull Request #1431 · KeeperHub/keeperhub

joelorzet · 2026-06-01T16:10:07Z

Problem

The code sandbox sheds load with 429 sandbox at capacity plus Retry-After: 1 once a pod hits its per-pod concurrency cap. The main-app client treated any non-200 as terminal, so a momentary overshoot above the cluster ceiling surfaced directly to the customer as a failed Code node, even though the server explicitly signals "back off and retry." Paid and enterprise workloads hit this on brief bursts.

Change

runRemote now honors the Retry-After the sandbox already sends:

Bounded retries (default 3, tunable via SANDBOX_MAX_RETRIES) with jitter so concurrent callers do not resynchronize into a thundering herd.
Abort-aware backoff: a client disconnect short-circuits the wait instead of pinning the workflow step.
Only a capacity 429 is retried, since the run never started and resending is safe. Every other failure (5xx, malformed, network) stays terminal and fast.

A SandboxHttpError carries the status and parsed Retry-After so the retry decision does not string-match the message.

This absorbs sub-second micro-bursts at zero infra cost. It does not add capacity; that is the separate pod-sizing work captured in specs/sandbox-scaling.md, included here for the infra discussion.

Tests

Added coverage in tests/unit/sandbox-client.test.ts: retry-then-succeed honoring Retry-After, give-up after the bounded retries on sustained 429, and no-retry on a non-429 error. Full unit suite green, type-check passes.

The sandbox sheds load with 429 + Retry-After when its per-pod concurrency cap is reached, but the client treated any non-200 as terminal, so a momentary overshoot surfaced as a hard error on the Code node. Honor the Retry-After with bounded, jittered, abort-aware retries (default 3, via SANDBOX_MAX_RETRIES). Only capacity 429s are retried since the run never started; all other failures stay terminal. Also adds a scaling notes spec for the separate pod-sizing discussion.

Retry-After is read off an HTTP response, so feeding it straight into setTimeout let a buggy or hostile value pin a workflow step on a long timer (CodeQL: resource exhaustion). Clamp the backoff to a 5s ceiling at both the parse site and the timer sink.

CodeQL flagged feeding the Retry-After header into setTimeout as a resource-exhaustion sink, and a Math.min clamp did not break the taint. Drop the header-derived duration entirely and use a fixed, capped, exponential backoff schedule. The 429 is still the retry trigger; only the wait magnitude changes (the sandbox always signals a 1s backoff anyway).

github-actions · 2026-06-01T17:26:14Z

🧹 PR Environment Cleaned Up

The PR environment has been successfully deleted.

Deleted Resources:

Namespace: pr-1431
All Helm releases (Keeperhub, Scheduler, Event services)
PostgreSQL Database (including data)
LocalStack, Redis
All associated secrets and configs

All resources have been cleaned up and will no longer incur costs.

github-actions · 2026-06-01T17:26:15Z

ℹ️ No PR Environment to Clean Up

No PR environment was found for this PR. This is expected if:

The PR never had the deploy-pr-environment label
The environment was already cleaned up
The deployment never completed successfully

joelorzet temporarily deployed to staging June 1, 2026 16:10 — with GitHub Actions Inactive

docs: simplify sandbox scaling notes for readability

b48ab0b

joelorzet temporarily deployed to staging June 1, 2026 16:11 — with GitHub Actions Inactive

github-advanced-security AI found potential problems Jun 1, 2026

View reviewed changes

Comment thread lib/sandbox/client.ts Fixed

joelorzet temporarily deployed to staging June 1, 2026 16:16 — with GitHub Actions Inactive

github-advanced-security AI found potential problems Jun 1, 2026

View reviewed changes

Comment thread lib/sandbox/client.ts Fixed

joelorzet temporarily deployed to staging June 1, 2026 16:32 — with GitHub Actions Inactive

joelorzet requested review from a team, OleksandrUA, eskp and suisuss and removed request for a team June 1, 2026 17:21

joelorzet merged commit 254d57d into staging Jun 1, 2026
42 checks passed

joelorzet deleted the fix/sandbox-429-retry branch June 1, 2026 17:25

joelorzet temporarily deployed to staging June 1, 2026 17:25 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: retry sandbox capacity 429 before surfacing to caller#1431

fix: retry sandbox capacity 429 before surfacing to caller#1431
joelorzet merged 4 commits into
stagingfrom
fix/sandbox-429-retry

joelorzet commented Jun 1, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joelorzet commented Jun 1, 2026

Problem

Change

Tests

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 1, 2026

🧹 PR Environment Cleaned Up

Uh oh!

github-actions Bot commented Jun 1, 2026

ℹ️ No PR Environment to Clean Up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants