Skip to content

fix(warm-pool): warm-spawn deadline must outlast pod-ready (exclusive pool refill after restart)#671

Merged
fuziontech merged 1 commit into
mainfrom
fix/warm-spawn-deadline-exclusive
Jun 4, 2026
Merged

fix(warm-pool): warm-spawn deadline must outlast pod-ready (exclusive pool refill after restart)#671
fuziontech merged 1 commit into
mainfrom
fix/warm-spawn-deadline-exclusive

Conversation

@fuziontech
Copy link
Copy Markdown
Member

Problem (found on mw-prod-us)

After the CP was promoted to #670, the default/exclusive shared-warm pool sat stuck at ~1 of 6 workers, churning context deadline exceeded, while the colocated 8/48 + 4/16 pools refilled fine. Customer query connections use the default profile (and profile matching is strict — a default request can't borrow a colocated worker), so they had no warm worker and would cold-start or stall.

Root cause

Every warm-pool replenish/reconcile spawn ran with a 30s context deadline, but a spawn's waitForPodReady is 5 min — sized for Karpenter to provision a fresh node. An exclusive 46/360 worker needs its own large instance (r6gd.16xlarge, ~2-3 min). So the 30s ctx cancelled the spawn every time before the node could come up. The periodic janitor reconcile (reconcileWarmCapacityImageTargets) is what maintains the default pool at SHARED_WARM_TARGET, and it hit this 30s deadline → the default pool could never refill after a restart. Small, fast, bin-packed colocated shapes fit inside 30s, so they recovered — masking the bug.

Four sites all used 30 * time.Second:

  • k8s_pool.go triggerPerImageReplenish + triggerColocatedWarmReplenish (fire-and-forget)
  • multitenant.go janitor reconcileColocatedWarm + reconcileWarmCapacityImageTargets (periodic)

Fix

Two named constants:

  • workerPodReadyTimeout = 5m (the existing pod-ready wait, now named + used at the spawn site)
  • warmSpawnReconcileTimeout = 6mmust exceed the pod-ready wait; used at all four warm-spawn sites.

TestWarmSpawnTimeoutExceedsPodReady guards the invariant.

Note on janitor blocking

The janitor reconcile is synchronous (wg.Wait). This change does not make blocking worse — it makes it better: the old 30s blocked the janitor every tick forever (perpetual deficit + churn, pool never filled); the new 6m blocks once for the duration of a cold fill, after which the pool is at target and the reconcile returns instantly. In-flight spawns are counted at the DB layer (CreateNeutralWarmWorkerSlotForImage), so no double-spawn across ticks.

Test

Full controlplane suite green under -tags kubernetes; new invariant test passes.

Rollout

New CP image → prod promote → CP restart. With this fix, that restart's warm pools (incl. the exclusive default pool) actually refill to target instead of stalling.

🤖 Generated with Claude Code

… refill)

Warm-pool replenish/reconcile spawns ran with a 30s context deadline, but a
spawn's waitForPodReady is 5min — long enough for Karpenter to provision a
fresh node. A cold spawn that needs a new node (notably a single-tenant
exclusive 46/360 worker, which gets its own large instance) was therefore
cancelled at 30s, every time, with "context deadline exceeded".

Consequence: after a control-plane restart (i.e. every deploy/promote), the
default/exclusive shared-warm pool could not refill — it sat near-empty while
the janitor churned failed spawns every tick, and only the small, fast,
bin-packed colocated shapes recovered. Default-profile connections (customer
query traffic, which can't use colocated workers — profiles match strictly)
then had no warm worker and paid a cold spawn or stalled.

Fix: introduce workerPodReadyTimeout (5m) and warmSpawnReconcileTimeout (6m,
must exceed it) and use the latter at all four warm-spawn sites — the two
fire-and-forget replenishers (per-image, colocated) and the two janitor
reconciles (per-image, colocated). The janitor now blocks once for the
duration of a cold fill instead of failing at 30s every tick forever, so the
pool actually reaches target and then returns instantly.

TestWarmSpawnTimeoutExceedsPodReady guards the invariant.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@fuziontech fuziontech requested a review from a team June 4, 2026 00:32
@fuziontech fuziontech merged commit 2ec6b2d into main Jun 4, 2026
24 checks passed
@fuziontech fuziontech deleted the fix/warm-spawn-deadline-exclusive branch June 4, 2026 00:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants