fix(warm-pool): warm-spawn deadline must outlast pod-ready (exclusive pool refill after restart)#671
Merged
Merged
Conversation
… refill) Warm-pool replenish/reconcile spawns ran with a 30s context deadline, but a spawn's waitForPodReady is 5min — long enough for Karpenter to provision a fresh node. A cold spawn that needs a new node (notably a single-tenant exclusive 46/360 worker, which gets its own large instance) was therefore cancelled at 30s, every time, with "context deadline exceeded". Consequence: after a control-plane restart (i.e. every deploy/promote), the default/exclusive shared-warm pool could not refill — it sat near-empty while the janitor churned failed spawns every tick, and only the small, fast, bin-packed colocated shapes recovered. Default-profile connections (customer query traffic, which can't use colocated workers — profiles match strictly) then had no warm worker and paid a cold spawn or stalled. Fix: introduce workerPodReadyTimeout (5m) and warmSpawnReconcileTimeout (6m, must exceed it) and use the latter at all four warm-spawn sites — the two fire-and-forget replenishers (per-image, colocated) and the two janitor reconciles (per-image, colocated). The janitor now blocks once for the duration of a cold fill instead of failing at 30s every tick forever, so the pool actually reaches target and then returns instantly. TestWarmSpawnTimeoutExceedsPodReady guards the invariant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
jghoman
approved these changes
Jun 4, 2026
EDsCODE
approved these changes
Jun 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem (found on mw-prod-us)
After the CP was promoted to #670, the default/exclusive shared-warm pool sat stuck at ~1 of 6 workers, churning
context deadline exceeded, while the colocated 8/48 + 4/16 pools refilled fine. Customer query connections use the default profile (and profile matching is strict — a default request can't borrow a colocated worker), so they had no warm worker and would cold-start or stall.Root cause
Every warm-pool replenish/reconcile spawn ran with a 30s context deadline, but a spawn's
waitForPodReadyis 5 min — sized for Karpenter to provision a fresh node. An exclusive 46/360 worker needs its own large instance (r6gd.16xlarge, ~2-3 min). So the 30s ctx cancelled the spawn every time before the node could come up. The periodic janitor reconcile (reconcileWarmCapacityImageTargets) is what maintains the default pool atSHARED_WARM_TARGET, and it hit this 30s deadline → the default pool could never refill after a restart. Small, fast, bin-packed colocated shapes fit inside 30s, so they recovered — masking the bug.Four sites all used
30 * time.Second:k8s_pool.gotriggerPerImageReplenish+triggerColocatedWarmReplenish(fire-and-forget)multitenant.gojanitorreconcileColocatedWarm+reconcileWarmCapacityImageTargets(periodic)Fix
Two named constants:
workerPodReadyTimeout = 5m(the existing pod-ready wait, now named + used at the spawn site)warmSpawnReconcileTimeout = 6m— must exceed the pod-ready wait; used at all four warm-spawn sites.TestWarmSpawnTimeoutExceedsPodReadyguards the invariant.Note on janitor blocking
The janitor reconcile is synchronous (
wg.Wait). This change does not make blocking worse — it makes it better: the old 30s blocked the janitor every tick forever (perpetual deficit + churn, pool never filled); the new 6m blocks once for the duration of a cold fill, after which the pool is at target and the reconcile returns instantly. In-flight spawns are counted at the DB layer (CreateNeutralWarmWorkerSlotForImage), so no double-spawn across ticks.Test
Full
controlplanesuite green under-tags kubernetes; new invariant test passes.Rollout
New CP image → prod promote → CP restart. With this fix, that restart's warm pools (incl. the exclusive default pool) actually refill to target instead of stalling.
🤖 Generated with Claude Code