fix(warm-pool): warm-spawn deadline must outlast pod-ready (exclusive pool refill after restart) by fuziontech · Pull Request #671 · PostHog/duckgres

fuziontech · 2026-06-04T00:31:26Z

Problem (found on mw-prod-us)

After the CP was promoted to #670, the default/exclusive shared-warm pool sat stuck at ~1 of 6 workers, churning context deadline exceeded, while the colocated 8/48 + 4/16 pools refilled fine. Customer query connections use the default profile (and profile matching is strict — a default request can't borrow a colocated worker), so they had no warm worker and would cold-start or stall.

Root cause

Every warm-pool replenish/reconcile spawn ran with a 30s context deadline, but a spawn's waitForPodReady is 5 min — sized for Karpenter to provision a fresh node. An exclusive 46/360 worker needs its own large instance (r6gd.16xlarge, ~2-3 min). So the 30s ctx cancelled the spawn every time before the node could come up. The periodic janitor reconcile (reconcileWarmCapacityImageTargets) is what maintains the default pool at SHARED_WARM_TARGET, and it hit this 30s deadline → the default pool could never refill after a restart. Small, fast, bin-packed colocated shapes fit inside 30s, so they recovered — masking the bug.

Four sites all used 30 * time.Second:

k8s_pool.go triggerPerImageReplenish + triggerColocatedWarmReplenish (fire-and-forget)
multitenant.go janitor reconcileColocatedWarm + reconcileWarmCapacityImageTargets (periodic)

Fix

Two named constants:

workerPodReadyTimeout = 5m (the existing pod-ready wait, now named + used at the spawn site)
warmSpawnReconcileTimeout = 6m — must exceed the pod-ready wait; used at all four warm-spawn sites.

TestWarmSpawnTimeoutExceedsPodReady guards the invariant.

Note on janitor blocking

The janitor reconcile is synchronous (wg.Wait). This change does not make blocking worse — it makes it better: the old 30s blocked the janitor every tick forever (perpetual deficit + churn, pool never filled); the new 6m blocks once for the duration of a cold fill, after which the pool is at target and the reconcile returns instantly. In-flight spawns are counted at the DB layer (CreateNeutralWarmWorkerSlotForImage), so no double-spawn across ticks.

Test

Full controlplane suite green under -tags kubernetes; new invariant test passes.

Rollout

New CP image → prod promote → CP restart. With this fix, that restart's warm pools (incl. the exclusive default pool) actually refill to target instead of stalling.

🤖 Generated with Claude Code

… refill) Warm-pool replenish/reconcile spawns ran with a 30s context deadline, but a spawn's waitForPodReady is 5min — long enough for Karpenter to provision a fresh node. A cold spawn that needs a new node (notably a single-tenant exclusive 46/360 worker, which gets its own large instance) was therefore cancelled at 30s, every time, with "context deadline exceeded". Consequence: after a control-plane restart (i.e. every deploy/promote), the default/exclusive shared-warm pool could not refill — it sat near-empty while the janitor churned failed spawns every tick, and only the small, fast, bin-packed colocated shapes recovered. Default-profile connections (customer query traffic, which can't use colocated workers — profiles match strictly) then had no warm worker and paid a cold spawn or stalled. Fix: introduce workerPodReadyTimeout (5m) and warmSpawnReconcileTimeout (6m, must exceed it) and use the latter at all four warm-spawn sites — the two fire-and-forget replenishers (per-image, colocated) and the two janitor reconciles (per-image, colocated). The janitor now blocks once for the duration of a cold fill instead of failing at 30s every tick forever, so the pool actually reaches target and then returns instantly. TestWarmSpawnTimeoutExceedsPodReady guards the invariant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

fuziontech requested a review from a team June 4, 2026 00:32

jghoman approved these changes Jun 4, 2026

View reviewed changes

EDsCODE approved these changes Jun 4, 2026

View reviewed changes

fuziontech merged commit 2ec6b2d into main Jun 4, 2026
24 checks passed

fuziontech deleted the fix/warm-spawn-deadline-exclusive branch June 4, 2026 00:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(warm-pool): warm-spawn deadline must outlast pod-ready (exclusive pool refill after restart)#671

fix(warm-pool): warm-spawn deadline must outlast pod-ready (exclusive pool refill after restart)#671
fuziontech merged 1 commit into
mainfrom
fix/warm-spawn-deadline-exclusive

fuziontech commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fuziontech commented Jun 4, 2026

Problem (found on mw-prod-us)

Root cause

Fix

Note on janitor blocking

Test

Rollout

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants