feat(controlplane): unbounded K8s worker scaling — remove memory-budget-derived MaxWorkers cap by benben · Pull Request #597 · PostHog/duckgres

benben · 2026-05-21T08:28:10Z

Summary

For K8s / remote-worker mode, drop the k8s.max_workers = memory_budget / 256MB derivation. If cfg.K8s.MaxWorkers is unset (0), the worker pool is unbounded — the cluster's NodePool / autoscaler is the natural ceiling.
Skip the "k8s.shared_warm_target exceeds k8s.max_workers" capping when MaxWorkers is unbounded; the warm target stands on its own.
Replace the misleading "Derived k8s.max_workers from memory budget" startup log with one that names what is actually happening.
Process / local mode keeps the existing derivation — process-mode workers share the CP process's memory, so the derivation is still meaningful there.

No other code paths needed changes: OrgReservedPool, K8sWorkerPool.canSpawn, ConfigStore.ClaimIdleWorker, and CreateSpawningWorkerSlot all already treat MaxWorkers == 0 as "no cap".

Why

A 30-tenant / 30-qps-per-tenant (900 qps target) load test stabilised at 11 workers / ~96 qps and refused to scale further despite Karpenter having spun up 12 worker nodes ready to accept pods. cp_errors_1m had dropped to 0 — the control plane simply decided 11 workers was enough.

Root cause: in controlplane/control.go the CP was using a temporary MemoryRebalancer to derive k8s.max_workers from the CP's own memory budget (75% of CP host RAM by default). On the dev CP nodes that yielded ~2856 MB → 2856 / 256 = 11 workers. This derivation is wrong for K8s mode: workers run as separate pods on separate nodes, so the CP's RAM tells us nothing about how many worker pods the cluster can host. The startup log even said as much:

Derived k8s.max_workers from memory budget. k8s_max_workers=11 memory_budget=2856MB

…which is exactly what was capping the test.

Downstream audit

MaxWorkers == 0 is consistently treated as "unbounded" downstream:

controlplane/org_reserved_pool.go:62: if p.maxWorkers == 0 || assignedCount < p.maxWorkers — 0 falls through.
controlplane/k8s_pool.go:1022, 2660: canSpawn := p.maxWorkers == 0 || liveCount < p.maxWorkers — 0 falls through.
controlplane/configstore/store.go:908, 1322, 1332: org / global cap checks are all guarded with maxOrgWorkers > 0 / maxGlobalWorkers > 0 — 0 is skipped.

So once the CP stops deriving a fake cap, the entire chain Just Works.

Trade-off

Operators who relied on the derivation as an implicit safety net now need to set K8s.MaxWorkers explicitly if they want a global cap. The new startup log makes this visible:

k8s.max_workers unset; worker pool is unbounded — cluster autoscaler (e.g. Karpenter) is the ceiling.

In production we want exactly this: the NodePool sizing / Karpenter limits become the single source of truth for "how big can the worker fleet get", instead of being silently clamped by an unrelated CP-host RAM heuristic.

Test plan

go build -tags kubernetes ./... clean
go test -tags kubernetes ./controlplane/... ./controlplane/configstore/... ./controlplane/provisioner/... clean (existing admin failures are unrelated — they require docker-compose for the postgres-container integration tests, fail identically on main).
New TestOrgReservedPoolAcquireUnboundedWhenMaxWorkersZero drives OrgReservedPool with maxWorkers=0, pre-seeds 30 warm workers in the shared pool, and asserts AcquireWorker can hand all 30 out without rejecting on max-workers grounds.
After deploy: re-run the 30-tenant / 900 qps load test and confirm worker count climbs past 11 to whatever Karpenter / NodePool actually allows.
Verify the new startup log line appears on K8s CPs that don't set k8s.max_workers explicitly.

Update — fast scale-up (commit 2)

Re-running the 30-tenant / 30-qps cold ramp after the first commit (now with the worker pool unbounded) exposed a second throttle: the pool was technically allowed to grow past 11, but it was still creeping up at ~2-3 workers per minute instead of closing the gap to ~30 in one reconcile cycle.

Observed timeline:

t=0     5 workers   (initial warm)
t=60s   8           (+3)
t=120s  10          (+2)
t=150s  11          (+1)
t=600s  still 11    never reached the desired ~30

Karpenter was already provisioning nodes within ~30s; the bottleneck was on the CP side. Two throttles compounded:

reconcileWarmCapacity only ever filled to the static target. The janitor called SpawnMinWorkers(target) every 5s where target = K8s.SharedWarmTarget — a static config value. It never reacted to bursts of queued WarmCapacityExhausted retries. The visible pool growth came entirely from triggerPerImageReplenish refilling exactly one slot per consumed warm worker, which scales with successful activations — not with queued demand.
spawnSem was sized at 3. Even when SpawnMinWorkers parallelised its WaitGroup fan-out, the semaphore serialised pod creates down to 3 at a time. The K8s client was already configured for QPS=50 / Burst=100, so the semaphore was the binding constraint.

The second commit threads observed demand through the reconciler:

K8sWorkerPool.warmCapacityMisses (atomic int64) is incremented every time ReserveSharedWorker returns WarmCapacityExhausted for any reason except OrgCap (per-org caps are not a shared-pool shortage).
New ConsumeWarmCapacityDemand() returns and atomically resets the counter.
The janitor's reconcileWarmCapacity closure now computes effectiveTarget = staticTarget + observedDemand and calls SpawnMinWorkers(effectiveTarget) — scaling to demand in one tick rather than creeping up at the static floor.
spawnSem raised from 3 → 50 so the WaitGroup fan-out inside SpawnMinWorkers actually runs in parallel.

Scale-DOWN is intentionally untouched — the idle reaper keeps its slower cadence so steady-state idle dips don't thrash the pool.

Where scale-up decisions live

For posterity / reviewers:

K8sWorkerPool.SpawnMinWorkers(count) — claims up to count - idleCount neutral warm slots from ConfigStore (sequential under an advisory lock; DB ops only) and then fans out parallel pod creates via sync.WaitGroup. Already parallel; bottlenecked only by spawnSem.
K8sWorkerPool.SpawnMinWorkersForImage(ctx, image, count) — same pattern for per-image floors. Same parallel fan-out.
K8sWorkerPool.triggerPerImageReplenish(image) — fire-and-forget spawn of one pod after a warm worker is consumed by ReserveSharedWorker. Replaces consumed warm pods 1:1 — doesn't react to queued demand.
reconcileWarmCapacity (multitenant.go) — janitor's 5-second tick. Previously called SpawnMinWorkers(staticTarget). Now calls SpawnMinWorkers(staticTarget + demand) where demand = ConsumeWarmCapacityDemand().
shouldReplenishWarmCapacityLocked (k8s_pool.go) — only used by the non-runtime-store path (single-CP mode). Returns a bool — spawns at most one replacement per consumed worker. Left as-is; cluster mode (which the load test runs in) doesn't go through this path.

Test plan — fast scale-up

go build -tags kubernetes ./... clean
go test -tags kubernetes ./controlplane/... clean (same pre-existing admin postgres-container failures as before; race detector also flags a pre-existing race in TestK8sPoolRetireWorkerUsesTrackedPodName that reproduces on main too).
New TestK8sPoolWarmCapacityDemandScalesPoolInOneTick drives 30 concurrent ReserveSharedWorker calls against a store that always misses, runs one janitor tick, and asserts exactly 35 (staticTarget=5 + demand=30) spawn slots are allocated in that single pass — pinning the new behaviour against a regression to per-tick increments.
After deploy: re-run the 30-tenant / 900 qps load test and confirm the pool now reaches ~30 workers within one or two janitor ticks instead of 10+ minutes.

🤖 Generated with Claude Code

…et-derived MaxWorkers cap In K8s mode, workers run as separate pods on separate nodes, so the control plane's own memory budget tells us nothing about how many worker pods the cluster can host. Yet the CP was deriving `k8s.max_workers = memory_budget / 256MB` whenever the operator left the cap unset, and a 30-tenant / 900-qps load test stabilised at 11 workers / ~96 qps because the CP host had ~2.8 GB free — even though Karpenter had already spun up 12 worker nodes ready to accept pods. This change: * Drops the memory-budget derivation for `k8s.max_workers`. If `cfg.K8s.MaxWorkers` is 0, the pool is unbounded and the cluster's NodePool / autoscaler is the natural ceiling. Downstream call sites (`OrgReservedPool`, `K8sWorkerPool.canSpawn`, `ConfigStore.ClaimIdleWorker`, `CreateSpawningWorkerSlot`) already treat `MaxWorkers == 0` as "no cap", so no other code paths need changes. * Skips the "k8s.shared_warm_target exceeds k8s.max_workers" capping when MaxWorkers is unbounded — the warm target stands on its own. * Replaces the misleading "Derived k8s.max_workers from memory budget" startup log with one that names what's actually happening — `k8s.max_workers unset; worker pool is unbounded`. * Updates the `K8sConfig.MaxWorkers` field doc to match. Process / local mode keeps the existing derivation since process-mode workers share the CP process's memory. Adds `TestOrgReservedPoolAcquireUnboundedWhenMaxWorkersZero` driving `OrgReservedPool` with `maxWorkers=0` and asserting it can acquire many more workers than the previous cap would have allowed (30 in the test) without rejecting on max-workers grounds. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

A 30-tenant cold ramp at 30 qps revealed the warm-pool reconciler creeping up at ~2-3 workers per minute instead of closing the gap in one cycle: t=0 5 workers (initial warm) t=60s 8 (+3) t=120s 10 (+2) t=150s 11 (+1) t=600s still 11 never reached the desired 30 Karpenter was already provisioning nodes within ~30s; the bottleneck was on the CP side. Two throttles compounded: 1. `reconcileWarmCapacity` called `SpawnMinWorkers(target)` where `target = K8s.SharedWarmTarget` — a static configuration value. Every tick it filled the warm pool only up to that floor and never reacted to bursts of `WarmCapacityExhausted` retries. The observed pool growth came entirely from `triggerPerImageReplenish` refilling exactly one slot per consumed warm worker, which scales with successful activations — not with queued demand. 2. `K8sWorkerPool.spawnSem` was sized at 3, serialising even the parallel WaitGroup fan-out in `SpawnMinWorkers` down to 3 concurrent pod creates. The K8s client was already configured for QPS=50 / Burst=100, so the semaphore was the binding constraint. This commit threads observed demand through the reconciler: * `K8sWorkerPool.warmCapacityMisses` (atomic int64) is incremented every time `ReserveSharedWorker` returns `WarmCapacityExhausted` for any non-`OrgCap` reason. `OrgCap` is excluded because adding neutral warm pods doesn't help an org that has hit its own cap. * `ConsumeWarmCapacityDemand()` returns and atomically resets the counter. * Janitor's `reconcileWarmCapacity` now computes `effectiveTarget = staticTarget + observedDemand` and calls `SpawnMinWorkers(effectiveTarget)`, scaling to absorbed demand in a single tick. Pod creation already fans out via WaitGroup inside `SpawnMinWorkers`; with `spawnSem` raised to 50, the K8s API calls actually run in parallel. Scale-DOWN is intentionally untouched — the idle reaper keeps its slower cadence so steady-state idle dips don't thrash the pool. `TestK8sPoolWarmCapacityDemandScalesPoolInOneTick` drives 30 concurrent `ReserveSharedWorker` calls against a store that always misses, simulates one janitor tick, and asserts exactly 35 (`staticTarget=5 + demand=30`) spawn slots are allocated in a single pass — pinning the new behaviour against a regression to per-tick increments. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

benben added 2 commits May 21, 2026 10:27

benben merged commit 4bbb7a4 into main May 21, 2026
22 checks passed

benben deleted the ben/remove-cp-max-workers-cap branch May 21, 2026 08:49

This was referenced May 21, 2026

[codex] Add central warm capacity miss buckets #602

Merged

Add dynamic warm capacity target computation #605

Merged

Wire dynamic warm capacity reconciliation #608

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(controlplane): unbounded K8s worker scaling — remove memory-budget-derived MaxWorkers cap#597

feat(controlplane): unbounded K8s worker scaling — remove memory-budget-derived MaxWorkers cap#597
benben merged 2 commits into
mainfrom
ben/remove-cp-max-workers-cap

benben commented May 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benben commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Downstream audit

Trade-off

Test plan

Update — fast scale-up (commit 2)

Where scale-up decisions live

Test plan — fast scale-up

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

benben commented May 21, 2026 •

edited

Loading