feat(controlplane): unbounded K8s worker scaling — remove memory-budget-derived MaxWorkers cap#597
Merged
Merged
Conversation
…et-derived MaxWorkers cap
In K8s mode, workers run as separate pods on separate nodes, so the
control plane's own memory budget tells us nothing about how many worker
pods the cluster can host. Yet the CP was deriving
`k8s.max_workers = memory_budget / 256MB` whenever the operator left
the cap unset, and a 30-tenant / 900-qps load test stabilised at 11
workers / ~96 qps because the CP host had ~2.8 GB free — even though
Karpenter had already spun up 12 worker nodes ready to accept pods.
This change:
* Drops the memory-budget derivation for `k8s.max_workers`. If
`cfg.K8s.MaxWorkers` is 0, the pool is unbounded and the cluster's
NodePool / autoscaler is the natural ceiling. Downstream call sites
(`OrgReservedPool`, `K8sWorkerPool.canSpawn`,
`ConfigStore.ClaimIdleWorker`, `CreateSpawningWorkerSlot`) already
treat `MaxWorkers == 0` as "no cap", so no other code paths need
changes.
* Skips the "k8s.shared_warm_target exceeds k8s.max_workers" capping
when MaxWorkers is unbounded — the warm target stands on its own.
* Replaces the misleading "Derived k8s.max_workers from memory
budget" startup log with one that names what's actually
happening — `k8s.max_workers unset; worker pool is unbounded`.
* Updates the `K8sConfig.MaxWorkers` field doc to match.
Process / local mode keeps the existing derivation since process-mode
workers share the CP process's memory.
Adds `TestOrgReservedPoolAcquireUnboundedWhenMaxWorkersZero` driving
`OrgReservedPool` with `maxWorkers=0` and asserting it can acquire
many more workers than the previous cap would have allowed (30 in
the test) without rejecting on max-workers grounds.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
A 30-tenant cold ramp at 30 qps revealed the warm-pool reconciler
creeping up at ~2-3 workers per minute instead of closing the gap in
one cycle:
t=0 5 workers (initial warm)
t=60s 8 (+3)
t=120s 10 (+2)
t=150s 11 (+1)
t=600s still 11 never reached the desired 30
Karpenter was already provisioning nodes within ~30s; the bottleneck
was on the CP side. Two throttles compounded:
1. `reconcileWarmCapacity` called `SpawnMinWorkers(target)` where
`target = K8s.SharedWarmTarget` — a static configuration value.
Every tick it filled the warm pool only up to that floor and
never reacted to bursts of `WarmCapacityExhausted` retries. The
observed pool growth came entirely from `triggerPerImageReplenish`
refilling exactly one slot per consumed warm worker, which scales
with successful activations — not with queued demand.
2. `K8sWorkerPool.spawnSem` was sized at 3, serialising even the
parallel WaitGroup fan-out in `SpawnMinWorkers` down to 3
concurrent pod creates. The K8s client was already configured for
QPS=50 / Burst=100, so the semaphore was the binding constraint.
This commit threads observed demand through the reconciler:
* `K8sWorkerPool.warmCapacityMisses` (atomic int64) is incremented
every time `ReserveSharedWorker` returns `WarmCapacityExhausted`
for any non-`OrgCap` reason. `OrgCap` is excluded because adding
neutral warm pods doesn't help an org that has hit its own cap.
* `ConsumeWarmCapacityDemand()` returns and atomically resets the
counter.
* Janitor's `reconcileWarmCapacity` now computes
`effectiveTarget = staticTarget + observedDemand` and calls
`SpawnMinWorkers(effectiveTarget)`, scaling to absorbed demand in
a single tick. Pod creation already fans out via WaitGroup
inside `SpawnMinWorkers`; with `spawnSem` raised to 50, the K8s
API calls actually run in parallel.
Scale-DOWN is intentionally untouched — the idle reaper keeps its
slower cadence so steady-state idle dips don't thrash the pool.
`TestK8sPoolWarmCapacityDemandScalesPoolInOneTick` drives 30
concurrent `ReserveSharedWorker` calls against a store that always
misses, simulates one janitor tick, and asserts exactly 35
(`staticTarget=5 + demand=30`) spawn slots are allocated in a single
pass — pinning the new behaviour against a regression to per-tick
increments.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
This was referenced May 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
k8s.max_workers = memory_budget / 256MBderivation. Ifcfg.K8s.MaxWorkersis unset (0), the worker pool is unbounded — the cluster's NodePool / autoscaler is the natural ceiling.k8s.shared_warm_targetexceedsk8s.max_workers" capping when MaxWorkers is unbounded; the warm target stands on its own.No other code paths needed changes:
OrgReservedPool,K8sWorkerPool.canSpawn,ConfigStore.ClaimIdleWorker, andCreateSpawningWorkerSlotall already treatMaxWorkers == 0as "no cap".Why
A 30-tenant / 30-qps-per-tenant (900 qps target) load test stabilised at 11 workers / ~96 qps and refused to scale further despite Karpenter having spun up 12 worker nodes ready to accept pods.
cp_errors_1mhad dropped to 0 — the control plane simply decided 11 workers was enough.Root cause: in
controlplane/control.gothe CP was using a temporaryMemoryRebalancerto derivek8s.max_workersfrom the CP's own memory budget (75% of CP host RAM by default). On the dev CP nodes that yielded ~2856 MB →2856 / 256 = 11workers. This derivation is wrong for K8s mode: workers run as separate pods on separate nodes, so the CP's RAM tells us nothing about how many worker pods the cluster can host. The startup log even said as much:…which is exactly what was capping the test.
Downstream audit
MaxWorkers == 0is consistently treated as "unbounded" downstream:controlplane/org_reserved_pool.go:62:if p.maxWorkers == 0 || assignedCount < p.maxWorkers— 0 falls through.controlplane/k8s_pool.go:1022, 2660:canSpawn := p.maxWorkers == 0 || liveCount < p.maxWorkers— 0 falls through.controlplane/configstore/store.go:908, 1322, 1332: org / global cap checks are all guarded withmaxOrgWorkers > 0/maxGlobalWorkers > 0— 0 is skipped.So once the CP stops deriving a fake cap, the entire chain Just Works.
Trade-off
Operators who relied on the derivation as an implicit safety net now need to set
K8s.MaxWorkersexplicitly if they want a global cap. The new startup log makes this visible:In production we want exactly this: the NodePool sizing / Karpenter limits become the single source of truth for "how big can the worker fleet get", instead of being silently clamped by an unrelated CP-host RAM heuristic.
Test plan
go build -tags kubernetes ./...cleango test -tags kubernetes ./controlplane/... ./controlplane/configstore/... ./controlplane/provisioner/...clean (existingadminfailures are unrelated — they requiredocker-composefor the postgres-container integration tests, fail identically on main).TestOrgReservedPoolAcquireUnboundedWhenMaxWorkersZerodrivesOrgReservedPoolwithmaxWorkers=0, pre-seeds 30 warm workers in the shared pool, and assertsAcquireWorkercan hand all 30 out without rejecting on max-workers grounds.k8s.max_workersexplicitly.Update — fast scale-up (commit 2)
Re-running the 30-tenant / 30-qps cold ramp after the first commit (now with the worker pool unbounded) exposed a second throttle: the pool was technically allowed to grow past 11, but it was still creeping up at ~2-3 workers per minute instead of closing the gap to ~30 in one reconcile cycle.
Observed timeline:
Karpenter was already provisioning nodes within ~30s; the bottleneck was on the CP side. Two throttles compounded:
reconcileWarmCapacityonly ever filled to the static target. The janitor calledSpawnMinWorkers(target)every 5s wheretarget = K8s.SharedWarmTarget— a static config value. It never reacted to bursts of queuedWarmCapacityExhaustedretries. The visible pool growth came entirely fromtriggerPerImageReplenishrefilling exactly one slot per consumed warm worker, which scales with successful activations — not with queued demand.spawnSemwas sized at 3. Even whenSpawnMinWorkersparallelised its WaitGroup fan-out, the semaphore serialised pod creates down to 3 at a time. The K8s client was already configured for QPS=50 / Burst=100, so the semaphore was the binding constraint.The second commit threads observed demand through the reconciler:
K8sWorkerPool.warmCapacityMisses(atomic int64) is incremented every timeReserveSharedWorkerreturnsWarmCapacityExhaustedfor any reason exceptOrgCap(per-org caps are not a shared-pool shortage).ConsumeWarmCapacityDemand()returns and atomically resets the counter.reconcileWarmCapacityclosure now computeseffectiveTarget = staticTarget + observedDemandand callsSpawnMinWorkers(effectiveTarget)— scaling to demand in one tick rather than creeping up at the static floor.spawnSemraised from 3 → 50 so the WaitGroup fan-out insideSpawnMinWorkersactually runs in parallel.Scale-DOWN is intentionally untouched — the idle reaper keeps its slower cadence so steady-state idle dips don't thrash the pool.
Where scale-up decisions live
For posterity / reviewers:
K8sWorkerPool.SpawnMinWorkers(count)— claims up tocount - idleCountneutral warm slots fromConfigStore(sequential under an advisory lock; DB ops only) and then fans out parallel pod creates viasync.WaitGroup. Already parallel; bottlenecked only byspawnSem.K8sWorkerPool.SpawnMinWorkersForImage(ctx, image, count)— same pattern for per-image floors. Same parallel fan-out.K8sWorkerPool.triggerPerImageReplenish(image)— fire-and-forget spawn of one pod after a warm worker is consumed byReserveSharedWorker. Replaces consumed warm pods 1:1 — doesn't react to queued demand.reconcileWarmCapacity(multitenant.go) — janitor's 5-second tick. Previously calledSpawnMinWorkers(staticTarget). Now callsSpawnMinWorkers(staticTarget + demand)wheredemand = ConsumeWarmCapacityDemand().shouldReplenishWarmCapacityLocked(k8s_pool.go) — only used by the non-runtime-store path (single-CP mode). Returns a bool — spawns at most one replacement per consumed worker. Left as-is; cluster mode (which the load test runs in) doesn't go through this path.Test plan — fast scale-up
go build -tags kubernetes ./...cleango test -tags kubernetes ./controlplane/...clean (same pre-existingadminpostgres-container failures as before; race detector also flags a pre-existing race inTestK8sPoolRetireWorkerUsesTrackedPodNamethat reproduces onmaintoo).TestK8sPoolWarmCapacityDemandScalesPoolInOneTickdrives 30 concurrentReserveSharedWorkercalls against a store that always misses, runs one janitor tick, and asserts exactly 35 (staticTarget=5 + demand=30) spawn slots are allocated in that single pass — pinning the new behaviour against a regression to per-tick increments.🤖 Generated with Claude Code