Skip to content

feat(controlplane): unbounded K8s worker scaling — remove memory-budget-derived MaxWorkers cap#597

Merged
benben merged 2 commits into
mainfrom
ben/remove-cp-max-workers-cap
May 21, 2026
Merged

feat(controlplane): unbounded K8s worker scaling — remove memory-budget-derived MaxWorkers cap#597
benben merged 2 commits into
mainfrom
ben/remove-cp-max-workers-cap

Conversation

@benben
Copy link
Copy Markdown
Member

@benben benben commented May 21, 2026

Summary

  • For K8s / remote-worker mode, drop the k8s.max_workers = memory_budget / 256MB derivation. If cfg.K8s.MaxWorkers is unset (0), the worker pool is unbounded — the cluster's NodePool / autoscaler is the natural ceiling.
  • Skip the "k8s.shared_warm_target exceeds k8s.max_workers" capping when MaxWorkers is unbounded; the warm target stands on its own.
  • Replace the misleading "Derived k8s.max_workers from memory budget" startup log with one that names what is actually happening.
  • Process / local mode keeps the existing derivation — process-mode workers share the CP process's memory, so the derivation is still meaningful there.

No other code paths needed changes: OrgReservedPool, K8sWorkerPool.canSpawn, ConfigStore.ClaimIdleWorker, and CreateSpawningWorkerSlot all already treat MaxWorkers == 0 as "no cap".

Why

A 30-tenant / 30-qps-per-tenant (900 qps target) load test stabilised at 11 workers / ~96 qps and refused to scale further despite Karpenter having spun up 12 worker nodes ready to accept pods. cp_errors_1m had dropped to 0 — the control plane simply decided 11 workers was enough.

Root cause: in controlplane/control.go the CP was using a temporary MemoryRebalancer to derive k8s.max_workers from the CP's own memory budget (75% of CP host RAM by default). On the dev CP nodes that yielded ~2856 MB → 2856 / 256 = 11 workers. This derivation is wrong for K8s mode: workers run as separate pods on separate nodes, so the CP's RAM tells us nothing about how many worker pods the cluster can host. The startup log even said as much:

Derived k8s.max_workers from memory budget. k8s_max_workers=11 memory_budget=2856MB

…which is exactly what was capping the test.

Downstream audit

MaxWorkers == 0 is consistently treated as "unbounded" downstream:

  • controlplane/org_reserved_pool.go:62: if p.maxWorkers == 0 || assignedCount < p.maxWorkers — 0 falls through.
  • controlplane/k8s_pool.go:1022, 2660: canSpawn := p.maxWorkers == 0 || liveCount < p.maxWorkers — 0 falls through.
  • controlplane/configstore/store.go:908, 1322, 1332: org / global cap checks are all guarded with maxOrgWorkers > 0 / maxGlobalWorkers > 0 — 0 is skipped.

So once the CP stops deriving a fake cap, the entire chain Just Works.

Trade-off

Operators who relied on the derivation as an implicit safety net now need to set K8s.MaxWorkers explicitly if they want a global cap. The new startup log makes this visible:

k8s.max_workers unset; worker pool is unbounded — cluster autoscaler (e.g. Karpenter) is the ceiling.

In production we want exactly this: the NodePool sizing / Karpenter limits become the single source of truth for "how big can the worker fleet get", instead of being silently clamped by an unrelated CP-host RAM heuristic.

Test plan

  • go build -tags kubernetes ./... clean
  • go test -tags kubernetes ./controlplane/... ./controlplane/configstore/... ./controlplane/provisioner/... clean (existing admin failures are unrelated — they require docker-compose for the postgres-container integration tests, fail identically on main).
  • New TestOrgReservedPoolAcquireUnboundedWhenMaxWorkersZero drives OrgReservedPool with maxWorkers=0, pre-seeds 30 warm workers in the shared pool, and asserts AcquireWorker can hand all 30 out without rejecting on max-workers grounds.
  • After deploy: re-run the 30-tenant / 900 qps load test and confirm worker count climbs past 11 to whatever Karpenter / NodePool actually allows.
  • Verify the new startup log line appears on K8s CPs that don't set k8s.max_workers explicitly.

Update — fast scale-up (commit 2)

Re-running the 30-tenant / 30-qps cold ramp after the first commit (now with the worker pool unbounded) exposed a second throttle: the pool was technically allowed to grow past 11, but it was still creeping up at ~2-3 workers per minute instead of closing the gap to ~30 in one reconcile cycle.

Observed timeline:

t=0     5 workers   (initial warm)
t=60s   8           (+3)
t=120s  10          (+2)
t=150s  11          (+1)
t=600s  still 11    never reached the desired ~30

Karpenter was already provisioning nodes within ~30s; the bottleneck was on the CP side. Two throttles compounded:

  1. reconcileWarmCapacity only ever filled to the static target. The janitor called SpawnMinWorkers(target) every 5s where target = K8s.SharedWarmTarget — a static config value. It never reacted to bursts of queued WarmCapacityExhausted retries. The visible pool growth came entirely from triggerPerImageReplenish refilling exactly one slot per consumed warm worker, which scales with successful activations — not with queued demand.

  2. spawnSem was sized at 3. Even when SpawnMinWorkers parallelised its WaitGroup fan-out, the semaphore serialised pod creates down to 3 at a time. The K8s client was already configured for QPS=50 / Burst=100, so the semaphore was the binding constraint.

The second commit threads observed demand through the reconciler:

  • K8sWorkerPool.warmCapacityMisses (atomic int64) is incremented every time ReserveSharedWorker returns WarmCapacityExhausted for any reason except OrgCap (per-org caps are not a shared-pool shortage).
  • New ConsumeWarmCapacityDemand() returns and atomically resets the counter.
  • The janitor's reconcileWarmCapacity closure now computes effectiveTarget = staticTarget + observedDemand and calls SpawnMinWorkers(effectiveTarget) — scaling to demand in one tick rather than creeping up at the static floor.
  • spawnSem raised from 3 → 50 so the WaitGroup fan-out inside SpawnMinWorkers actually runs in parallel.

Scale-DOWN is intentionally untouched — the idle reaper keeps its slower cadence so steady-state idle dips don't thrash the pool.

Where scale-up decisions live

For posterity / reviewers:

  • K8sWorkerPool.SpawnMinWorkers(count) — claims up to count - idleCount neutral warm slots from ConfigStore (sequential under an advisory lock; DB ops only) and then fans out parallel pod creates via sync.WaitGroup. Already parallel; bottlenecked only by spawnSem.
  • K8sWorkerPool.SpawnMinWorkersForImage(ctx, image, count) — same pattern for per-image floors. Same parallel fan-out.
  • K8sWorkerPool.triggerPerImageReplenish(image) — fire-and-forget spawn of one pod after a warm worker is consumed by ReserveSharedWorker. Replaces consumed warm pods 1:1 — doesn't react to queued demand.
  • reconcileWarmCapacity (multitenant.go) — janitor's 5-second tick. Previously called SpawnMinWorkers(staticTarget). Now calls SpawnMinWorkers(staticTarget + demand) where demand = ConsumeWarmCapacityDemand().
  • shouldReplenishWarmCapacityLocked (k8s_pool.go) — only used by the non-runtime-store path (single-CP mode). Returns a bool — spawns at most one replacement per consumed worker. Left as-is; cluster mode (which the load test runs in) doesn't go through this path.

Test plan — fast scale-up

  • go build -tags kubernetes ./... clean
  • go test -tags kubernetes ./controlplane/... clean (same pre-existing admin postgres-container failures as before; race detector also flags a pre-existing race in TestK8sPoolRetireWorkerUsesTrackedPodName that reproduces on main too).
  • New TestK8sPoolWarmCapacityDemandScalesPoolInOneTick drives 30 concurrent ReserveSharedWorker calls against a store that always misses, runs one janitor tick, and asserts exactly 35 (staticTarget=5 + demand=30) spawn slots are allocated in that single pass — pinning the new behaviour against a regression to per-tick increments.
  • After deploy: re-run the 30-tenant / 900 qps load test and confirm the pool now reaches ~30 workers within one or two janitor ticks instead of 10+ minutes.

🤖 Generated with Claude Code

benben added 2 commits May 21, 2026 10:27
…et-derived MaxWorkers cap

In K8s mode, workers run as separate pods on separate nodes, so the
control plane's own memory budget tells us nothing about how many worker
pods the cluster can host. Yet the CP was deriving
`k8s.max_workers = memory_budget / 256MB` whenever the operator left
the cap unset, and a 30-tenant / 900-qps load test stabilised at 11
workers / ~96 qps because the CP host had ~2.8 GB free — even though
Karpenter had already spun up 12 worker nodes ready to accept pods.

This change:

  * Drops the memory-budget derivation for `k8s.max_workers`. If
    `cfg.K8s.MaxWorkers` is 0, the pool is unbounded and the cluster's
    NodePool / autoscaler is the natural ceiling. Downstream call sites
    (`OrgReservedPool`, `K8sWorkerPool.canSpawn`,
    `ConfigStore.ClaimIdleWorker`, `CreateSpawningWorkerSlot`) already
    treat `MaxWorkers == 0` as "no cap", so no other code paths need
    changes.

  * Skips the "k8s.shared_warm_target exceeds k8s.max_workers" capping
    when MaxWorkers is unbounded — the warm target stands on its own.

  * Replaces the misleading "Derived k8s.max_workers from memory
    budget" startup log with one that names what's actually
    happening — `k8s.max_workers unset; worker pool is unbounded`.

  * Updates the `K8sConfig.MaxWorkers` field doc to match.

Process / local mode keeps the existing derivation since process-mode
workers share the CP process's memory.

Adds `TestOrgReservedPoolAcquireUnboundedWhenMaxWorkersZero` driving
`OrgReservedPool` with `maxWorkers=0` and asserting it can acquire
many more workers than the previous cap would have allowed (30 in
the test) without rejecting on max-workers grounds.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
A 30-tenant cold ramp at 30 qps revealed the warm-pool reconciler
creeping up at ~2-3 workers per minute instead of closing the gap in
one cycle:

    t=0    5 workers  (initial warm)
    t=60s  8          (+3)
    t=120s 10         (+2)
    t=150s 11         (+1)
    t=600s still 11   never reached the desired 30

Karpenter was already provisioning nodes within ~30s; the bottleneck
was on the CP side. Two throttles compounded:

  1. `reconcileWarmCapacity` called `SpawnMinWorkers(target)` where
     `target = K8s.SharedWarmTarget` — a static configuration value.
     Every tick it filled the warm pool only up to that floor and
     never reacted to bursts of `WarmCapacityExhausted` retries. The
     observed pool growth came entirely from `triggerPerImageReplenish`
     refilling exactly one slot per consumed warm worker, which scales
     with successful activations — not with queued demand.

  2. `K8sWorkerPool.spawnSem` was sized at 3, serialising even the
     parallel WaitGroup fan-out in `SpawnMinWorkers` down to 3
     concurrent pod creates. The K8s client was already configured for
     QPS=50 / Burst=100, so the semaphore was the binding constraint.

This commit threads observed demand through the reconciler:

  * `K8sWorkerPool.warmCapacityMisses` (atomic int64) is incremented
    every time `ReserveSharedWorker` returns `WarmCapacityExhausted`
    for any non-`OrgCap` reason. `OrgCap` is excluded because adding
    neutral warm pods doesn't help an org that has hit its own cap.
  * `ConsumeWarmCapacityDemand()` returns and atomically resets the
    counter.
  * Janitor's `reconcileWarmCapacity` now computes
    `effectiveTarget = staticTarget + observedDemand` and calls
    `SpawnMinWorkers(effectiveTarget)`, scaling to absorbed demand in
    a single tick. Pod creation already fans out via WaitGroup
    inside `SpawnMinWorkers`; with `spawnSem` raised to 50, the K8s
    API calls actually run in parallel.

Scale-DOWN is intentionally untouched — the idle reaper keeps its
slower cadence so steady-state idle dips don't thrash the pool.

`TestK8sPoolWarmCapacityDemandScalesPoolInOneTick` drives 30
concurrent `ReserveSharedWorker` calls against a store that always
misses, simulates one janitor tick, and asserts exactly 35
(`staticTarget=5 + demand=30`) spawn slots are allocated in a single
pass — pinning the new behaviour against a regression to per-tick
increments.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
@benben benben merged commit 4bbb7a4 into main May 21, 2026
22 checks passed
@benben benben deleted the ben/remove-cp-max-workers-cap branch May 21, 2026 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant