fix(worker-pool): drain-aware worker eviction (do-not-disrupt + grace + health guard)#682
fix(worker-pool): drain-aware worker eviction (do-not-disrupt + grace + health guard)#682bill-ph wants to merge 5 commits into
Conversation
… + health guard) On 2026-06-04 a Karpenter Drift roll (Bottlerocket AMI 1.61→1.62) tainted worker nodes out from under running queries: 31 in-flight queries were killed, each a median ~5s after its node was tainted while the node still had ~108s of life left. The control plane itself canceled them (3 failed health checks → "worker unresponsive"), even though the workers were draining. The CP drains its own sessions on a roll (900s grace, unbounded — control.go:46); workers had none of that. This gives a busy worker the same protection. 1a (controlplane/worker_disruption_guard.go): the CP stamps karpenter.sh/do-not-disrupt on a worker while it serves a session and clears it when idle, so Karpenter skips a node running a query. Reconciled every 5s on the shared K8sWorkerPool, covering both the flat pool and OrgReservedPool (which share its worker map). Patches only on busy<->idle transitions. Requires the new pods:patch RBAC. 1c (k8s_pool.go): worker pods set terminationGracePeriodSeconds=600 (was unset → 30s default) — a real drain window for in-flight queries on SIGTERM. 2a (k8s_pool.go HealthCheckLoop): a worker failing health checks while its pod is already Terminating (planned node drain) is no longer marked Lost / canceled; the loop defers to the informer-driven pod-terminated path so the worker drains. Terminating is read from the pod informer cache — no API call, no new RBAC. Protects against voluntary disruption only. Involuntary loss (spot reclaim, node failure) is unchanged; that residual tail needs transparent statement retry for commit-safe statements (follow-up). Disruption budgets are intentionally not touched: workers already had Drift budget=1 and it did not prevent the kills. Tests: - unit: reconciler set/clear/idempotent/skip-exiting; pod-Terminating cache check; pod-spec grace assertion (TestK8sPool_SpawnWorkerCreatesCorrectPod). - manifest: k8s/rbac.yaml now grants pods:patch. - e2e-mw-dev/harness.sh: a busy worker carries do-not-disrupt and an idle worker clears it; worker terminationGracePeriodSeconds=600. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… grace dependency Holistic review against the live Karpenter config (v1.9.0, karpenter.sh/v1) surfaced two issues: 1. 2a regression: deferring "mark Lost" whenever a worker's pod was Terminating kept the worker in the pool until the pod was actually deleted. findIdle/ leastLoaded only skip workers whose done channel is closed (pod gone), so a Terminating-but-not-deleted IDLE worker stayed acquirable — a new session could be routed to a shutting-down pod, for no benefit (no query to protect). Fix: gate the deferral on activeSessions>0. Idle workers are marked Lost promptly as before; only busy workers (a query to drain) defer. 2. The duckgres-workers / -colocated / -cp NodePools all have terminationGracePeriod: None. In Karpenter v1, do-not-disrupt then blocks Drift/expiration INDEFINITELY (no forced removal), so a long/stuck/idle-held session could wedge a security AMI roll. The do-not-disrupt change (1a) MUST ship with a NodePool terminationGracePeriod ceiling. Documented as a blocking companion infra change in the PR; not fixable in this repo (ArgoCD-managed karpenter-config). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Holistic validation against the live Karpenter/k8s configChecked the fix against the actual prod Karpenter ( ✅ Confirmed correct
|
|
Blocking companion infra PR is now open: PostHog/charts#11756 — adds |
… failover)
The reconciler tracked applied-state in an in-memory ManagedWorker field. That
loses the orphan-clear case on CP death/failover: when a CP dies, its sessions
die with it (worker goes idle) but the worker pod keeps the do-not-disrupt
annotation the dead CP stamped. A surviving/replacement CP adopts the worker
with applied=false and activeSessions=0, so applied==busy==false and the
reconciler skips it forever — leaving an orphaned annotation that suppresses
Karpenter consolidation until the idle reaper happens to delete the pod.
Reconcile against the pod's ACTUAL annotation read from the pod informer cache
(new podHasDoNotDisrupt, no API call) instead of an in-memory flag. Any CP now
self-corrects: it sees desired=idle vs current=annotated and clears the orphan.
Removes the ManagedWorker field entirely (worker_mgr.go back to unchanged).
Also gates the 2a health-check deferral on activeSessions>0 (prior commit), so a
Terminating *idle* worker is still marked Lost promptly rather than lingering in
the acquirable pool.
New regression test TestReconcileDisruptionGuardsClearsStaleAnnotationAfterFailover
covers the orphan path; steady-state still issues zero patches.
Note (no code change): verified against Karpenter v1 docs that
spec.template.spec.terminationGracePeriod DOES make Drift bypass do-not-disrupt
after the grace ("a node may be disrupted via drift even if there are pods with
... the karpenter.sh/do-not-disrupt annotation"), so the charts#11756 ceiling
correctly bounds the hold for AMI/CVE drift — not just expiration.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Follow-up from review (commit 5d99d79): CP-failover orphan (real gap, fixed). The reconciler tracked applied-state in an in-memory field. On CP death the worker's sessions die (worker goes idle) but its pod keeps the annotation; a replacement CP adopted it with Karpenter semantics confirmed. Per the v1 disruption docs, |
Loops reconcileDisruptionGuards + workerPodTerminating against 16 goroutines flipping activeSessions under the pool lock (as Acquire/ReleaseWorker do). Asserts the snapshot-under-RLock / patch-without-lock discipline is race-free. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lity + scheduling) Follow-up to the dual-model review of #682. BLOCKING — relabel adopted pods so the informer sees them The pod informer is label-scoped to duckgres/control-plane=<cpID> (cpID = os.Hostname(), per CP pod). adoptClaimedWorker did not relabel, so after a CP restart/rollout an adopted worker kept the dead CP's label and was invisible to the new CP's informer — defeating podHasDoNotDisrupt (stale annotation never cleared; busy adopted worker patched every 5s forever), workerPodTerminating (2a never protected adopted workers), and informer-driven cleanup. Now adoptClaimedWorker merge-patches duckgres/control-plane to this CP. should-fix — don't hand a new session to a Terminating worker isGenericSessionSchedulableWorkerLocked and OrgReservedPool. workerReadyForSchedulingLocked now exclude workers whose pod is Terminating, so the 2a deferral can't leave a shutting-down pod schedulable once its session releases. Codex P1 follow-up: the runtime-store claim path bypasses those predicates, so adoptClaimedWorker also rejects a claim whose pod is already Terminating (routes into the existing claim-retire/fallback). should-fix — shrink the reconcile race window 5s -> 2s (eager-stamp on the 0->1 session transition deferred: it touches the hot acquire path on both pools, and with the terminationGracePeriod ceiling a disruption inside the window is non-fatal anyway.) test — harness assert_drain_aware_eviction attributes the annotation to its OWN held session via a baseline diff (not "first annotated pod"), and drops $(seq) for a POSIX while-loop (#!/bin/sh + set -eu). New unit test TestRelabelAdoptedPodToThisCP; reconcile/terminating/-race tests updated. Charts rollout-ordering + 1h-semantics documented in PostHog/charts#11756. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Dual-model review — findings resolved (commit 295e6ef)Ran a second independent review (Claude + Codex). Net: 1 blocking + several should-fix, all addressed. 🔴 BLOCKING — adopted workers were invisible to the informer (Codex-found, verified) 🟡 should-fix
charts#11756 got a rollout-ordering note (NodeClaims snapshot Verified: |
Why
On 2026-06-04 a Karpenter Drift roll (Bottlerocket AMI 1.61→1.62) tainted worker nodes out from under running queries. 31 in-flight queries were killed, each a median ~5s after its node was tainted while the node still had ~108s of life left — and the control plane itself canceled them (3 failed health checks →
K8s worker unresponsive, deleting pod), even though the workers were draining. The dominant casualty wasposthog_data_import_2'sDELETE FROM ducklake.posthog.events(24 of them, 21–37 min each, 0 ultimately succeeded).The CP drains its own sessions on a roll (900s grace, unbounded —
control.go:46); workers had none of that: noterminationGracePeriodSeconds(→ 30s default), nodo-not-disrupt, and the health loop canceled the query the moment it couldn't reach the (draining) worker. This PR gives a busy worker the same protection the control plane already gives itself.What
karpenter.sh/do-not-disrupton busy workers (controlplane/worker_disruption_guard.go): the CP stamps the annotation on a worker while it serves a session and clears it when idle, so Karpenter skips a node running a query. Reconciled every 5s on the sharedK8sWorkerPool, covering both the flat pool andOrgReservedPool(they share its worker map). Patches only on busy↔idle transitions. Requires the newpods: patchRBAC.terminationGracePeriodSeconds=600(k8s_pool.go, was unset → 30s): a real drain window on SIGTERM.k8s_pool.goHealthCheckLoop): a worker failing health checks while its pod is alreadyTerminating(planned node drain) is no longer marked Lost / canceled — the loop defers to the informer-driven pod-terminated path so the worker drains.Terminatingis read from the existing pod informer cache — no API call, no new RBAC.Scope / non-goals
doNotDisruptApplieddedup flag resets on CP restart; the failure direction is safe (a still-busy worker is re-patched; a worker idled across the restart leaves a stale annotation that merely defers consolidation until the idle reaper retires it — it never clears a busy worker). Documented in-code with a follow-up.Tests
controlplane/worker_disruption_guard_test.go): reconciler set/clear/idempotent/skip-exiting;workerPodTerminatingcache check. Plus a grace-period assertion inTestK8sPool_SpawnWorkerCreatesCorrectPod.tests/manifests/manifests_test.go):k8s/rbac.yamlgrantspods: patch.tests/e2e-mw-dev/harness.sh): regression guard — a busy worker carriesdo-not-disruptand an idle worker clears it; workerterminationGracePeriodSeconds=600. (Asserting Karpenter actually defers needs a real node drain, out of scope for the in-Job harness; the annotation contract is what gates Karpenter, so that is what we assert.)🤖 Generated with Claude Code