Add warm-worker pool observability, stuck-worker reaper, and runbooks#344
Merged
Add warm-worker pool observability, stuck-worker reaper, and runbooks#344
Conversation
… runbooks Add Prometheus metrics for the shared warm-worker lifecycle (idle, reserved, activating, hot, draining gauges), activation latency histogram, activation failure counter, retirement counter with reason labels, and hot-worker session histogram. Instrument k8s_pool.go and org_reserved_pool.go to emit metrics on state transitions. Add automatic stuck-worker reaper that retires workers stuck in reserved/activating state >2 minutes and replenishes the pool. Extend idleReaper to always run for stuck-worker detection. Track reservedAt and peakSessions on ManagedWorker. Include 3 operational runbooks (drain hot workers, recover stuck activating workers, replenish capacity). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
efaf485 to
16f6b13
Compare
peakSessions is now tracked in FlightWorkerPool's AcquireWorker too. reservedAt only applies to the k8s warm-pool reservation flow so it gets a targeted nolint:unused. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The multitenant-seed-kind recipe races with the control plane's config store migration. The deployment becomes "available" before the CP has finished creating the duckgres_orgs table via GORM auto-migrate. Add a retry loop (up to 30s) so the seed waits for the schema to exist. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ys-on The reserve → activate → hot lifecycle is now the only path for shared warm workers. Previously, a --k8s-shared-warm-workers feature flag allowed workers to serve sessions in the reserved state without activation, but this was scaffolding from initial development that is no longer needed. This also fixes a bug where the stuck-worker reaper would have force-retired healthy reserved workers after 2 minutes when the flag was off, since reserved was the steady-state serving lifecycle in that mode. Removed: --k8s-shared-warm-workers CLI flag, DUCKGRES_K8S_SHARED_WARM_WORKERS env var, k8s.shared_warm_workers YAML config, sharedWarmActivation field on K8sWorkerPool, sharedWarmWorkers field on OrgReservedPool, and EnableSharedWarmActivation method. DUCKGRES_SHARED_WARM_WORKER=true is now set unconditionally on all worker pods. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EDsCODE
added a commit
that referenced
this pull request
Mar 24, 2026
Resolve conflicts from warm-pool observability PR (#344) which removed SharedWarmWorkers flag. Keep AWSAccountID/AWSRegion and stsBroker additions. Fix NewOrgReservedPool call sites in new warm_pool_metrics_test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
7 Prometheus metrics for the shared warm-worker lifecycle:
duckgres_warm_workers— idle (unassigned) workers gaugeduckgres_reserved_workers— reserved workers gaugeduckgres_activating_workers— activating workers gaugeduckgres_hot_workers— hot (tenant-bound) workers gaugeduckgres_draining_workers— draining workers gaugeduckgres_activation_duration_seconds— reservation-to-hot latency histogramduckgres_activation_failures_total{reason}— activation failure counterduckgres_worker_retirements_total{reason}— retirement counter with reason labels (normal, activation_failure, crash, shutdown, idle_timeout, stuck_activating)duckgres_hot_worker_sessions_total— sessions served per hot worker at retirementStuck-worker reaper: auto-retires workers stuck in reserved/activating state >2 minutes, with automatic pool replenishment
idleReaperalways runs: no longer exits early whenidleTimeout=0, enabling stuck-worker detection even without idle reapingreservedAt/peakSessionstracking onManagedWorkerfor latency and session histograms3 operational runbooks: drain hot workers, recover stuck activating workers, replenish capacity
Test plan
TestObserveWarmPoolLifecycleGauges— lifecycle gauge countingTestObserveWarmPoolLifecycleGauges_SkipsDeadWorkers— dead worker exclusionTestMarkWorkerRetiredLocked_RecordsRetirementMetric— retirement counter with reasonTestMarkWorkerRetiredLocked_RecordsHotWorkerSessions— hot worker session histogramTestReservedAtTracking— reservedAt set during ReserveSharedWorkerTestPeakSessionsTracking— peakSessions high-water markTestReapStuckActivatingWorkers— stuck worker reaped + replacement spawnedTestReapStuckActivatingWorkers_RecentlyReservedNotReaped— recently reserved protectedcontrolplanetest suite passes (66 seconds, all green)🤖 Generated with Claude Code