Add warm-worker pool observability, stuck-worker reaper, and runbooks by bill-ph · Pull Request #344 · PostHog/duckgres

bill-ph · 2026-03-23T20:42:15Z

Summary

7 Prometheus metrics for the shared warm-worker lifecycle:
- duckgres_warm_workers — idle (unassigned) workers gauge
- duckgres_reserved_workers — reserved workers gauge
- duckgres_activating_workers — activating workers gauge
- duckgres_hot_workers — hot (tenant-bound) workers gauge
- duckgres_draining_workers — draining workers gauge
- duckgres_activation_duration_seconds — reservation-to-hot latency histogram
- duckgres_activation_failures_total{reason} — activation failure counter
- duckgres_worker_retirements_total{reason} — retirement counter with reason labels (normal, activation_failure, crash, shutdown, idle_timeout, stuck_activating)
- duckgres_hot_worker_sessions_total — sessions served per hot worker at retirement
Stuck-worker reaper: auto-retires workers stuck in reserved/activating state >2 minutes, with automatic pool replenishment
idleReaper always runs: no longer exits early when idleTimeout=0, enabling stuck-worker detection even without idle reaping
reservedAt / peakSessions tracking on ManagedWorker for latency and session histograms
3 operational runbooks: drain hot workers, recover stuck activating workers, replenish capacity

Test plan

TestObserveWarmPoolLifecycleGauges — lifecycle gauge counting
TestObserveWarmPoolLifecycleGauges_SkipsDeadWorkers — dead worker exclusion
TestMarkWorkerRetiredLocked_RecordsRetirementMetric — retirement counter with reason
TestMarkWorkerRetiredLocked_RecordsHotWorkerSessions — hot worker session histogram
TestReservedAtTracking — reservedAt set during ReserveSharedWorker
TestPeakSessionsTracking — peakSessions high-water mark
TestReapStuckActivatingWorkers — stuck worker reaped + replacement spawned
TestReapStuckActivatingWorkers_RecentlyReservedNotReaped — recently reserved protected
Full controlplane test suite passes (66 seconds, all green)

🤖 Generated with Claude Code

… runbooks Add Prometheus metrics for the shared warm-worker lifecycle (idle, reserved, activating, hot, draining gauges), activation latency histogram, activation failure counter, retirement counter with reason labels, and hot-worker session histogram. Instrument k8s_pool.go and org_reserved_pool.go to emit metrics on state transitions. Add automatic stuck-worker reaper that retires workers stuck in reserved/activating state >2 minutes and replenishes the pool. Extend idleReaper to always run for stuck-worker detection. Track reservedAt and peakSessions on ManagedWorker. Include 3 operational runbooks (drain hot workers, recover stuck activating workers, replenish capacity). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

peakSessions is now tracked in FlightWorkerPool's AcquireWorker too. reservedAt only applies to the k8s warm-pool reservation flow so it gets a targeted nolint:unused. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The multitenant-seed-kind recipe races with the control plane's config store migration. The deployment becomes "available" before the CP has finished creating the duckgres_orgs table via GORM auto-migrate. Add a retry loop (up to 30s) so the seed waits for the schema to exist. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ys-on The reserve → activate → hot lifecycle is now the only path for shared warm workers. Previously, a --k8s-shared-warm-workers feature flag allowed workers to serve sessions in the reserved state without activation, but this was scaffolding from initial development that is no longer needed. This also fixes a bug where the stuck-worker reaper would have force-retired healthy reserved workers after 2 minutes when the flag was off, since reserved was the steady-state serving lifecycle in that mode. Removed: --k8s-shared-warm-workers CLI flag, DUCKGRES_K8S_SHARED_WARM_WORKERS env var, k8s.shared_warm_workers YAML config, sharedWarmActivation field on K8sWorkerPool, sharedWarmWorkers field on OrgReservedPool, and EnableSharedWarmActivation method. DUCKGRES_SHARED_WARM_WORKER=true is now set unconditionally on all worker pods. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Resolve conflicts from warm-pool observability PR (#344) which removed SharedWarmWorkers flag. Keep AWSAccountID/AWSRegion and stsBroker additions. Fix NewOrgReservedPool call sites in new warm_pool_metrics_test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bill-ph changed the title ~~Major architecture refactor: multi-tenant control plane and K8s support~~ Add warm-worker pool observability, stuck-worker reaper, and runbooks Mar 23, 2026

bill-ph force-pushed the claude/goofy-meitner branch from efaf485 to 16f6b13 Compare March 23, 2026 21:46

bill-ph and others added 6 commits March 23, 2026 18:20

Fix activation failure metric cardinality

83149d0

Fix warm pool metric reporting

f377968

Harden flaky k8s integration tests

5a0c9cc

bill-ph merged commit 0ffcdc6 into main Mar 24, 2026
23 of 25 checks passed

bill-ph deleted the claude/goofy-meitner branch March 24, 2026 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add warm-worker pool observability, stuck-worker reaper, and runbooks#344

Add warm-worker pool observability, stuck-worker reaper, and runbooks#344
bill-ph merged 7 commits intomainfrom
claude/goofy-meitner

bill-ph commented Mar 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bill-ph commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bill-ph commented Mar 23, 2026 •

edited

Loading