Fence worker lifecycle cleanup transitions by bill-ph · Pull Request #615 · PostHog/duckgres

bill-ph · 2026-05-23T01:18:23Z

Summary:

add owner/epoch/snapshot fences to worker lifecycle cleanup transitions
prevent stale janitor/reaper/shutdown paths from deleting pods after CAS misses
enforce hot-idle org-cap checks and add regression coverage for stale snapshots and cleanup handoff

Testing:

just test-configstore-integration
just test-controlplane
just test-controlplane-k8s
just lint

- persistWorkerRecord: downgrade ErrWorkerRecordUpsertFenceMiss to Debug so expected lease-fence rejections don't pollute Warn-level alerts. - retireCurrentRuntimeWorker: bail when GetWorkerRecord returns a missing row — the downstream CAS would miss anyway and we shouldn't issue a pod delete with no runtime row authorizing it. - workerLostEligibleStates: document why draining is excluded so the shared terminal CAS's silent no-op on draining→lost is intentional.

* Add opaque WorkerSnapshot, WorkerLease, TransitionOutcome types These three types are the load-bearing API distinction for the lifecycle-service refactor. Their unexported internals mean callers outside controlplane/configstore cannot fabricate either: - WorkerSnapshot: a point-in-time observation returned only by store reads. The only argument accepted by cleanup-on-observed-row paths (orphan retire, hot-idle reaper, mismatched-version reaper). - WorkerLease: proof of current ownership at a specific epoch, returned only by claim/takeover/refresh paths. The only argument accepted by owned-op paths (Drain on shutdown, MarkLost from health check, the upcoming RefreshLease for cred refresh). - TransitionOutcome: a typed result with stable reason labels so PR 6 can hang per-image metrics off CAS-miss causes without having to grep log strings. No callers yet — this commit is pure addition. The next commits add store methods that produce these types and a WorkerLifecycle service that consumes them. * Add snapshot/lease-producing store methods Wraps the existing WorkerRecord-returning store methods in snapshot-typed counterparts (ObserveWorker, ListOrphanedWorkerSnapshots, ListExpiredHotIdleSnapshots, ListStuckWorkerSnapshots, ListWorkerRecordSnapshotsByStatesBefore) plus two lease constructors: NewWorkerLease for callers with explicit ownership info (e.g. the K8sWorkerPool's in-memory ManagedWorker), and LeaseFromClaimedRecord for the claim/takeover paths that have a freshly-returned record in hand. The legacy WorkerRecord-returning methods are deliberately preserved through PR 3 so each subsequent migration commit is independently revertible. PR 4 deletes them once every caller is on the new types. * Add WorkerLifecycle service shell The central typed lifecycle-transition API. Three snapshot-fenced methods (RetireFromSnapshot, RetireOrphanFromSnapshot, RetireIdleVariantFromSnapshot) plus four lease-fenced ones (Drain, RetireDrained, MarkLostFromLease, RefreshLease). On a successful CAS the service schedules pod/secret cleanup via a WorkerPhysicalCleanup interface; on a CAS miss it intentionally skips cleanup because we have no proof of ownership over the pod. The service has no callers yet — that's PR 3's commits 4-6. Wired against a narrow workerLifecycleStore subset so unit tests can mock the durable layer without spelling out the full RuntimeWorkerStore surface; the same surface will also let PR 4 prune unused store methods safely. Includes a NewWorkerSnapshotForTesting helper in configstore so external-package tests can drive lifecycle code with specific observed fields. The 'ForTesting' suffix is the signal that any production import of it is a bug. * Migrate mismatched-version reaper to WorkerLifecycle First caller migration. K8sWorkerPool gains a lifecycle service field (wired via NewWorkerLifecycle in the constructor, lazily wired in test fixtures) and the RuntimeWorkerStore interface gains ObserveWorker plus BumpWorkerEpoch so the typed seam is uniform across CAS paths. RetireOneMismatchedVersionWorker now flows through lifecycle.RetireIdleVariantFromSnapshot: the durable CAS is fenced by a fresh ObserveWorker snapshot, and on success the lifecycle service schedules pod + secret cleanup via DeleteWorkerArtifacts (the typed entry point that wraps the existing deleteRetiredRuntimeWorker plumbing). The reaper additionally issues a synchronous pod Delete to preserve its one-per-call contract — the second async delete is idempotent. Refactored NewWorkerSnapshotForTesting to take a full WorkerRecord so tests can preserve image/org_id alongside the other identity fields without listing each through a wide parameter list. * Migrate janitor hot-idle path to WorkerLifecycle When the janitor is wired with a *WorkerLifecycle (which multitenant.go now does, taking the shared K8s pool's instance), the hot-idle reaper flows through lifecycle.RetireFromSnapshot: ListExpiredHotIdleSnapshots -> RetireFromSnapshot (CAS + cleanup) instead of the older ListExpiredHotIdleWorkers -> RetireHotIdleWorker -> [ deleteRetiredWorker | retireLocalWorker fallback | retireWorker fallback ] triple-fallback. Same fences as before — the snapshot carries state/owner/epoch/updated_at, MarkWorkerTerminalIfCurrent enforces them — but the choreography is now in one place and the post-CAS pod cleanup is the lifecycle service's WorkerPhysicalCleanup callback. The legacy three-lambda path is retained for deployments that don't wire a lifecycle (no current callers, but it keeps the janitor buildable in isolation). PR 4 prunes it. * Migrate janitor orphan path to WorkerLifecycle When a lifecycle service is wired into the janitor, the orphan-cleanup loop now goes through: ListOrphanedWorkerSnapshots -> RetireOrphanFromSnapshot (CAS + cleanup) instead of the older ListOrphanedWorkers -> retireOrphanWorker lambda -> [pool's RetireOrphanWorker + retireWorkerPod] The lifecycle's RetireOrphanFromSnapshot already carries the orphan- specific CP-revival fence (via configstore.RetireOrphanWorker's NOT EXISTS clause) so the snapshot path keeps the same correctness guarantees while consolidating the choreography. Legacy lambda path retained behind 'if j.lifecycle == nil' until PR 4. * Migrate ReserveSharedWorker + SpawnMinWorkers retire paths to WorkerLifecycle retireClaimedWorker (the post-claim-failed retire used by image-mismatch, reserve-failure, and the SpawnMin* paths) now flows through lifecycle.RetireFromSnapshot when a lifecycle service is wired: NewWorkerSnapshot(claimed) -> RetireFromSnapshot (CAS + cleanup) The legacy direct-MarkWorkerTerminalIfCurrent path is retained behind a nil-lifecycle guard so the process backend keeps building, but the K8s pool always has a lifecycle (set in the constructor). retireCurrentRuntimeWorker — the SpawnMin failure refresh-then-retire — also moves to ObserveWorker (snapshot-typed) for the refresh step, keeping the type-safe seam consistent across all snapshot-fenced paths. Renamed NewWorkerSnapshotForTesting to NewWorkerSnapshot. The original name signaled 'test-only', but the production claim/reserve callers need to wrap freshly-issued records into snapshots too — the safety property they care about (Snapshot != Lease at the type level) is preserved, while paying for an extra ObserveWorker round-trip on a row we just CAS'd would be pure overhead. * Migrate ShutdownAll to lease-based Drain + RetireDrained The 3-step shutdown chain now goes through the typed lifecycle: lease := configstore.NewWorkerLease(w.ID, cpInstanceID, w.OwnerEpoch()) lifecycle.Drain(lease) -> [pod delete] -> lifecycle.RetireDrained(lease) The pod delete stays caller-orchestrated between the two CAS steps so a delete failure leaves the row in draining for the orphan sweep to reconcile — same property the original chain provided, now expressed in terms that can't accidentally swap a Snapshot in where a Lease belongs. Legacy direct-store calls retained behind 'if p.lifecycle == nil' for builds without a lifecycle service. PR 4 prunes. * Migrate cred-refresh to lifecycle.RefreshLease The credential-refresh path's two-step BumpWorkerEpoch + SetOwnerEpoch now goes through the typed seam: currentLease := configstore.NewWorkerLease(...) newLease, _ := lifecycle.RefreshLease(currentLease) worker.SetOwnerEpoch(newLease.OwnerEpoch()) Same underlying CAS (BumpWorkerEpoch); same in-memory propagation afterwards. What's gained is the typed lease — the cred-refresh path now produces a WorkerLease that downstream lifecycle calls could consume directly (no current callers, but the API is now composition-friendly). Note: this commit does not close the BumpWorkerEpoch ↔ ShutdownAll race I flagged in the #615 review. The race window between the durable bump and the in-memory SetOwnerEpoch persists, because no typed API can change the fact that the durable CAS returns *before* the in-memory copy can update. PR 5 (recovery) is the natural place to address it, either with a shared lock between the two operations or by making ShutdownAll observe a 'refresh in progress' flag. * Round 1 review fixes - Wire pool.lifecycle in shutdownTestPool so the full TestShutdownAll_* suite exercises the lifecycle-based Drain → pod-delete → RetireDrained chain (commit f1afa61) instead of only the legacy direct-store fallback. - Add TestControlPlaneJanitorOrphanGoesThroughLifecycleWhenWired, symmetric to the hot-idle lifecycle test, so the orphan migration (commit e9bca02) is no longer untested at the janitor level. - RetireOrphanFromSnapshot no longer mis-classifies any owned-snapshot CAS miss as 'fence_miss_cp_revived'. Without an extra GetWorkerRecord we can't distinguish state/owner/epoch/updated_at/CP-revival; report the generic fence-miss label and let PR 6 add a real classifier. Pinned with TestRetireOrphanFromSnapshotReportsGenericMissWithoutExtraRead. - Renamed lifecycle_testing.go → lifecycle_constructors.go now that NewWorkerSnapshot is used from production retireClaimedWorker as well as tests. - Dropped the unreferenced (WorkerSnapshot).recordPtr() accessor. - Honest comment on MarkLostFromLease about why the health-checker migration is deferred (the existing 'remove + replenish' threading needs prep work first). * Round 2 review fixes - Restore the nil-guard on p.runtimeStore in the mismatched-version reaper. Pre-PR code wrapped the runtime-store read in 'if p.runtimeStore != nil'; commit 7e0aa4a accidentally dropped the guard. Latent (production always wires the store) but easy to invert later if a test pool builder ever omits the store. - ShutdownAll now re-mints the WorkerLease at Step 3 (RetireDrained) from the worker's current in-memory OwnerEpoch, matching the legacy per-step w.OwnerEpoch() lookup. Without this, a cred-refresh bump landing between Drain and RetireDrained would CAS-miss the cached lease and leave the row in draining for the orphan sweep — the legacy path's late w.OwnerEpoch() read avoided that. - NewWorkerSnapshot no longer rewrites a zero UpdatedAt to time.Now(). The downstream SQL fence only adds the 'updated_at <= ?' predicate when UpdatedAt is non-zero (see store.MarkWorkerTerminalIfCurrent), so preserving zero correctly maps to 'no updated_at fence' instead of producing a forward-dated snapshot that any current row would fence-pass against. - RefreshLease docstring is honest about not closing the BumpWorkerEpoch + SetOwnerEpoch race. The typed lease is the value; the race is PR 5 territory (already noted in the PR description). * Round 3 review fixes - RetireIdleVariantFromSnapshot docstring updated: today's only caller is the mismatched-version reaper. The hot-idle TTL janitor was promoted to RetireFromSnapshot in commit fd136c8 since the list query already narrows candidates to state=hot_idle. - Removed unreferenced LeaseFromClaimedRecord. It was exported for a hypothetical claim-then-wrap call site that ended up using NewWorkerLease directly. PR 4 would have culled it anyway. - Preserved the legacy 'bump owner epoch for refresh' error wrapper inside the lifecycle branch so log-grep / dashboard filters that match on the previous wrapper string keep matching after the migration. ErrWorkerOwnerEpochMismatch is still threaded via %w. - Added TestRetireFromSnapshotTargetingLostSchedulesCleanup so the reason=RetireReasonCrash → target=Lost branch (used by retireClaimedWorker) is covered separately from the Retired target. * Round 4 review fixes - Deleted unreferenced newWorkerLeaseFromRecord helper. It was the package-internal companion to the now-deleted LeaseFromClaimedRecord; no callers remain. - Tightened TestRetireIdleVariantSkipsWhenSnapshotNotEligible's comment. The previous wording described a 'store says true → service overrides' scenario that the code doesn't actually produce — the state override only fires on !transitioned. The new wording matches what the test asserts.

bill-ph added 3 commits May 22, 2026 21:18

Fence worker lifecycle cleanup transitions

45b3c8b

Tighten worker lifecycle regression coverage

511a24b

bill-ph merged commit eb10d06 into main May 23, 2026
22 checks passed

bill-ph deleted the codex/pr2-worker-lifecycle-fencing branch May 23, 2026 22:54

bill-ph mentioned this pull request May 24, 2026

PR 3: central WorkerLifecycle transition API #616

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fence worker lifecycle cleanup transitions#615

Fence worker lifecycle cleanup transitions#615
bill-ph merged 3 commits into
mainfrom
codex/pr2-worker-lifecycle-fencing

bill-ph commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bill-ph commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant