Fence worker lifecycle cleanup transitions#615
Merged
Conversation
- persistWorkerRecord: downgrade ErrWorkerRecordUpsertFenceMiss to Debug so expected lease-fence rejections don't pollute Warn-level alerts. - retireCurrentRuntimeWorker: bail when GetWorkerRecord returns a missing row — the downstream CAS would miss anyway and we shouldn't issue a pod delete with no runtime row authorizing it. - workerLostEligibleStates: document why draining is excluded so the shared terminal CAS's silent no-op on draining→lost is intentional.
4 tasks
bill-ph
added a commit
that referenced
this pull request
May 24, 2026
* Add opaque WorkerSnapshot, WorkerLease, TransitionOutcome types
These three types are the load-bearing API distinction for the
lifecycle-service refactor. Their unexported internals mean callers
outside controlplane/configstore cannot fabricate either:
- WorkerSnapshot: a point-in-time observation returned only by store
reads. The only argument accepted by cleanup-on-observed-row paths
(orphan retire, hot-idle reaper, mismatched-version reaper).
- WorkerLease: proof of current ownership at a specific epoch, returned
only by claim/takeover/refresh paths. The only argument accepted by
owned-op paths (Drain on shutdown, MarkLost from health check, the
upcoming RefreshLease for cred refresh).
- TransitionOutcome: a typed result with stable reason labels so PR 6
can hang per-image metrics off CAS-miss causes without having to grep
log strings.
No callers yet — this commit is pure addition. The next commits add
store methods that produce these types and a WorkerLifecycle service
that consumes them.
* Add snapshot/lease-producing store methods
Wraps the existing WorkerRecord-returning store methods in
snapshot-typed counterparts (ObserveWorker, ListOrphanedWorkerSnapshots,
ListExpiredHotIdleSnapshots, ListStuckWorkerSnapshots,
ListWorkerRecordSnapshotsByStatesBefore) plus two lease constructors:
NewWorkerLease for callers with explicit ownership info (e.g. the
K8sWorkerPool's in-memory ManagedWorker), and LeaseFromClaimedRecord for
the claim/takeover paths that have a freshly-returned record in hand.
The legacy WorkerRecord-returning methods are deliberately preserved
through PR 3 so each subsequent migration commit is independently
revertible. PR 4 deletes them once every caller is on the new types.
* Add WorkerLifecycle service shell
The central typed lifecycle-transition API. Three snapshot-fenced
methods (RetireFromSnapshot, RetireOrphanFromSnapshot,
RetireIdleVariantFromSnapshot) plus four lease-fenced ones (Drain,
RetireDrained, MarkLostFromLease, RefreshLease). On a successful CAS
the service schedules pod/secret cleanup via a WorkerPhysicalCleanup
interface; on a CAS miss it intentionally skips cleanup because we
have no proof of ownership over the pod.
The service has no callers yet — that's PR 3's commits 4-6. Wired
against a narrow workerLifecycleStore subset so unit tests can mock
the durable layer without spelling out the full RuntimeWorkerStore
surface; the same surface will also let PR 4 prune unused store
methods safely.
Includes a NewWorkerSnapshotForTesting helper in configstore so
external-package tests can drive lifecycle code with specific
observed fields. The 'ForTesting' suffix is the signal that any
production import of it is a bug.
* Migrate mismatched-version reaper to WorkerLifecycle
First caller migration. K8sWorkerPool gains a lifecycle service field
(wired via NewWorkerLifecycle in the constructor, lazily wired in
test fixtures) and the RuntimeWorkerStore interface gains ObserveWorker
plus BumpWorkerEpoch so the typed seam is uniform across CAS paths.
RetireOneMismatchedVersionWorker now flows through
lifecycle.RetireIdleVariantFromSnapshot: the durable CAS is fenced by
a fresh ObserveWorker snapshot, and on success the lifecycle service
schedules pod + secret cleanup via DeleteWorkerArtifacts (the typed
entry point that wraps the existing deleteRetiredRuntimeWorker
plumbing). The reaper additionally issues a synchronous pod Delete to
preserve its one-per-call contract — the second async delete is
idempotent.
Refactored NewWorkerSnapshotForTesting to take a full WorkerRecord
so tests can preserve image/org_id alongside the other identity
fields without listing each through a wide parameter list.
* Migrate janitor hot-idle path to WorkerLifecycle
When the janitor is wired with a *WorkerLifecycle (which multitenant.go
now does, taking the shared K8s pool's instance), the hot-idle reaper
flows through lifecycle.RetireFromSnapshot:
ListExpiredHotIdleSnapshots -> RetireFromSnapshot (CAS + cleanup)
instead of the older
ListExpiredHotIdleWorkers -> RetireHotIdleWorker -> [
deleteRetiredWorker | retireLocalWorker fallback | retireWorker fallback
]
triple-fallback. Same fences as before — the snapshot carries
state/owner/epoch/updated_at, MarkWorkerTerminalIfCurrent enforces them
— but the choreography is now in one place and the post-CAS pod cleanup
is the lifecycle service's WorkerPhysicalCleanup callback.
The legacy three-lambda path is retained for deployments that don't
wire a lifecycle (no current callers, but it keeps the janitor
buildable in isolation). PR 4 prunes it.
* Migrate janitor orphan path to WorkerLifecycle
When a lifecycle service is wired into the janitor, the orphan-cleanup
loop now goes through:
ListOrphanedWorkerSnapshots -> RetireOrphanFromSnapshot (CAS + cleanup)
instead of the older
ListOrphanedWorkers -> retireOrphanWorker lambda -> [pool's RetireOrphanWorker + retireWorkerPod]
The lifecycle's RetireOrphanFromSnapshot already carries the orphan-
specific CP-revival fence (via configstore.RetireOrphanWorker's
NOT EXISTS clause) so the snapshot path keeps the same correctness
guarantees while consolidating the choreography.
Legacy lambda path retained behind 'if j.lifecycle == nil' until PR 4.
* Migrate ReserveSharedWorker + SpawnMinWorkers retire paths to WorkerLifecycle
retireClaimedWorker (the post-claim-failed retire used by image-mismatch,
reserve-failure, and the SpawnMin* paths) now flows through
lifecycle.RetireFromSnapshot when a lifecycle service is wired:
NewWorkerSnapshot(claimed) -> RetireFromSnapshot (CAS + cleanup)
The legacy direct-MarkWorkerTerminalIfCurrent path is retained behind a
nil-lifecycle guard so the process backend keeps building, but the K8s
pool always has a lifecycle (set in the constructor).
retireCurrentRuntimeWorker — the SpawnMin failure refresh-then-retire —
also moves to ObserveWorker (snapshot-typed) for the refresh step,
keeping the type-safe seam consistent across all snapshot-fenced paths.
Renamed NewWorkerSnapshotForTesting to NewWorkerSnapshot. The original
name signaled 'test-only', but the production claim/reserve callers
need to wrap freshly-issued records into snapshots too — the safety
property they care about (Snapshot != Lease at the type level) is
preserved, while paying for an extra ObserveWorker round-trip on a row
we just CAS'd would be pure overhead.
* Migrate ShutdownAll to lease-based Drain + RetireDrained
The 3-step shutdown chain now goes through the typed lifecycle:
lease := configstore.NewWorkerLease(w.ID, cpInstanceID, w.OwnerEpoch())
lifecycle.Drain(lease) -> [pod delete] -> lifecycle.RetireDrained(lease)
The pod delete stays caller-orchestrated between the two CAS steps so a
delete failure leaves the row in draining for the orphan sweep to
reconcile — same property the original chain provided, now expressed in
terms that can't accidentally swap a Snapshot in where a Lease belongs.
Legacy direct-store calls retained behind 'if p.lifecycle == nil' for
builds without a lifecycle service. PR 4 prunes.
* Migrate cred-refresh to lifecycle.RefreshLease
The credential-refresh path's two-step BumpWorkerEpoch + SetOwnerEpoch
now goes through the typed seam:
currentLease := configstore.NewWorkerLease(...)
newLease, _ := lifecycle.RefreshLease(currentLease)
worker.SetOwnerEpoch(newLease.OwnerEpoch())
Same underlying CAS (BumpWorkerEpoch); same in-memory propagation
afterwards. What's gained is the typed lease — the cred-refresh path
now produces a WorkerLease that downstream lifecycle calls could
consume directly (no current callers, but the API is now
composition-friendly).
Note: this commit does not close the BumpWorkerEpoch ↔ ShutdownAll
race I flagged in the #615 review. The race window between the
durable bump and the in-memory SetOwnerEpoch persists, because no
typed API can change the fact that the durable CAS returns *before*
the in-memory copy can update. PR 5 (recovery) is the natural place
to address it, either with a shared lock between the two operations
or by making ShutdownAll observe a 'refresh in progress' flag.
* Round 1 review fixes
- Wire pool.lifecycle in shutdownTestPool so the full TestShutdownAll_*
suite exercises the lifecycle-based Drain → pod-delete → RetireDrained
chain (commit f1afa61) instead of only the legacy direct-store fallback.
- Add TestControlPlaneJanitorOrphanGoesThroughLifecycleWhenWired,
symmetric to the hot-idle lifecycle test, so the orphan migration
(commit e9bca02) is no longer untested at the janitor level.
- RetireOrphanFromSnapshot no longer mis-classifies any owned-snapshot
CAS miss as 'fence_miss_cp_revived'. Without an extra GetWorkerRecord
we can't distinguish state/owner/epoch/updated_at/CP-revival; report
the generic fence-miss label and let PR 6 add a real classifier.
Pinned with TestRetireOrphanFromSnapshotReportsGenericMissWithoutExtraRead.
- Renamed lifecycle_testing.go → lifecycle_constructors.go now that
NewWorkerSnapshot is used from production retireClaimedWorker as well
as tests.
- Dropped the unreferenced (WorkerSnapshot).recordPtr() accessor.
- Honest comment on MarkLostFromLease about why the health-checker
migration is deferred (the existing 'remove + replenish' threading
needs prep work first).
* Round 2 review fixes
- Restore the nil-guard on p.runtimeStore in the mismatched-version
reaper. Pre-PR code wrapped the runtime-store read in
'if p.runtimeStore != nil'; commit 7e0aa4a accidentally dropped the
guard. Latent (production always wires the store) but easy to invert
later if a test pool builder ever omits the store.
- ShutdownAll now re-mints the WorkerLease at Step 3 (RetireDrained)
from the worker's current in-memory OwnerEpoch, matching the legacy
per-step w.OwnerEpoch() lookup. Without this, a cred-refresh bump
landing between Drain and RetireDrained would CAS-miss the cached
lease and leave the row in draining for the orphan sweep — the legacy
path's late w.OwnerEpoch() read avoided that.
- NewWorkerSnapshot no longer rewrites a zero UpdatedAt to time.Now().
The downstream SQL fence only adds the 'updated_at <= ?' predicate
when UpdatedAt is non-zero (see store.MarkWorkerTerminalIfCurrent),
so preserving zero correctly maps to 'no updated_at fence' instead
of producing a forward-dated snapshot that any current row would
fence-pass against.
- RefreshLease docstring is honest about not closing the
BumpWorkerEpoch + SetOwnerEpoch race. The typed lease is the value;
the race is PR 5 territory (already noted in the PR description).
* Round 3 review fixes
- RetireIdleVariantFromSnapshot docstring updated: today's only caller
is the mismatched-version reaper. The hot-idle TTL janitor was
promoted to RetireFromSnapshot in commit fd136c8 since the list
query already narrows candidates to state=hot_idle.
- Removed unreferenced LeaseFromClaimedRecord. It was exported for a
hypothetical claim-then-wrap call site that ended up using
NewWorkerLease directly. PR 4 would have culled it anyway.
- Preserved the legacy 'bump owner epoch for refresh' error wrapper
inside the lifecycle branch so log-grep / dashboard filters that
match on the previous wrapper string keep matching after the
migration. ErrWorkerOwnerEpochMismatch is still threaded via %w.
- Added TestRetireFromSnapshotTargetingLostSchedulesCleanup so the
reason=RetireReasonCrash → target=Lost branch (used by
retireClaimedWorker) is covered separately from the Retired target.
* Round 4 review fixes
- Deleted unreferenced newWorkerLeaseFromRecord helper. It was the
package-internal companion to the now-deleted LeaseFromClaimedRecord;
no callers remain.
- Tightened TestRetireIdleVariantSkipsWhenSnapshotNotEligible's comment.
The previous wording described a 'store says true → service overrides'
scenario that the code doesn't actually produce — the state override
only fires on !transitioned. The new wording matches what the test
asserts.
This was referenced May 24, 2026
Draft
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Testing: