Skip to content

Fence worker lifecycle cleanup transitions#615

Merged
bill-ph merged 3 commits into
mainfrom
codex/pr2-worker-lifecycle-fencing
May 23, 2026
Merged

Fence worker lifecycle cleanup transitions#615
bill-ph merged 3 commits into
mainfrom
codex/pr2-worker-lifecycle-fencing

Conversation

@bill-ph
Copy link
Copy Markdown
Collaborator

@bill-ph bill-ph commented May 23, 2026

Summary:

  • add owner/epoch/snapshot fences to worker lifecycle cleanup transitions
  • prevent stale janitor/reaper/shutdown paths from deleting pods after CAS misses
  • enforce hot-idle org-cap checks and add regression coverage for stale snapshots and cleanup handoff

Testing:

  • just test-configstore-integration
  • just test-controlplane
  • just test-controlplane-k8s
  • just lint

bill-ph added 3 commits May 22, 2026 21:18
- persistWorkerRecord: downgrade ErrWorkerRecordUpsertFenceMiss to Debug
  so expected lease-fence rejections don't pollute Warn-level alerts.
- retireCurrentRuntimeWorker: bail when GetWorkerRecord returns a missing
  row — the downstream CAS would miss anyway and we shouldn't issue a
  pod delete with no runtime row authorizing it.
- workerLostEligibleStates: document why draining is excluded so the
  shared terminal CAS's silent no-op on draining→lost is intentional.
@bill-ph bill-ph merged commit eb10d06 into main May 23, 2026
22 checks passed
@bill-ph bill-ph deleted the codex/pr2-worker-lifecycle-fencing branch May 23, 2026 22:54
bill-ph added a commit that referenced this pull request May 24, 2026
* Add opaque WorkerSnapshot, WorkerLease, TransitionOutcome types

These three types are the load-bearing API distinction for the
lifecycle-service refactor. Their unexported internals mean callers
outside controlplane/configstore cannot fabricate either:

- WorkerSnapshot: a point-in-time observation returned only by store
  reads. The only argument accepted by cleanup-on-observed-row paths
  (orphan retire, hot-idle reaper, mismatched-version reaper).

- WorkerLease: proof of current ownership at a specific epoch, returned
  only by claim/takeover/refresh paths. The only argument accepted by
  owned-op paths (Drain on shutdown, MarkLost from health check, the
  upcoming RefreshLease for cred refresh).

- TransitionOutcome: a typed result with stable reason labels so PR 6
  can hang per-image metrics off CAS-miss causes without having to grep
  log strings.

No callers yet — this commit is pure addition. The next commits add
store methods that produce these types and a WorkerLifecycle service
that consumes them.

* Add snapshot/lease-producing store methods

Wraps the existing WorkerRecord-returning store methods in
snapshot-typed counterparts (ObserveWorker, ListOrphanedWorkerSnapshots,
ListExpiredHotIdleSnapshots, ListStuckWorkerSnapshots,
ListWorkerRecordSnapshotsByStatesBefore) plus two lease constructors:
NewWorkerLease for callers with explicit ownership info (e.g. the
K8sWorkerPool's in-memory ManagedWorker), and LeaseFromClaimedRecord for
the claim/takeover paths that have a freshly-returned record in hand.

The legacy WorkerRecord-returning methods are deliberately preserved
through PR 3 so each subsequent migration commit is independently
revertible. PR 4 deletes them once every caller is on the new types.

* Add WorkerLifecycle service shell

The central typed lifecycle-transition API. Three snapshot-fenced
methods (RetireFromSnapshot, RetireOrphanFromSnapshot,
RetireIdleVariantFromSnapshot) plus four lease-fenced ones (Drain,
RetireDrained, MarkLostFromLease, RefreshLease). On a successful CAS
the service schedules pod/secret cleanup via a WorkerPhysicalCleanup
interface; on a CAS miss it intentionally skips cleanup because we
have no proof of ownership over the pod.

The service has no callers yet — that's PR 3's commits 4-6. Wired
against a narrow workerLifecycleStore subset so unit tests can mock
the durable layer without spelling out the full RuntimeWorkerStore
surface; the same surface will also let PR 4 prune unused store
methods safely.

Includes a NewWorkerSnapshotForTesting helper in configstore so
external-package tests can drive lifecycle code with specific
observed fields. The 'ForTesting' suffix is the signal that any
production import of it is a bug.

* Migrate mismatched-version reaper to WorkerLifecycle

First caller migration. K8sWorkerPool gains a lifecycle service field
(wired via NewWorkerLifecycle in the constructor, lazily wired in
test fixtures) and the RuntimeWorkerStore interface gains ObserveWorker
plus BumpWorkerEpoch so the typed seam is uniform across CAS paths.

RetireOneMismatchedVersionWorker now flows through
lifecycle.RetireIdleVariantFromSnapshot: the durable CAS is fenced by
a fresh ObserveWorker snapshot, and on success the lifecycle service
schedules pod + secret cleanup via DeleteWorkerArtifacts (the typed
entry point that wraps the existing deleteRetiredRuntimeWorker
plumbing). The reaper additionally issues a synchronous pod Delete to
preserve its one-per-call contract — the second async delete is
idempotent.

Refactored NewWorkerSnapshotForTesting to take a full WorkerRecord
so tests can preserve image/org_id alongside the other identity
fields without listing each through a wide parameter list.

* Migrate janitor hot-idle path to WorkerLifecycle

When the janitor is wired with a *WorkerLifecycle (which multitenant.go
now does, taking the shared K8s pool's instance), the hot-idle reaper
flows through lifecycle.RetireFromSnapshot:

    ListExpiredHotIdleSnapshots -> RetireFromSnapshot (CAS + cleanup)

instead of the older

    ListExpiredHotIdleWorkers -> RetireHotIdleWorker -> [
        deleteRetiredWorker | retireLocalWorker fallback | retireWorker fallback
    ]

triple-fallback. Same fences as before — the snapshot carries
state/owner/epoch/updated_at, MarkWorkerTerminalIfCurrent enforces them
— but the choreography is now in one place and the post-CAS pod cleanup
is the lifecycle service's WorkerPhysicalCleanup callback.

The legacy three-lambda path is retained for deployments that don't
wire a lifecycle (no current callers, but it keeps the janitor
buildable in isolation). PR 4 prunes it.

* Migrate janitor orphan path to WorkerLifecycle

When a lifecycle service is wired into the janitor, the orphan-cleanup
loop now goes through:

    ListOrphanedWorkerSnapshots -> RetireOrphanFromSnapshot (CAS + cleanup)

instead of the older

    ListOrphanedWorkers -> retireOrphanWorker lambda -> [pool's RetireOrphanWorker + retireWorkerPod]

The lifecycle's RetireOrphanFromSnapshot already carries the orphan-
specific CP-revival fence (via configstore.RetireOrphanWorker's
NOT EXISTS clause) so the snapshot path keeps the same correctness
guarantees while consolidating the choreography.

Legacy lambda path retained behind 'if j.lifecycle == nil' until PR 4.

* Migrate ReserveSharedWorker + SpawnMinWorkers retire paths to WorkerLifecycle

retireClaimedWorker (the post-claim-failed retire used by image-mismatch,
reserve-failure, and the SpawnMin* paths) now flows through
lifecycle.RetireFromSnapshot when a lifecycle service is wired:

    NewWorkerSnapshot(claimed) -> RetireFromSnapshot (CAS + cleanup)

The legacy direct-MarkWorkerTerminalIfCurrent path is retained behind a
nil-lifecycle guard so the process backend keeps building, but the K8s
pool always has a lifecycle (set in the constructor).

retireCurrentRuntimeWorker — the SpawnMin failure refresh-then-retire —
also moves to ObserveWorker (snapshot-typed) for the refresh step,
keeping the type-safe seam consistent across all snapshot-fenced paths.

Renamed NewWorkerSnapshotForTesting to NewWorkerSnapshot. The original
name signaled 'test-only', but the production claim/reserve callers
need to wrap freshly-issued records into snapshots too — the safety
property they care about (Snapshot != Lease at the type level) is
preserved, while paying for an extra ObserveWorker round-trip on a row
we just CAS'd would be pure overhead.

* Migrate ShutdownAll to lease-based Drain + RetireDrained

The 3-step shutdown chain now goes through the typed lifecycle:

    lease := configstore.NewWorkerLease(w.ID, cpInstanceID, w.OwnerEpoch())
    lifecycle.Drain(lease) -> [pod delete] -> lifecycle.RetireDrained(lease)

The pod delete stays caller-orchestrated between the two CAS steps so a
delete failure leaves the row in draining for the orphan sweep to
reconcile — same property the original chain provided, now expressed in
terms that can't accidentally swap a Snapshot in where a Lease belongs.

Legacy direct-store calls retained behind 'if p.lifecycle == nil' for
builds without a lifecycle service. PR 4 prunes.

* Migrate cred-refresh to lifecycle.RefreshLease

The credential-refresh path's two-step BumpWorkerEpoch + SetOwnerEpoch
now goes through the typed seam:

    currentLease := configstore.NewWorkerLease(...)
    newLease, _ := lifecycle.RefreshLease(currentLease)
    worker.SetOwnerEpoch(newLease.OwnerEpoch())

Same underlying CAS (BumpWorkerEpoch); same in-memory propagation
afterwards. What's gained is the typed lease — the cred-refresh path
now produces a WorkerLease that downstream lifecycle calls could
consume directly (no current callers, but the API is now
composition-friendly).

Note: this commit does not close the BumpWorkerEpoch ↔ ShutdownAll
race I flagged in the #615 review. The race window between the
durable bump and the in-memory SetOwnerEpoch persists, because no
typed API can change the fact that the durable CAS returns *before*
the in-memory copy can update. PR 5 (recovery) is the natural place
to address it, either with a shared lock between the two operations
or by making ShutdownAll observe a 'refresh in progress' flag.

* Round 1 review fixes

- Wire pool.lifecycle in shutdownTestPool so the full TestShutdownAll_*
  suite exercises the lifecycle-based Drain → pod-delete → RetireDrained
  chain (commit f1afa61) instead of only the legacy direct-store fallback.
- Add TestControlPlaneJanitorOrphanGoesThroughLifecycleWhenWired,
  symmetric to the hot-idle lifecycle test, so the orphan migration
  (commit e9bca02) is no longer untested at the janitor level.
- RetireOrphanFromSnapshot no longer mis-classifies any owned-snapshot
  CAS miss as 'fence_miss_cp_revived'. Without an extra GetWorkerRecord
  we can't distinguish state/owner/epoch/updated_at/CP-revival; report
  the generic fence-miss label and let PR 6 add a real classifier.
  Pinned with TestRetireOrphanFromSnapshotReportsGenericMissWithoutExtraRead.
- Renamed lifecycle_testing.go → lifecycle_constructors.go now that
  NewWorkerSnapshot is used from production retireClaimedWorker as well
  as tests.
- Dropped the unreferenced (WorkerSnapshot).recordPtr() accessor.
- Honest comment on MarkLostFromLease about why the health-checker
  migration is deferred (the existing 'remove + replenish' threading
  needs prep work first).

* Round 2 review fixes

- Restore the nil-guard on p.runtimeStore in the mismatched-version
  reaper. Pre-PR code wrapped the runtime-store read in
  'if p.runtimeStore != nil'; commit 7e0aa4a accidentally dropped the
  guard. Latent (production always wires the store) but easy to invert
  later if a test pool builder ever omits the store.
- ShutdownAll now re-mints the WorkerLease at Step 3 (RetireDrained)
  from the worker's current in-memory OwnerEpoch, matching the legacy
  per-step w.OwnerEpoch() lookup. Without this, a cred-refresh bump
  landing between Drain and RetireDrained would CAS-miss the cached
  lease and leave the row in draining for the orphan sweep — the legacy
  path's late w.OwnerEpoch() read avoided that.
- NewWorkerSnapshot no longer rewrites a zero UpdatedAt to time.Now().
  The downstream SQL fence only adds the 'updated_at <= ?' predicate
  when UpdatedAt is non-zero (see store.MarkWorkerTerminalIfCurrent),
  so preserving zero correctly maps to 'no updated_at fence' instead
  of producing a forward-dated snapshot that any current row would
  fence-pass against.
- RefreshLease docstring is honest about not closing the
  BumpWorkerEpoch + SetOwnerEpoch race. The typed lease is the value;
  the race is PR 5 territory (already noted in the PR description).

* Round 3 review fixes

- RetireIdleVariantFromSnapshot docstring updated: today's only caller
  is the mismatched-version reaper. The hot-idle TTL janitor was
  promoted to RetireFromSnapshot in commit fd136c8 since the list
  query already narrows candidates to state=hot_idle.
- Removed unreferenced LeaseFromClaimedRecord. It was exported for a
  hypothetical claim-then-wrap call site that ended up using
  NewWorkerLease directly. PR 4 would have culled it anyway.
- Preserved the legacy 'bump owner epoch for refresh' error wrapper
  inside the lifecycle branch so log-grep / dashboard filters that
  match on the previous wrapper string keep matching after the
  migration. ErrWorkerOwnerEpochMismatch is still threaded via %w.
- Added TestRetireFromSnapshotTargetingLostSchedulesCleanup so the
  reason=RetireReasonCrash → target=Lost branch (used by
  retireClaimedWorker) is covered separately from the Retired target.

* Round 4 review fixes

- Deleted unreferenced newWorkerLeaseFromRecord helper. It was the
  package-internal companion to the now-deleted LeaseFromClaimedRecord;
  no callers remain.
- Tightened TestRetireIdleVariantSkipsWhenSnapshotNotEligible's comment.
  The previous wording described a 'store says true → service overrides'
  scenario that the code doesn't actually produce — the state override
  only fires on !transitioned. The new wording matches what the test
  asserts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant