PR 4: privatize unsafe lifecycle transition paths#617
Merged
Conversation
PR 3 left every migrated caller with a 'if lifecycle == nil { legacy }'
fallback so each migration commit was independently revertible. With PR
3 merged the fallbacks are dead in production — this commit deletes
them, plus the orphan janitor lambdas they used to call into:
- k8s_pool.go retireClaimedWorker: drop direct MarkWorkerTerminalIfCurrent
+ deleteRetiredRuntimeWorker branch; lifecycle is the only path.
- k8s_pool.go ShutdownAll: drop the direct MarkWorkerDraining and
RetireDrainingWorker fallbacks; only the lease-based Drain →
RetireDrained chain remains.
- shared_worker_activator.go RefreshWorkerCredentials: drop direct
BumpWorkerEpoch fallback; the typed RefreshLease is the only path.
- janitor.go: orphan, stuck-spawning/activating, and hot-idle reapers
all flow through the lifecycle. Stuck-worker path migrates from the
retireWorker lambda to lifecycle.RetireFromSnapshot, which lets us
drop retireRuntimeWorker / retireWorker / retireOrphanWorker /
retireLocalWorker / deleteRetiredWorker entirely.
- multitenant.go: corresponding lambda wires deleted; only the
lifecycle handoff remains.
- janitor's controlPlaneExpiryStore: drop ListOrphanedWorkers,
ListStuckWorkers, ListExpiredHotIdleWorkers, RetireHotIdleWorker —
the snapshot-typed counterparts are the only callers.
- K8sWorkerPool.ensureLifecycle(): lazy-init for test fixtures that
bypass the constructor and set runtimeStore directly. Production
always wires lifecycle eagerly in newK8sWorkerPool.
- janitor tests: TestControlPlaneJanitorDeletesHotIdlePodAfterStoreRetire
deleted (covered by the lifecycle-path test);
TestControlPlaneJanitorSkipsHotIdlePodDeleteAfterStoreMiss rewritten
to assert via lifecycle outcome. Other janitor tests migrate from
retireWorker callbacks to lifecycle-store assertions.
Worker-id-only lifecycle CAS methods are now reachable only through the WorkerLifecycle service. Three methods that were on the public RuntimeWorkerStore interface but only called via lifecycle are moved off the interface: - MarkWorkerDraining (workerID, cpID, epoch) - RetireDrainingWorker (workerID, cpID, epoch, reason) - BumpWorkerEpoch (workerID, cpID, epoch) They stay on *configstore.ConfigStore as concrete methods and are extracted via a type assertion to workerLifecycleStore when wiring the lifecycle in newK8sWorkerPool / ensureLifecycle. Tests that construct the pool struct directly satisfy workerLifecycleStore on their mock and the same assertion picks them up. MarkWorkerLostIfCurrentLease stays on RuntimeWorkerStore for one more cycle — the health-checker still calls it directly; PR 5 will migrate that path to lifecycle.MarkLostFromLease. Also removed RetireIdleWorker entirely. It was on the interface and implemented on ConfigStore + the mock, but had no callers anywhere in the tree — RetireIdleOrHotIdleWorker covered every site that used to hit it.
- ensureLifecycle() now guards lazy initialization with sync.Once to prevent the data-race the round-1 reviewer flagged. Production paths go through the eager constructor; tests that set runtimeStore after the fact now safely have multiple goroutines hit a lifecycle-using method concurrently. - TestK8sPool_ShutdownAll now wires a runtime store so the lifecycle path actually executes its 3-step CAS chain + pod deletes. Previous test passed vacuously on the trailing 'p.workers = preserved' line without verifying any of the new behavior. - TestControlPlaneJanitorRunRetiresOrphanedAndStuckWorkers now asserts that the snapshot the janitor passes to RetireOrphanFromSnapshot / RetireFromSnapshot carries the listed row's observed state — a regression that drops state from the snapshot in transit (the SQL fence keys off it) gets caught here. - janitor.go nil-lifecycle guard upgraded from silent skip with a misleading 'older non-K8s deployments' comment to a slog.Error log. The only janitor constructor wires lifecycle unconditionally; nil here is a wiring bug. - Deleted dead K8sWorkerPool.retireOrphanWorker method and removed BumpWorkerEpoch from credentialExpiryStore — neither has live callers after the round-1 ShutdownAll test fix lands.
- NewSharedWorkerActivator now borrows the pool's lifecycle via ensureLifecycle() instead of snapshotting the raw field. Tests that wire runtimeStore after construction (which the K8sWorkerPool handles via lazy init) get the same wiring discipline on the activator. - RuntimeWorkerStore narrowed further: RetireIdleOrHotIdleWorker, RetireOrphanWorker, and MarkWorkerTerminalIfCurrent are off the public interface. They had zero callers via that interface — only through workerLifecycleStore — so leaving them on RuntimeWorkerStore invited future regressions that bypass the typed lease seam. - Janitor's nil-lifecycle slog.Error is now once-per-process via sync.Once. A static wiring bug shouldn't log at the janitor tick rate forever; loud once is sufficient. - Stale 'RetireIdleWorker' comment in the k8s integration test retargeted to RetireIdleOrHotIdleWorker.
Guard MarkCredentialsRefreshed against a nil runtimeStore. The NewSharedWorkerActivator constructor independently asserts credentialExpiryStore (sets a.runtimeStore) and workerLifecycleStore (via shared.ensureLifecycle() for a.lifecycle). Today's production ConfigStore satisfies both, but the two assertions are independent — a future partial store implementation could leave a.runtimeStore nil while a.lifecycle is set, and RefreshCredentials would nil-deref at the MarkCredentialsRefreshed call. The expiry stamp was already best-effort (skipping it means the next refresh tick sees a stale expiry and refreshes again), so guarding it is a clean fix that preserves intent.
This was referenced May 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR 4 of the worker-lifecycle redesign. Net: +163/-434 LOC across 8 files — mostly deletions, no new behavior.
PR 3 introduced the typed lifecycle service but left every migrated call site with a `if lifecycle == nil { legacy }` fallback so each migration commit was independently revertible. With PR 3 merged those fallbacks are dead in production. This PR removes them and narrows the public store interface so the typed seam is the obvious path.
What lands in this PR
1. Delete legacy `lifecycle == nil` fallback branches
2. Delete unused janitor lambda plumbing
All four janitor lambdas (`retireWorker`, `retireOrphanWorker`, `retireLocalWorker`, `deleteRetiredWorker`) and their corresponding `multitenant.go` wires are gone. The `retireRuntimeWorker` helper that wrapped them is gone. The `controlPlaneExpiryStore` interface sheds `ListOrphanedWorkers`, `ListStuckWorkers`, `ListExpiredHotIdleWorkers`, `RetireHotIdleWorker` — the snapshot-typed variants are the only callers now.
3. Narrow `RuntimeWorkerStore` interface
Three worker-id-only CAS methods are moved off the public interface. They stay on `*configstore.ConfigStore` and are extracted via type assertion to the package-private `workerLifecycleStore` when wiring the lifecycle service:
A caller with only a `RuntimeWorkerStore` reference cannot reach these methods directly — they have to go through `WorkerLifecycle`. That's the PR 4 hardening goal stated in the roadmap.
`MarkWorkerLostIfCurrentLease` stays on `RuntimeWorkerStore` because the health-checker (`markWorkerLostForHealthLease`) still calls it directly; PR 5 will migrate that path to `lifecycle.MarkLostFromLease`.
`RetireIdleWorker` is removed entirely — it had no callers anywhere in the tree.
4. `K8sWorkerPool.ensureLifecycle()`
Lazy initialization for tests that bypass `newK8sWorkerPool` and assign `pool.runtimeStore` directly. Production always wires the lifecycle eagerly in the constructor. The 42 existing tests that pattern `pool.runtimeStore = store` continue to work unchanged; the lifecycle is created on first lifecycle-using call.
5. Test cleanup
What stays
Testing
Out of scope
🤖 Generated with Claude Code