Skip to content

PR 4: privatize unsafe lifecycle transition paths#617

Merged
bill-ph merged 5 commits into
mainfrom
codex/pr4-deletion
May 24, 2026
Merged

PR 4: privatize unsafe lifecycle transition paths#617
bill-ph merged 5 commits into
mainfrom
codex/pr4-deletion

Conversation

@bill-ph
Copy link
Copy Markdown
Collaborator

@bill-ph bill-ph commented May 24, 2026

Summary

PR 4 of the worker-lifecycle redesign. Net: +163/-434 LOC across 8 files — mostly deletions, no new behavior.

PR 3 introduced the typed lifecycle service but left every migrated call site with a `if lifecycle == nil { legacy }` fallback so each migration commit was independently revertible. With PR 3 merged those fallbacks are dead in production. This PR removes them and narrows the public store interface so the typed seam is the obvious path.

What lands in this PR

1. Delete legacy `lifecycle == nil` fallback branches

  • `k8s_pool.go::retireClaimedWorker`: the direct `MarkWorkerTerminalIfCurrent` + `deleteRetiredRuntimeWorker` branch is gone; `lifecycle.RetireFromSnapshot` is the only path.
  • `k8s_pool.go::ShutdownAll`: the direct `MarkWorkerDraining` / `RetireDrainingWorker` legacy branches are gone; the lease-based `Drain` → pod delete → `RetireDrained` chain is the only path.
  • `shared_worker_activator.go::RefreshWorkerCredentials`: the direct `BumpWorkerEpoch` branch is gone; `lifecycle.RefreshLease` is the only path. Returns an explicit error if the activator is constructed without a lifecycle.
  • `janitor.go`: orphan, stuck-spawning/activating, and hot-idle reapers all flow through the lifecycle. The stuck-worker path migrates from the now-deleted `retireWorker` lambda to `lifecycle.RetireFromSnapshot`.

2. Delete unused janitor lambda plumbing

All four janitor lambdas (`retireWorker`, `retireOrphanWorker`, `retireLocalWorker`, `deleteRetiredWorker`) and their corresponding `multitenant.go` wires are gone. The `retireRuntimeWorker` helper that wrapped them is gone. The `controlPlaneExpiryStore` interface sheds `ListOrphanedWorkers`, `ListStuckWorkers`, `ListExpiredHotIdleWorkers`, `RetireHotIdleWorker` — the snapshot-typed variants are the only callers now.

3. Narrow `RuntimeWorkerStore` interface

Three worker-id-only CAS methods are moved off the public interface. They stay on `*configstore.ConfigStore` and are extracted via type assertion to the package-private `workerLifecycleStore` when wiring the lifecycle service:

  • `MarkWorkerDraining(workerID, cpID, epoch)`
  • `RetireDrainingWorker(workerID, cpID, epoch, reason)`
  • `BumpWorkerEpoch(workerID, cpID, epoch)`

A caller with only a `RuntimeWorkerStore` reference cannot reach these methods directly — they have to go through `WorkerLifecycle`. That's the PR 4 hardening goal stated in the roadmap.

`MarkWorkerLostIfCurrentLease` stays on `RuntimeWorkerStore` because the health-checker (`markWorkerLostForHealthLease`) still calls it directly; PR 5 will migrate that path to `lifecycle.MarkLostFromLease`.

`RetireIdleWorker` is removed entirely — it had no callers anywhere in the tree.

4. `K8sWorkerPool.ensureLifecycle()`

Lazy initialization for tests that bypass `newK8sWorkerPool` and assign `pool.runtimeStore` directly. Production always wires the lifecycle eagerly in the constructor. The 42 existing tests that pattern `pool.runtimeStore = store` continue to work unchanged; the lifecycle is created on first lifecycle-using call.

5. Test cleanup

  • `TestControlPlaneJanitorDeletesHotIdlePodAfterStoreRetire`: deleted, covered by the lifecycle-path test.
  • `TestControlPlaneJanitorSkipsHotIdlePodDeleteAfterStoreMiss`: rewritten to assert on lifecycle outcome.
  • `TestControlPlaneJanitorRunRetiresOrphanedAndStuckWorkers` and `TestControlPlaneJanitorRunOnceContinuesAfterExpireError`: migrated from `retireWorker` callback assertions to lifecycle-store assertions.
  • The lifecycle-path tests added in PR 3 (`TestControlPlaneJanitorHotIdleGoesThroughLifecycleWhenWired`, `TestControlPlaneJanitorOrphanGoesThroughLifecycleWhenWired`) shed the now-impossible "legacy lambda must not fire" assertions; the fields they referenced are gone.
  • `captureRuntimeWorkerStore.RetireIdleWorker` mock impl and unused `janitorObservedWorkerRecordMatchesCurrent` helper deleted.

What stays

  • `MarkWorkerLostIfCurrentLease` on the public interface (health-check path migration → PR 5).
  • The configstore `List*` legacy methods (`ListOrphanedWorkers`, etc.) — postgres integration tests in `tests/configstore/` exercise them directly, and the snapshot-typed wrappers are thin layers over them.

Testing

  • `go build -tags kubernetes ./...` clean
  • `go vet -tags kubernetes ./controlplane/...` clean
  • `go test ./controlplane/...` and `go test -tags kubernetes ./controlplane/...` — all 5 packages pass

Out of scope

  • `BumpWorkerEpoch ↔ ShutdownAll` in-memory-epoch race → PR 5
  • Health-checker migration to `lifecycle.MarkLostFromLease` → PR 5 (the activator-style legacy code path it would simplify is fundamentally similar to the one PR 3 left for that reason)
  • Per-image / fine-grained CAS-miss classification metrics → PR 6

⚠️ Two commits in this stack are unsigned because the Secretive agent has been refusing prompts during the autonomous session. Easy fix before merge: `git rebase --exec 'git commit --amend --no-edit -S' origin/main`, or rely on squash-merge.

🤖 Generated with Claude Code

bill-ph added 5 commits May 23, 2026 22:57
PR 3 left every migrated caller with a 'if lifecycle == nil { legacy }'
fallback so each migration commit was independently revertible. With PR
3 merged the fallbacks are dead in production — this commit deletes
them, plus the orphan janitor lambdas they used to call into:

- k8s_pool.go retireClaimedWorker: drop direct MarkWorkerTerminalIfCurrent
  + deleteRetiredRuntimeWorker branch; lifecycle is the only path.
- k8s_pool.go ShutdownAll: drop the direct MarkWorkerDraining and
  RetireDrainingWorker fallbacks; only the lease-based Drain →
  RetireDrained chain remains.
- shared_worker_activator.go RefreshWorkerCredentials: drop direct
  BumpWorkerEpoch fallback; the typed RefreshLease is the only path.
- janitor.go: orphan, stuck-spawning/activating, and hot-idle reapers
  all flow through the lifecycle. Stuck-worker path migrates from the
  retireWorker lambda to lifecycle.RetireFromSnapshot, which lets us
  drop retireRuntimeWorker / retireWorker / retireOrphanWorker /
  retireLocalWorker / deleteRetiredWorker entirely.
- multitenant.go: corresponding lambda wires deleted; only the
  lifecycle handoff remains.
- janitor's controlPlaneExpiryStore: drop ListOrphanedWorkers,
  ListStuckWorkers, ListExpiredHotIdleWorkers, RetireHotIdleWorker —
  the snapshot-typed counterparts are the only callers.
- K8sWorkerPool.ensureLifecycle(): lazy-init for test fixtures that
  bypass the constructor and set runtimeStore directly. Production
  always wires lifecycle eagerly in newK8sWorkerPool.
- janitor tests: TestControlPlaneJanitorDeletesHotIdlePodAfterStoreRetire
  deleted (covered by the lifecycle-path test);
  TestControlPlaneJanitorSkipsHotIdlePodDeleteAfterStoreMiss rewritten
  to assert via lifecycle outcome. Other janitor tests migrate from
  retireWorker callbacks to lifecycle-store assertions.
Worker-id-only lifecycle CAS methods are now reachable only through
the WorkerLifecycle service. Three methods that were on the public
RuntimeWorkerStore interface but only called via lifecycle are moved
off the interface:

- MarkWorkerDraining (workerID, cpID, epoch)
- RetireDrainingWorker (workerID, cpID, epoch, reason)
- BumpWorkerEpoch (workerID, cpID, epoch)

They stay on *configstore.ConfigStore as concrete methods and are
extracted via a type assertion to workerLifecycleStore when wiring
the lifecycle in newK8sWorkerPool / ensureLifecycle. Tests that
construct the pool struct directly satisfy workerLifecycleStore on
their mock and the same assertion picks them up.

MarkWorkerLostIfCurrentLease stays on RuntimeWorkerStore for one more
cycle — the health-checker still calls it directly; PR 5 will migrate
that path to lifecycle.MarkLostFromLease.

Also removed RetireIdleWorker entirely. It was on the interface and
implemented on ConfigStore + the mock, but had no callers anywhere in
the tree — RetireIdleOrHotIdleWorker covered every site that used to
hit it.
- ensureLifecycle() now guards lazy initialization with sync.Once to
  prevent the data-race the round-1 reviewer flagged. Production paths
  go through the eager constructor; tests that set runtimeStore after
  the fact now safely have multiple goroutines hit a lifecycle-using
  method concurrently.
- TestK8sPool_ShutdownAll now wires a runtime store so the lifecycle
  path actually executes its 3-step CAS chain + pod deletes. Previous
  test passed vacuously on the trailing 'p.workers = preserved' line
  without verifying any of the new behavior.
- TestControlPlaneJanitorRunRetiresOrphanedAndStuckWorkers now asserts
  that the snapshot the janitor passes to RetireOrphanFromSnapshot /
  RetireFromSnapshot carries the listed row's observed state — a
  regression that drops state from the snapshot in transit (the SQL
  fence keys off it) gets caught here.
- janitor.go nil-lifecycle guard upgraded from silent skip with a
  misleading 'older non-K8s deployments' comment to a slog.Error log.
  The only janitor constructor wires lifecycle unconditionally; nil
  here is a wiring bug.
- Deleted dead K8sWorkerPool.retireOrphanWorker method and removed
  BumpWorkerEpoch from credentialExpiryStore — neither has live
  callers after the round-1 ShutdownAll test fix lands.
- NewSharedWorkerActivator now borrows the pool's lifecycle via
  ensureLifecycle() instead of snapshotting the raw field. Tests that
  wire runtimeStore after construction (which the K8sWorkerPool handles
  via lazy init) get the same wiring discipline on the activator.
- RuntimeWorkerStore narrowed further: RetireIdleOrHotIdleWorker,
  RetireOrphanWorker, and MarkWorkerTerminalIfCurrent are off the
  public interface. They had zero callers via that interface — only
  through workerLifecycleStore — so leaving them on RuntimeWorkerStore
  invited future regressions that bypass the typed lease seam.
- Janitor's nil-lifecycle slog.Error is now once-per-process via
  sync.Once. A static wiring bug shouldn't log at the janitor tick
  rate forever; loud once is sufficient.
- Stale 'RetireIdleWorker' comment in the k8s integration test
  retargeted to RetireIdleOrHotIdleWorker.
Guard MarkCredentialsRefreshed against a nil runtimeStore. The
NewSharedWorkerActivator constructor independently asserts
credentialExpiryStore (sets a.runtimeStore) and workerLifecycleStore
(via shared.ensureLifecycle() for a.lifecycle). Today's production
ConfigStore satisfies both, but the two assertions are independent —
a future partial store implementation could leave a.runtimeStore nil
while a.lifecycle is set, and RefreshCredentials would nil-deref at
the MarkCredentialsRefreshed call.

The expiry stamp was already best-effort (skipping it means the next
refresh tick sees a stale expiry and refreshes again), so guarding
it is a clean fix that preserves intent.
@bill-ph bill-ph marked this pull request as ready for review May 24, 2026 12:16
@bill-ph bill-ph merged commit 25f5168 into main May 24, 2026
22 checks passed
@bill-ph bill-ph deleted the codex/pr4-deletion branch May 24, 2026 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant