feat(worker): audit-only orphan-customer-DB / redis-namespace sweep (flag-gated OFF)#102
Merged
Merged
Conversation
…flag-gated OFF)
New River periodic job orphan_db_sweep (hourly, reconcile queue, UniqueOpts)
that addresses the ~25 orphaned customer DB / redis namespace drain-backlog —
in DETECTION / DRY-RUN mode only. It lists instant-customer-* namespaces, flags
the ones whose token has NO non-terminal (pending/active/paused/suspended)
resources row and is past the provisioning grace window, then LOGS each
candidate (token masked via logsafe.Token) and emits the candidate metrics.
It DROPS NOTHING in audit-only mode.
truehomie-2026-06-03 safety: there is NO manual / raw DROP anywhere in this
job. The destructive teardown sits behind a SECOND flag and, when (and only
when) enabled, routes through the AUDITED provisioner DeprovisionResource
chokepoint — the same path the TTL reaper (expire.go) uses. For this PR that
path is intentionally unreachable-by-default.
Two flag gates, BOTH default OFF / fail-closed:
ORPHAN_DB_SWEEP_ENABLED — master flag; off → Work is a DEBUG no-op
(no namespace List, no DB read, no metric).
ORPHAN_DB_SWEEP_DESTRUCTIVE_ENABLED — destructive flag; meaningless unless the
master is also on AND a provisioner is
wired. Routes through the audited
chokepoint only.
Fail-safe / fail-open: a namespace-List error or a live-token DB-read error
degrades to ZERO candidates (never an empty-set that a destructive caller could
read as "drop everything"); a candidate whose token reappears live at the
destructive re-confirm is SKIPPED; a generic (unmapped-kind) orphan is SKIPPED
(no proven backing type → no guessed DROP). When in doubt, skip + log.
Metrics (lazy *Vec, both labels primed in metrics_test):
instant_orphan_db_sweep_candidates_total{kind} — counter, kind in
{customer_namespace, redis_namespace}
instant_orphan_db_sweep_candidates_current{kind} — gauge, current backlog
Alert + dashboard + catalog live in the infra repo (not owned here) — see PR
body for the exact metric names + suggested alerts (rule 25 follow-up).
Tests: candidate detection (orphan vs live vs pending vs within-grace), kind
classification, both flag gates (off → no-op), masking, fail-open paths, and
that the destructive deprovisioner is NEVER called in audit-only mode. New file
at 100% statement coverage. make gate green (build + vet + go test -short).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ner helper diff-cover flagged the `if provClient != nil` branch in the StartWorkers wiring (integration-only, not unit-reachable). Extract the typed-nil-safe conversion into orphanDBSweepDeprovisionerFor (mirrors NewExpireAnonymousWorker's handling) and unit-test both arms directly, so the wiring call site is a single non-branching expression covered by TestStartWorkers_FullBoot and the branch logic is covered by the new unit test. New code back to 100% patch coverage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
A new River periodic job
orphan_db_sweep(internal/jobs/orphan_db_sweep.go) that addresses the ~25 orphaned customer DB / redis-namespace drain-backlog — in DETECTION / DRY-RUN mode only. It listsinstant-customer-*namespaces, flags the ones whose token has no non-terminal (pending/active/paused/suspended)resourcesrow and is past the provisioning grace window, then LOGS each candidate (token masked vialogsafe.Token) + emits candidate metrics. It drops nothing in audit-only mode.truehomie-2026-06-03 safety — NO destructive drop runs by default
On 2026-06-03 an active Pro customer's DB + role were dropped by an unidentified, unaudited path. Accordingly:
DROPanywhere in this job.DeprovisionResourcechokepoint — the same path the TTL reaper (expire.go) uses.How it differs from
orphan_sweep_reconciler.goPASS 4PASS 4 already lists
instant-customer-*namespaces and deletes the orphans immediately as a side-effect of a large reconciler. This job is a separate, conservative, observability-first surface: hourly, audit-only by default, flag-gated, with per-kind candidate metrics so we can measure the backlog and review the dry-run list before any reclamation.Flag gates — both default OFF / fail-closed
ORPHAN_DB_SWEEP_ENABLEDfalseWorkis a DEBUG no-op: no namespace List, no DB read, no metric.ORPHAN_DB_SWEEP_DESTRUCTIVE_ENABLEDfalseDeprovisionResourcechokepoint only.destructiveArmed()=Enabled && DestructiveEnabled && provisioner != nil(defense-in-depth).Fail-safe / fail-open
Metrics (rule 25 — infra follow-up needed)
This repo does not own
infra/, so the alert + dashboard tile +METRICS-CATALOG.mdrow are an explicit follow-up. Exact metric names:instant_orphan_db_sweep_candidates_total{kind}— counter;kind∈{customer_namespace, redis_namespace}. One increment per detected orphan candidate.instant_orphan_db_sweep_candidates_current{kind}— gauge; the orphan-candidate count observed by the most recent tick (falls to 0 when the backlog drains).Both are lazy
*Vec; bothkindlabel values are primed inmetrics_test.goso the tiles render from process start.Suggested alerts for the infra PR:
sum(instant_orphan_db_sweep_candidates_total) by (kind) > 0for 1h → P2 (standing orphan backlog = real cost: live pod, no owner; review the dry-run log before enabling destructive reclamation).max(instant_orphan_db_sweep_candidates_current) by (kind) > 25→ P2 (the documented drain-backlog from the task brief; above it, accumulation is outpacing reclamation).instant_orphan_db_sweep_candidates_currentperkindoninfra/newrelic/dashboards/instanode-reliability.json.Wiring
StartWorkersreusing the same seams as the orphan-sweep reconciler:K8sNamespaceLister(namespace List + age check) and the auditedResourceDeprovisioner(=provClient, only for the flag-gated destructive arm).buildPeriodicJobs— hourly,reconcileInsertOpts(carriesUniqueOptsso replicas:2 doesn't double-run; passesTestPeriodicJobs_AllCarryUniqueOpts),RunOnStart=false.Coverage block (rule 17)
Gate
make gategreen (the EXACT CI deploy.yml test step — build + vet +go test ./... -short -count=1).golangci-lint runon touched packages: 0 issues.Explicit statement
No destructive drop runs by default. Both flags default OFF / fail-closed; the destructive arm is unreachable on merge and, even when armed, never issues a raw DROP — it only routes through the existing audited provisioner deprovision chokepoint.
🤖 Generated with Claude Code