Skip to content

test(jobs): assert reconciler periodic-jobs bind to the reconcile queue#3

Merged
mastermanas805 merged 1 commit into
masterfrom
test/worker-queue-behavior
May 11, 2026
Merged

test(jobs): assert reconciler periodic-jobs bind to the reconcile queue#3
mastermanas805 merged 1 commit into
masterfrom
test/worker-queue-behavior

Conversation

@mastermanas805
Copy link
Copy Markdown
Member

Summary

`TestQueueReconcileConst` guarded the const string but not the actual InsertOpts binding. A typo in the periodic-job closures could route reconcilers back to the default queue and `TestQueueReconcileConst` would still pass.

Extracts `reconcileInsertOpts()` as the single call site every reconciler closure uses, and adds a test that exercises it directly.

Test plan

  • `TestQueueReconcileConst` — guards the const string (unchanged)
  • `TestReconcileInsertOpts_BindsToReconcileQueue` — guards the helper return value (new)
  • `go build ./...` passes

🤖 Generated with Claude Code

The previous TestQueueReconcileConst only guarded the const string
"reconcile" — but a typo could still route the periodic-job closures
to a different queue if the InsertOpts struct literal was edited in
isolation.

This PR extracts reconcileInsertOpts() — a single helper that returns
the InsertOpts every reconciler closure must use. Both periodic-job
builders (custom-domain, deploy-status) now call the helper, and a
new test exercises it directly:

  TestReconcileInsertOpts_BindsToReconcileQueue verifies:
    - returns non-nil (a nil opts falls back to QueueDefault, which
      is exactly the starvation bug fix/reconcile-queue prevented)
    - Queue equals queueReconcile
    - Queue resolves to the literal "reconcile" (matches the
      QueueConfig key in the river.Config)

Closure code path is identical to before — production behavior
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 merged commit 994e9db into master May 11, 2026
@mastermanas805 mastermanas805 deleted the test/worker-queue-behavior branch May 11, 2026 09:48
mastermanas805 added a commit that referenced this pull request May 20, 2026
PROBLEM. The prod orphan instant-deploy-04dc0b31 (2026-05-14) sat in
ImagePullBackOff for 9h+. PASS 3 left it alone because the deployments
row was status='deploying', not 'deleted', and its team was active.
deploy_status_reconcile flips a row to 'failed' only when k8s reports
DeploymentReplicaFailure=True — a stuck pod never trips that.

PASS 3 ENHANCEMENTS

- Per-namespace reason labels (team_tombstoned, no_db_row,
  failed_old_deployment) drive new Prometheus metric
  instant_orphan_sweep_reaped_total{reason}.
- no_db_row now applies a 1h grace via GetNamespaceAge to avoid racing
  with in-flight provisions.
- failed_old_deployment reaps instant-deploy-* whose row is status='failed'
  AND created_at > 6h ago (autopsy stays in deployment_events).
- Proposed-reap structured log lands BEFORE every delete with full
  evidence (constraint #3: operator must see what is about to happen).

PASS 6 (NEW) — STUCK-BUILD DETECTION

Catches deployments stuck in 'building'/'deploying' for >30min whose
only pod is in ImagePullBackOff/ErrImagePull/CrashLoopBackOff.
Flips the row to 'failed' + sets error_message. The autopsy is captured
by the next deploy_status_reconcile tick (one source of truth).

SAFETY

- Whitelist on the three prefixes (instant-deploy-*, instant-customer-*,
  instant-stack-*) enforced by the ListNamespacesWithPrefix seam.
- All grace thresholds (1h/6h/30min) conservative on purpose.
- Per-namespace fail-open posture matches the existing passes.
- Pure classifier extracted (classifyDeployOrphan) for direct table-driven testing.

TESTS

- TestOrphanSweep_NamespaceWithoutDBRow_ReapsAfterGrace
- TestOrphanSweep_FailedDeployment_ReapedAfter6h
- TestOrphanSweep_Pass6_StuckBuild_FlipsToFailed
- TestOrphanSweep_Pass6_RunningPod_DoesNotFlip
- TestOrphanSweep_PrefixWhitelist_RefusesUnknownNamespace
- TestOrphanSweep_ClassifyDeployOrphan_TableDriven (10 row shapes)
- TestOrphanSweep_StuckBuildWaitingReasons_Registry (registry-iterating
  per CLAUDE.md rule 18)

METRICS

- instant_orphan_sweep_reaped_total{reason} — PASS 3/4/5/6 reaps
- instant_orphan_sweep_reap_failed_total{reason} — k8s/DB failures

Companion infra PR adds the Prom alerts (no_db_row > 0 over 1h → P0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant