Skip to content

feat(jobs): k8s-aware deploy + custom-domain reconcilers#1

Merged
mastermanas805 merged 1 commit into
masterfrom
feat/k8s-reconciler-jobs
May 11, 2026
Merged

feat(jobs): k8s-aware deploy + custom-domain reconcilers#1
mastermanas805 merged 1 commit into
masterfrom
feat/k8s-reconciler-jobs

Conversation

@mastermanas805
Copy link
Copy Markdown
Member

Summary

  • deploy_status_reconcile.go: poll k8s for actual ReplicaSet state and update DB status. Replaces the optimistic "trust the compute provider" model that flagged `healthy` before pods were Ready.
  • custom_domain_reconcile.go: drive Ingress + cert-manager Certificate creation for custom-domain rows; mark verified when cert is Ready.

Adds `k8s.io/client-go` to deps.

Test plan

  • Add reconcile-loop unit tests (follow-up)
  • Verified live behaviour via the deploy-status reporting in PR-A's smoke tests

🤖 Generated with Claude Code

Two new background reconcilers wired into the River queue:

- deploy_status_reconcile.go: polls the k8s deploy namespace for each
  active deployment and updates the platform DB status (building →
  deploying → healthy/failed) based on actual ReplicaSet rollout
  state. Replaces the previous "compute provider returns and we trust
  it" model that was reporting healthy before pods were Ready.

- custom_domain_reconcile.go: checks pending custom-domain rows,
  creates/updates the corresponding Ingress + cert-manager
  Certificate resources, marks the row verified when the cert is
  Ready. Tolerates transient ACME failures.

Adds k8s.io/client-go (and transitive deps) to go.mod. Jobs registered
in internal/jobs/workers.go; expire.go gains a small change to skip
deploys that the reconciler will catch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 merged commit 60f1254 into master May 11, 2026
@mastermanas805 mastermanas805 deleted the feat/k8s-reconciler-jobs branch May 11, 2026 08:00
mastermanas805 added a commit that referenced this pull request May 17, 2026
P2 worker-side fixes from BUGHUNT-REPORT-2026-05-17.md:

1. GeoLite2 refresh corrupted the .mmdb after every run — MaxMind serves a
   gzipped tarball (suffix=tar.gz) but the job io.Copy'd the raw bytes
   straight to the .mmdb path. Now gunzip + untar and extract only the
   *.mmdb member (geodb.go: extractGeoLite2MMDB).

2. razorpay_webhook_events dedup table was never pruned (migration 033
   envisioned it; no job shipped). Added RazorpayWebhookPruneWorker — a
   daily DELETE of rows > 30 days, registered alongside uptime_retention.

3. Billing reconciler free-downgrade stranded paid resources: a terminal
   Razorpay status with PaidCount==0 set plan_tier='free' (the 24h-TTL
   ephemeral tier) while resources kept a paid tier + expires_at=NULL.
   Now every terminal status downgrades to 'hobby' (lowest paid tier),
   matching the non-zero-paid branch — keeps team-tier and resource-tier
   coherent.

4. ExpireStacksWorker hard-deleted the stacks row even when namespace
   teardown was skipped (k8sClient==nil) — orphaning a live namespace
   with no DB pointer. Now skips the DELETE too, leaving the row for a
   later in-cluster run.

5. Storage suspend/unsuspend flapped at the boundary (suspend and
   unsuspend both at limitBytes). Added a hysteresis band: unsuspend only
   below 90% of the limit; the unsuspend loop now also skips resources the
   suspend loop flipped in the same Work() tick.

6. Expiry worker ignored paused/suspended anonymous resources past TTL —
   query was status='active' only, orphaning them. Now expires
   status IN ('active','paused','suspended') past TTL (TTL wins over
   lifecycle state).

7. Confirmed worker storage limits are MiB throughout (*1024*1024); no
   decimal-MB sites — no change needed.

Tests: extraction unit tests for #1, hysteresis dead-band test for #5,
paused/suspended expiry test for #6, prune-job tests for #2. go build /
go vet / go test ./... -short all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805 added a commit that referenced this pull request May 17, 2026
The worker's storage-quota Redis suspension built the ACL username as
usr_<token[:8]> via redisUsernameForToken. Wave-2 changed both the
provisioner shared backend (provisioner/.../redis/local.go) and the api
cache provider (api/.../cache/redis.go) to use the FULL token —
usr_<full-token> — and the dedicated backend
(provisioner/.../redis/dedicated.go) uses ded_<token[:8]>. The worker
matched neither, so `ACL SETUSER <user> off` targeted a non-existent
user: a silent no-op. The resource row flipped to 'suspended' but the
customer kept full Redis access. Token-truncation class (recurring
pattern #1 in BUGHUNT-REPORT-2026-05-17-round2.md).

Fix: redisUsernameForToken now takes the resource tier and returns the
exact provision-time username — usr_<full-token> for shared-backend
tiers (anonymous/free, via isSharedRedisTier) and ded_<token[:8]> for
dedicated paid tiers. The ResourceInfraRevoker interface's unused
connectionURL parameter is repurposed to tier; both call sites in
quota.go (suspend + unsuspend loops) thread the tier through.

Tests: quota_infra_redis_username_test.go asserts both username schemes
byte-for-byte and that the tier classifier matches the eviction loop's;
the suspend test now asserts tier is passed to the revoker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805 added a commit that referenced this pull request May 17, 2026
…_resource_id

Token-truncation class (P1, BUGHUNT-REPORT-2026-05-17-round2 recurring
pattern #1). The worker's quota-suspend re-derived the dedicated-Redis ACL
username as ded_<token[:8]> and the storage scanner re-derived the object
prefix as token[:8] — both collide for two tokens sharing 8 hex chars.

Store-at-provision, never re-derive (mirrors provisioner dedident.go and
api prefixident.go):
- ResourceInfraRevoker.RevokeAccess/GrantAccess gain a providerResourceID
  param. redisUsernameForToken resolves: stored provider_resource_id when
  present (the exact provisioned name), else shared usr_<full-token>, else
  LEGACY dedicated ded_<token[:8]> for pre-fix rows.
- The quota suspend/unsuspend SELECTs now read provider_resource_id and
  thread it to the revoker.
- storage_minio.minioObjectPrefix already preferred provider_resource_id;
  the token[:8] branch is documented as a legacy fallback with a named
  constant (legacyStorageObjectPrefixTokenLen).

Coverage tests: quota_infra_redis_username_test.go and the new
storage_minio_prefix_test.go assert the stored-PRID path is used verbatim,
the legacy token[:8] fallback still resolves old-form identifiers, and two
8-char-prefix-sharing tokens no longer collide.

Deploy-order-safe: a brief api/worker deploy skew only weakens quota-
suspend (fail-safe — no worse than today's silent no-op); no isolation
boundary depends on cross-service version parity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805 added a commit that referenced this pull request May 20, 2026
…ation APPLIED (#34)

Pre-fix bug (CHAOS-DRILL-2026-05-20 finding #1, propagation_runner.go
lines 756–771):

  handleTierElevation treated `(Applied=false, SkipReason=<any string
  not in the allowed-skip whitelist>)` as success. A WARN log fired,
  firstErr stayed nil, the runner stamped applied_at on the row, and
  the entitlement_reconciler (5-min backstop) saw no drift to correct
  because applied_at was set. A paying customer's tier-elevation
  regrade never landed — no retry, no dead-letter, no alert.

  Real prod trigger: customer's postgres pod missing postgres-admin
  Secret (legacy free-tier pods, mid-deprovisioning races). The chaos
  drill confirmed the failure mode end-to-end.

Fix: any non-allowed SkipReason now returns propagationUnexpectedSkipErr
(implements errors.Is on errPropagationUnexpectedSkipSentinel). The
runner's markRetry path detects the sentinel and emits a distinct
propagation.unexpected_skip audit row (NOT propagation.applied). The
row retries per the existing backoff schedule (1m, 5m, 15m, ...) and
dead-letters at propagationMaxAttempts (10 attempts ≈ 24h33m), going
through the standard markDeadLettered path with the canonical
propagation.dead_lettered audit kind that operators already alert on.

New Prometheus counter:

  instant_propagation_unexpected_skip_total{kind,resource_type,skip_reason}

with bounded skip_reason cardinality via bucketSkipReason() —
postgres_admin_secret_missing, redis_auth_secret_missing,
namespace_not_found, pod_not_found, resource_not_reachable,
legacy_resource, other. Leading indicator for the dead-letter alert
that already exists.

Audit kinds the runner now emits (mirrors api/models/audit_kinds.go):
  - propagation.applied         (success; unchanged)
  - propagation.retrying        (routine retry; unchanged)
  - propagation.dead_lettered   (terminal failure; unchanged)
  - propagation.unexpected_skip (NEW: F1 retry signal)

Coverage block (CLAUDE.md rule 17):

  Symptom:        propagation.applied audit row + applied_at stamp on a
                  row whose regrade never landed
  Enumeration:    rg -F 'unexpected_skip'
                  (worker, provisioner, api repos)
  Sites found:    1 emit site (handleTierElevation only)
  Sites touched:  1
  Coverage test:  TestIsPropagationAllowedSkip_Coverage iterates
                  propagationAllowedSkipSubstrings + a known-failure
                  string set; TestPropagation_UnexpectedSkip_DoesNotMarkApplied
                  fails the second a future PR re-routes unexpected_skip
                  through markApplied
  Live verified:  pending — will verify post-deploy via synthetic
                  pending_propagations row pointing at non-existent
                  team_id with kind=tier_elevation

Tests pass:
  TestPropagation_UnexpectedSkip_DoesNotMarkApplied             PASS
  TestPropagation_UnexpectedSkip_DeadLettersAtMaxAttempts       PASS
  TestIsPropagationAllowedSkip_Coverage                         PASS
  TestPropagationUnexpectedSkipErr_IsMatches                    PASS
  TestBucketSkipReason_BoundsCardinality                        PASS

make gate green (build + vet + go test ./... -short -count=1).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805 added a commit that referenced this pull request May 20, 2026
…integration test layer (#35)

* fix(jobs): CHAOS F1 — unexpected_skip no longer silently marks propagation APPLIED

Pre-fix bug (CHAOS-DRILL-2026-05-20 finding #1, propagation_runner.go
lines 756–771):

  handleTierElevation treated `(Applied=false, SkipReason=<any string
  not in the allowed-skip whitelist>)` as success. A WARN log fired,
  firstErr stayed nil, the runner stamped applied_at on the row, and
  the entitlement_reconciler (5-min backstop) saw no drift to correct
  because applied_at was set. A paying customer's tier-elevation
  regrade never landed — no retry, no dead-letter, no alert.

  Real prod trigger: customer's postgres pod missing postgres-admin
  Secret (legacy free-tier pods, mid-deprovisioning races). The chaos
  drill confirmed the failure mode end-to-end.

Fix: any non-allowed SkipReason now returns propagationUnexpectedSkipErr
(implements errors.Is on errPropagationUnexpectedSkipSentinel). The
runner's markRetry path detects the sentinel and emits a distinct
propagation.unexpected_skip audit row (NOT propagation.applied). The
row retries per the existing backoff schedule (1m, 5m, 15m, ...) and
dead-letters at propagationMaxAttempts (10 attempts ≈ 24h33m), going
through the standard markDeadLettered path with the canonical
propagation.dead_lettered audit kind that operators already alert on.

New Prometheus counter:

  instant_propagation_unexpected_skip_total{kind,resource_type,skip_reason}

with bounded skip_reason cardinality via bucketSkipReason() —
postgres_admin_secret_missing, redis_auth_secret_missing,
namespace_not_found, pod_not_found, resource_not_reachable,
legacy_resource, other. Leading indicator for the dead-letter alert
that already exists.

Audit kinds the runner now emits (mirrors api/models/audit_kinds.go):
  - propagation.applied         (success; unchanged)
  - propagation.retrying        (routine retry; unchanged)
  - propagation.dead_lettered   (terminal failure; unchanged)
  - propagation.unexpected_skip (NEW: F1 retry signal)

Coverage block (CLAUDE.md rule 17):

  Symptom:        propagation.applied audit row + applied_at stamp on a
                  row whose regrade never landed
  Enumeration:    rg -F 'unexpected_skip'
                  (worker, provisioner, api repos)
  Sites found:    1 emit site (handleTierElevation only)
  Sites touched:  1
  Coverage test:  TestIsPropagationAllowedSkip_Coverage iterates
                  propagationAllowedSkipSubstrings + a known-failure
                  string set; TestPropagation_UnexpectedSkip_DoesNotMarkApplied
                  fails the second a future PR re-routes unexpected_skip
                  through markApplied
  Live verified:  pending — will verify post-deploy via synthetic
                  pending_propagations row pointing at non-existent
                  team_id with kind=tier_elevation

Tests pass:
  TestPropagation_UnexpectedSkip_DoesNotMarkApplied             PASS
  TestPropagation_UnexpectedSkip_DeadLettersAtMaxAttempts       PASS
  TestIsPropagationAllowedSkip_Coverage                         PASS
  TestPropagationUnexpectedSkipErr_IsMatches                    PASS
  TestBucketSkipReason_BoundsCardinality                        PASS

make gate green (build + vet + go test ./... -short -count=1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(jobs): CHAOS F2/F3/F4 — bound unknown_kind retries, add dead-letter counter, pin River RescueStuckJobsAfter

Ships the three follow-ups from CHAOS-DRILL-2026-05-20 on top of F1's
unexpected_skip fix. F1 already pulled in the F2/F3 helpers (the
propagation_runner.go markUnknownKindDeadLettered path + the
PropagationDeadLetteredTotal / PropagationUnknownKindTotal counters);
this commit adds the F4 River config pin and the registry-iterating tests
that lock the three fixes in.

F2 (P1, CHAOS-DRILL-2026-05-20):
  pending_propagations rows whose kind has no handler now respect the same
  propagationMaxAttempts ceiling as a real-failure row. Pre-fix they retried
  forever — confirmed live during the drill (chaos_test_unknown_kind reached
  attempts=10 in 4 minutes without ever transitioning to failed_at). New
  markUnknownKindDeadLettered path emits a distinct
  propagation.unknown_kind_dead_lettered audit row + bumps
  instant_propagation_dead_lettered_total{reason="unknown_kind",kind="unknown_kind"}
  (the second label is a bounded BUCKET, NOT the raw row.kind, so an
  attacker-controlled enqueue cannot blow up Prom cardinality).

F3 (P2, CHAOS-DRILL-2026-05-20):
  Adds instant_propagation_dead_lettered_total{reason,kind} counter incremented
  on every transition to failed_at. reason="max_attempts" covers the modal
  path (real RegradeResource failures, F1's unexpected_skip-as-failure, and
  markApplied DB failures once they exhaust the backoff schedule);
  reason="unknown_kind" covers F2's image-skew path. Also adds the per-tick
  instant_propagation_unknown_kind_total{kind} counter as a leading indicator
  so the operator sees "worker is older than api" in seconds rather than
  waiting ~24h for the dead-letter to land.

F4 (P1, CHAOS-DRILL-2026-05-20):
  River's default RescueStuckJobsAfter = JobTimeout + JobRescuerRescueAfterDefault
  = 20m + 1h = 1h20m. That is an 80-minute RTO ceiling on any catastrophic
  worker death (OOMKill / pod eviction / segfault) where River's client never
  gets to mark the job back to 'available' itself. Pin it explicitly to
  25 minutes — JobTimeout (20m) + 5m of jitter headroom. Every job in this
  worker is idempotent, so a duplicate rescue is a no-op rather than a
  double-effect. The rescue_stuck_jobs_after value is now also stamped into
  the jobs.workers.started log line so a kubectl-logs grep after a roll
  confirms the pinned RTO is live.

TESTS (registry-iterating per CLAUDE.md rule 18):
  TestPropagation_UnknownKind_DeadLettersAtMaxAttempts:
    synthesises a guaranteed-not-in-registry kind (chaos_unknown_kind_<unix_nano>),
    drives a row at propagationMaxAttempts-1 through Work(), asserts (a)
    failed_at-stamping UPDATE landed and (b) PropagationDeadLetteredTotal
    {reason=unknown_kind,kind=unknown_kind} delta == 1. A future PR adding the
    synthetic kind to propagationHandlers cannot turn this test into a no-op
    because the kind is freshly generated each run.

  TestPropagation_UnknownKind_RetriesBelowMaxAttempts:
    companion guard so the F2 fix can't regress into the opposite bug
    (immediate dead-letter at attempts=0).

  TestPropagation_DeadLetter_IncrementsMetric:
    pins the F3 contract on the modal max_attempts path (tier_elevation kind,
    persistent gRPC failure at attempts=propagationMaxAttempts-1, asserts the
    Prom counter incremented).

  TestWorker_RiverConfig_RescueStuckJobsAfterIs25Min:
    pins rescueStuckJobsAfter == 25m exactly, AND > globalJobTimeout (so the
    rescuer doesn't race River's own timeout), AND < River's default 1h20m
    (so the explicit pin remains an actual reduction).

GATES: make gate green (build + vet + go test ./... -short -count=1 all green).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(integration): propagation_runner integration test layer (Track 3)

Adds the next layer up from worker/internal/jobs/propagation_runner_test.go
(sqlmock unit drift guards). No build tag — runs under regular make gate.

  - TestPropagation_BackoffIntegration_ExactScheduleViaMarkRetry
    Drives w.markRetry directly with a deterministic clock and pins
    the persisted next_attempt_at SQL UPDATE arg for every position
    in propagationBackoffSchedule + the clamp arm. Catches a refactor
    that changes the next-attempt formula without updating the
    schedule.

  - TestPropagation_DeadLetterIntegration_AtMaxAttempts
    Drives w.markDeadLettered directly. Pins the SQL UPDATE setting
    failed_at AND the propagation.dead_lettered audit row emission.
    Catches a conditional skip of the audit emission.

  - TestPropagation_UnknownKindIntegration_BoundedRetries (F2 P1 guard)
    A pending_propagation with kind='garbage_kind_nobody_handles'
    must flow through markRetry (attempts++), not a silent skip.
    Catches a refactor that bypasses attempts++ in the unknown_kind
    branch.

  - TestPropagation_ForUpdateSkipLockedIntegration
    Live-DB concurrent picker test. Two workers pickEligible the
    same row; total picks must be <= 1. Gated on TEST_DATABASE_URL
    + absence of -short. Catches a SKIP LOCKED removal that lets
    sibling pods double-dispatch.

  - TestPropagation_RegistryWalkIntegration_EnumVsHandlerMap
    Rule 18 registry walk against the pending_propagations.kind PG
    enum. Catches a migration that adds an enum value without a
    matching handler in propagationHandlers.

CLAUDE.md rule 17 coverage block per test, rule 18 registry walk
in two of the five tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant