feat(jobs): k8s-aware deploy + custom-domain reconcilers#1
Merged
Conversation
Two new background reconcilers wired into the River queue: - deploy_status_reconcile.go: polls the k8s deploy namespace for each active deployment and updates the platform DB status (building → deploying → healthy/failed) based on actual ReplicaSet rollout state. Replaces the previous "compute provider returns and we trust it" model that was reporting healthy before pods were Ready. - custom_domain_reconcile.go: checks pending custom-domain rows, creates/updates the corresponding Ingress + cert-manager Certificate resources, marks the row verified when the cert is Ready. Tolerates transient ACME failures. Adds k8s.io/client-go (and transitive deps) to go.mod. Jobs registered in internal/jobs/workers.go; expire.go gains a small change to skip deploys that the reconciler will catch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805
added a commit
that referenced
this pull request
May 17, 2026
P2 worker-side fixes from BUGHUNT-REPORT-2026-05-17.md:
1. GeoLite2 refresh corrupted the .mmdb after every run — MaxMind serves a
gzipped tarball (suffix=tar.gz) but the job io.Copy'd the raw bytes
straight to the .mmdb path. Now gunzip + untar and extract only the
*.mmdb member (geodb.go: extractGeoLite2MMDB).
2. razorpay_webhook_events dedup table was never pruned (migration 033
envisioned it; no job shipped). Added RazorpayWebhookPruneWorker — a
daily DELETE of rows > 30 days, registered alongside uptime_retention.
3. Billing reconciler free-downgrade stranded paid resources: a terminal
Razorpay status with PaidCount==0 set plan_tier='free' (the 24h-TTL
ephemeral tier) while resources kept a paid tier + expires_at=NULL.
Now every terminal status downgrades to 'hobby' (lowest paid tier),
matching the non-zero-paid branch — keeps team-tier and resource-tier
coherent.
4. ExpireStacksWorker hard-deleted the stacks row even when namespace
teardown was skipped (k8sClient==nil) — orphaning a live namespace
with no DB pointer. Now skips the DELETE too, leaving the row for a
later in-cluster run.
5. Storage suspend/unsuspend flapped at the boundary (suspend and
unsuspend both at limitBytes). Added a hysteresis band: unsuspend only
below 90% of the limit; the unsuspend loop now also skips resources the
suspend loop flipped in the same Work() tick.
6. Expiry worker ignored paused/suspended anonymous resources past TTL —
query was status='active' only, orphaning them. Now expires
status IN ('active','paused','suspended') past TTL (TTL wins over
lifecycle state).
7. Confirmed worker storage limits are MiB throughout (*1024*1024); no
decimal-MB sites — no change needed.
Tests: extraction unit tests for #1, hysteresis dead-band test for #5,
paused/suspended expiry test for #6, prune-job tests for #2. go build /
go vet / go test ./... -short all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805
added a commit
that referenced
this pull request
May 17, 2026
The worker's storage-quota Redis suspension built the ACL username as usr_<token[:8]> via redisUsernameForToken. Wave-2 changed both the provisioner shared backend (provisioner/.../redis/local.go) and the api cache provider (api/.../cache/redis.go) to use the FULL token — usr_<full-token> — and the dedicated backend (provisioner/.../redis/dedicated.go) uses ded_<token[:8]>. The worker matched neither, so `ACL SETUSER <user> off` targeted a non-existent user: a silent no-op. The resource row flipped to 'suspended' but the customer kept full Redis access. Token-truncation class (recurring pattern #1 in BUGHUNT-REPORT-2026-05-17-round2.md). Fix: redisUsernameForToken now takes the resource tier and returns the exact provision-time username — usr_<full-token> for shared-backend tiers (anonymous/free, via isSharedRedisTier) and ded_<token[:8]> for dedicated paid tiers. The ResourceInfraRevoker interface's unused connectionURL parameter is repurposed to tier; both call sites in quota.go (suspend + unsuspend loops) thread the tier through. Tests: quota_infra_redis_username_test.go asserts both username schemes byte-for-byte and that the tier classifier matches the eviction loop's; the suspend test now asserts tier is passed to the revoker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805
added a commit
that referenced
this pull request
May 17, 2026
…_resource_id Token-truncation class (P1, BUGHUNT-REPORT-2026-05-17-round2 recurring pattern #1). The worker's quota-suspend re-derived the dedicated-Redis ACL username as ded_<token[:8]> and the storage scanner re-derived the object prefix as token[:8] — both collide for two tokens sharing 8 hex chars. Store-at-provision, never re-derive (mirrors provisioner dedident.go and api prefixident.go): - ResourceInfraRevoker.RevokeAccess/GrantAccess gain a providerResourceID param. redisUsernameForToken resolves: stored provider_resource_id when present (the exact provisioned name), else shared usr_<full-token>, else LEGACY dedicated ded_<token[:8]> for pre-fix rows. - The quota suspend/unsuspend SELECTs now read provider_resource_id and thread it to the revoker. - storage_minio.minioObjectPrefix already preferred provider_resource_id; the token[:8] branch is documented as a legacy fallback with a named constant (legacyStorageObjectPrefixTokenLen). Coverage tests: quota_infra_redis_username_test.go and the new storage_minio_prefix_test.go assert the stored-PRID path is used verbatim, the legacy token[:8] fallback still resolves old-form identifiers, and two 8-char-prefix-sharing tokens no longer collide. Deploy-order-safe: a brief api/worker deploy skew only weakens quota- suspend (fail-safe — no worse than today's silent no-op); no isolation boundary depends on cross-service version parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805
added a commit
that referenced
this pull request
May 20, 2026
…ation APPLIED (#34) Pre-fix bug (CHAOS-DRILL-2026-05-20 finding #1, propagation_runner.go lines 756–771): handleTierElevation treated `(Applied=false, SkipReason=<any string not in the allowed-skip whitelist>)` as success. A WARN log fired, firstErr stayed nil, the runner stamped applied_at on the row, and the entitlement_reconciler (5-min backstop) saw no drift to correct because applied_at was set. A paying customer's tier-elevation regrade never landed — no retry, no dead-letter, no alert. Real prod trigger: customer's postgres pod missing postgres-admin Secret (legacy free-tier pods, mid-deprovisioning races). The chaos drill confirmed the failure mode end-to-end. Fix: any non-allowed SkipReason now returns propagationUnexpectedSkipErr (implements errors.Is on errPropagationUnexpectedSkipSentinel). The runner's markRetry path detects the sentinel and emits a distinct propagation.unexpected_skip audit row (NOT propagation.applied). The row retries per the existing backoff schedule (1m, 5m, 15m, ...) and dead-letters at propagationMaxAttempts (10 attempts ≈ 24h33m), going through the standard markDeadLettered path with the canonical propagation.dead_lettered audit kind that operators already alert on. New Prometheus counter: instant_propagation_unexpected_skip_total{kind,resource_type,skip_reason} with bounded skip_reason cardinality via bucketSkipReason() — postgres_admin_secret_missing, redis_auth_secret_missing, namespace_not_found, pod_not_found, resource_not_reachable, legacy_resource, other. Leading indicator for the dead-letter alert that already exists. Audit kinds the runner now emits (mirrors api/models/audit_kinds.go): - propagation.applied (success; unchanged) - propagation.retrying (routine retry; unchanged) - propagation.dead_lettered (terminal failure; unchanged) - propagation.unexpected_skip (NEW: F1 retry signal) Coverage block (CLAUDE.md rule 17): Symptom: propagation.applied audit row + applied_at stamp on a row whose regrade never landed Enumeration: rg -F 'unexpected_skip' (worker, provisioner, api repos) Sites found: 1 emit site (handleTierElevation only) Sites touched: 1 Coverage test: TestIsPropagationAllowedSkip_Coverage iterates propagationAllowedSkipSubstrings + a known-failure string set; TestPropagation_UnexpectedSkip_DoesNotMarkApplied fails the second a future PR re-routes unexpected_skip through markApplied Live verified: pending — will verify post-deploy via synthetic pending_propagations row pointing at non-existent team_id with kind=tier_elevation Tests pass: TestPropagation_UnexpectedSkip_DoesNotMarkApplied PASS TestPropagation_UnexpectedSkip_DeadLettersAtMaxAttempts PASS TestIsPropagationAllowedSkip_Coverage PASS TestPropagationUnexpectedSkipErr_IsMatches PASS TestBucketSkipReason_BoundsCardinality PASS make gate green (build + vet + go test ./... -short -count=1). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805
added a commit
that referenced
this pull request
May 20, 2026
…integration test layer (#35) * fix(jobs): CHAOS F1 — unexpected_skip no longer silently marks propagation APPLIED Pre-fix bug (CHAOS-DRILL-2026-05-20 finding #1, propagation_runner.go lines 756–771): handleTierElevation treated `(Applied=false, SkipReason=<any string not in the allowed-skip whitelist>)` as success. A WARN log fired, firstErr stayed nil, the runner stamped applied_at on the row, and the entitlement_reconciler (5-min backstop) saw no drift to correct because applied_at was set. A paying customer's tier-elevation regrade never landed — no retry, no dead-letter, no alert. Real prod trigger: customer's postgres pod missing postgres-admin Secret (legacy free-tier pods, mid-deprovisioning races). The chaos drill confirmed the failure mode end-to-end. Fix: any non-allowed SkipReason now returns propagationUnexpectedSkipErr (implements errors.Is on errPropagationUnexpectedSkipSentinel). The runner's markRetry path detects the sentinel and emits a distinct propagation.unexpected_skip audit row (NOT propagation.applied). The row retries per the existing backoff schedule (1m, 5m, 15m, ...) and dead-letters at propagationMaxAttempts (10 attempts ≈ 24h33m), going through the standard markDeadLettered path with the canonical propagation.dead_lettered audit kind that operators already alert on. New Prometheus counter: instant_propagation_unexpected_skip_total{kind,resource_type,skip_reason} with bounded skip_reason cardinality via bucketSkipReason() — postgres_admin_secret_missing, redis_auth_secret_missing, namespace_not_found, pod_not_found, resource_not_reachable, legacy_resource, other. Leading indicator for the dead-letter alert that already exists. Audit kinds the runner now emits (mirrors api/models/audit_kinds.go): - propagation.applied (success; unchanged) - propagation.retrying (routine retry; unchanged) - propagation.dead_lettered (terminal failure; unchanged) - propagation.unexpected_skip (NEW: F1 retry signal) Coverage block (CLAUDE.md rule 17): Symptom: propagation.applied audit row + applied_at stamp on a row whose regrade never landed Enumeration: rg -F 'unexpected_skip' (worker, provisioner, api repos) Sites found: 1 emit site (handleTierElevation only) Sites touched: 1 Coverage test: TestIsPropagationAllowedSkip_Coverage iterates propagationAllowedSkipSubstrings + a known-failure string set; TestPropagation_UnexpectedSkip_DoesNotMarkApplied fails the second a future PR re-routes unexpected_skip through markApplied Live verified: pending — will verify post-deploy via synthetic pending_propagations row pointing at non-existent team_id with kind=tier_elevation Tests pass: TestPropagation_UnexpectedSkip_DoesNotMarkApplied PASS TestPropagation_UnexpectedSkip_DeadLettersAtMaxAttempts PASS TestIsPropagationAllowedSkip_Coverage PASS TestPropagationUnexpectedSkipErr_IsMatches PASS TestBucketSkipReason_BoundsCardinality PASS make gate green (build + vet + go test ./... -short -count=1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jobs): CHAOS F2/F3/F4 — bound unknown_kind retries, add dead-letter counter, pin River RescueStuckJobsAfter Ships the three follow-ups from CHAOS-DRILL-2026-05-20 on top of F1's unexpected_skip fix. F1 already pulled in the F2/F3 helpers (the propagation_runner.go markUnknownKindDeadLettered path + the PropagationDeadLetteredTotal / PropagationUnknownKindTotal counters); this commit adds the F4 River config pin and the registry-iterating tests that lock the three fixes in. F2 (P1, CHAOS-DRILL-2026-05-20): pending_propagations rows whose kind has no handler now respect the same propagationMaxAttempts ceiling as a real-failure row. Pre-fix they retried forever — confirmed live during the drill (chaos_test_unknown_kind reached attempts=10 in 4 minutes without ever transitioning to failed_at). New markUnknownKindDeadLettered path emits a distinct propagation.unknown_kind_dead_lettered audit row + bumps instant_propagation_dead_lettered_total{reason="unknown_kind",kind="unknown_kind"} (the second label is a bounded BUCKET, NOT the raw row.kind, so an attacker-controlled enqueue cannot blow up Prom cardinality). F3 (P2, CHAOS-DRILL-2026-05-20): Adds instant_propagation_dead_lettered_total{reason,kind} counter incremented on every transition to failed_at. reason="max_attempts" covers the modal path (real RegradeResource failures, F1's unexpected_skip-as-failure, and markApplied DB failures once they exhaust the backoff schedule); reason="unknown_kind" covers F2's image-skew path. Also adds the per-tick instant_propagation_unknown_kind_total{kind} counter as a leading indicator so the operator sees "worker is older than api" in seconds rather than waiting ~24h for the dead-letter to land. F4 (P1, CHAOS-DRILL-2026-05-20): River's default RescueStuckJobsAfter = JobTimeout + JobRescuerRescueAfterDefault = 20m + 1h = 1h20m. That is an 80-minute RTO ceiling on any catastrophic worker death (OOMKill / pod eviction / segfault) where River's client never gets to mark the job back to 'available' itself. Pin it explicitly to 25 minutes — JobTimeout (20m) + 5m of jitter headroom. Every job in this worker is idempotent, so a duplicate rescue is a no-op rather than a double-effect. The rescue_stuck_jobs_after value is now also stamped into the jobs.workers.started log line so a kubectl-logs grep after a roll confirms the pinned RTO is live. TESTS (registry-iterating per CLAUDE.md rule 18): TestPropagation_UnknownKind_DeadLettersAtMaxAttempts: synthesises a guaranteed-not-in-registry kind (chaos_unknown_kind_<unix_nano>), drives a row at propagationMaxAttempts-1 through Work(), asserts (a) failed_at-stamping UPDATE landed and (b) PropagationDeadLetteredTotal {reason=unknown_kind,kind=unknown_kind} delta == 1. A future PR adding the synthetic kind to propagationHandlers cannot turn this test into a no-op because the kind is freshly generated each run. TestPropagation_UnknownKind_RetriesBelowMaxAttempts: companion guard so the F2 fix can't regress into the opposite bug (immediate dead-letter at attempts=0). TestPropagation_DeadLetter_IncrementsMetric: pins the F3 contract on the modal max_attempts path (tier_elevation kind, persistent gRPC failure at attempts=propagationMaxAttempts-1, asserts the Prom counter incremented). TestWorker_RiverConfig_RescueStuckJobsAfterIs25Min: pins rescueStuckJobsAfter == 25m exactly, AND > globalJobTimeout (so the rescuer doesn't race River's own timeout), AND < River's default 1h20m (so the explicit pin remains an actual reduction). GATES: make gate green (build + vet + go test ./... -short -count=1 all green). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(integration): propagation_runner integration test layer (Track 3) Adds the next layer up from worker/internal/jobs/propagation_runner_test.go (sqlmock unit drift guards). No build tag — runs under regular make gate. - TestPropagation_BackoffIntegration_ExactScheduleViaMarkRetry Drives w.markRetry directly with a deterministic clock and pins the persisted next_attempt_at SQL UPDATE arg for every position in propagationBackoffSchedule + the clamp arm. Catches a refactor that changes the next-attempt formula without updating the schedule. - TestPropagation_DeadLetterIntegration_AtMaxAttempts Drives w.markDeadLettered directly. Pins the SQL UPDATE setting failed_at AND the propagation.dead_lettered audit row emission. Catches a conditional skip of the audit emission. - TestPropagation_UnknownKindIntegration_BoundedRetries (F2 P1 guard) A pending_propagation with kind='garbage_kind_nobody_handles' must flow through markRetry (attempts++), not a silent skip. Catches a refactor that bypasses attempts++ in the unknown_kind branch. - TestPropagation_ForUpdateSkipLockedIntegration Live-DB concurrent picker test. Two workers pickEligible the same row; total picks must be <= 1. Gated on TEST_DATABASE_URL + absence of -short. Catches a SKIP LOCKED removal that lets sibling pods double-dispatch. - TestPropagation_RegistryWalkIntegration_EnumVsHandlerMap Rule 18 registry walk against the pending_propagations.kind PG enum. Catches a migration that adds an enum value without a matching handler in propagationHandlers. CLAUDE.md rule 17 coverage block per test, rule 18 registry walk in two of the five tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds `k8s.io/client-go` to deps.
Test plan
🤖 Generated with Claude Code