feat(jobs): k8s-aware deploy + custom-domain reconcilers by mastermanas805 · Pull Request #1 · InstaNode-dev/worker

mastermanas805 · 2026-05-11T07:58:33Z

Summary

deploy_status_reconcile.go: poll k8s for actual ReplicaSet state and update DB status. Replaces the optimistic "trust the compute provider" model that flagged `healthy` before pods were Ready.
custom_domain_reconcile.go: drive Ingress + cert-manager Certificate creation for custom-domain rows; mark verified when cert is Ready.

Adds `k8s.io/client-go` to deps.

Test plan

Add reconcile-loop unit tests (follow-up)
Verified live behaviour via the deploy-status reporting in PR-A's smoke tests

🤖 Generated with Claude Code

Two new background reconcilers wired into the River queue: - deploy_status_reconcile.go: polls the k8s deploy namespace for each active deployment and updates the platform DB status (building → deploying → healthy/failed) based on actual ReplicaSet rollout state. Replaces the previous "compute provider returns and we trust it" model that was reporting healthy before pods were Ready. - custom_domain_reconcile.go: checks pending custom-domain rows, creates/updates the corresponding Ingress + cert-manager Certificate resources, marks the row verified when the cert is Ready. Tolerates transient ACME failures. Adds k8s.io/client-go (and transitive deps) to go.mod. Jobs registered in internal/jobs/workers.go; expire.go gains a small change to skip deploys that the reconciler will catch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

P2 worker-side fixes from BUGHUNT-REPORT-2026-05-17.md: 1. GeoLite2 refresh corrupted the .mmdb after every run — MaxMind serves a gzipped tarball (suffix=tar.gz) but the job io.Copy'd the raw bytes straight to the .mmdb path. Now gunzip + untar and extract only the *.mmdb member (geodb.go: extractGeoLite2MMDB). 2. razorpay_webhook_events dedup table was never pruned (migration 033 envisioned it; no job shipped). Added RazorpayWebhookPruneWorker — a daily DELETE of rows > 30 days, registered alongside uptime_retention. 3. Billing reconciler free-downgrade stranded paid resources: a terminal Razorpay status with PaidCount==0 set plan_tier='free' (the 24h-TTL ephemeral tier) while resources kept a paid tier + expires_at=NULL. Now every terminal status downgrades to 'hobby' (lowest paid tier), matching the non-zero-paid branch — keeps team-tier and resource-tier coherent. 4. ExpireStacksWorker hard-deleted the stacks row even when namespace teardown was skipped (k8sClient==nil) — orphaning a live namespace with no DB pointer. Now skips the DELETE too, leaving the row for a later in-cluster run. 5. Storage suspend/unsuspend flapped at the boundary (suspend and unsuspend both at limitBytes). Added a hysteresis band: unsuspend only below 90% of the limit; the unsuspend loop now also skips resources the suspend loop flipped in the same Work() tick. 6. Expiry worker ignored paused/suspended anonymous resources past TTL — query was status='active' only, orphaning them. Now expires status IN ('active','paused','suspended') past TTL (TTL wins over lifecycle state). 7. Confirmed worker storage limits are MiB throughout (*1024*1024); no decimal-MB sites — no change needed. Tests: extraction unit tests for #1, hysteresis dead-band test for #5, paused/suspended expiry test for #6, prune-job tests for #2. go build / go vet / go test ./... -short all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The worker's storage-quota Redis suspension built the ACL username as usr_<token[:8]> via redisUsernameForToken. Wave-2 changed both the provisioner shared backend (provisioner/.../redis/local.go) and the api cache provider (api/.../cache/redis.go) to use the FULL token — usr_<full-token> — and the dedicated backend (provisioner/.../redis/dedicated.go) uses ded_<token[:8]>. The worker matched neither, so `ACL SETUSER <user> off` targeted a non-existent user: a silent no-op. The resource row flipped to 'suspended' but the customer kept full Redis access. Token-truncation class (recurring pattern #1 in BUGHUNT-REPORT-2026-05-17-round2.md). Fix: redisUsernameForToken now takes the resource tier and returns the exact provision-time username — usr_<full-token> for shared-backend tiers (anonymous/free, via isSharedRedisTier) and ded_<token[:8]> for dedicated paid tiers. The ResourceInfraRevoker interface's unused connectionURL parameter is repurposed to tier; both call sites in quota.go (suspend + unsuspend loops) thread the tier through. Tests: quota_infra_redis_username_test.go asserts both username schemes byte-for-byte and that the tier classifier matches the eviction loop's; the suspend test now asserts tier is passed to the revoker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…_resource_id Token-truncation class (P1, BUGHUNT-REPORT-2026-05-17-round2 recurring pattern #1). The worker's quota-suspend re-derived the dedicated-Redis ACL username as ded_<token[:8]> and the storage scanner re-derived the object prefix as token[:8] — both collide for two tokens sharing 8 hex chars. Store-at-provision, never re-derive (mirrors provisioner dedident.go and api prefixident.go): - ResourceInfraRevoker.RevokeAccess/GrantAccess gain a providerResourceID param. redisUsernameForToken resolves: stored provider_resource_id when present (the exact provisioned name), else shared usr_<full-token>, else LEGACY dedicated ded_<token[:8]> for pre-fix rows. - The quota suspend/unsuspend SELECTs now read provider_resource_id and thread it to the revoker. - storage_minio.minioObjectPrefix already preferred provider_resource_id; the token[:8] branch is documented as a legacy fallback with a named constant (legacyStorageObjectPrefixTokenLen). Coverage tests: quota_infra_redis_username_test.go and the new storage_minio_prefix_test.go assert the stored-PRID path is used verbatim, the legacy token[:8] fallback still resolves old-form identifiers, and two 8-char-prefix-sharing tokens no longer collide. Deploy-order-safe: a brief api/worker deploy skew only weakens quota- suspend (fail-safe — no worse than today's silent no-op); no isolation boundary depends on cross-service version parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ation APPLIED (#34) Pre-fix bug (CHAOS-DRILL-2026-05-20 finding #1, propagation_runner.go lines 756–771): handleTierElevation treated `(Applied=false, SkipReason=<any string not in the allowed-skip whitelist>)` as success. A WARN log fired, firstErr stayed nil, the runner stamped applied_at on the row, and the entitlement_reconciler (5-min backstop) saw no drift to correct because applied_at was set. A paying customer's tier-elevation regrade never landed — no retry, no dead-letter, no alert. Real prod trigger: customer's postgres pod missing postgres-admin Secret (legacy free-tier pods, mid-deprovisioning races). The chaos drill confirmed the failure mode end-to-end. Fix: any non-allowed SkipReason now returns propagationUnexpectedSkipErr (implements errors.Is on errPropagationUnexpectedSkipSentinel). The runner's markRetry path detects the sentinel and emits a distinct propagation.unexpected_skip audit row (NOT propagation.applied). The row retries per the existing backoff schedule (1m, 5m, 15m, ...) and dead-letters at propagationMaxAttempts (10 attempts ≈ 24h33m), going through the standard markDeadLettered path with the canonical propagation.dead_lettered audit kind that operators already alert on. New Prometheus counter: instant_propagation_unexpected_skip_total{kind,resource_type,skip_reason} with bounded skip_reason cardinality via bucketSkipReason() — postgres_admin_secret_missing, redis_auth_secret_missing, namespace_not_found, pod_not_found, resource_not_reachable, legacy_resource, other. Leading indicator for the dead-letter alert that already exists. Audit kinds the runner now emits (mirrors api/models/audit_kinds.go): - propagation.applied (success; unchanged) - propagation.retrying (routine retry; unchanged) - propagation.dead_lettered (terminal failure; unchanged) - propagation.unexpected_skip (NEW: F1 retry signal) Coverage block (CLAUDE.md rule 17): Symptom: propagation.applied audit row + applied_at stamp on a row whose regrade never landed Enumeration: rg -F 'unexpected_skip' (worker, provisioner, api repos) Sites found: 1 emit site (handleTierElevation only) Sites touched: 1 Coverage test: TestIsPropagationAllowedSkip_Coverage iterates propagationAllowedSkipSubstrings + a known-failure string set; TestPropagation_UnexpectedSkip_DoesNotMarkApplied fails the second a future PR re-routes unexpected_skip through markApplied Live verified: pending — will verify post-deploy via synthetic pending_propagations row pointing at non-existent team_id with kind=tier_elevation Tests pass: TestPropagation_UnexpectedSkip_DoesNotMarkApplied PASS TestPropagation_UnexpectedSkip_DeadLettersAtMaxAttempts PASS TestIsPropagationAllowedSkip_Coverage PASS TestPropagationUnexpectedSkipErr_IsMatches PASS TestBucketSkipReason_BoundsCardinality PASS make gate green (build + vet + go test ./... -short -count=1). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…integration test layer (#35) * fix(jobs): CHAOS F1 — unexpected_skip no longer silently marks propagation APPLIED Pre-fix bug (CHAOS-DRILL-2026-05-20 finding #1, propagation_runner.go lines 756–771): handleTierElevation treated `(Applied=false, SkipReason=<any string not in the allowed-skip whitelist>)` as success. A WARN log fired, firstErr stayed nil, the runner stamped applied_at on the row, and the entitlement_reconciler (5-min backstop) saw no drift to correct because applied_at was set. A paying customer's tier-elevation regrade never landed — no retry, no dead-letter, no alert. Real prod trigger: customer's postgres pod missing postgres-admin Secret (legacy free-tier pods, mid-deprovisioning races). The chaos drill confirmed the failure mode end-to-end. Fix: any non-allowed SkipReason now returns propagationUnexpectedSkipErr (implements errors.Is on errPropagationUnexpectedSkipSentinel). The runner's markRetry path detects the sentinel and emits a distinct propagation.unexpected_skip audit row (NOT propagation.applied). The row retries per the existing backoff schedule (1m, 5m, 15m, ...) and dead-letters at propagationMaxAttempts (10 attempts ≈ 24h33m), going through the standard markDeadLettered path with the canonical propagation.dead_lettered audit kind that operators already alert on. New Prometheus counter: instant_propagation_unexpected_skip_total{kind,resource_type,skip_reason} with bounded skip_reason cardinality via bucketSkipReason() — postgres_admin_secret_missing, redis_auth_secret_missing, namespace_not_found, pod_not_found, resource_not_reachable, legacy_resource, other. Leading indicator for the dead-letter alert that already exists. Audit kinds the runner now emits (mirrors api/models/audit_kinds.go): - propagation.applied (success; unchanged) - propagation.retrying (routine retry; unchanged) - propagation.dead_lettered (terminal failure; unchanged) - propagation.unexpected_skip (NEW: F1 retry signal) Coverage block (CLAUDE.md rule 17): Symptom: propagation.applied audit row + applied_at stamp on a row whose regrade never landed Enumeration: rg -F 'unexpected_skip' (worker, provisioner, api repos) Sites found: 1 emit site (handleTierElevation only) Sites touched: 1 Coverage test: TestIsPropagationAllowedSkip_Coverage iterates propagationAllowedSkipSubstrings + a known-failure string set; TestPropagation_UnexpectedSkip_DoesNotMarkApplied fails the second a future PR re-routes unexpected_skip through markApplied Live verified: pending — will verify post-deploy via synthetic pending_propagations row pointing at non-existent team_id with kind=tier_elevation Tests pass: TestPropagation_UnexpectedSkip_DoesNotMarkApplied PASS TestPropagation_UnexpectedSkip_DeadLettersAtMaxAttempts PASS TestIsPropagationAllowedSkip_Coverage PASS TestPropagationUnexpectedSkipErr_IsMatches PASS TestBucketSkipReason_BoundsCardinality PASS make gate green (build + vet + go test ./... -short -count=1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jobs): CHAOS F2/F3/F4 — bound unknown_kind retries, add dead-letter counter, pin River RescueStuckJobsAfter Ships the three follow-ups from CHAOS-DRILL-2026-05-20 on top of F1's unexpected_skip fix. F1 already pulled in the F2/F3 helpers (the propagation_runner.go markUnknownKindDeadLettered path + the PropagationDeadLetteredTotal / PropagationUnknownKindTotal counters); this commit adds the F4 River config pin and the registry-iterating tests that lock the three fixes in. F2 (P1, CHAOS-DRILL-2026-05-20): pending_propagations rows whose kind has no handler now respect the same propagationMaxAttempts ceiling as a real-failure row. Pre-fix they retried forever — confirmed live during the drill (chaos_test_unknown_kind reached attempts=10 in 4 minutes without ever transitioning to failed_at). New markUnknownKindDeadLettered path emits a distinct propagation.unknown_kind_dead_lettered audit row + bumps instant_propagation_dead_lettered_total{reason="unknown_kind",kind="unknown_kind"} (the second label is a bounded BUCKET, NOT the raw row.kind, so an attacker-controlled enqueue cannot blow up Prom cardinality). F3 (P2, CHAOS-DRILL-2026-05-20): Adds instant_propagation_dead_lettered_total{reason,kind} counter incremented on every transition to failed_at. reason="max_attempts" covers the modal path (real RegradeResource failures, F1's unexpected_skip-as-failure, and markApplied DB failures once they exhaust the backoff schedule); reason="unknown_kind" covers F2's image-skew path. Also adds the per-tick instant_propagation_unknown_kind_total{kind} counter as a leading indicator so the operator sees "worker is older than api" in seconds rather than waiting ~24h for the dead-letter to land. F4 (P1, CHAOS-DRILL-2026-05-20): River's default RescueStuckJobsAfter = JobTimeout + JobRescuerRescueAfterDefault = 20m + 1h = 1h20m. That is an 80-minute RTO ceiling on any catastrophic worker death (OOMKill / pod eviction / segfault) where River's client never gets to mark the job back to 'available' itself. Pin it explicitly to 25 minutes — JobTimeout (20m) + 5m of jitter headroom. Every job in this worker is idempotent, so a duplicate rescue is a no-op rather than a double-effect. The rescue_stuck_jobs_after value is now also stamped into the jobs.workers.started log line so a kubectl-logs grep after a roll confirms the pinned RTO is live. TESTS (registry-iterating per CLAUDE.md rule 18): TestPropagation_UnknownKind_DeadLettersAtMaxAttempts: synthesises a guaranteed-not-in-registry kind (chaos_unknown_kind_<unix_nano>), drives a row at propagationMaxAttempts-1 through Work(), asserts (a) failed_at-stamping UPDATE landed and (b) PropagationDeadLetteredTotal {reason=unknown_kind,kind=unknown_kind} delta == 1. A future PR adding the synthetic kind to propagationHandlers cannot turn this test into a no-op because the kind is freshly generated each run. TestPropagation_UnknownKind_RetriesBelowMaxAttempts: companion guard so the F2 fix can't regress into the opposite bug (immediate dead-letter at attempts=0). TestPropagation_DeadLetter_IncrementsMetric: pins the F3 contract on the modal max_attempts path (tier_elevation kind, persistent gRPC failure at attempts=propagationMaxAttempts-1, asserts the Prom counter incremented). TestWorker_RiverConfig_RescueStuckJobsAfterIs25Min: pins rescueStuckJobsAfter == 25m exactly, AND > globalJobTimeout (so the rescuer doesn't race River's own timeout), AND < River's default 1h20m (so the explicit pin remains an actual reduction). GATES: make gate green (build + vet + go test ./... -short -count=1 all green). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(integration): propagation_runner integration test layer (Track 3) Adds the next layer up from worker/internal/jobs/propagation_runner_test.go (sqlmock unit drift guards). No build tag — runs under regular make gate. - TestPropagation_BackoffIntegration_ExactScheduleViaMarkRetry Drives w.markRetry directly with a deterministic clock and pins the persisted next_attempt_at SQL UPDATE arg for every position in propagationBackoffSchedule + the clamp arm. Catches a refactor that changes the next-attempt formula without updating the schedule. - TestPropagation_DeadLetterIntegration_AtMaxAttempts Drives w.markDeadLettered directly. Pins the SQL UPDATE setting failed_at AND the propagation.dead_lettered audit row emission. Catches a conditional skip of the audit emission. - TestPropagation_UnknownKindIntegration_BoundedRetries (F2 P1 guard) A pending_propagation with kind='garbage_kind_nobody_handles' must flow through markRetry (attempts++), not a silent skip. Catches a refactor that bypasses attempts++ in the unknown_kind branch. - TestPropagation_ForUpdateSkipLockedIntegration Live-DB concurrent picker test. Two workers pickEligible the same row; total picks must be <= 1. Gated on TEST_DATABASE_URL + absence of -short. Catches a SKIP LOCKED removal that lets sibling pods double-dispatch. - TestPropagation_RegistryWalkIntegration_EnumVsHandlerMap Rule 18 registry walk against the pending_propagations.kind PG enum. Catches a migration that adds an enum value without a matching handler in propagationHandlers. CLAUDE.md rule 17 coverage block per test, rule 18 registry walk in two of the five tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mastermanas805 merged commit 60f1254 into master May 11, 2026

mastermanas805 deleted the feat/k8s-reconciler-jobs branch May 11, 2026 08:00

mastermanas805 mentioned this pull request May 20, 2026

fix(jobs): CHAOS F1 — unexpected_skip no longer silently marks propagation APPLIED #34

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(jobs): k8s-aware deploy + custom-domain reconcilers#1

feat(jobs): k8s-aware deploy + custom-domain reconcilers#1
mastermanas805 merged 1 commit into
masterfrom
feat/k8s-reconciler-jobs

mastermanas805 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mastermanas805 commented May 11, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant