fix(chaos)+test(integration): F1-F4 chaos fixes + propagation runner integration test layer#35
Merged
Merged
Conversation
…ation APPLIED Pre-fix bug (CHAOS-DRILL-2026-05-20 finding #1, propagation_runner.go lines 756–771): handleTierElevation treated `(Applied=false, SkipReason=<any string not in the allowed-skip whitelist>)` as success. A WARN log fired, firstErr stayed nil, the runner stamped applied_at on the row, and the entitlement_reconciler (5-min backstop) saw no drift to correct because applied_at was set. A paying customer's tier-elevation regrade never landed — no retry, no dead-letter, no alert. Real prod trigger: customer's postgres pod missing postgres-admin Secret (legacy free-tier pods, mid-deprovisioning races). The chaos drill confirmed the failure mode end-to-end. Fix: any non-allowed SkipReason now returns propagationUnexpectedSkipErr (implements errors.Is on errPropagationUnexpectedSkipSentinel). The runner's markRetry path detects the sentinel and emits a distinct propagation.unexpected_skip audit row (NOT propagation.applied). The row retries per the existing backoff schedule (1m, 5m, 15m, ...) and dead-letters at propagationMaxAttempts (10 attempts ≈ 24h33m), going through the standard markDeadLettered path with the canonical propagation.dead_lettered audit kind that operators already alert on. New Prometheus counter: instant_propagation_unexpected_skip_total{kind,resource_type,skip_reason} with bounded skip_reason cardinality via bucketSkipReason() — postgres_admin_secret_missing, redis_auth_secret_missing, namespace_not_found, pod_not_found, resource_not_reachable, legacy_resource, other. Leading indicator for the dead-letter alert that already exists. Audit kinds the runner now emits (mirrors api/models/audit_kinds.go): - propagation.applied (success; unchanged) - propagation.retrying (routine retry; unchanged) - propagation.dead_lettered (terminal failure; unchanged) - propagation.unexpected_skip (NEW: F1 retry signal) Coverage block (CLAUDE.md rule 17): Symptom: propagation.applied audit row + applied_at stamp on a row whose regrade never landed Enumeration: rg -F 'unexpected_skip' (worker, provisioner, api repos) Sites found: 1 emit site (handleTierElevation only) Sites touched: 1 Coverage test: TestIsPropagationAllowedSkip_Coverage iterates propagationAllowedSkipSubstrings + a known-failure string set; TestPropagation_UnexpectedSkip_DoesNotMarkApplied fails the second a future PR re-routes unexpected_skip through markApplied Live verified: pending — will verify post-deploy via synthetic pending_propagations row pointing at non-existent team_id with kind=tier_elevation Tests pass: TestPropagation_UnexpectedSkip_DoesNotMarkApplied PASS TestPropagation_UnexpectedSkip_DeadLettersAtMaxAttempts PASS TestIsPropagationAllowedSkip_Coverage PASS TestPropagationUnexpectedSkipErr_IsMatches PASS TestBucketSkipReason_BoundsCardinality PASS make gate green (build + vet + go test ./... -short -count=1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er counter, pin River RescueStuckJobsAfter
Ships the three follow-ups from CHAOS-DRILL-2026-05-20 on top of F1's
unexpected_skip fix. F1 already pulled in the F2/F3 helpers (the
propagation_runner.go markUnknownKindDeadLettered path + the
PropagationDeadLetteredTotal / PropagationUnknownKindTotal counters);
this commit adds the F4 River config pin and the registry-iterating tests
that lock the three fixes in.
F2 (P1, CHAOS-DRILL-2026-05-20):
pending_propagations rows whose kind has no handler now respect the same
propagationMaxAttempts ceiling as a real-failure row. Pre-fix they retried
forever — confirmed live during the drill (chaos_test_unknown_kind reached
attempts=10 in 4 minutes without ever transitioning to failed_at). New
markUnknownKindDeadLettered path emits a distinct
propagation.unknown_kind_dead_lettered audit row + bumps
instant_propagation_dead_lettered_total{reason="unknown_kind",kind="unknown_kind"}
(the second label is a bounded BUCKET, NOT the raw row.kind, so an
attacker-controlled enqueue cannot blow up Prom cardinality).
F3 (P2, CHAOS-DRILL-2026-05-20):
Adds instant_propagation_dead_lettered_total{reason,kind} counter incremented
on every transition to failed_at. reason="max_attempts" covers the modal
path (real RegradeResource failures, F1's unexpected_skip-as-failure, and
markApplied DB failures once they exhaust the backoff schedule);
reason="unknown_kind" covers F2's image-skew path. Also adds the per-tick
instant_propagation_unknown_kind_total{kind} counter as a leading indicator
so the operator sees "worker is older than api" in seconds rather than
waiting ~24h for the dead-letter to land.
F4 (P1, CHAOS-DRILL-2026-05-20):
River's default RescueStuckJobsAfter = JobTimeout + JobRescuerRescueAfterDefault
= 20m + 1h = 1h20m. That is an 80-minute RTO ceiling on any catastrophic
worker death (OOMKill / pod eviction / segfault) where River's client never
gets to mark the job back to 'available' itself. Pin it explicitly to
25 minutes — JobTimeout (20m) + 5m of jitter headroom. Every job in this
worker is idempotent, so a duplicate rescue is a no-op rather than a
double-effect. The rescue_stuck_jobs_after value is now also stamped into
the jobs.workers.started log line so a kubectl-logs grep after a roll
confirms the pinned RTO is live.
TESTS (registry-iterating per CLAUDE.md rule 18):
TestPropagation_UnknownKind_DeadLettersAtMaxAttempts:
synthesises a guaranteed-not-in-registry kind (chaos_unknown_kind_<unix_nano>),
drives a row at propagationMaxAttempts-1 through Work(), asserts (a)
failed_at-stamping UPDATE landed and (b) PropagationDeadLetteredTotal
{reason=unknown_kind,kind=unknown_kind} delta == 1. A future PR adding the
synthetic kind to propagationHandlers cannot turn this test into a no-op
because the kind is freshly generated each run.
TestPropagation_UnknownKind_RetriesBelowMaxAttempts:
companion guard so the F2 fix can't regress into the opposite bug
(immediate dead-letter at attempts=0).
TestPropagation_DeadLetter_IncrementsMetric:
pins the F3 contract on the modal max_attempts path (tier_elevation kind,
persistent gRPC failure at attempts=propagationMaxAttempts-1, asserts the
Prom counter incremented).
TestWorker_RiverConfig_RescueStuckJobsAfterIs25Min:
pins rescueStuckJobsAfter == 25m exactly, AND > globalJobTimeout (so the
rescuer doesn't race River's own timeout), AND < River's default 1h20m
(so the explicit pin remains an actual reduction).
GATES: make gate green (build + vet + go test ./... -short -count=1 all green).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the next layer up from worker/internal/jobs/propagation_runner_test.go
(sqlmock unit drift guards). No build tag — runs under regular make gate.
- TestPropagation_BackoffIntegration_ExactScheduleViaMarkRetry
Drives w.markRetry directly with a deterministic clock and pins
the persisted next_attempt_at SQL UPDATE arg for every position
in propagationBackoffSchedule + the clamp arm. Catches a refactor
that changes the next-attempt formula without updating the
schedule.
- TestPropagation_DeadLetterIntegration_AtMaxAttempts
Drives w.markDeadLettered directly. Pins the SQL UPDATE setting
failed_at AND the propagation.dead_lettered audit row emission.
Catches a conditional skip of the audit emission.
- TestPropagation_UnknownKindIntegration_BoundedRetries (F2 P1 guard)
A pending_propagation with kind='garbage_kind_nobody_handles'
must flow through markRetry (attempts++), not a silent skip.
Catches a refactor that bypasses attempts++ in the unknown_kind
branch.
- TestPropagation_ForUpdateSkipLockedIntegration
Live-DB concurrent picker test. Two workers pickEligible the
same row; total picks must be <= 1. Gated on TEST_DATABASE_URL
+ absence of -short. Catches a SKIP LOCKED removal that lets
sibling pods double-dispatch.
- TestPropagation_RegistryWalkIntegration_EnumVsHandlerMap
Rule 18 registry walk against the pending_propagations.kind PG
enum. Catches a migration that adds an enum value without a
matching handler in propagationHandlers.
CLAUDE.md rule 17 coverage block per test, rule 18 registry walk
in two of the five tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CHAOS fixes (F1/F2/F3/F4, base commits)
unexpected_skipno longer silently marks propagation APPLIED — handler returns propagationUnexpectedSkipErr, runner takes the markRetry path.unknown_kindpropagation rows are bounded by maxAttempts + emit propagation.unknown_kind_dead_lettered audit kind.NEW: Track 3 — Propagation runner integration tests (this commit)
Adds the next layer up from propagation_runner_test.go (sqlmock unit drift guards). No build tag — runs under regular
make gate.worker/internal/jobs/propagation_runner_integration_test.go:TestPropagation_BackoffIntegration_ExactScheduleViaMarkRetry— markRetry end-to-end persistence for every backoff position + clamp arm.TestPropagation_DeadLetterIntegration_AtMaxAttempts— markDeadLettered direct + audit row emission pin.TestPropagation_UnknownKindIntegration_BoundedRetries— F2 P1 guard: unknown_kind hits attempts++ via markRetry, NOT a silent skip.TestPropagation_ForUpdateSkipLockedIntegration— live-DB concurrent picker test, TEST_DATABASE_URL-gated.TestPropagation_RegistryWalkIntegration_EnumVsHandlerMap— rule-18 enum-vs-handler-map walk, TEST_DATABASE_URL-gated.CLAUDE.md rule 17 coverage block per test, rule 18 registry walk in two of the five tests.
Local verified
🤖 Generated with Claude Code