fix(jobs): CHAOS F1 — unexpected_skip no longer silently marks propagation APPLIED#34
Merged
Merged
Conversation
…ation APPLIED Pre-fix bug (CHAOS-DRILL-2026-05-20 finding #1, propagation_runner.go lines 756–771): handleTierElevation treated `(Applied=false, SkipReason=<any string not in the allowed-skip whitelist>)` as success. A WARN log fired, firstErr stayed nil, the runner stamped applied_at on the row, and the entitlement_reconciler (5-min backstop) saw no drift to correct because applied_at was set. A paying customer's tier-elevation regrade never landed — no retry, no dead-letter, no alert. Real prod trigger: customer's postgres pod missing postgres-admin Secret (legacy free-tier pods, mid-deprovisioning races). The chaos drill confirmed the failure mode end-to-end. Fix: any non-allowed SkipReason now returns propagationUnexpectedSkipErr (implements errors.Is on errPropagationUnexpectedSkipSentinel). The runner's markRetry path detects the sentinel and emits a distinct propagation.unexpected_skip audit row (NOT propagation.applied). The row retries per the existing backoff schedule (1m, 5m, 15m, ...) and dead-letters at propagationMaxAttempts (10 attempts ≈ 24h33m), going through the standard markDeadLettered path with the canonical propagation.dead_lettered audit kind that operators already alert on. New Prometheus counter: instant_propagation_unexpected_skip_total{kind,resource_type,skip_reason} with bounded skip_reason cardinality via bucketSkipReason() — postgres_admin_secret_missing, redis_auth_secret_missing, namespace_not_found, pod_not_found, resource_not_reachable, legacy_resource, other. Leading indicator for the dead-letter alert that already exists. Audit kinds the runner now emits (mirrors api/models/audit_kinds.go): - propagation.applied (success; unchanged) - propagation.retrying (routine retry; unchanged) - propagation.dead_lettered (terminal failure; unchanged) - propagation.unexpected_skip (NEW: F1 retry signal) Coverage block (CLAUDE.md rule 17): Symptom: propagation.applied audit row + applied_at stamp on a row whose regrade never landed Enumeration: rg -F 'unexpected_skip' (worker, provisioner, api repos) Sites found: 1 emit site (handleTierElevation only) Sites touched: 1 Coverage test: TestIsPropagationAllowedSkip_Coverage iterates propagationAllowedSkipSubstrings + a known-failure string set; TestPropagation_UnexpectedSkip_DoesNotMarkApplied fails the second a future PR re-routes unexpected_skip through markApplied Live verified: pending — will verify post-deploy via synthetic pending_propagations row pointing at non-existent team_id with kind=tier_elevation Tests pass: TestPropagation_UnexpectedSkip_DoesNotMarkApplied PASS TestPropagation_UnexpectedSkip_DeadLettersAtMaxAttempts PASS TestIsPropagationAllowedSkip_Coverage PASS TestPropagationUnexpectedSkipErr_IsMatches PASS TestBucketSkipReason_BoundsCardinality PASS make gate green (build + vet + go test ./... -short -count=1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CHAOS-DRILL-2026-05-20 P0 finding #1:
propagation_runner.gohandleTierElevationtreated(Applied=false, SkipReason=<not-in-allowed-list>)as success, silently stampingapplied_aton a propagation row whose regrade never landed. Real prod trigger: customer's postgres pod missing thepostgres-adminSecret (legacy free-tier pods, mid-deprovisioning races, namespace teardown that beats the regrade).A paying customer's tier-elevation regrade never landed — no retry, no dead-letter, no alert. Pre-fix, the only signal was a WARN log buried under thousands of routine messages.
Fix
Any non-allowed
SkipReasonnow returnspropagationUnexpectedSkipErr(errors.IsmatcheserrPropagationUnexpectedSkipSentinel). The runner'smarkRetrypath detects the sentinel and emits a distinctpropagation.unexpected_skipaudit row (NOTpropagation.applied). The row retries per the existing backoff schedule and dead-letters atpropagationMaxAttemptsvia the standardmarkDeadLetteredpath that operators already alert on.New Prometheus counter
instant_propagation_unexpected_skip_total{kind,resource_type,skip_reason}— boundedskip_reasoncardinality viabucketSkipReason(). Leading indicator for the dead-letter alert.Suggested NR alert:
Tests
All 5 F1 tests pass.
make gategreen (build + vet + go test ./... -short -count=1).Coverage block (CLAUDE.md rule 17)
🤖 Generated with Claude Code