Skip to content

fix(jobs): CHAOS F1 — unexpected_skip no longer silently marks propagation APPLIED#34

Merged
mastermanas805 merged 1 commit into
masterfrom
fix/chaos-f1-unexpected-skip
May 20, 2026
Merged

fix(jobs): CHAOS F1 — unexpected_skip no longer silently marks propagation APPLIED#34
mastermanas805 merged 1 commit into
masterfrom
fix/chaos-f1-unexpected-skip

Conversation

@mastermanas805
Copy link
Copy Markdown
Member

Summary

CHAOS-DRILL-2026-05-20 P0 finding #1: propagation_runner.go handleTierElevation treated (Applied=false, SkipReason=<not-in-allowed-list>) as success, silently stamping applied_at on a propagation row whose regrade never landed. Real prod trigger: customer's postgres pod missing the postgres-admin Secret (legacy free-tier pods, mid-deprovisioning races, namespace teardown that beats the regrade).

A paying customer's tier-elevation regrade never landed — no retry, no dead-letter, no alert. Pre-fix, the only signal was a WARN log buried under thousands of routine messages.

Fix

Any non-allowed SkipReason now returns propagationUnexpectedSkipErr (errors.Is matches errPropagationUnexpectedSkipSentinel). The runner's markRetry path detects the sentinel and emits a distinct propagation.unexpected_skip audit row (NOT propagation.applied). The row retries per the existing backoff schedule and dead-letters at propagationMaxAttempts via the standard markDeadLettered path that operators already alert on.

New Prometheus counter

instant_propagation_unexpected_skip_total{kind,resource_type,skip_reason} — bounded skip_reason cardinality via bucketSkipReason(). Leading indicator for the dead-letter alert.

Suggested NR alert:

sum(rate(instant_propagation_unexpected_skip_total[15m])) > 0 for 30m → P2 page

Tests

All 5 F1 tests pass. make gate green (build + vet + go test ./... -short -count=1).

TestPropagation_UnexpectedSkip_DoesNotMarkApplied             PASS
TestPropagation_UnexpectedSkip_DeadLettersAtMaxAttempts       PASS
TestIsPropagationAllowedSkip_Coverage                         PASS
TestPropagationUnexpectedSkipErr_IsMatches                    PASS
TestBucketSkipReason_BoundsCardinality                        PASS

Coverage block (CLAUDE.md rule 17)

Symptom:        propagation.applied audit row + applied_at stamp on a
                row whose regrade never landed
Enumeration:    rg -F 'unexpected_skip'   (worker, provisioner, api)
Sites found:    1 emit site (handleTierElevation only)
Sites touched:  1
Coverage test:  TestIsPropagationAllowedSkip_Coverage iterates the
                propagationAllowedSkipSubstrings registry + asserts
                every known-failure SkipReason is treated as unexpected
Live verified:  pending — synthetic pending_propagations row pointing
                at non-existent team_id post-deploy

🤖 Generated with Claude Code

…ation APPLIED

Pre-fix bug (CHAOS-DRILL-2026-05-20 finding #1, propagation_runner.go
lines 756–771):

  handleTierElevation treated `(Applied=false, SkipReason=<any string
  not in the allowed-skip whitelist>)` as success. A WARN log fired,
  firstErr stayed nil, the runner stamped applied_at on the row, and
  the entitlement_reconciler (5-min backstop) saw no drift to correct
  because applied_at was set. A paying customer's tier-elevation
  regrade never landed — no retry, no dead-letter, no alert.

  Real prod trigger: customer's postgres pod missing postgres-admin
  Secret (legacy free-tier pods, mid-deprovisioning races). The chaos
  drill confirmed the failure mode end-to-end.

Fix: any non-allowed SkipReason now returns propagationUnexpectedSkipErr
(implements errors.Is on errPropagationUnexpectedSkipSentinel). The
runner's markRetry path detects the sentinel and emits a distinct
propagation.unexpected_skip audit row (NOT propagation.applied). The
row retries per the existing backoff schedule (1m, 5m, 15m, ...) and
dead-letters at propagationMaxAttempts (10 attempts ≈ 24h33m), going
through the standard markDeadLettered path with the canonical
propagation.dead_lettered audit kind that operators already alert on.

New Prometheus counter:

  instant_propagation_unexpected_skip_total{kind,resource_type,skip_reason}

with bounded skip_reason cardinality via bucketSkipReason() —
postgres_admin_secret_missing, redis_auth_secret_missing,
namespace_not_found, pod_not_found, resource_not_reachable,
legacy_resource, other. Leading indicator for the dead-letter alert
that already exists.

Audit kinds the runner now emits (mirrors api/models/audit_kinds.go):
  - propagation.applied         (success; unchanged)
  - propagation.retrying        (routine retry; unchanged)
  - propagation.dead_lettered   (terminal failure; unchanged)
  - propagation.unexpected_skip (NEW: F1 retry signal)

Coverage block (CLAUDE.md rule 17):

  Symptom:        propagation.applied audit row + applied_at stamp on a
                  row whose regrade never landed
  Enumeration:    rg -F 'unexpected_skip'
                  (worker, provisioner, api repos)
  Sites found:    1 emit site (handleTierElevation only)
  Sites touched:  1
  Coverage test:  TestIsPropagationAllowedSkip_Coverage iterates
                  propagationAllowedSkipSubstrings + a known-failure
                  string set; TestPropagation_UnexpectedSkip_DoesNotMarkApplied
                  fails the second a future PR re-routes unexpected_skip
                  through markApplied
  Live verified:  pending — will verify post-deploy via synthetic
                  pending_propagations row pointing at non-existent
                  team_id with kind=tier_elevation

Tests pass:
  TestPropagation_UnexpectedSkip_DoesNotMarkApplied             PASS
  TestPropagation_UnexpectedSkip_DeadLettersAtMaxAttempts       PASS
  TestIsPropagationAllowedSkip_Coverage                         PASS
  TestPropagationUnexpectedSkipErr_IsMatches                    PASS
  TestBucketSkipReason_BoundsCardinality                        PASS

make gate green (build + vet + go test ./... -short -count=1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 merged commit 8ecab5c into master May 20, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant