feat(email): failover ESP provider — secondary on primary reject/error (#35)#93
Merged
Merged
Conversation
#35) Add a FailoverProvider that wraps an ordered chain of EmailProvider (primary first) and retries the same EventEmail through each secondary ESP when the primary errors or hard-rejects a send. A send is "successful" (forwarder claims the row, preserving worker convention 6 claim-after-2xx) iff ANY provider in the chain returns success. This directly hardens the standing P0 (Brevo sender unvalidated → every send rejected): wire EMAIL_PROVIDER_FALLBACK=ses and a Brevo account-level reject falls through to SES on the same payload. Inert by default: EMAIL_PROVIDER_FALLBACK unset → NewProvider returns the bare single provider, byte-identical to today (no wrapper, no metric). Proven by TestFactory_NoFallback_ReturnsBareProvider. - factory: single provider build factored into buildOne(name, cfg) so primary + fallbacks share ONE switch (no duplicated provider switch). Unknown fallback name → hard boot error; valid name with unset creds → skip-with-warning (fail-open, lets ops stage the flag ahead of the SES_* secret). If all fallbacks skip, boots with the primary only. - config: EMAIL_PROVIDER_FALLBACK comma-separated → ordered Fallbacks. - metric: instant_email_failover_total{outcome=primary_ok|fallback_ok| all_failed}. INFO on fallback engage (work performed), DEBUG on primary-ok (idle), ERROR on all_failed. Emails masked in all logs. Lazy *Vec — absent from /metrics until first emit (inert default). - tests: in-package failover_provider_test.go (fake recording provider): primary-ok-no-fallback-call, primary-fail→fallback-ok (transient/ permanent/skip), all-fail→last-error, all-skip→no all_failed metric, ordering, Name(), empty/nil guards, per-outcome metric increments, email-masking on every log path. Factory inert-default + with-fallback + unknown-name-errors + missing-creds-skipped. Config parse table. internal/email per-package coverage 96.4% (≥95 floor); all new funcs 100%. make gate green. Infra alert/tile/catalog ship in a separate infra PR (rule 25) — referenced in the PR body. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a
FailoverProvider(worker/internal/email/failover_provider.go) that wraps an ordered chain ofEmailProvider(primary first) and retries the sameEventEmailthrough each secondary ESP when the primary errors or hard-rejects. A send is "successful" — and the forwarder claims the ledger row (worker convention 6 claim-after-2xx) — iff any provider in the chain returns success.Directly hardens the standing P0 (Brevo sender unvalidated → every send rejected): wire
EMAIL_PROVIDER_FALLBACK=sesand a Brevo account-level reject falls through to SES on the same payload.Inert by default (safety contract)
EMAIL_PROVIDER_FALLBACKunset →NewProviderreturns the bare single provider, byte-identical to today (no wrapper, no metric). Proven byTestFactory_NoFallback_ReturnsBareProvider.Decision rule
Any non-success result (Transient / Permanent / SkippedNoTemplate) advances to the next provider; first success wins; all-fail returns the last error. All-skipped (no ESP has a template for the kind) returns the skip verbatim so the forwarder advances silently and does not trip
all_failed.Factory
buildOne(name, cfg)— primary + every fallback share one switch (no duplicated switch).Metric (rule 25)
instant_email_failover_total{outcome=primary_ok|fallback_ok|all_failed}(lazy CounterVec — never emits in inert default). INFO on fallback engage, DEBUG on primary-ok, ERROR on all_failed. Emails masked in all logs.Infra companion PR (alert + tile + catalog): InstaNode-dev/infra#feat/email-failover-observability.
Tests / gate
go test ./internal/email/ -cover→ PASS, 96.4% (≥95 floor); all new funcs 100%.make gate(exact CI sequence) → green.Delivery itself can't be verified (Brevo sender-unvalidated P0) — this PR asserts the failover contract + metric via tests; delivery awaits operator Brevo/SES validation.
🤖 Generated with Claude Code