fix(webhook,internal): auth-first ordering + distinct error codes (API-19/26/27/28/77/78/96/97/98)#175
Merged
Conversation
…I-19/26/27/28/77/78/96/97/98)
QA 2026-05-29 surfaced two related fail-closed inversions:
1. **Webhook routes** `/api/v1/email/webhook/brevo` + `/ses` returned a
generic 401 envelope (`error: "invalid_signature"`) for BOTH the
secret-unset path and the signature-mismatch path. Operators chasing
"Brevo can't post delivery events" got the canonical "log in for a
new INSTANODE_TOKEN" agent_action — useless because the actual fault
was an undeployed BREVO_WEBHOOK_SECRET. The bug is even more pointed
on prod: every Brevo bounce/complaint/unsubscribe was silently lost,
breaking CLAUDE.md rule 12 (email truth surface). Same pattern on
SES/SNS.
2. **Internal routes** `/internal/teams/:id/terminate`,
`/internal/email/resend-magic-link`,
`/internal/teams/:id/backup-quota/refund` parsed the path :id / body
BEFORE checking the worker JWT — so a probe could distinguish "path
malformed" (400) from "auth bad" (401) by the envelope code,
inverting the fail-closed posture documented for the /internal/*
routes.
This PR ships:
- **Auth-first ordering on all three /internal/* handlers.** Each now
runs a `preVerify*` pass (signature + alg pin + purpose + iat
freshness) BEFORE path / body parse. The team_id / link_id binding
check stays in the second-phase verify, after the path / body parses
cleanly. Pre-fix probes that could shape-fingerprint the route via
400 vs 401 envelope codes now see uniform 401 internal_token_required
for every unauth call.
- **Distinct error codes per failure class** on the email-provider
webhook routes:
- `webhook_secret_mismatch` — operator hasn't deployed the env var
- `webhook_signature_mismatch` — secret IS set, payload didn't verify
- `webhook_method_not_allowed` — GET on a POST-only webhook URL
(API-98 — dashboard pre-flight)
Each carries an operator-targeted agent_action (NOT the user-targeted
"log in to mint a new INSTANODE_TOKEN") so an alert correctly tells
the operator to fix the deploy / dashboard, not the user to log in.
- **`internal_token_required`** as the canonical /internal/* auth-fail
code. Same agent_action surface.
- **GET handlers on Brevo + SES webhook URLs.** Return 405 + Allow: POST
so a provider dashboard pre-flight that GETs the URL sees "URL exists,
method wrong" instead of the catch-all 401 (which some dashboards
interpret as "URL invalid" and silently drop).
- **Metric `instant_webhook_auth_failures_total{webhook,reason}`** with
labels {brevo_hmac, ses_sns} x {secret_unset, signature_mismatch}.
Operators can split "we forgot to deploy the secret" from "the
provider rotated their key" with one query.
- **NR alert `webhook-auth-failures.json`** — CRITICAL on
reason=secret_unset within 5 min (fix is one kubectl set env), per
CLAUDE.md rule 25.
Tests:
- 12 new hermetic tests in `auth_first_ordering_test.go` covering
every fail-first path (BUG-API-019/026/027/028/077/078/096/097/098).
- Updated 3 pre-existing tests that asserted the broken pre-fix
ordering (`invalid_body` / `invalid_link_id` / `invalid_team_id`
arms — now exercised with a valid JWT so the body / path arm stays
covered).
- OpenAPI route-coverage gate updated — new GET 405 handlers added to
the intentionallyHidden whitelist (provider-dashboard plumbing, not
agent-facing).
Surface checklist:
- [x] api/internal/handlers/{email_webhooks,internal_*}.go — handlers
- [x] api/internal/handlers/helpers.go — codeToAgentAction registry
(4 new entries: webhook_secret_mismatch, webhook_signature_mismatch,
webhook_method_not_allowed, internal_token_required, plus
invalid_message which was orphaned)
- [x] api/internal/router/router.go — GET 405 handlers registered
- [x] api/internal/metrics/metrics.go — instant_webhook_auth_failures_total
- [x] api/internal/handlers/openapi_test.go — GET routes whitelisted
- [x] infra/newrelic/alerts/webhook-auth-failures.json — NR alert
- [x] Tests: 12 new + 3 updated; codeToAgentAction registry gate passes
- [N/A] OpenAPI spec — GET endpoints are 405 plumbing, not agent surface
- [N/A] content/llms.txt — no contract change to public API
- [N/A] CHANGELOG — internal hardening, no user-visible surface change
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-bearer branch
100% patch-coverage gate review: the preVerify* helpers had a defensive
`tokenStr == ""` branch after `TrimSpace(authHeader[len("Bearer "):])`
that is unreachable in practice — the outer TrimSpace of the full
Authorization header eats trailing whitespace before the prefix check,
so a "Bearer " header lands in the missing_bearer branch. Same for the
`!tok.Valid` branch — jwt-go's ParseWithClaims surfaces every cause as
an err return, never as `tok.Valid == false` with err == nil.
Both branches removed to satisfy the 100%-of-changed-lines gate.
WithValidMethods + the SigningMethodHMAC type-assert (defence in depth)
remain — those are the actual alg-pin enforcement.
New tests cover the remaining preVerify arms:
- TestInternal{Terminate,ResendMagicLink,Refund}_PreVerify_WrongPurpose
- TestInternal{Terminate,ResendMagicLink,Refund}_PreVerify_StaleIat
- TestInternal{Terminate,ResendMagicLink,Refund}_PreVerify_MissingIat
- TestSESWebhook_GetReturns405_WithWebhookMethodNotAllowed
- TestSESWebhook_GoodTopicArnButBadEnvelope_Returns400_InvalidPayload
- TestSESWebhook_NotificationWithBadInnerMessage_Returns400_InvalidMessage
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eyfunc jwt.ParseWithClaims with WithValidMethods([HS256]) short-circuits any non-HS256 token BEFORE the keyfunc runs, so the inner *SigningMethodHMAC type-assert is unreachable. The defensive "defense-in-depth" comment overstated the value — the test confirms WithValidMethods does the work alone. Removing the unreachable branch closes the last 6 lines of the 100% patch-coverage gate without weakening alg pinning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9d4c98a to
471c069
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Closes BUG-API-019, 026, 027, 028, 077, 078, 096, 097, 098 from the 2026-05-29 QA pass.
Two related fail-closed inversions:
Webhook routes
/api/v1/email/webhook/brevo+/sesreturned a generic 401 (error: "invalid_signature") for BOTH the secret-unset path and the signature-mismatch path. Operators chasing "Brevo can't post delivery events" got the canonical "log in for a new INSTANODE_TOKEN" agent_action — useless because the actual fault was an undeployedBREVO_WEBHOOK_SECRET. The bug is even more pointed on prod: every Brevo bounce/complaint/unsubscribe was silently lost, breakingCLAUDE.mdrule 12 (email truth surface). Same pattern on SES/SNS.Internal routes
/internal/teams/:id/terminate,/internal/email/resend-magic-link,/internal/teams/:id/backup-quota/refundparsed the path:id/ body BEFORE checking the worker JWT — so an unauthenticated probe could distinguish "path malformed" (400) from "auth bad" (401) by the envelope code, inverting the fail-closed posture documented for the/internal/*routes.How
Auth-first ordering on all three
/internal/*handlers. Each now runs apreVerify*pass (signature + alg pin + purpose + iat freshness) BEFORE path / body parse. The team_id / link_id binding check stays in the second-phase verify, after the path / body parses cleanly. Pre-fix probes that could shape-fingerprint the route via 400 vs 401 envelope codes now see uniform 401internal_token_requiredfor every unauth call.Distinct error codes per failure class on the email-provider webhook routes:
webhook_secret_mismatch— operator hasn't deployed the env varwebhook_signature_mismatch— secret IS set, payload didn't verifywebhook_method_not_allowed— GET on a POST-only webhook URL (BUG-API-098 — dashboard pre-flight)Each carries an operator-targeted
agent_action(NOT the user-targeted "log in to mint a new INSTANODE_TOKEN") so an alert correctly tells the operator to fix the deploy / dashboard, not the user to log in.internal_token_requiredas the canonical /internal/* auth-fail code.GET handlers on Brevo + SES webhook URLs. Return 405 +
Allow: POSTso a provider dashboard pre-flight that GETs the URL sees "URL exists, method wrong" instead of the catch-all 401.Metric
instant_webhook_auth_failures_total{webhook,reason}with labels{brevo_hmac, ses_sns}x{secret_unset, signature_mismatch}. Operators can split "we forgot to deploy the secret" from "the provider rotated their key" with one query.NR alert
webhook-auth-failures.json— CRITICAL onreason=secret_unsetwithin 5 min (fix is onekubectl set env), per CLAUDE.md rule 25.Tests
internal/handlers/auth_first_ordering_test.gocovering every fail-first path (BUG-API-019/026/027/028/077/078/096/097/098). No DB required — sqlmock + Fiber app.invalid_body/invalid_link_id/invalid_team_idarms — now exercised with a valid JWT so the body / path arm stays covered).intentionallyHiddenwhitelist (provider-dashboard plumbing, not agent-facing).TestCodeToAgentAction_NoOrphans+TestErrorCode_HasAgentActionregistry gates green.Surface checklist
api/internal/handlers/{email_webhooks,internal_*}.go— handler ordering + codesapi/internal/handlers/helpers.go—codeToAgentAction(5 new entries:webhook_secret_mismatch,webhook_signature_mismatch,webhook_method_not_allowed,internal_token_required,invalid_message)api/internal/router/router.go— GET 405 handlers registeredapi/internal/metrics/metrics.go—instant_webhook_auth_failures_totalapi/internal/handlers/openapi_test.go— GET routes whitelistedinfra/newrelic/alerts/webhook-auth-failures.json— NR alert (rule 25)content/llms.txt— no contract change to public APILive verify plan (post-merge)
curl https://api.instanode.dev/healthz | jq .commit_idmatches HEADcurl -X POST https://api.instanode.dev/api/v1/email/webhook/brevo -d '{}'→ 401 +error: "webhook_secret_mismatch"(NOTunauthorized)curl -X GET https://api.instanode.dev/api/v1/email/webhook/brevo→ 405 +Allow: POST+error: "webhook_method_not_allowed"curl -X POST https://api.instanode.dev/internal/teams/NOT-A-UUID/terminate→ 401 +error: "internal_token_required"(NOT 400invalid_team_id)curl -X POST https://api.instanode.dev/internal/email/resend-magic-link -d 'JUNK'→ 401 +error: "internal_token_required"(NOT 400invalid_body)Constraints respected
/webhooks/brevo/:secret(the transactional-delivery receiver — PR fix(onboarding,brevo): /start always 302 + Brevo 401 carries correct agent_action (API-5/6) #172) is untouched — worker rule 12 path still works.🤖 Generated with Claude Code