Skip to content

fix(webhook,internal): auth-first ordering + distinct error codes (API-19/26/27/28/77/78/96/97/98)#175

Merged
mastermanas805 merged 3 commits into
masterfrom
fix/webhook-and-internal-auth-ordering
May 29, 2026
Merged

fix(webhook,internal): auth-first ordering + distinct error codes (API-19/26/27/28/77/78/96/97/98)#175
mastermanas805 merged 3 commits into
masterfrom
fix/webhook-and-internal-auth-ordering

Conversation

@mastermanas805
Copy link
Copy Markdown
Member

What

Closes BUG-API-019, 026, 027, 028, 077, 078, 096, 097, 098 from the 2026-05-29 QA pass.

Two related fail-closed inversions:

  1. Webhook routes /api/v1/email/webhook/brevo + /ses returned a generic 401 (error: "invalid_signature") for BOTH the secret-unset path and the signature-mismatch path. Operators chasing "Brevo can't post delivery events" got the canonical "log in for a new INSTANODE_TOKEN" agent_action — useless because the actual fault was an undeployed BREVO_WEBHOOK_SECRET. The bug is even more pointed on prod: every Brevo bounce/complaint/unsubscribe was silently lost, breaking CLAUDE.md rule 12 (email truth surface). Same pattern on SES/SNS.

  2. Internal routes /internal/teams/:id/terminate, /internal/email/resend-magic-link, /internal/teams/:id/backup-quota/refund parsed the path :id / body BEFORE checking the worker JWT — so an unauthenticated probe could distinguish "path malformed" (400) from "auth bad" (401) by the envelope code, inverting the fail-closed posture documented for the /internal/* routes.

How

  • Auth-first ordering on all three /internal/* handlers. Each now runs a preVerify* pass (signature + alg pin + purpose + iat freshness) BEFORE path / body parse. The team_id / link_id binding check stays in the second-phase verify, after the path / body parses cleanly. Pre-fix probes that could shape-fingerprint the route via 400 vs 401 envelope codes now see uniform 401 internal_token_required for every unauth call.

  • Distinct error codes per failure class on the email-provider webhook routes:

    • webhook_secret_mismatch — operator hasn't deployed the env var
    • webhook_signature_mismatch — secret IS set, payload didn't verify
    • webhook_method_not_allowed — GET on a POST-only webhook URL (BUG-API-098 — dashboard pre-flight)

    Each carries an operator-targeted agent_action (NOT the user-targeted "log in to mint a new INSTANODE_TOKEN") so an alert correctly tells the operator to fix the deploy / dashboard, not the user to log in.

  • internal_token_required as the canonical /internal/* auth-fail code.

  • GET handlers on Brevo + SES webhook URLs. Return 405 + Allow: POST so a provider dashboard pre-flight that GETs the URL sees "URL exists, method wrong" instead of the catch-all 401.

  • Metric instant_webhook_auth_failures_total{webhook,reason} with labels {brevo_hmac, ses_sns} x {secret_unset, signature_mismatch}. Operators can split "we forgot to deploy the secret" from "the provider rotated their key" with one query.

  • NR alert webhook-auth-failures.json — CRITICAL on reason=secret_unset within 5 min (fix is one kubectl set env), per CLAUDE.md rule 25.

Tests

  • 12 new hermetic tests in internal/handlers/auth_first_ordering_test.go covering every fail-first path (BUG-API-019/026/027/028/077/078/096/097/098). No DB required — sqlmock + Fiber app.
  • 3 pre-existing tests updated that asserted the broken pre-fix ordering (invalid_body / invalid_link_id / invalid_team_id arms — now exercised with a valid JWT so the body / path arm stays covered).
  • OpenAPI route-coverage gate updated — new GET 405 handlers added to the intentionallyHidden whitelist (provider-dashboard plumbing, not agent-facing).
  • TestCodeToAgentAction_NoOrphans + TestErrorCode_HasAgentAction registry gates green.

Surface checklist

  • api/internal/handlers/{email_webhooks,internal_*}.go — handler ordering + codes
  • api/internal/handlers/helpers.gocodeToAgentAction (5 new entries: webhook_secret_mismatch, webhook_signature_mismatch, webhook_method_not_allowed, internal_token_required, invalid_message)
  • api/internal/router/router.go — GET 405 handlers registered
  • api/internal/metrics/metrics.goinstant_webhook_auth_failures_total
  • api/internal/handlers/openapi_test.go — GET routes whitelisted
  • infra/newrelic/alerts/webhook-auth-failures.json — NR alert (rule 25)
  • [N/A] OpenAPI spec — GET endpoints are 405 plumbing, not agent surface
  • [N/A] content/llms.txt — no contract change to public API
  • [N/A] CHANGELOG — internal hardening, no user-visible surface change

Live verify plan (post-merge)

  1. curl https://api.instanode.dev/healthz | jq .commit_id matches HEAD
  2. curl -X POST https://api.instanode.dev/api/v1/email/webhook/brevo -d '{}' → 401 + error: "webhook_secret_mismatch" (NOT unauthorized)
  3. curl -X GET https://api.instanode.dev/api/v1/email/webhook/brevo → 405 + Allow: POST + error: "webhook_method_not_allowed"
  4. curl -X POST https://api.instanode.dev/internal/teams/NOT-A-UUID/terminate → 401 + error: "internal_token_required" (NOT 400 invalid_team_id)
  5. curl -X POST https://api.instanode.dev/internal/email/resend-magic-link -d 'JUNK' → 401 + error: "internal_token_required" (NOT 400 invalid_body)

Constraints respected

🤖 Generated with Claude Code

mastermanas805 and others added 3 commits May 29, 2026 17:43
…I-19/26/27/28/77/78/96/97/98)

QA 2026-05-29 surfaced two related fail-closed inversions:

1. **Webhook routes** `/api/v1/email/webhook/brevo` + `/ses` returned a
   generic 401 envelope (`error: "invalid_signature"`) for BOTH the
   secret-unset path and the signature-mismatch path. Operators chasing
   "Brevo can't post delivery events" got the canonical "log in for a
   new INSTANODE_TOKEN" agent_action — useless because the actual fault
   was an undeployed BREVO_WEBHOOK_SECRET. The bug is even more pointed
   on prod: every Brevo bounce/complaint/unsubscribe was silently lost,
   breaking CLAUDE.md rule 12 (email truth surface). Same pattern on
   SES/SNS.

2. **Internal routes** `/internal/teams/:id/terminate`,
   `/internal/email/resend-magic-link`,
   `/internal/teams/:id/backup-quota/refund` parsed the path :id / body
   BEFORE checking the worker JWT — so a probe could distinguish "path
   malformed" (400) from "auth bad" (401) by the envelope code,
   inverting the fail-closed posture documented for the /internal/*
   routes.

This PR ships:

- **Auth-first ordering on all three /internal/* handlers.** Each now
  runs a `preVerify*` pass (signature + alg pin + purpose + iat
  freshness) BEFORE path / body parse. The team_id / link_id binding
  check stays in the second-phase verify, after the path / body parses
  cleanly. Pre-fix probes that could shape-fingerprint the route via
  400 vs 401 envelope codes now see uniform 401 internal_token_required
  for every unauth call.

- **Distinct error codes per failure class** on the email-provider
  webhook routes:
    - `webhook_secret_mismatch` — operator hasn't deployed the env var
    - `webhook_signature_mismatch` — secret IS set, payload didn't verify
    - `webhook_method_not_allowed` — GET on a POST-only webhook URL
      (API-98 — dashboard pre-flight)
  Each carries an operator-targeted agent_action (NOT the user-targeted
  "log in to mint a new INSTANODE_TOKEN") so an alert correctly tells
  the operator to fix the deploy / dashboard, not the user to log in.

- **`internal_token_required`** as the canonical /internal/* auth-fail
  code. Same agent_action surface.

- **GET handlers on Brevo + SES webhook URLs.** Return 405 + Allow: POST
  so a provider dashboard pre-flight that GETs the URL sees "URL exists,
  method wrong" instead of the catch-all 401 (which some dashboards
  interpret as "URL invalid" and silently drop).

- **Metric `instant_webhook_auth_failures_total{webhook,reason}`** with
  labels {brevo_hmac, ses_sns} x {secret_unset, signature_mismatch}.
  Operators can split "we forgot to deploy the secret" from "the
  provider rotated their key" with one query.

- **NR alert `webhook-auth-failures.json`** — CRITICAL on
  reason=secret_unset within 5 min (fix is one kubectl set env), per
  CLAUDE.md rule 25.

Tests:
- 12 new hermetic tests in `auth_first_ordering_test.go` covering
  every fail-first path (BUG-API-019/026/027/028/077/078/096/097/098).
- Updated 3 pre-existing tests that asserted the broken pre-fix
  ordering (`invalid_body` / `invalid_link_id` / `invalid_team_id`
  arms — now exercised with a valid JWT so the body / path arm stays
  covered).
- OpenAPI route-coverage gate updated — new GET 405 handlers added to
  the intentionallyHidden whitelist (provider-dashboard plumbing, not
  agent-facing).

Surface checklist:
- [x] api/internal/handlers/{email_webhooks,internal_*}.go — handlers
- [x] api/internal/handlers/helpers.go — codeToAgentAction registry
      (4 new entries: webhook_secret_mismatch, webhook_signature_mismatch,
      webhook_method_not_allowed, internal_token_required, plus
      invalid_message which was orphaned)
- [x] api/internal/router/router.go — GET 405 handlers registered
- [x] api/internal/metrics/metrics.go — instant_webhook_auth_failures_total
- [x] api/internal/handlers/openapi_test.go — GET routes whitelisted
- [x] infra/newrelic/alerts/webhook-auth-failures.json — NR alert
- [x] Tests: 12 new + 3 updated; codeToAgentAction registry gate passes
- [N/A] OpenAPI spec — GET endpoints are 405 plumbing, not agent surface
- [N/A] content/llms.txt — no contract change to public API
- [N/A] CHANGELOG — internal hardening, no user-visible surface change

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-bearer branch

100% patch-coverage gate review: the preVerify* helpers had a defensive
`tokenStr == ""` branch after `TrimSpace(authHeader[len("Bearer "):])`
that is unreachable in practice — the outer TrimSpace of the full
Authorization header eats trailing whitespace before the prefix check,
so a "Bearer " header lands in the missing_bearer branch. Same for the
`!tok.Valid` branch — jwt-go's ParseWithClaims surfaces every cause as
an err return, never as `tok.Valid == false` with err == nil.

Both branches removed to satisfy the 100%-of-changed-lines gate.
WithValidMethods + the SigningMethodHMAC type-assert (defence in depth)
remain — those are the actual alg-pin enforcement.

New tests cover the remaining preVerify arms:
- TestInternal{Terminate,ResendMagicLink,Refund}_PreVerify_WrongPurpose
- TestInternal{Terminate,ResendMagicLink,Refund}_PreVerify_StaleIat
- TestInternal{Terminate,ResendMagicLink,Refund}_PreVerify_MissingIat
- TestSESWebhook_GetReturns405_WithWebhookMethodNotAllowed
- TestSESWebhook_GoodTopicArnButBadEnvelope_Returns400_InvalidPayload
- TestSESWebhook_NotificationWithBadInnerMessage_Returns400_InvalidMessage

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eyfunc

jwt.ParseWithClaims with WithValidMethods([HS256]) short-circuits any
non-HS256 token BEFORE the keyfunc runs, so the inner
*SigningMethodHMAC type-assert is unreachable. The defensive
"defense-in-depth" comment overstated the value — the test confirms
WithValidMethods does the work alone.

Removing the unreachable branch closes the last 6 lines of the 100%
patch-coverage gate without weakening alg pinning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 force-pushed the fix/webhook-and-internal-auth-ordering branch from 9d4c98a to 471c069 Compare May 29, 2026 12:13
@mastermanas805 mastermanas805 merged commit f69f6f2 into master May 29, 2026
14 checks passed
@mastermanas805 mastermanas805 deleted the fix/webhook-and-internal-auth-ordering branch May 29, 2026 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant