From 13aa4180d18b63635ff9975e54f611e6c0a83136 Mon Sep 17 00:00:00 2001 From: Manas Srivastava Date: Wed, 20 May 2026 16:27:04 +0530 Subject: [PATCH 1/3] fix(p0-3): emit provision.persistence_failed audit + agent_action + coverage tests MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MR-P0-3 (BugBash 2026-05-20) atomic-provisioning hardening — completion of the orphan-resource prevention work shipped in commit 36eac9b. That commit wired all 18 provisioning callsites through finalizeProvision, which on persistence failure (post-RPC) tears down the backend object (best-effort), soft-deletes the row, and returns errProvisionPersistFailed so handlers map to 503 instead of 201. This commit closes three remaining gaps in that work: 1) **Operator-visible audit row**: emit AuditKindProvisionPersistenceFailed on every persistence-failure path. Operators can now reconstruct the exact moment the platform produced an unreachable resource — by resource_id / resource_type / log_prefix / provider_resource_id / request_id / tier / env. NOT wired into the customer-email forwarder (this is an internal alert, mirrors AuditKindBillingChargeUndeliverable and AuditKindPropagationDeadLettered). Sync emit + WARN slog on failure — must NEVER block the 503 to the customer. 2) **agent_action for provision_failed**: the catch-all 503 envelope used to fall back to AgentActionContactSupport ('email support'). That's wrong for the MR-P0-3 path — the backend object was rolled back and the row soft-deleted, so the right action is 'retry with exponential backoff', not contact support. Adds an explicit codeToAgentAction entry under the U3 contract (opens with 'Tell the user', names the reason, names the action, full https://instanode.dev URL, < 280 chars). 3) **Registry-iterating coverage tests** (CLAUDE.md rule 18): two static guards in provision_atomicity_coverage_test.go scan every production .go file in internal/handlers/ and assert: - every file calling models.CreateResource also calls finalizeProvision (so a new handler can't bypass the two-phase pattern) - every file calling finalizeProvision also routes the error through respondProvisionFailed or twinCoreErr (so a new handler can't swallow the sentinel) Plus TestProvisionFailedHasAgentAction asserts the catch-all code still has explicit agent_action guidance (defends against a future refactor that drops the new entry). Coverage block per CLAUDE.md rule 17: Symptom: 201 returned for resource the platform can't address (orphan backend + NULL conn_url + NULL PRID + billed) Enumeration: rg -l 'models.CreateResource\(' internal/handlers/ yielded db.go, cache.go, nosql.go, queue.go, storage.go, webhook.go, vector.go (each w/ anon + auth + twin paths) Sites found: 7 files × ~3 paths = ~21 provisioning entry points Sites touched: none added — every site already calls finalizeProvision per commit 36eac9b. This commit adds a coverage TEST that fails the build if a new site bypasses the pattern. Coverage test: TestEveryCreateResourceCallSiteIsFollowedByFinalizeProvision + TestEveryFinalizeProvisionCallSiteRespondsProvisionFailedOnError + TestProvisionFailedHasAgentAction Live verified: (this commit; pending CI green + deploy) Per-handler audit findings: - internal/handlers/db.go: already atomic (finalizeProvision at L265, L432, L642) - internal/handlers/cache.go: already atomic (finalizeProvision at L232, L387, L564) - internal/handlers/nosql.go: already atomic (finalizeProvision at L228, L380, L569) - internal/handlers/queue.go: already atomic (finalizeProvision at L306, L525) - internal/handlers/storage.go: already atomic (finalizeProvision at L335, L525) - internal/handlers/webhook.go: already atomic (finalizeProvision at L327, L444 — cleanup=nil ok, no backend object) - internal/handlers/vector.go: already atomic (finalizeProvision at L370, L522) - internal/handlers/deploy.go: DIFFERENT MODEL — async (202 not 201); row persisted before async build runs; failures update status. Not P0-3 shape. - internal/handlers/stack.go: DIFFERENT MODEL — async (202 not 201); same shape as deploy. Co-Authored-By: Claude Opus 4.7 (1M context) --- internal/handlers/export_test.go | 24 ++ internal/handlers/helpers.go | 14 + .../provision_atomicity_coverage_test.go | 276 ++++++++++++++++++ internal/handlers/provision_helper.go | 62 ++++ internal/models/audit_kinds.go | 17 ++ 5 files changed, 393 insertions(+) create mode 100644 internal/handlers/provision_atomicity_coverage_test.go diff --git a/internal/handlers/export_test.go b/internal/handlers/export_test.go index c147d52..5bd39c4 100644 --- a/internal/handlers/export_test.go +++ b/internal/handlers/export_test.go @@ -33,3 +33,27 @@ func RunFinalizeProvisionForTest( helper := provisionHelper{db: dbConn, cfg: cfg} return helper.finalizeProvision(ctx, res, connectionURL, keyPrefix, providerResourceID, requestID, logPrefix, cleanup) } + +// CodeToAgentActionMetaForTest is a read-only mirror of the package's +// errorCodeMeta exposed for MR-P0-3 coverage tests. Mirrored as a separate +// type (not a type-alias) to keep the unexported errorCodeMeta out of the +// public surface — tests only need the two visible fields. +type CodeToAgentActionMetaForTest struct { + AgentAction string + UpgradeURL string +} + +// LookupCodeToAgentActionForTest returns the registered agent_action metadata +// for `code`, or (zero, false) when the code has no entry. Mirrors the +// lookup respondError itself performs, so the test exercises exactly the +// same branch as the production envelope-emit path. +func LookupCodeToAgentActionForTest(code string) (CodeToAgentActionMetaForTest, bool) { + meta, ok := codeToAgentAction[code] + if !ok { + return CodeToAgentActionMetaForTest{}, false + } + return CodeToAgentActionMetaForTest{ + AgentAction: meta.AgentAction, + UpgradeURL: meta.UpgradeURL, + }, true +} diff --git a/internal/handlers/helpers.go b/internal/handlers/helpers.go index 877cc2d..ad6d945 100644 --- a/internal/handlers/helpers.go +++ b/internal/handlers/helpers.go @@ -223,6 +223,20 @@ var codeToAgentAction = map[string]errorCodeMeta{ AgentAction: "Tell the user the provisioner is temporarily unavailable. Retry in 30 seconds — see live status at https://instanode.dev/status.", UpgradeURL: "https://instanode.dev/status", }, + + // MR-P0-3 (BugBash 2026-05-20): explicit agent_action for the catch-all + // `provision_failed` 503 — historically omitted here so the response fell + // back to AgentActionContactSupport ("email support"). For an atomic- + // persistence-failure landing this code, that fallback is wrong: the + // backend object was just torn down (best-effort) and the row soft- + // deleted, so the right action is "retry the provision with backoff," + // NOT "email support." Sentence keeps the U3 contract (opens with + // "Tell the user", names the reason, names the action, full + // https://instanode.dev URL, < 280 chars). The retry_after_seconds + // header on a 503 also signals the backoff window. + "provision_failed": { + AgentAction: "Tell the user provisioning hit a transient platform-persistence error and no charge or resource was created. Retry the same request with exponential backoff (start at 5s, cap at 60s) — see https://instanode.dev/status if it persists.", + }, "billing_provider_unavailable": { AgentAction: "Tell the user the billing provider is temporarily unavailable. Retry the upgrade in 60 seconds — see status at https://instanode.dev/status.", UpgradeURL: "https://instanode.dev/status", diff --git a/internal/handlers/provision_atomicity_coverage_test.go b/internal/handlers/provision_atomicity_coverage_test.go new file mode 100644 index 0000000..df8e31b --- /dev/null +++ b/internal/handlers/provision_atomicity_coverage_test.go @@ -0,0 +1,276 @@ +package handlers_test + +// provision_atomicity_coverage_test.go — MR-P0-3 cross-handler coverage guard +// (BugBash 2026-05-20). +// +// CLAUDE.md rule 18: when the bug class is "all members of a registry should +// X", the regression test iterates the live registry (here: every .go file in +// internal/handlers/ that calls models.CreateResource), not a hand-typed +// slice. This test scans the handlers directory and asserts that every +// production-code `models.CreateResource(` call site lives in a file that +// also contains a `finalizeProvision(` call. The orphan-generator bug fixed +// by MR-P0-3 was exactly this: a handler that inserted a `resources` row, +// did the backend gRPC, and persisted the connection URL inline with `// fail +// open` comments — a logged error and a 201 carrying credentials for a row +// the platform could not address. Catching a new handler that re-introduces +// that shape at test time (not in prod) is the whole point. +// +// The test is intentionally STATIC (string scan over source files), not a +// reflection-based registry walk: there is no in-memory "provisioning +// handler" registry the platform exposes today, so a string scan over the +// canonical authorial source is the cheapest way to enforce the invariant. +// Per CLAUDE.md convention, this is a "registry-iterating" test even though +// the registry happens to be the source tree itself: it discovers call sites +// dynamically rather than encoding them in a hand-typed list that would +// itself drift. + +import ( + "os" + "path/filepath" + "regexp" + "strings" + "testing" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + + "instant.dev/internal/handlers" +) + +// handlersDir is the package path under audit. The test runs from +// internal/handlers/ (test files live alongside production code), so the +// relative path is "." — but we resolve absolutely so the test fails clearly +// if invoked from an unexpected CWD. +func handlersDir(t *testing.T) string { + t.Helper() + wd, err := os.Getwd() + require.NoError(t, err) + // Sanity: we expect to be inside internal/handlers/. + if !strings.HasSuffix(wd, "internal/handlers") { + t.Fatalf("expected CWD to end with internal/handlers, got %q", wd) + } + return wd +} + +// productionGoFile returns true for a .go file that is part of the production +// build (i.e. NOT a *_test.go file and NOT a tooling file). +func productionGoFile(name string) bool { + if !strings.HasSuffix(name, ".go") { + return false + } + if strings.HasSuffix(name, "_test.go") { + return false + } + return true +} + +// TestEveryCreateResourceCallSiteIsFollowedByFinalizeProvision is the MR-P0-3 +// cross-handler coverage guard. For every production handler file that calls +// `models.CreateResource(`, the file MUST also contain a `finalizeProvision(` +// call. (We deliberately check at file granularity, not exact-line proximity: +// some handlers split create + finalize across helper functions, and the +// file-level pairing is the correct enforcement scope — a CreateResource in +// db.go without ANY finalizeProvision in db.go is the bug. A finalizeProvision +// somewhere in db.go for a CreateResource somewhere in db.go is acceptable +// because they cluster by service.) +// +// Allow-listed files: a small set of CreateResource callers that are NOT +// provisioning entry points and therefore do not require finalizeProvision: +// - test files (filtered above) +// - none in production today — every CreateResource caller IS a provisioning +// handler. The allow-list map is present so future intentional exemptions +// can be added with an explanatory comment. +func TestEveryCreateResourceCallSiteIsFollowedByFinalizeProvision(t *testing.T) { + dir := handlersDir(t) + + entries, err := os.ReadDir(dir) + require.NoError(t, err) + + // Files that legitimately call models.CreateResource WITHOUT + // finalizeProvision. Empty today — add an entry only with a justifying + // comment naming the alternate persistence path the file uses. Any new + // entry here is a code-review trigger by itself. + allowList := map[string]string{} + + type violation struct { + file string + reason string + } + var violations []violation + + for _, ent := range entries { + if ent.IsDir() { + continue + } + if !productionGoFile(ent.Name()) { + continue + } + + path := filepath.Join(dir, ent.Name()) + body, err := os.ReadFile(path) + require.NoError(t, err) + src := string(body) + + // Strip line comments before searching so commented-out CreateResource + // references in long file-header notes don't trip the test. (Block + // comments are rare and not used in this codebase to discuss + // CreateResource by name; a future change could swap to go/parser + // for full fidelity.) + stripped := stripLineComments(src) + + hasCreate := strings.Contains(stripped, "models.CreateResource(") + if !hasCreate { + continue + } + + if _, ok := allowList[ent.Name()]; ok { + continue + } + + hasFinalize := strings.Contains(stripped, "finalizeProvision(") + if !hasFinalize { + violations = append(violations, violation{ + file: ent.Name(), + reason: "calls models.CreateResource but does NOT call finalizeProvision — " + + "this is the MR-P0-3 orphan-generator shape (insert row, downstream " + + "provision, return 201 without atomically persisting credentials). Wire " + + "the path through h.finalizeProvision so a persistence failure tears down " + + "the backend and returns 503, never 201.", + }) + } + } + + if len(violations) > 0 { + var msg strings.Builder + msg.WriteString("MR-P0-3 atomic-provisioning coverage failed.\n") + msg.WriteString("Every handler file that calls models.CreateResource MUST also call ") + msg.WriteString("finalizeProvision (see internal/handlers/provision_helper.go).\n\n") + msg.WriteString("Violations:\n") + for _, v := range violations { + msg.WriteString(" - ") + msg.WriteString(v.file) + msg.WriteString(": ") + msg.WriteString(v.reason) + msg.WriteString("\n") + } + t.Fatal(msg.String()) + } +} + +// TestEveryFinalizeProvisionCallSiteRespondsProvisionFailedOnError is the +// second-half guard: the value of finalizeProvision is in its 503-on-failure +// semantic, so the handler MUST translate a non-nil return into a 503 via +// respondProvisionFailed (or a domain-specific 5xx that maps to the same +// shape). A handler that calls finalizeProvision but then ignores the error +// would re-introduce the bug from the other side. We check at file level: +// every file calling finalizeProvision MUST also reference +// respondProvisionFailed or an equivalent error handler (twinCoreErr in the +// bulk-twin path). +func TestEveryFinalizeProvisionCallSiteRespondsProvisionFailedOnError(t *testing.T) { + dir := handlersDir(t) + + entries, err := os.ReadDir(dir) + require.NoError(t, err) + + // Acceptable downstream-error handlers — any file calling finalizeProvision + // must also reference at least one of these by name (string-grep). The set + // is intentionally small: every production caller funnels through one of + // them. + acceptableHandlers := []string{ + "respondProvisionFailed", // canonical 503 envelope + "twinCoreErr", // bulk-twin handler — returns string err + } + + type violation struct { + file string + reason string + } + var violations []violation + + for _, ent := range entries { + if ent.IsDir() { + continue + } + if !productionGoFile(ent.Name()) { + continue + } + // The helper itself defines finalizeProvision; skip. + if ent.Name() == "provision_helper.go" { + continue + } + + path := filepath.Join(dir, ent.Name()) + body, err := os.ReadFile(path) + require.NoError(t, err) + src := stripLineComments(string(body)) + + if !strings.Contains(src, "finalizeProvision(") { + continue + } + + found := false + for _, h := range acceptableHandlers { + if strings.Contains(src, h) { + found = true + break + } + } + if !found { + violations = append(violations, violation{ + file: ent.Name(), + reason: "calls finalizeProvision but does NOT route the error through respondProvisionFailed or twinCoreErr — a swallowed persistence error is the MR-P0-3 bug in reverse.", + }) + } + } + + if len(violations) > 0 { + var msg strings.Builder + msg.WriteString("MR-P0-3 503-response coverage failed.\n") + for _, v := range violations { + msg.WriteString(" - ") + msg.WriteString(v.file) + msg.WriteString(": ") + msg.WriteString(v.reason) + msg.WriteString("\n") + } + t.Fatal(msg.String()) + } +} + +// stripLineComments removes `// …` line comments from Go source so the test +// search ignores commented-out code (file-header docs, deprecated examples) +// that mention CreateResource / finalizeProvision but do not call them. +func stripLineComments(src string) string { + // Simple line-by-line strip — fine for our test which only does + // substring containment, not AST analysis. The regexp is conservative: + // it does NOT strip // inside double-quoted strings, which Go source + // only rarely contains for this token set; the codebase's + // `models.CreateResource(` reference in a string literal would be + // notable on its own. + re := regexp.MustCompile(`(?m)^[\t ]*//.*$`) + return re.ReplaceAllString(src, "") +} + +// TestProvisionFailedHasAgentAction asserts that the catch-all +// `provision_failed` code returned by respondProvisionFailed carries an +// explicit agent_action — not the AgentActionContactSupport fallback. The +// MR-P0-3 path returns 503 with code=provision_failed; for callers (CLI, MCP, +// dashboard, Claude Code) to do the right thing the body must include the +// "retry with exponential backoff" sentence. +func TestProvisionFailedHasAgentAction(t *testing.T) { + meta, ok := handlers.LookupCodeToAgentActionForTest("provision_failed") + require.True(t, ok, + "provision_failed MUST have an entry in codeToAgentAction so MR-P0-3 503s do not fall back to AgentActionContactSupport") + assert.NotEmpty(t, meta.AgentAction, + "provision_failed agent_action MUST be non-empty") + // Spot-check the U3 contract: "Tell the user" opening + a real URL. + assert.Contains(t, meta.AgentAction, "Tell the user", + "provision_failed agent_action must open with 'Tell the user' per U3 contract") + assert.Contains(t, meta.AgentAction, "https://instanode.dev", + "provision_failed agent_action must contain a full https://instanode.dev URL per U3 contract") + // Spot-check the retry guidance — the MR-P0-3 path's contract is + // "retry with backoff," not "email support." + assert.True(t, + strings.Contains(meta.AgentAction, "Retry") || strings.Contains(meta.AgentAction, "backoff"), + "provision_failed agent_action must instruct the agent to retry (with backoff), not contact support — backend object was rolled back") +} diff --git a/internal/handlers/provision_helper.go b/internal/handlers/provision_helper.go index 3819a72..21c969c 100644 --- a/internal/handlers/provision_helper.go +++ b/internal/handlers/provision_helper.go @@ -453,9 +453,71 @@ func (h *provisionHelper) finalizeProvision( slog.Error(logPrefix+".cleanup_soft_delete_failed", "error", delErr, "resource_id", resource.ID, "request_id", requestID) } + + // MR-P0-3: emit the operator-alert audit row. This is the moment the + // platform produced an unreachable resource — the backend object existed + // (briefly), the platform DB could not address it, and the deprovision + + // soft-delete is the compensation. Operators key on this kind to + // reconstruct the exact request, find the upstream failure cause + // (DB unreachable / encrypt failure / etc.), and audit-trace any backend + // objects that escaped the best-effort cleanup. Best-effort emit: audit- + // log errors must never block the 503 — the customer needs a clean answer. + emitProvisionPersistenceFailedAudit(ctx, h.db, resource, providerResourceID, requestID, logPrefix) + return errProvisionPersistFailed } +// emitProvisionPersistenceFailedAudit emits the +// AuditKindProvisionPersistenceFailed row. Best-effort: any audit-write error +// is logged at WARN and swallowed — the caller already runs the cleanup + +// soft-delete and returns 503 to the customer. We never want an audit-store +// blip to wedge the response. Synchronous (not goroutine'd) so the row lands +// before the request goroutine returns; the row is small (< 1 KB), the DB hit +// is sub-millisecond on a healthy platform, and the bound on request latency +// is already dominated by the backend RPC that just succeeded. +func emitProvisionPersistenceFailedAudit( + ctx context.Context, + db *sql.DB, + res *models.Resource, + providerResourceID, requestID, logPrefix string, +) { + var teamID uuid.UUID + if res.TeamID.Valid { + teamID = res.TeamID.UUID + } + + // JSONB metadata so operator queries can pivot by resource_type / log_prefix + // / provider_resource_id. Hand-constructed via fmt.Sprintf with %q + // formatting so each field is JSON-string-escaped — avoids a json package + // import here. Keys are stable contract: the NR Log dashboard queries them + // by literal name. + meta := fmt.Sprintf( + `{"resource_id":%q,"resource_type":%q,"log_prefix":%q,"provider_resource_id":%q,"request_id":%q,"tier":%q,"env":%q}`, + res.ID.String(), + res.ResourceType, + logPrefix, + providerResourceID, + requestID, + res.Tier, + res.Env, + ) + + if auditErr := models.InsertAuditEvent(ctx, db, models.AuditEvent{ + TeamID: teamID, + Actor: "system", + Kind: models.AuditKindProvisionPersistenceFailed, + ResourceType: res.ResourceType, + ResourceID: uuid.NullUUID{UUID: res.ID, Valid: true}, + Summary: "provision succeeded downstream but platform persistence failed; " + + "backend object torn down (best-effort), resource soft-deleted, " + + "503 returned to caller", + Metadata: []byte(meta), + }); auditErr != nil { + slog.Warn(logPrefix+".persistence_failed_audit_emit_failed", + "error", auditErr, "resource_id", res.ID, "request_id", requestID) + } +} + // issueOnboardingJWT signs a short-lived JWT for the upgrade CTA. // It looks up ALL active resources for the fingerprint so the landing page // reflects the full session (not just the current service). diff --git a/internal/models/audit_kinds.go b/internal/models/audit_kinds.go index 691cffe..efbaa6c 100644 --- a/internal/models/audit_kinds.go +++ b/internal/models/audit_kinds.go @@ -433,6 +433,23 @@ const ( // not an automated template. Metadata: {propagation_id, kind, // team_id, target_tier, attempts, last_error, age_seconds}. AuditKindPropagationDeadLettered = "propagation.dead_lettered" + + // AuditKindProvisionPersistenceFailed fires from finalizeProvision when the + // backend provision RPC succeeded but a post-RPC persistence step + // (connection-URL encrypt/store, provider_resource_id store, pending→active + // flip) failed. This is the MR-P0-3 orphan-prevention signal: at the + // moment we know "the customer got real credentials downstream but our + // platform DB cannot address the row", we tear down the backend object + // (best-effort), soft-delete the row, return 503 to the caller, AND emit + // this audit kind so operators can reconstruct exactly when the platform + // produced an unreachable resource. NOT wired into the Loops/Brevo email + // forwarder — this is an internal operator alert, not a customer event + // (mirrors AuditKindBillingChargeUndeliverable and + // AuditKindPropagationDeadLettered). Metadata: {resource_id, resource_type, + // log_prefix, provider_resource_id, request_id, tier, env}. INFO-level + // audit row + ERROR-level slog line (already emitted at the per-step + // failure) for NR alerting. + AuditKindProvisionPersistenceFailed = "provision.persistence_failed" ) // PropagationKind* are the discriminator values for pending_propagations.kind. From 7893b061a005558e32b197f7ae2f8c71d3796121 Mon Sep 17 00:00:00 2001 From: Manas Srivastava Date: Wed, 20 May 2026 16:35:42 +0530 Subject: [PATCH 2/3] test(p0-3): swap provision_failed fixture for db_error in two 5xx-fallback tests MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MR-P0-3 (BugBash 2026-05-20) added an explicit codeToAgentAction entry for `provision_failed` so the catch-all 503 instructs the agent to retry with exponential backoff, not contact support. Two pre-existing tests used `provision_failed` as their "5xx code with no registry entry" fixture and fail under the new contract: - TestRespondError_UnknownCode_5xx_FallsBackToContactSupport - TestErrorEnvelope_503_AllFieldsAndHeader Swap both to use `db_error` — documented in helpers.go's curation principles (lines ~87-89) as one of the 'pure plumbing errors deliberately omitted' from codeToAgentAction. Same shape, same status code, same fallback branch — the test contract is preserved while the production contract gets sharper guidance for the actual MR-P0-3 path. Co-Authored-By: Claude Opus 4.7 (1M context) --- internal/handlers/agent_action_test.go | 12 ++++++++++-- internal/handlers/error_envelope_test.go | 20 +++++++++++++------- 2 files changed, 23 insertions(+), 9 deletions(-) diff --git a/internal/handlers/agent_action_test.go b/internal/handlers/agent_action_test.go index 6893335..34b0d18 100644 --- a/internal/handlers/agent_action_test.go +++ b/internal/handlers/agent_action_test.go @@ -183,12 +183,20 @@ func TestRespondError_KnownCode_PopulatesAgentAction(t *testing.T) { // request_id"). upgrade_url stays absent because the remedy is not an // upgrade. func TestRespondError_UnknownCode_5xx_FallsBackToContactSupport(t *testing.T) { + // Pick a 5xx code that is deliberately NOT in codeToAgentAction so the + // W7G fallback branch fires. Previously this test used `provision_failed`, + // but MR-P0-3 (BugBash 2026-05-20) added an explicit retry-with-backoff + // entry for that code — its 503 must instruct the agent to retry, not + // contact support, because the backend object was atomically rolled back. + // `db_error` is documented in helpers.go's curation principles as one of + // the "pure plumbing errors deliberately omitted" — exactly the shape this + // test is meant to exercise. status, body := doErrorRequest(t, func(c *fiber.Ctx) error { - return respondError(c, fiber.StatusServiceUnavailable, "provision_failed", "transient failure") + return respondError(c, fiber.StatusServiceUnavailable, "db_error", "transient failure") }) assert.Equal(t, fiber.StatusServiceUnavailable, status) assert.Equal(t, false, body["ok"]) - assert.Equal(t, "provision_failed", body["error"]) + assert.Equal(t, "db_error", body["error"]) assert.Equal(t, "transient failure", body["message"]) action, _ := body["agent_action"].(string) diff --git a/internal/handlers/error_envelope_test.go b/internal/handlers/error_envelope_test.go index 61725a7..a1a67de 100644 --- a/internal/handlers/error_envelope_test.go +++ b/internal/handlers/error_envelope_test.go @@ -81,15 +81,21 @@ func decodeEnvelope(t *testing.T, resp *http.Response) map[string]any { } // TestErrorEnvelope_503_AllFieldsAndHeader covers the canonical 503 case -// called out in the W7G brief: provisioner failure / transient infra. -// The envelope must carry request_id, retry_after_seconds=30, the -// AgentActionContactSupport fallback (because "provision_failed" isn't in -// codeToAgentAction), AND the matching Retry-After: 30 header. +// called out in the W7G brief: a transient-infra failure with NO registry +// entry. The envelope must carry request_id, retry_after_seconds=30, the +// AgentActionContactSupport fallback, AND the matching Retry-After: 30 header. +// +// Uses `db_error` as the fixture code: it's documented in helpers.go's +// curation principles as deliberately omitted from codeToAgentAction, so +// the W7G fallback branch fires deterministically. (Previously this test +// used `provision_failed`, but MR-P0-3 added an explicit retry-with-backoff +// entry for that code — its 503 must instruct the agent to retry, not +// contact support.) func TestErrorEnvelope_503_AllFieldsAndHeader(t *testing.T) { app := envelopeApp(t) app.Get("/x", func(c *fiber.Ctx) error { return respondError(c, fiber.StatusServiceUnavailable, - "provision_failed", "Failed to provision webhook resource") + "db_error", "Failed to query platform database") }) req := httptest.NewRequest(http.MethodGet, "/x", nil) @@ -106,8 +112,8 @@ func TestErrorEnvelope_503_AllFieldsAndHeader(t *testing.T) { body := decodeEnvelope(t, resp) assert.Equal(t, false, body["ok"]) - assert.Equal(t, "provision_failed", body["error"]) - assert.Equal(t, "Failed to provision webhook resource", body["message"]) + assert.Equal(t, "db_error", body["error"]) + assert.Equal(t, "Failed to query platform database", body["message"]) assert.Equal(t, "rid-fixed-123", body["request_id"], "request_id must echo X-Request-ID so agents quoting it to support don't have to read headers") assert.Equal(t, float64(30), body["retry_after_seconds"], From 9de27c84e2c6cefb7dcfed8bb35e38e52bce10f7 Mon Sep 17 00:00:00 2001 From: Manas Srivastava Date: Wed, 20 May 2026 16:44:52 +0530 Subject: [PATCH 3/3] test(p0-3): register provision.persistence_failed in auditConsumerSpec MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The cross-track reliability contract test in e2e/reliability_contract_test.go asserts every AuditKind* constant has a consumer-spec entry. Adding the new constant in this PR without the spec entry tripped that gate in CI. Add entry with IntentionallyNoConsumer=true — mirrors the existing billing.charge_undeliverable and propagation.dead_lettered shape: this is an internal operator alert that intentionally has no customer email and no propagation wiring (a customer whose resource was just torn down by the platform deserves a human follow-up, not an automated template). Co-Authored-By: Claude Opus 4.7 (1M context) --- e2e/reliability_contract_test.go | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/e2e/reliability_contract_test.go b/e2e/reliability_contract_test.go index 166c770..c9993d2 100644 --- a/e2e/reliability_contract_test.go +++ b/e2e/reliability_contract_test.go @@ -139,6 +139,15 @@ var auditConsumerSpec = map[string]auditConsumerExpectation{ // Billing — internal alerts, no customer email "billing.charge_undeliverable": {IntentionallyNoConsumer: true}, + // MR-P0-3 (BugBash 2026-05-20): fires from finalizeProvision when the + // backend provision RPC succeeded but a post-RPC persistence step failed. + // Internal operator-alert kind, mirroring billing.charge_undeliverable and + // propagation.dead_lettered — NOT wired into the customer-email forwarder + // because the appropriate response is human-eyes-on, not an automated + // template. The emit site (provision_helper.go) accompanies the row with + // an ERROR-level slog line so NR alerts can key on either. + "provision.persistence_failed": {IntentionallyNoConsumer: true}, + // Promote workflow — admin actions, no customer email "promote.approval_requested": {IntentionallyNoConsumer: true}, "promote.approved": {IntentionallyNoConsumer: true},