fix(p0-3): emit provision.persistence_failed audit + agent_action + coverage tests#121
Merged
Merged
Conversation
…overage tests MR-P0-3 (BugBash 2026-05-20) atomic-provisioning hardening — completion of the orphan-resource prevention work shipped in commit 36eac9b. That commit wired all 18 provisioning callsites through finalizeProvision, which on persistence failure (post-RPC) tears down the backend object (best-effort), soft-deletes the row, and returns errProvisionPersistFailed so handlers map to 503 instead of 201. This commit closes three remaining gaps in that work: 1) **Operator-visible audit row**: emit AuditKindProvisionPersistenceFailed on every persistence-failure path. Operators can now reconstruct the exact moment the platform produced an unreachable resource — by resource_id / resource_type / log_prefix / provider_resource_id / request_id / tier / env. NOT wired into the customer-email forwarder (this is an internal alert, mirrors AuditKindBillingChargeUndeliverable and AuditKindPropagationDeadLettered). Sync emit + WARN slog on failure — must NEVER block the 503 to the customer. 2) **agent_action for provision_failed**: the catch-all 503 envelope used to fall back to AgentActionContactSupport ('email support'). That's wrong for the MR-P0-3 path — the backend object was rolled back and the row soft-deleted, so the right action is 'retry with exponential backoff', not contact support. Adds an explicit codeToAgentAction entry under the U3 contract (opens with 'Tell the user', names the reason, names the action, full https://instanode.dev URL, < 280 chars). 3) **Registry-iterating coverage tests** (CLAUDE.md rule 18): two static guards in provision_atomicity_coverage_test.go scan every production .go file in internal/handlers/ and assert: - every file calling models.CreateResource also calls finalizeProvision (so a new handler can't bypass the two-phase pattern) - every file calling finalizeProvision also routes the error through respondProvisionFailed or twinCoreErr (so a new handler can't swallow the sentinel) Plus TestProvisionFailedHasAgentAction asserts the catch-all code still has explicit agent_action guidance (defends against a future refactor that drops the new entry). Coverage block per CLAUDE.md rule 17: Symptom: 201 returned for resource the platform can't address (orphan backend + NULL conn_url + NULL PRID + billed) Enumeration: rg -l 'models.CreateResource\(' internal/handlers/ yielded db.go, cache.go, nosql.go, queue.go, storage.go, webhook.go, vector.go (each w/ anon + auth + twin paths) Sites found: 7 files × ~3 paths = ~21 provisioning entry points Sites touched: none added — every site already calls finalizeProvision per commit 36eac9b. This commit adds a coverage TEST that fails the build if a new site bypasses the pattern. Coverage test: TestEveryCreateResourceCallSiteIsFollowedByFinalizeProvision + TestEveryFinalizeProvisionCallSiteRespondsProvisionFailedOnError + TestProvisionFailedHasAgentAction Live verified: (this commit; pending CI green + deploy) Per-handler audit findings: - internal/handlers/db.go: already atomic (finalizeProvision at L265, L432, L642) - internal/handlers/cache.go: already atomic (finalizeProvision at L232, L387, L564) - internal/handlers/nosql.go: already atomic (finalizeProvision at L228, L380, L569) - internal/handlers/queue.go: already atomic (finalizeProvision at L306, L525) - internal/handlers/storage.go: already atomic (finalizeProvision at L335, L525) - internal/handlers/webhook.go: already atomic (finalizeProvision at L327, L444 — cleanup=nil ok, no backend object) - internal/handlers/vector.go: already atomic (finalizeProvision at L370, L522) - internal/handlers/deploy.go: DIFFERENT MODEL — async (202 not 201); row persisted before async build runs; failures update status. Not P0-3 shape. - internal/handlers/stack.go: DIFFERENT MODEL — async (202 not 201); same shape as deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lback tests MR-P0-3 (BugBash 2026-05-20) added an explicit codeToAgentAction entry for `provision_failed` so the catch-all 503 instructs the agent to retry with exponential backoff, not contact support. Two pre-existing tests used `provision_failed` as their "5xx code with no registry entry" fixture and fail under the new contract: - TestRespondError_UnknownCode_5xx_FallsBackToContactSupport - TestErrorEnvelope_503_AllFieldsAndHeader Swap both to use `db_error` — documented in helpers.go's curation principles (lines ~87-89) as one of the 'pure plumbing errors deliberately omitted' from codeToAgentAction. Same shape, same status code, same fallback branch — the test contract is preserved while the production contract gets sharper guidance for the actual MR-P0-3 path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cross-track reliability contract test in e2e/reliability_contract_test.go asserts every AuditKind* constant has a consumer-spec entry. Adding the new constant in this PR without the spec entry tripped that gate in CI. Add entry with IntentionallyNoConsumer=true — mirrors the existing billing.charge_undeliverable and propagation.dead_lettered shape: this is an internal operator alert that intentionally has no customer email and no propagation wiring (a customer whose resource was just torn down by the platform deserves a human follow-up, not an automated template). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
MR-P0-3 (BugBash 2026-05-20) atomic-provisioning hardening — completion of the orphan-resource prevention work shipped in commit 36eac9b.
That commit wired all 18 provisioning callsites through
finalizeProvision, which on persistence failure tears down the backend object (best-effort), soft-deletes the row, and returnserrProvisionPersistFailedso handlers map to 503 instead of 201. This PR closes three remaining gaps:Operator-visible audit row: emit
AuditKindProvisionPersistenceFailed(provision.persistence_failed) on every persistence-failure path. MirrorsAuditKindBillingChargeUndeliverableandAuditKindPropagationDeadLettered— internal operator alert, not customer-email.agent_action for
provision_failed: the catch-all 503 envelope used to fall back toAgentActionContactSupport. That's wrong for the MR-P0-3 path — the backend object was rolled back, so the right action is 'retry with exponential backoff', not 'email support'. NewcodeToAgentActionentry follows the U3 contract.Registry-iterating coverage tests (CLAUDE.md rule 18): two static guards in
provision_atomicity_coverage_test.goscan every production .go file ininternal/handlers/and assert:models.CreateResourcealso callsfinalizeProvisionfinalizeProvisionalso routes the error throughrespondProvisionFailedortwinCoreErrTestProvisionFailedHasAgentActionasserts the catch-all code carries explicit guidance (no fallback)Per-handler audit findings
db.go: already atomic (finalizeProvision at L265, L432, L642)cache.go: already atomic (L232, L387, L564)nosql.go: already atomic (L228, L380, L569)queue.go: already atomic (L306, L525)storage.go: already atomic (L335, L525)webhook.go: already atomic (L327, L444 — cleanup=nil OK, no backend object)vector.go: already atomic (L370, L522)deploy.go: DIFFERENT MODEL — async 202; row persisted before async build runs; failures update status. Not P0-3 shape.stack.go: DIFFERENT MODEL — async 202; same shape as deploy.Coverage block (CLAUDE.md rule 17)
Test output
Two pre-existing tests (
TestRespondError_UnknownCode_5xx_FallsBackToContactSupport,TestErrorEnvelope_503_AllFieldsAndHeader) usedprovision_failedas their 'unregistered 5xx code' fixture; swapped todb_error(per helpers.go curation principles).Six other failures (
TestDBNew_*,TestBulkTwin_*) are pre-existing on master — they require port-forward to customer-postgres at localhost:5434; not caused by this PR (verified by checkout master and reproducing).🤖 Generated with Claude Code