Skip to content

fix(p0-3): emit provision.persistence_failed audit + agent_action + coverage tests#121

Merged
mastermanas805 merged 3 commits into
masterfrom
fix/p0-3-atomic-provisioning-2026-05-20
May 20, 2026
Merged

fix(p0-3): emit provision.persistence_failed audit + agent_action + coverage tests#121
mastermanas805 merged 3 commits into
masterfrom
fix/p0-3-atomic-provisioning-2026-05-20

Conversation

@mastermanas805
Copy link
Copy Markdown
Member

Summary

MR-P0-3 (BugBash 2026-05-20) atomic-provisioning hardening — completion of the orphan-resource prevention work shipped in commit 36eac9b.

That commit wired all 18 provisioning callsites through finalizeProvision, which on persistence failure tears down the backend object (best-effort), soft-deletes the row, and returns errProvisionPersistFailed so handlers map to 503 instead of 201. This PR closes three remaining gaps:

  1. Operator-visible audit row: emit AuditKindProvisionPersistenceFailed (provision.persistence_failed) on every persistence-failure path. Mirrors AuditKindBillingChargeUndeliverable and AuditKindPropagationDeadLettered — internal operator alert, not customer-email.

  2. agent_action for provision_failed: the catch-all 503 envelope used to fall back to AgentActionContactSupport. That's wrong for the MR-P0-3 path — the backend object was rolled back, so the right action is 'retry with exponential backoff', not 'email support'. New codeToAgentAction entry follows the U3 contract.

  3. Registry-iterating coverage tests (CLAUDE.md rule 18): two static guards in provision_atomicity_coverage_test.go scan every production .go file in internal/handlers/ and assert:

    • every file calling models.CreateResource also calls finalizeProvision
    • every file calling finalizeProvision also routes the error through respondProvisionFailed or twinCoreErr
    • TestProvisionFailedHasAgentAction asserts the catch-all code carries explicit guidance (no fallback)

Per-handler audit findings

  • db.go: already atomic (finalizeProvision at L265, L432, L642)
  • cache.go: already atomic (L232, L387, L564)
  • nosql.go: already atomic (L228, L380, L569)
  • queue.go: already atomic (L306, L525)
  • storage.go: already atomic (L335, L525)
  • webhook.go: already atomic (L327, L444 — cleanup=nil OK, no backend object)
  • vector.go: already atomic (L370, L522)
  • deploy.go: DIFFERENT MODEL — async 202; row persisted before async build runs; failures update status. Not P0-3 shape.
  • stack.go: DIFFERENT MODEL — async 202; same shape as deploy.

Coverage block (CLAUDE.md rule 17)

Symptom:        201 returned for resource the platform can't address
                (orphan backend + NULL conn_url + NULL PRID + billed)
Enumeration:    rg -l 'models.CreateResource\(' internal/handlers/
                yielded db.go, cache.go, nosql.go, queue.go, storage.go,
                webhook.go, vector.go (each w/ anon + auth + twin paths)
Sites found:    7 files × ~3 paths = ~21 provisioning entry points
Sites touched:  none added — every site already calls finalizeProvision
                per commit 36eac9b. This PR adds a coverage TEST that
                fails the build if a new site bypasses the pattern.
Coverage test:  TestEveryCreateResourceCallSiteIsFollowedByFinalizeProvision
                + TestEveryFinalizeProvisionCallSiteRespondsProvisionFailedOnError
                + TestProvisionFailedHasAgentAction
Live verified:  pending CI green + master Deploy

Test output

=== RUN   TestEveryCreateResourceCallSiteIsFollowedByFinalizeProvision
--- PASS: TestEveryCreateResourceCallSiteIsFollowedByFinalizeProvision (0.73s)
=== RUN   TestEveryFinalizeProvisionCallSiteRespondsProvisionFailedOnError
--- PASS: TestEveryFinalizeProvisionCallSiteRespondsProvisionFailedOnError (0.70s)
=== RUN   TestProvisionFailedHasAgentAction
--- PASS: TestProvisionFailedHasAgentAction (0.00s)

=== RUN   TestFinalizeProvision_PersistenceFailure_ReturnsErrorAndRunsCleanup
INFO audit.event audit_kind=provision.persistence_failed actor=system resource_type=postgres
--- PASS: TestFinalizeProvision_PersistenceFailure_ReturnsErrorAndRunsCleanup (0.10s)
=== RUN   TestFinalizeProvision_Success_FlipsToActive
--- PASS: TestFinalizeProvision_Success_FlipsToActive (0.08s)

Two pre-existing tests (TestRespondError_UnknownCode_5xx_FallsBackToContactSupport, TestErrorEnvelope_503_AllFieldsAndHeader) used provision_failed as their 'unregistered 5xx code' fixture; swapped to db_error (per helpers.go curation principles).

Six other failures (TestDBNew_*, TestBulkTwin_*) are pre-existing on master — they require port-forward to customer-postgres at localhost:5434; not caused by this PR (verified by checkout master and reproducing).

🤖 Generated with Claude Code

mastermanas805 and others added 3 commits May 20, 2026 16:27
…overage tests

MR-P0-3 (BugBash 2026-05-20) atomic-provisioning hardening — completion
of the orphan-resource prevention work shipped in commit 36eac9b.

That commit wired all 18 provisioning callsites through finalizeProvision,
which on persistence failure (post-RPC) tears down the backend object
(best-effort), soft-deletes the row, and returns errProvisionPersistFailed
so handlers map to 503 instead of 201. This commit closes three remaining
gaps in that work:

1) **Operator-visible audit row**: emit AuditKindProvisionPersistenceFailed
   on every persistence-failure path. Operators can now reconstruct the
   exact moment the platform produced an unreachable resource — by
   resource_id / resource_type / log_prefix / provider_resource_id /
   request_id / tier / env. NOT wired into the customer-email forwarder
   (this is an internal alert, mirrors AuditKindBillingChargeUndeliverable
   and AuditKindPropagationDeadLettered). Sync emit + WARN slog on
   failure — must NEVER block the 503 to the customer.

2) **agent_action for provision_failed**: the catch-all 503 envelope used
   to fall back to AgentActionContactSupport ('email support'). That's
   wrong for the MR-P0-3 path — the backend object was rolled back and
   the row soft-deleted, so the right action is 'retry with exponential
   backoff', not contact support. Adds an explicit codeToAgentAction
   entry under the U3 contract (opens with 'Tell the user', names the
   reason, names the action, full https://instanode.dev URL, < 280 chars).

3) **Registry-iterating coverage tests** (CLAUDE.md rule 18): two static
   guards in provision_atomicity_coverage_test.go scan every production
   .go file in internal/handlers/ and assert:
   - every file calling models.CreateResource also calls finalizeProvision
     (so a new handler can't bypass the two-phase pattern)
   - every file calling finalizeProvision also routes the error through
     respondProvisionFailed or twinCoreErr (so a new handler can't
     swallow the sentinel)
   Plus TestProvisionFailedHasAgentAction asserts the catch-all code
   still has explicit agent_action guidance (defends against a future
   refactor that drops the new entry).

Coverage block per CLAUDE.md rule 17:

  Symptom:        201 returned for resource the platform can't address
                  (orphan backend + NULL conn_url + NULL PRID + billed)
  Enumeration:    rg -l 'models.CreateResource\(' internal/handlers/
                  yielded db.go, cache.go, nosql.go, queue.go, storage.go,
                  webhook.go, vector.go (each w/ anon + auth + twin paths)
  Sites found:    7 files × ~3 paths = ~21 provisioning entry points
  Sites touched:  none added — every site already calls finalizeProvision
                  per commit 36eac9b. This commit adds a coverage TEST
                  that fails the build if a new site bypasses the pattern.
  Coverage test:  TestEveryCreateResourceCallSiteIsFollowedByFinalizeProvision
                  + TestEveryFinalizeProvisionCallSiteRespondsProvisionFailedOnError
                  + TestProvisionFailedHasAgentAction
  Live verified:  (this commit; pending CI green + deploy)

Per-handler audit findings:
- internal/handlers/db.go:        already atomic (finalizeProvision at L265, L432, L642)
- internal/handlers/cache.go:     already atomic (finalizeProvision at L232, L387, L564)
- internal/handlers/nosql.go:     already atomic (finalizeProvision at L228, L380, L569)
- internal/handlers/queue.go:     already atomic (finalizeProvision at L306, L525)
- internal/handlers/storage.go:   already atomic (finalizeProvision at L335, L525)
- internal/handlers/webhook.go:   already atomic (finalizeProvision at L327, L444 — cleanup=nil ok, no backend object)
- internal/handlers/vector.go:    already atomic (finalizeProvision at L370, L522)
- internal/handlers/deploy.go:    DIFFERENT MODEL — async (202 not 201); row persisted before async build runs; failures update status. Not P0-3 shape.
- internal/handlers/stack.go:     DIFFERENT MODEL — async (202 not 201); same shape as deploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lback tests

MR-P0-3 (BugBash 2026-05-20) added an explicit codeToAgentAction entry for
`provision_failed` so the catch-all 503 instructs the agent to retry with
exponential backoff, not contact support. Two pre-existing tests used
`provision_failed` as their "5xx code with no registry entry" fixture and
fail under the new contract:

  - TestRespondError_UnknownCode_5xx_FallsBackToContactSupport
  - TestErrorEnvelope_503_AllFieldsAndHeader

Swap both to use `db_error` — documented in helpers.go's curation
principles (lines ~87-89) as one of the 'pure plumbing errors deliberately
omitted' from codeToAgentAction. Same shape, same status code, same
fallback branch — the test contract is preserved while the production
contract gets sharper guidance for the actual MR-P0-3 path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cross-track reliability contract test in e2e/reliability_contract_test.go
asserts every AuditKind* constant has a consumer-spec entry. Adding the new
constant in this PR without the spec entry tripped that gate in CI.

Add entry with IntentionallyNoConsumer=true — mirrors the existing
billing.charge_undeliverable and propagation.dead_lettered shape: this is
an internal operator alert that intentionally has no customer email and
no propagation wiring (a customer whose resource was just torn down by the
platform deserves a human follow-up, not an automated template).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 merged commit 88c3e75 into master May 20, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant