fix(purchases): recover executions stranded in 'approved' (closes #632)#635
Conversation
Approving a purchase flips the execution to 'approved' before running the purchase synchronously inside the HTTP/Lambda request, and only finalizes to 'completed'/'failed' after executePurchase returns. An interruption between the two (Lambda timeout, cold-start eviction, panic) leaves the row persisted as 'approved' with no owner, no error, and purchased=false — no automatic recovery, requiring manual DB intervention. Add a recovery sweep, RecoverStrandedApprovals, run at the top of the existing ProcessScheduledPurchases scheduled task (already advisory-locked). It finds executions stuck in 'approved' past a 15-minute threshold (the purchase Lambda timeout is 60s) via the new GetStaleApprovedExecutions store method and drives them into a terminal 'failed' state with a clear, operator-readable error so the row is visible in History (#623) and Retry-able (#47) instead of silently stuck. The sweep deliberately does NOT re-run the purchase: commitment creation has no idempotency token (EC2 PurchaseReservedInstancesOffering sets no ClientToken; CreateSavingsPlan has none), so auto-re-driving a row interrupted after AWS created the commitment but before the row persisted would double-purchase. The 'approved' -> 'failed' transition is atomic (TransitionExecutionStatus only flips rows still in 'approved'), so a late-completing original run is never clobbered. Mirrors the existing RI-exchange stale-sweep pattern (GetStaleProcessingExchanges). Adds a regression test that simulates an interrupted execution and asserts the row becomes failed, is not re-executed (no provider created), a fresh approved row is untouched, and a late-completing row is skipped.
|
@coderabbitai review |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a recovery sweep for purchase executions stuck in "approved": new store interface method and Postgres query, Manager.RecoverStrandedApprovals with recovered-count tracking, handler log metric, and test/mocks updates plus unit tests for recovery behaviors. ChangesStranded Approval Recovery
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/purchase/manager.go`:
- Around line 190-194: The current loop treats every error returned from
TransitionExecutionStatus as a benign race and continues, which hides real
DB/query failures; change the handling in the block that checks txErr (around
the call to TransitionExecutionStatus for exec.ExecutionID) to distinguish
sentinel "already transitioned"/"not found" errors from other failures: use
errors.Is (or compare against your package sentinel errors like
ErrAlreadyTransitioned/ErrNotFound or sql.ErrNoRows) and only call logging.Warnf
and continue for those specific cases, but for any other txErr log it as an
error and return/propagate it to fail the sweep so stranded rows aren’t silently
skipped.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bde7204b-d1c9-4d0b-935b-4a6377222ea9
📒 Files selected for processing (10)
internal/analytics/collector_test.gointernal/api/mocks_test.gointernal/config/interfaces.gointernal/config/store_postgres.gointernal/purchase/manager.gointernal/purchase/manager_test.gointernal/purchase/mocks_test.gointernal/scheduler/scheduler_test.gointernal/server/handler.gointernal/server/test_helpers_test.go
…s in stranded-approval recovery Address CR finding on PR #635 (internal/purchase/manager.go:194): Before: every TransitionExecutionStatus error in the recovery loop was treated as a skip-this-row. That swallows real store failures (DB unreachable, query syntax error, permission revoked mid-loop) the same way it swallows benign races (concurrent sweep handled the row, or the original run finished after the LIST snapshot). The result is silent under-recovery: a transient DB outage during a sweep makes every row look like a race and leaves them all stranded, while the operator sees "sweep finished, recovered=0" and assumes the system is healthy. After: probe the current row state via GetExecutionByID. A clean read with Status != "approved" confirms the race and skips the row as before. Any other outcome (read error, still-approved row) is a real failure and the sweep returns with the error and the count of rows already recovered, so the operator can see exactly how far the sweep got before the failure. The unrelated failure TestManager_ExecuteAndFinalize_HistorySaveFailure_StaysVisible (in execution_test.go:1234) predates this change and exercises a different code path (executeAndFinalize in execution.go).
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
…loses #636) (#638) * feat(purchases): add deterministic idempotency token helper (#636) Add DeriveIdempotencyToken (SHA-256 of execution_id:rec_index, 64-char hex fitting the AWS ClientToken limit) plus an IdempotencyToken field on PurchaseOptions and an IdempotencyTagKey constant. The token is stable across a strand-and-re-drive so a re-driven purchase reuses the same value: Savings Plans dedupe on it natively via ClientToken and EC2 RIs use it as a dedupe tag, making commitment creation idempotent. * feat(purchases): make EC2 RI and Savings Plans creation idempotent (#636) Thread a deterministic per-rec idempotency token (derived from execution_id + rec index) through the purchase fan-out so a re-driven stranded execution never double-buys. Savings Plans set the token as the native CreateSavingsPlan ClientToken, which AWS dedupes server-side. EC2 Reserved Instances have no ClientToken, so the client looks up an RI already tagged with the token before purchasing (short-circuiting on a re-drive) and tags the newly bought RI with it afterwards; a failed lookup refuses to purchase rather than risk a double-buy. The residual purchase-then-tag-fails window is irreducible given EC2's API and stays backstopped by #635 safe-fail. Does not flip RecoverStrandedApprovals to re-drive; that remains the documented follow-up now that idempotency is safe to rely on.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
…t idempotent Extends the #636 IdempotencyToken mechanism to the five remaining AWS commitment executors so a re-drive (and #639's auto-re-drive sweep) cannot double-purchase. Stacks on #638. - RDS/ElastiCache/MemoryDB: derive the customer-supplied reservation ID (ReservedDBInstanceId/ReservedCacheNodeId/ReservationId) deterministically from the token instead of time.Now(). Pre-purchase Describe-by-ID guard short-circuits a re-drive; AWS's native *AlreadyExists* fault backstops a guard miss and is recovered to the existing commitment. Double-buy is impossible by two independent mechanisms. - OpenSearch: derive ReservationName from the token; the name is unique per account+region so AWS rejects a duplicate with ResourceAlreadyExistsException (recovered by name). A pre-purchase by-name Describe guard short-circuits. - Redshift: no customer ID, no native dedupe, no tag filter, so an EC2-style tag-guard: DescribeReservedNodes + per-node DescribeTags matches the token tag, with the token written post-purchase via CreateTags. Residual window (tagging support uncertain) documented and backstopped by #635's safe-fail. Lookup errors fail loud (refuse to purchase) to avoid a guard-miss double-buy. Empty token preserves prior non-idempotent behaviour for non-execution callers. Adds common.IdempotentReservationID helper. Per-provider regression tests: same token on re-drive does not create a second commitment (guard short-circuit, AlreadyExists recovery, fail-loud-on-lookup-error). Refs #641.
…t idempotent (refs #641) (#652) * feat(purchases): make AWS RDS/ElastiCache/MemoryDB/OpenSearch/Redshift idempotent Extends the #636 IdempotencyToken mechanism to the five remaining AWS commitment executors so a re-drive (and #639's auto-re-drive sweep) cannot double-purchase. Stacks on #638. - RDS/ElastiCache/MemoryDB: derive the customer-supplied reservation ID (ReservedDBInstanceId/ReservedCacheNodeId/ReservationId) deterministically from the token instead of time.Now(). Pre-purchase Describe-by-ID guard short-circuits a re-drive; AWS's native *AlreadyExists* fault backstops a guard miss and is recovered to the existing commitment. Double-buy is impossible by two independent mechanisms. - OpenSearch: derive ReservationName from the token; the name is unique per account+region so AWS rejects a duplicate with ResourceAlreadyExistsException (recovered by name). A pre-purchase by-name Describe guard short-circuits. - Redshift: no customer ID, no native dedupe, no tag filter, so an EC2-style tag-guard: DescribeReservedNodes + per-node DescribeTags matches the token tag, with the token written post-purchase via CreateTags. Residual window (tagging support uncertain) documented and backstopped by #635's safe-fail. Lookup errors fail loud (refuse to purchase) to avoid a guard-miss double-buy. Empty token preserves prior non-idempotent behaviour for non-execution callers. Adds common.IdempotentReservationID helper. Per-provider regression tests: same token on re-drive does not create a second commitment (guard short-circuit, AlreadyExists recovery, fail-loud-on-lookup-error). Refs #641. * fix(purchases): redact idempotency tokens in AWS re-drive skip logs CodeRabbit (PR #652) flagged that the "already exists; skipping purchase" log lines emit the full caller-supplied idempotency token, a stable per-execution identifier that should not leak verbatim into persistent logs. Add common.MaskToken (8-char prefix + ellipsis, "(none)" for empty) and use it across all five idempotent AWS executors (RDS, ElastiCache, MemoryDB, OpenSearch, Redshift). The masked prefix still correlates log lines for a single purchase. CR flagged three; OpenSearch and Redshift carried the identical pattern and are fixed here too for consistency. Log-message-only change: the idempotency guard, derived reservation IDs, and AlreadyExists recovery are untouched, so the same-token-no-double-buy invariant is preserved per service. Refs #641.
Summary
Fixes the P1 where an approved purchase strands permanently in
approved(purchase never runs, no error,completed_atNULL) when the synchronous execution is interrupted (Lambda timeout / cold-start eviction / panic) — see #632.Approach: recovery sweep + safe-fail (not auto-re-drive)
A new
RecoverStrandedApprovalsruns at the top of the existing advisory-lockedProcessScheduledPurchasesscheduled task. It calls a new store methodGetStaleApprovedExecutions(olderThan)(WHERE status='approved' AND updated_at < NOW() - interval, mirroring the existingGetStaleProcessingExchangesRI-exchange prior art) and drives each stranded row into terminalfailedwith a clear operator-readable error. Threshold is 15 min (the purchase Lambda timeout is 60s, so no in-flight run can ever be failed under itself).Why fail, not re-run (the key decision)
Commitment creation has no idempotency token: EC2
PurchaseReservedInstancesOfferingsets noClientTokenandCreateSavingsPlanhas no idempotency field. Auto-re-driving a row that was interrupted after AWS created the commitment but before the row persisted would double-purchase. So the sweep never re-runs the purchase (asserted viamockFactory.AssertNotCalled); it only fails the row. Theapproved->failedtransition uses atomicTransitionExecutionStatus (WHERE status='approved'), so a late-completing original run is never clobbered. Failed rows are visible in History (#623) and Retry-able after the operator confirms AWS state.The idempotent auto-re-drive path is deferred (needs a ClientToken/dedupe prerequisite) and filed as a follow-up.
Test plan
approved->failed, purchase NOT re-executed, a freshapprovedrow (< threshold) untouched, a late-completed row skippedgo build/go vet/golangci-lint clean; pre-commit passedCloses #632.
Summary by CodeRabbit
Bug Fixes
New Features
Tests