Skip to content

fix(purchases): recover executions stranded in 'approved' (closes #632)#635

Merged
cristim merged 2 commits into
feat/multicloud-web-frontendfrom
fix/stranded-approval-recovery
May 21, 2026
Merged

fix(purchases): recover executions stranded in 'approved' (closes #632)#635
cristim merged 2 commits into
feat/multicloud-web-frontendfrom
fix/stranded-approval-recovery

Conversation

@cristim
Copy link
Copy Markdown
Member

@cristim cristim commented May 20, 2026

Summary

Fixes the P1 where an approved purchase strands permanently in approved (purchase never runs, no error, completed_at NULL) when the synchronous execution is interrupted (Lambda timeout / cold-start eviction / panic) — see #632.

Approach: recovery sweep + safe-fail (not auto-re-drive)

A new RecoverStrandedApprovals runs at the top of the existing advisory-locked ProcessScheduledPurchases scheduled task. It calls a new store method GetStaleApprovedExecutions(olderThan) (WHERE status='approved' AND updated_at < NOW() - interval, mirroring the existing GetStaleProcessingExchanges RI-exchange prior art) and drives each stranded row into terminal failed with a clear operator-readable error. Threshold is 15 min (the purchase Lambda timeout is 60s, so no in-flight run can ever be failed under itself).

Why fail, not re-run (the key decision)

Commitment creation has no idempotency token: EC2 PurchaseReservedInstancesOffering sets no ClientToken and CreateSavingsPlan has no idempotency field. Auto-re-driving a row that was interrupted after AWS created the commitment but before the row persisted would double-purchase. So the sweep never re-runs the purchase (asserted via mockFactory.AssertNotCalled); it only fails the row. The approved->failed transition uses atomic TransitionExecutionStatus (WHERE status='approved'), so a late-completing original run is never clobbered. Failed rows are visible in History (#623) and Retry-able after the operator confirms AWS state.

The idempotent auto-re-drive path is deferred (needs a ClientToken/dedupe prerequisite) and filed as a follow-up.

Test plan

  • Regression test: stranded approved -> failed, purchase NOT re-executed, a fresh approved row (< threshold) untouched, a late-completed row skipped
  • All 10 ProcessScheduled/Recover tests green; scheduler/server/api/analytics 1763 pass; config 497 pass; go build/go vet/golangci-lint clean; pre-commit passed

Closes #632.

Summary by CodeRabbit

  • Bug Fixes

    • Executions stuck in "approved" beyond the staleness threshold are now automatically moved to "failed".
  • New Features

    • Processing results now include a "recovered" metric counting approvals transitioned during recovery.
    • Scheduled purchases processing runs the recovery sweep before normal work, and recovery failures are logged without blocking processing.
  • Tests

    • Added regression tests covering recovery behavior and edge cases.

Review Change Stack

Approving a purchase flips the execution to 'approved' before running the
purchase synchronously inside the HTTP/Lambda request, and only finalizes to
'completed'/'failed' after executePurchase returns. An interruption between the
two (Lambda timeout, cold-start eviction, panic) leaves the row persisted as
'approved' with no owner, no error, and purchased=false — no automatic
recovery, requiring manual DB intervention.

Add a recovery sweep, RecoverStrandedApprovals, run at the top of the existing
ProcessScheduledPurchases scheduled task (already advisory-locked). It finds
executions stuck in 'approved' past a 15-minute threshold (the purchase Lambda
timeout is 60s) via the new GetStaleApprovedExecutions store method and drives
them into a terminal 'failed' state with a clear, operator-readable error so
the row is visible in History (#623) and Retry-able (#47) instead of silently
stuck.

The sweep deliberately does NOT re-run the purchase: commitment creation has no
idempotency token (EC2 PurchaseReservedInstancesOffering sets no ClientToken;
CreateSavingsPlan has none), so auto-re-driving a row interrupted after AWS
created the commitment but before the row persisted would double-purchase. The
'approved' -> 'failed' transition is atomic (TransitionExecutionStatus only
flips rows still in 'approved'), so a late-completing original run is never
clobbered.

Mirrors the existing RI-exchange stale-sweep pattern (GetStaleProcessingExchanges).
Adds a regression test that simulates an interrupted execution and asserts the
row becomes failed, is not re-executed (no provider created), a fresh approved
row is untouched, and a late-completing row is skipped.
@cristim cristim added bug Something isn't working triaged Item has been triaged priority/p1 Next up; this sprint severity/high Significant harm urgency/this-sprint Within the current sprint impact/all-users Affects every user effort/m Days type/bug Defect labels May 20, 2026
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 20, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 811701d2-88b1-4bd7-bc9c-e4bb83774e29

📥 Commits

Reviewing files that changed from the base of the PR and between d97e0e3 and c37467b.

📒 Files selected for processing (1)
  • internal/purchase/manager.go

📝 Walkthrough

Walkthrough

Adds a recovery sweep for purchase executions stuck in "approved": new store interface method and Postgres query, Manager.RecoverStrandedApprovals with recovered-count tracking, handler log metric, and test/mocks updates plus unit tests for recovery behaviors.

Changes

Stranded Approval Recovery

Layer / File(s) Summary
Interface Contract and Database Layer
internal/config/interfaces.go, internal/config/store_postgres.go
New GetStaleApprovedExecutions(ctx, olderThan) on StoreInterface; Postgres implementation selects purchase_executions with status='approved' and updated_at older than the provided duration and returns via existing query helper.
Manager Recovery Logic and Integration
internal/purchase/manager.go
Adds ProcessResult.Recovered and RecoverStrandedApprovals(ctx) which lists stale approvals, atomically transitions eligible rows to failed with a recovery message, counts recovered rows, and invokes this sweep at the start of ProcessScheduledPurchases (errors logged, not blocking).
Handler Logging Update
internal/server/handler.go
Updates handleProcessScheduledPurchases success log to include the recovered count from ProcessResult.
Test Mock Infrastructure
internal/api/mocks_test.go, internal/purchase/mocks_test.go, internal/scheduler/scheduler_test.go, internal/analytics/collector_test.go, internal/server/test_helpers_test.go
Adds GetStaleApprovedExecutions(ctx, olderThan) mock/stub methods across test doubles following existing mock patterns so tests can exercise the new store call.
Manager Recovery Tests and Existing Test Updates
internal/purchase/manager_test.go
Updates existing ProcessScheduledPurchases tests to expect the new store call and adds three regression tests for RecoverStrandedApprovals: successful recovery, fresh result no-op, and race where a late completion prevents clobbering.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Suggested labels

priority/p2, severity/medium, urgency/this-quarter, impact/many

Poem

🐰 I hop through queues where approvals stall,
Finding rows that time forgot at all.
With careful checks and an atomic sweep,
I mark them failed so pipelines sleep.
Recovery done — the system hums; I nap, content and spry. 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately describes the main change: adding recovery functionality for executions stuck in 'approved' status, and directly addresses the issue (#632) being fixed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/stranded-approval-recovery

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/purchase/manager.go`:
- Around line 190-194: The current loop treats every error returned from
TransitionExecutionStatus as a benign race and continues, which hides real
DB/query failures; change the handling in the block that checks txErr (around
the call to TransitionExecutionStatus for exec.ExecutionID) to distinguish
sentinel "already transitioned"/"not found" errors from other failures: use
errors.Is (or compare against your package sentinel errors like
ErrAlreadyTransitioned/ErrNotFound or sql.ErrNoRows) and only call logging.Warnf
and continue for those specific cases, but for any other txErr log it as an
error and return/propagate it to fail the sweep so stranded rows aren’t silently
skipped.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bde7204b-d1c9-4d0b-935b-4a6377222ea9

📥 Commits

Reviewing files that changed from the base of the PR and between 9ecab71 and d97e0e3.

📒 Files selected for processing (10)
  • internal/analytics/collector_test.go
  • internal/api/mocks_test.go
  • internal/config/interfaces.go
  • internal/config/store_postgres.go
  • internal/purchase/manager.go
  • internal/purchase/manager_test.go
  • internal/purchase/mocks_test.go
  • internal/scheduler/scheduler_test.go
  • internal/server/handler.go
  • internal/server/test_helpers_test.go

Comment thread internal/purchase/manager.go Outdated
…s in stranded-approval recovery

Address CR finding on PR #635 (internal/purchase/manager.go:194):

Before: every TransitionExecutionStatus error in the recovery loop was
treated as a skip-this-row. That swallows real store failures (DB
unreachable, query syntax error, permission revoked mid-loop) the same
way it swallows benign races (concurrent sweep handled the row, or the
original run finished after the LIST snapshot). The result is silent
under-recovery: a transient DB outage during a sweep makes every row
look like a race and leaves them all stranded, while the operator sees
"sweep finished, recovered=0" and assumes the system is healthy.

After: probe the current row state via GetExecutionByID. A clean read
with Status != "approved" confirms the race and skips the row as
before. Any other outcome (read error, still-approved row) is a real
failure and the sweep returns with the error and the count of rows
already recovered, so the operator can see exactly how far the sweep
got before the failure.

The unrelated failure
TestManager_ExecuteAndFinalize_HistorySaveFailure_StaysVisible (in
execution_test.go:1234) predates this change and exercises a different
code path (executeAndFinalize in execution.go).
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 21, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

cristim added a commit that referenced this pull request May 21, 2026
…loses #636) (#638)

* feat(purchases): add deterministic idempotency token helper (#636)

Add DeriveIdempotencyToken (SHA-256 of execution_id:rec_index, 64-char
hex fitting the AWS ClientToken limit) plus an IdempotencyToken field on
PurchaseOptions and an IdempotencyTagKey constant. The token is stable
across a strand-and-re-drive so a re-driven purchase reuses the same
value: Savings Plans dedupe on it natively via ClientToken and EC2 RIs
use it as a dedupe tag, making commitment creation idempotent.

* feat(purchases): make EC2 RI and Savings Plans creation idempotent (#636)

Thread a deterministic per-rec idempotency token (derived from
execution_id + rec index) through the purchase fan-out so a re-driven
stranded execution never double-buys.

Savings Plans set the token as the native CreateSavingsPlan ClientToken,
which AWS dedupes server-side. EC2 Reserved Instances have no
ClientToken, so the client looks up an RI already tagged with the token
before purchasing (short-circuiting on a re-drive) and tags the newly
bought RI with it afterwards; a failed lookup refuses to purchase rather
than risk a double-buy. The residual purchase-then-tag-fails window is
irreducible given EC2's API and stays backstopped by #635 safe-fail.

Does not flip RecoverStrandedApprovals to re-drive; that remains the
documented follow-up now that idempotency is safe to rely on.
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 21, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@cristim cristim merged commit 08747bc into feat/multicloud-web-frontend May 21, 2026
4 checks passed
@cristim cristim deleted the fix/stranded-approval-recovery branch May 21, 2026 11:58
cristim added a commit that referenced this pull request May 21, 2026
…t idempotent

Extends the #636 IdempotencyToken mechanism to the five remaining AWS
commitment executors so a re-drive (and #639's auto-re-drive sweep) cannot
double-purchase. Stacks on #638.

- RDS/ElastiCache/MemoryDB: derive the customer-supplied reservation ID
  (ReservedDBInstanceId/ReservedCacheNodeId/ReservationId) deterministically
  from the token instead of time.Now(). Pre-purchase Describe-by-ID guard
  short-circuits a re-drive; AWS's native *AlreadyExists* fault backstops a
  guard miss and is recovered to the existing commitment. Double-buy is
  impossible by two independent mechanisms.
- OpenSearch: derive ReservationName from the token; the name is unique per
  account+region so AWS rejects a duplicate with ResourceAlreadyExistsException
  (recovered by name). A pre-purchase by-name Describe guard short-circuits.
- Redshift: no customer ID, no native dedupe, no tag filter, so an EC2-style
  tag-guard: DescribeReservedNodes + per-node DescribeTags matches the token
  tag, with the token written post-purchase via CreateTags. Residual window
  (tagging support uncertain) documented and backstopped by #635's safe-fail.

Lookup errors fail loud (refuse to purchase) to avoid a guard-miss double-buy.
Empty token preserves prior non-idempotent behaviour for non-execution callers.

Adds common.IdempotentReservationID helper. Per-provider regression tests:
same token on re-drive does not create a second commitment (guard short-circuit,
AlreadyExists recovery, fail-loud-on-lookup-error).

Refs #641.
cristim added a commit that referenced this pull request May 22, 2026
…t idempotent (refs #641) (#652)

* feat(purchases): make AWS RDS/ElastiCache/MemoryDB/OpenSearch/Redshift idempotent

Extends the #636 IdempotencyToken mechanism to the five remaining AWS
commitment executors so a re-drive (and #639's auto-re-drive sweep) cannot
double-purchase. Stacks on #638.

- RDS/ElastiCache/MemoryDB: derive the customer-supplied reservation ID
  (ReservedDBInstanceId/ReservedCacheNodeId/ReservationId) deterministically
  from the token instead of time.Now(). Pre-purchase Describe-by-ID guard
  short-circuits a re-drive; AWS's native *AlreadyExists* fault backstops a
  guard miss and is recovered to the existing commitment. Double-buy is
  impossible by two independent mechanisms.
- OpenSearch: derive ReservationName from the token; the name is unique per
  account+region so AWS rejects a duplicate with ResourceAlreadyExistsException
  (recovered by name). A pre-purchase by-name Describe guard short-circuits.
- Redshift: no customer ID, no native dedupe, no tag filter, so an EC2-style
  tag-guard: DescribeReservedNodes + per-node DescribeTags matches the token
  tag, with the token written post-purchase via CreateTags. Residual window
  (tagging support uncertain) documented and backstopped by #635's safe-fail.

Lookup errors fail loud (refuse to purchase) to avoid a guard-miss double-buy.
Empty token preserves prior non-idempotent behaviour for non-execution callers.

Adds common.IdempotentReservationID helper. Per-provider regression tests:
same token on re-drive does not create a second commitment (guard short-circuit,
AlreadyExists recovery, fail-loud-on-lookup-error).

Refs #641.

* fix(purchases): redact idempotency tokens in AWS re-drive skip logs

CodeRabbit (PR #652) flagged that the "already exists; skipping purchase"
log lines emit the full caller-supplied idempotency token, a stable
per-execution identifier that should not leak verbatim into persistent logs.

Add common.MaskToken (8-char prefix + ellipsis, "(none)" for empty) and use
it across all five idempotent AWS executors (RDS, ElastiCache, MemoryDB,
OpenSearch, Redshift). The masked prefix still correlates log lines for a
single purchase. CR flagged three; OpenSearch and Redshift carried the
identical pattern and are fixed here too for consistency.

Log-message-only change: the idempotency guard, derived reservation IDs, and
AlreadyExists recovery are untouched, so the same-token-no-double-buy
invariant is preserved per service.

Refs #641.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working effort/m Days impact/all-users Affects every user priority/p1 Next up; this sprint severity/high Significant harm triaged Item has been triaged type/bug Defect urgency/this-sprint Within the current sprint

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant