fix(purchases): recover executions stranded in 'approved' (closes #632) by cristim · Pull Request #635 · LeanerCloud/CUDly

cristim · 2026-05-20T23:56:44Z

Summary

Fixes the P1 where an approved purchase strands permanently in approved (purchase never runs, no error, completed_at NULL) when the synchronous execution is interrupted (Lambda timeout / cold-start eviction / panic) — see #632.

Approach: recovery sweep + safe-fail (not auto-re-drive)

A new RecoverStrandedApprovals runs at the top of the existing advisory-locked ProcessScheduledPurchases scheduled task. It calls a new store method GetStaleApprovedExecutions(olderThan) (WHERE status='approved' AND updated_at < NOW() - interval, mirroring the existing GetStaleProcessingExchanges RI-exchange prior art) and drives each stranded row into terminal failed with a clear operator-readable error. Threshold is 15 min (the purchase Lambda timeout is 60s, so no in-flight run can ever be failed under itself).

Why fail, not re-run (the key decision)

Commitment creation has no idempotency token: EC2 PurchaseReservedInstancesOffering sets no ClientToken and CreateSavingsPlan has no idempotency field. Auto-re-driving a row that was interrupted after AWS created the commitment but before the row persisted would double-purchase. So the sweep never re-runs the purchase (asserted via mockFactory.AssertNotCalled); it only fails the row. The approved->failed transition uses atomic TransitionExecutionStatus (WHERE status='approved'), so a late-completing original run is never clobbered. Failed rows are visible in History (#623) and Retry-able after the operator confirms AWS state.

The idempotent auto-re-drive path is deferred (needs a ClientToken/dedupe prerequisite) and filed as a follow-up.

Test plan

Regression test: stranded approved -> failed, purchase NOT re-executed, a fresh approved row (< threshold) untouched, a late-completed row skipped
All 10 ProcessScheduled/Recover tests green; scheduler/server/api/analytics 1763 pass; config 497 pass; go build/go vet/golangci-lint clean; pre-commit passed

Closes #632.

Summary by CodeRabbit

Bug Fixes
- Executions stuck in "approved" beyond the staleness threshold are now automatically moved to "failed".
New Features
- Processing results now include a "recovered" metric counting approvals transitioned during recovery.
- Scheduled purchases processing runs the recovery sweep before normal work, and recovery failures are logged without blocking processing.
Tests
- Added regression tests covering recovery behavior and edge cases.

Approving a purchase flips the execution to 'approved' before running the purchase synchronously inside the HTTP/Lambda request, and only finalizes to 'completed'/'failed' after executePurchase returns. An interruption between the two (Lambda timeout, cold-start eviction, panic) leaves the row persisted as 'approved' with no owner, no error, and purchased=false — no automatic recovery, requiring manual DB intervention. Add a recovery sweep, RecoverStrandedApprovals, run at the top of the existing ProcessScheduledPurchases scheduled task (already advisory-locked). It finds executions stuck in 'approved' past a 15-minute threshold (the purchase Lambda timeout is 60s) via the new GetStaleApprovedExecutions store method and drives them into a terminal 'failed' state with a clear, operator-readable error so the row is visible in History (#623) and Retry-able (#47) instead of silently stuck. The sweep deliberately does NOT re-run the purchase: commitment creation has no idempotency token (EC2 PurchaseReservedInstancesOffering sets no ClientToken; CreateSavingsPlan has none), so auto-re-driving a row interrupted after AWS created the commitment but before the row persisted would double-purchase. The 'approved' -> 'failed' transition is atomic (TransitionExecutionStatus only flips rows still in 'approved'), so a late-completing original run is never clobbered. Mirrors the existing RI-exchange stale-sweep pattern (GetStaleProcessingExchanges). Adds a regression test that simulates an interrupted execution and asserts the row becomes failed, is not re-executed (no provider created), a fresh approved row is untouched, and a late-completing row is skipped.

cristim · 2026-05-20T23:56:50Z

@coderabbitai review

coderabbitai · 2026-05-20T23:56:55Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 811701d2-88b1-4bd7-bc9c-e4bb83774e29

📥 Commits

Reviewing files that changed from the base of the PR and between d97e0e3 and c37467b.

📒 Files selected for processing (1)

internal/purchase/manager.go

📝 Walkthrough

Walkthrough

Adds a recovery sweep for purchase executions stuck in "approved": new store interface method and Postgres query, Manager.RecoverStrandedApprovals with recovered-count tracking, handler log metric, and test/mocks updates plus unit tests for recovery behaviors.

Changes

Stranded Approval Recovery

Layer / File(s)	Summary
Interface Contract and Database Layer `internal/config/interfaces.go`, `internal/config/store_postgres.go`	New `GetStaleApprovedExecutions(ctx, olderThan)` on `StoreInterface`; Postgres implementation selects `purchase_executions` with `status='approved'` and `updated_at` older than the provided duration and returns via existing query helper.
Manager Recovery Logic and Integration `internal/purchase/manager.go`	Adds `ProcessResult.Recovered` and `RecoverStrandedApprovals(ctx)` which lists stale approvals, atomically transitions eligible rows to `failed` with a recovery message, counts recovered rows, and invokes this sweep at the start of `ProcessScheduledPurchases` (errors logged, not blocking).
Handler Logging Update `internal/server/handler.go`	Updates `handleProcessScheduledPurchases` success log to include the `recovered` count from `ProcessResult`.
Test Mock Infrastructure `internal/api/mocks_test.go`, `internal/purchase/mocks_test.go`, `internal/scheduler/scheduler_test.go`, `internal/analytics/collector_test.go`, `internal/server/test_helpers_test.go`	Adds `GetStaleApprovedExecutions(ctx, olderThan)` mock/stub methods across test doubles following existing mock patterns so tests can exercise the new store call.
Manager Recovery Tests and Existing Test Updates `internal/purchase/manager_test.go`	Updates existing `ProcessScheduledPurchases` tests to expect the new store call and adds three regression tests for `RecoverStrandedApprovals`: successful recovery, fresh result no-op, and race where a late completion prevents clobbering.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

feat(purchases): idempotent commitment creation (ClientToken/dedupe) to enable auto-re-drive of stranded approvals #636: Related conceptually — proposes idempotency/dedupe changes to allow re-driving recoveries instead of failing; touches the broader recovery/idempotency design.
bug(purchases): approve strands execution in 'approved' (purchase never runs, no error) on interrupted sync execution #632: Directly related — this PR implements the recovery sweep for stranded "approved" executions described in the issue.

Suggested labels

priority/p2, severity/medium, urgency/this-quarter, impact/many

Poem

🐰 I hop through queues where approvals stall,
Finding rows that time forgot at all.
With careful checks and an atomic sweep,
I mark them failed so pipelines sleep.
Recovery done — the system hums; I nap, content and spry. 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and accurately describes the main change: adding recovery functionality for executions stuck in 'approved' status, and directly addresses the issue (`#632`) being fixed.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/stranded-approval-recovery

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai · 2026-05-20T23:56:55Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/purchase/manager.go`:
- Around line 190-194: The current loop treats every error returned from
TransitionExecutionStatus as a benign race and continues, which hides real
DB/query failures; change the handling in the block that checks txErr (around
the call to TransitionExecutionStatus for exec.ExecutionID) to distinguish
sentinel "already transitioned"/"not found" errors from other failures: use
errors.Is (or compare against your package sentinel errors like
ErrAlreadyTransitioned/ErrNotFound or sql.ErrNoRows) and only call logging.Warnf
and continue for those specific cases, but for any other txErr log it as an
error and return/propagate it to fail the sweep so stranded rows aren’t silently
skipped.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bde7204b-d1c9-4d0b-935b-4a6377222ea9

📥 Commits

Reviewing files that changed from the base of the PR and between 9ecab71 and d97e0e3.

📒 Files selected for processing (10)

internal/analytics/collector_test.go
internal/api/mocks_test.go
internal/config/interfaces.go
internal/config/store_postgres.go
internal/purchase/manager.go
internal/purchase/manager_test.go
internal/purchase/mocks_test.go
internal/scheduler/scheduler_test.go
internal/server/handler.go
internal/server/test_helpers_test.go

…s in stranded-approval recovery Address CR finding on PR #635 (internal/purchase/manager.go:194): Before: every TransitionExecutionStatus error in the recovery loop was treated as a skip-this-row. That swallows real store failures (DB unreachable, query syntax error, permission revoked mid-loop) the same way it swallows benign races (concurrent sweep handled the row, or the original run finished after the LIST snapshot). The result is silent under-recovery: a transient DB outage during a sweep makes every row look like a race and leaves them all stranded, while the operator sees "sweep finished, recovered=0" and assumes the system is healthy. After: probe the current row state via GetExecutionByID. A clean read with Status != "approved" confirms the race and skips the row as before. Any other outcome (read error, still-approved row) is a real failure and the sweep returns with the error and the count of rows already recovered, so the operator can see exactly how far the sweep got before the failure. The unrelated failure TestManager_ExecuteAndFinalize_HistorySaveFailure_StaysVisible (in execution_test.go:1234) predates this change and exercises a different code path (executeAndFinalize in execution.go).

cristim · 2026-05-21T05:23:12Z

@coderabbitai review

coderabbitai · 2026-05-21T05:23:17Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

…loses #636) (#638) * feat(purchases): add deterministic idempotency token helper (#636) Add DeriveIdempotencyToken (SHA-256 of execution_id:rec_index, 64-char hex fitting the AWS ClientToken limit) plus an IdempotencyToken field on PurchaseOptions and an IdempotencyTagKey constant. The token is stable across a strand-and-re-drive so a re-driven purchase reuses the same value: Savings Plans dedupe on it natively via ClientToken and EC2 RIs use it as a dedupe tag, making commitment creation idempotent. * feat(purchases): make EC2 RI and Savings Plans creation idempotent (#636) Thread a deterministic per-rec idempotency token (derived from execution_id + rec index) through the purchase fan-out so a re-driven stranded execution never double-buys. Savings Plans set the token as the native CreateSavingsPlan ClientToken, which AWS dedupes server-side. EC2 Reserved Instances have no ClientToken, so the client looks up an RI already tagged with the token before purchasing (short-circuiting on a re-drive) and tags the newly bought RI with it afterwards; a failed lookup refuses to purchase rather than risk a double-buy. The residual purchase-then-tag-fails window is irreducible given EC2's API and stays backstopped by #635 safe-fail. Does not flip RecoverStrandedApprovals to re-drive; that remains the documented follow-up now that idempotency is safe to rely on.

cristim · 2026-05-21T11:55:31Z

@coderabbitai review

coderabbitai · 2026-05-21T11:55:37Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

…t idempotent Extends the #636 IdempotencyToken mechanism to the five remaining AWS commitment executors so a re-drive (and #639's auto-re-drive sweep) cannot double-purchase. Stacks on #638. - RDS/ElastiCache/MemoryDB: derive the customer-supplied reservation ID (ReservedDBInstanceId/ReservedCacheNodeId/ReservationId) deterministically from the token instead of time.Now(). Pre-purchase Describe-by-ID guard short-circuits a re-drive; AWS's native *AlreadyExists* fault backstops a guard miss and is recovered to the existing commitment. Double-buy is impossible by two independent mechanisms. - OpenSearch: derive ReservationName from the token; the name is unique per account+region so AWS rejects a duplicate with ResourceAlreadyExistsException (recovered by name). A pre-purchase by-name Describe guard short-circuits. - Redshift: no customer ID, no native dedupe, no tag filter, so an EC2-style tag-guard: DescribeReservedNodes + per-node DescribeTags matches the token tag, with the token written post-purchase via CreateTags. Residual window (tagging support uncertain) documented and backstopped by #635's safe-fail. Lookup errors fail loud (refuse to purchase) to avoid a guard-miss double-buy. Empty token preserves prior non-idempotent behaviour for non-execution callers. Adds common.IdempotentReservationID helper. Per-provider regression tests: same token on re-drive does not create a second commitment (guard short-circuit, AlreadyExists recovery, fail-loud-on-lookup-error). Refs #641.

…t idempotent (refs #641) (#652) * feat(purchases): make AWS RDS/ElastiCache/MemoryDB/OpenSearch/Redshift idempotent Extends the #636 IdempotencyToken mechanism to the five remaining AWS commitment executors so a re-drive (and #639's auto-re-drive sweep) cannot double-purchase. Stacks on #638. - RDS/ElastiCache/MemoryDB: derive the customer-supplied reservation ID (ReservedDBInstanceId/ReservedCacheNodeId/ReservationId) deterministically from the token instead of time.Now(). Pre-purchase Describe-by-ID guard short-circuits a re-drive; AWS's native *AlreadyExists* fault backstops a guard miss and is recovered to the existing commitment. Double-buy is impossible by two independent mechanisms. - OpenSearch: derive ReservationName from the token; the name is unique per account+region so AWS rejects a duplicate with ResourceAlreadyExistsException (recovered by name). A pre-purchase by-name Describe guard short-circuits. - Redshift: no customer ID, no native dedupe, no tag filter, so an EC2-style tag-guard: DescribeReservedNodes + per-node DescribeTags matches the token tag, with the token written post-purchase via CreateTags. Residual window (tagging support uncertain) documented and backstopped by #635's safe-fail. Lookup errors fail loud (refuse to purchase) to avoid a guard-miss double-buy. Empty token preserves prior non-idempotent behaviour for non-execution callers. Adds common.IdempotentReservationID helper. Per-provider regression tests: same token on re-drive does not create a second commitment (guard short-circuit, AlreadyExists recovery, fail-loud-on-lookup-error). Refs #641. * fix(purchases): redact idempotency tokens in AWS re-drive skip logs CodeRabbit (PR #652) flagged that the "already exists; skipping purchase" log lines emit the full caller-supplied idempotency token, a stable per-execution identifier that should not leak verbatim into persistent logs. Add common.MaskToken (8-char prefix + ellipsis, "(none)" for empty) and use it across all five idempotent AWS executors (RDS, ElastiCache, MemoryDB, OpenSearch, Redshift). The masked prefix still correlates log lines for a single purchase. CR flagged three; OpenSearch and Redshift carried the identical pattern and are fixed here too for consistency. Log-message-only change: the idempotency guard, derived reservation IDs, and AlreadyExists recovery are untouched, so the same-token-no-double-buy invariant is preserved per service. Refs #641.

cristim added bug Something isn't working triaged Item has been triaged priority/p1 Next up; this sprint severity/high Significant harm urgency/this-sprint Within the current sprint impact/all-users Affects every user effort/m Days type/bug Defect labels May 20, 2026

cristim mentioned this pull request May 20, 2026

feat(purchases): idempotent commitment creation (ClientToken/dedupe) to enable auto-re-drive of stranded approvals #636

Closed

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread internal/purchase/manager.go Outdated

This was referenced May 21, 2026

feat(purchases): idempotent commitment creation (ClientToken/dedupe) (closes #636) #638

Merged

feat(purchases): flip RecoverStrandedApprovals to idempotent re-drive (now that #636 makes it safe) #639

Open

cristim merged commit 08747bc into feat/multicloud-web-frontend May 21, 2026
4 checks passed

cristim deleted the fix/stranded-approval-recovery branch May 21, 2026 11:58

cristim mentioned this pull request May 21, 2026

bug(test): TestManager_RecoverStrandedApprovals_LateCompletionNotClobbered fails on base feat/multicloud-web-frontend #651

Closed

cristim mentioned this pull request May 21, 2026

feat(purchases): make AWS RDS/ElastiCache/MemoryDB/OpenSearch/Redshift idempotent (refs #641) #652

Merged

2 tasks

coderabbitai Bot mentioned this pull request May 22, 2026

test(purchase): stub RecoverStrandedApprovals late-completion mock to fix panic #663

Merged

cristim mentioned this pull request May 25, 2026

bug(purchases): approve strands execution in 'approved' (purchase never runs, no error) on interrupted sync execution #632

Closed

coderabbitai Bot mentioned this pull request May 26, 2026

fix(purchases): AWS-scoped auto-re-drive in RecoverStrandedApprovals (partial #632) #728

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(purchases): recover executions stranded in 'approved' (closes #632)#635

fix(purchases): recover executions stranded in 'approved' (closes #632)#635
cristim merged 2 commits into
feat/multicloud-web-frontendfrom
fix/stranded-approval-recovery

cristim commented May 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

cristim commented May 20, 2026

Uh oh!

coderabbitai Bot commented May 20, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot commented May 20, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

cristim commented May 21, 2026

Uh oh!

coderabbitai Bot commented May 21, 2026

Uh oh!

cristim commented May 21, 2026

Uh oh!

coderabbitai Bot commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cristim commented May 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach: recovery sweep + safe-fail (not auto-re-drive)

Why fail, not re-run (the key decision)

Test plan

Summary by CodeRabbit

Uh oh!

cristim commented May 20, 2026

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot commented May 20, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cristim commented May 21, 2026

Uh oh!

coderabbitai Bot commented May 21, 2026

Uh oh!

cristim commented May 21, 2026

Uh oh!

coderabbitai Bot commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cristim commented May 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading