feat(purchases): idempotent commitment creation (ClientToken/dedupe) (closes #636)#638
Conversation
Add DeriveIdempotencyToken (SHA-256 of execution_id:rec_index, 64-char hex fitting the AWS ClientToken limit) plus an IdempotencyToken field on PurchaseOptions and an IdempotencyTagKey constant. The token is stable across a strand-and-re-drive so a re-driven purchase reuses the same value: Savings Plans dedupe on it natively via ClientToken and EC2 RIs use it as a dedupe tag, making commitment creation idempotent.
) Thread a deterministic per-rec idempotency token (derived from execution_id + rec index) through the purchase fan-out so a re-driven stranded execution never double-buys. Savings Plans set the token as the native CreateSavingsPlan ClientToken, which AWS dedupes server-side. EC2 Reserved Instances have no ClientToken, so the client looks up an RI already tagged with the token before purchasing (short-circuiting on a re-drive) and tags the newly bought RI with it afterwards; a failed lookup refuses to purchase rather than risk a double-buy. The residual purchase-then-tag-fails window is irreducible given EC2's API and stays backstopped by #635 safe-fail. Does not flip RecoverStrandedApprovals to re-drive; that remains the documented follow-up now that idempotency is safe to rely on.
|
@coderabbitai review |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (8)
📝 WalkthroughWalkthroughThe PR implements deterministic per-recommendation idempotency tokens to enable idempotent commitment creation for both EC2 Reserved Instances and Savings Plans. Tokens are derived from execution ID and recommendation index, ensuring re-drives reuse the same token. EC2 uses a dedupe guard and tag-based lookup; Savings Plans leverages AWS-native ClientToken idempotency. ChangesIdempotent Commitment Creation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
✅ Actions performedReview triggered.
|
…t idempotent Extends the #636 IdempotencyToken mechanism to the five remaining AWS commitment executors so a re-drive (and #639's auto-re-drive sweep) cannot double-purchase. Stacks on #638. - RDS/ElastiCache/MemoryDB: derive the customer-supplied reservation ID (ReservedDBInstanceId/ReservedCacheNodeId/ReservationId) deterministically from the token instead of time.Now(). Pre-purchase Describe-by-ID guard short-circuits a re-drive; AWS's native *AlreadyExists* fault backstops a guard miss and is recovered to the existing commitment. Double-buy is impossible by two independent mechanisms. - OpenSearch: derive ReservationName from the token; the name is unique per account+region so AWS rejects a duplicate with ResourceAlreadyExistsException (recovered by name). A pre-purchase by-name Describe guard short-circuits. - Redshift: no customer ID, no native dedupe, no tag filter, so an EC2-style tag-guard: DescribeReservedNodes + per-node DescribeTags matches the token tag, with the token written post-purchase via CreateTags. Residual window (tagging support uncertain) documented and backstopped by #635's safe-fail. Lookup errors fail loud (refuse to purchase) to avoid a guard-miss double-buy. Empty token preserves prior non-idempotent behaviour for non-execution callers. Adds common.IdempotentReservationID helper. Per-provider regression tests: same token on re-drive does not create a second commitment (guard short-circuit, AlreadyExists recovery, fail-loud-on-lookup-error). Refs #641.
Extends the #636 IdempotencyToken mechanism to all five Azure reservation executors (cache, compute, cosmosdb, database, search) so a re-drive (and #639's auto-re-drive sweep) cannot double-purchase. Stacks on #638. Each executor previously minted a fresh reservationOrderID per call (uuid.New(), or a time.Now() timestamp in search), so a retry created a second reservation. The Azure Reservations API PUTs to reservationOrders/{id} and is idempotent on a stable order ID, so the order ID is now derived deterministically from the IdempotencyToken via common.IdempotencyGUID: the same token re-PUTs the same order and returns the existing reservation rather than creating a second. An empty/short token falls back to the prior non-idempotent ID, preserving behaviour for non-execution callers. Adds common.IdempotencyGUID (canonical 8-4-4-4-12 GUID from the first 128 bits of the SHA-256 token) and common.ReservationOrderID, a single helper that returns the derived GUID or the caller's fallback so each executor stays a one-statement assignment (keeps gocyclo under threshold and avoids repeating the empty-token guard across services). Per-provider regression tests on compute and search assert that a same-token re-drive PUTs to the identical reservationOrders/{id} URL (no second reservation) while a distinct token targets a distinct order. Refs #641.
…t idempotent (refs #641) (#652) * feat(purchases): make AWS RDS/ElastiCache/MemoryDB/OpenSearch/Redshift idempotent Extends the #636 IdempotencyToken mechanism to the five remaining AWS commitment executors so a re-drive (and #639's auto-re-drive sweep) cannot double-purchase. Stacks on #638. - RDS/ElastiCache/MemoryDB: derive the customer-supplied reservation ID (ReservedDBInstanceId/ReservedCacheNodeId/ReservationId) deterministically from the token instead of time.Now(). Pre-purchase Describe-by-ID guard short-circuits a re-drive; AWS's native *AlreadyExists* fault backstops a guard miss and is recovered to the existing commitment. Double-buy is impossible by two independent mechanisms. - OpenSearch: derive ReservationName from the token; the name is unique per account+region so AWS rejects a duplicate with ResourceAlreadyExistsException (recovered by name). A pre-purchase by-name Describe guard short-circuits. - Redshift: no customer ID, no native dedupe, no tag filter, so an EC2-style tag-guard: DescribeReservedNodes + per-node DescribeTags matches the token tag, with the token written post-purchase via CreateTags. Residual window (tagging support uncertain) documented and backstopped by #635's safe-fail. Lookup errors fail loud (refuse to purchase) to avoid a guard-miss double-buy. Empty token preserves prior non-idempotent behaviour for non-execution callers. Adds common.IdempotentReservationID helper. Per-provider regression tests: same token on re-drive does not create a second commitment (guard short-circuit, AlreadyExists recovery, fail-loud-on-lookup-error). Refs #641. * fix(purchases): redact idempotency tokens in AWS re-drive skip logs CodeRabbit (PR #652) flagged that the "already exists; skipping purchase" log lines emit the full caller-supplied idempotency token, a stable per-execution identifier that should not leak verbatim into persistent logs. Add common.MaskToken (8-char prefix + ellipsis, "(none)" for empty) and use it across all five idempotent AWS executors (RDS, ElastiCache, MemoryDB, OpenSearch, Redshift). The masked prefix still correlates log lines for a single purchase. CR flagged three; OpenSearch and Redshift carried the identical pattern and are fixed here too for consistency. Log-message-only change: the idempotency guard, derived reservation IDs, and AlreadyExists recovery are untouched, so the same-token-no-double-buy invariant is preserved per service. Refs #641.
New Manager.ReapStuckExecutions(ctx, reapAfter): one sweep that finds executions stuck in approved/running longer than reapAfter and atomically transitions them to "failed" via the existing TransitionExecutionStatus CAS. Each successful transition also writes a canonical, human-readable error string so the History UI (#621) shows operators why the row was reaped and confirms it's safe to retry. Safety properties: - Local-status-only: never touches provider commitments. If the real executor did manage to create a commitment before dying, the retry hits the idempotency path (#636/#638/#652) which surfaces a duplicate-reservation error cleanly. - CAS-protected: if the real executor wakes up and finishes between the SELECT and the transition, the CAS rejects; logged at INFO, not surfaced as an error. The real executor wins the race. - Per-row error-isolation: a failure on row N never blocks N+1..K. Adds ParseReapAfterFromEnv to read PURCHASE_APPROVED_REAP_AFTER and fall back to a 10m default on missing-or-malformed env (with a WARN log so ops can spot a typo). Never panics — a misconfigured env var must not crash the other scheduled tasks sharing the Lambda. Tests cover: - stale approved row → flipped to failed with canonical message - stale running row → same - younger-than-threshold rows are filtered out by the SELECT (no TransitionExecutionStatus / SavePurchaseExecution calls happen) - terminal-status rows are never passed to the SELECT (regression guard against accidentally widening stuckStatuses) - CAS race lost → logged + counted, not surfaced as sweep error - CAS (nil, nil) defensive path treated as race-lost - 3-stuck-row integration with per-row CAS expectations - SELECT error → propagated as sweep error - per-row save error after successful CAS → Reaped++ AND Errored++ - env-var: unset, valid, invalid, non-default unit (2h30m)
closes #678) (#681) * feat(config): add ListStuckExecutions store method Selects purchase_executions rows whose status is in a caller-supplied set AND whose updated_at is older than a caller-supplied interval. Newest-stuck-first, capped at MaxListLimit per call. Mirrors the existing GetStaleProcessingExchanges pattern for RI exchanges. Used by the reaper sweep (issue #678) to find executions stuck in approved/running because the synchronous executor failed mid-flight (Lambda timeout, OOM, network hang) without flipping the row to a terminal state. Local SELECT only — no provider-side mutation. Adds the method to StoreInterface and to every test-side mock store that satisfies it (purchase, api, analytics, scheduler, server health). Adds pgxmock coverage for the happy path, the empty-statuses short-circuit, and a query-error path. * feat(purchase): add ReapStuckExecutions sweep + per-row CAS to failed New Manager.ReapStuckExecutions(ctx, reapAfter): one sweep that finds executions stuck in approved/running longer than reapAfter and atomically transitions them to "failed" via the existing TransitionExecutionStatus CAS. Each successful transition also writes a canonical, human-readable error string so the History UI (#621) shows operators why the row was reaped and confirms it's safe to retry. Safety properties: - Local-status-only: never touches provider commitments. If the real executor did manage to create a commitment before dying, the retry hits the idempotency path (#636/#638/#652) which surfaces a duplicate-reservation error cleanly. - CAS-protected: if the real executor wakes up and finishes between the SELECT and the transition, the CAS rejects; logged at INFO, not surfaced as an error. The real executor wins the race. - Per-row error-isolation: a failure on row N never blocks N+1..K. Adds ParseReapAfterFromEnv to read PURCHASE_APPROVED_REAP_AFTER and fall back to a 10m default on missing-or-malformed env (with a WARN log so ops can spot a typo). Never panics — a misconfigured env var must not crash the other scheduled tasks sharing the Lambda. Tests cover: - stale approved row → flipped to failed with canonical message - stale running row → same - younger-than-threshold rows are filtered out by the SELECT (no TransitionExecutionStatus / SavePurchaseExecution calls happen) - terminal-status rows are never passed to the SELECT (regression guard against accidentally widening stuckStatuses) - CAS race lost → logged + counted, not surfaced as sweep error - CAS (nil, nil) defensive path treated as race-lost - 3-stuck-row integration with per-row CAS expectations - SELECT error → propagated as sweep error - per-row save error after successful CAS → Reaped++ AND Errored++ - env-var: unset, valid, invalid, non-default unit (2h30m) * feat(purchase): wire periodic reaper invocation (closes #678) Wires the ReapStuckExecutions sweep (added in the previous commit) into the existing scheduled-task pipeline: - New ScheduledTaskType "reap_stuck_purchases" registered alongside the other periodic tasks (cleanup, analytics_refresh, ri_exchange_reshape). Adds a dedicated handleReapStuckPurchases dispatcher that reads the threshold via purchase.ParseReapAfterFromEnv on every invocation so an ops-side env-var tune via PURCHASE_APPROVED_REAP_AFTER takes effect on the next sweep without a redeploy. - Adds ReapStuckExecutions to PurchaseManagerInterface so the handler contract is symmetric with the Manager + the testutil mock can satisfy it. MockPurchaseManager gains ReapStuckExecutionsFunc following the existing per-method-Func mock convention. - New EventBridge rule "${stack_name}-reap-stuck-purchases" on rate(5 minutes), targeting the main Lambda with {"action":"reap_stuck_purchases"}. Cadence is intentionally more frequent than the 10m default threshold so a stuck row is reaped within ~1 threshold-window. The reaper itself is CAS-protected (TransitionExecutionStatus from approved/running → failed) so an over-run is safe — the real executor wins the race. - New terraform vars: enable_reap_stuck_purchases_schedule (bool, default true), reap_stuck_purchases_schedule (string, default "rate(5 minutes)"), purchase_approved_reap_after (string, default "" → use in-code DefaultReapAfter). Empty-string default for the threshold avoids pinning a Lambda env value when ops hasn't taken a position. - ParseScheduledEvent learns the "reap_stuck_purchases" action so the EventBridge → Lambda payload round-trips through the dispatcher. Tests cover the dispatcher success path (asserts the default 10m threshold is passed when the env var is unset), the store-error propagation path, the lock-ID uniqueness guard, and the ParseScheduledEvent action mapping. * fix(purchase/reaper): distinguish CAS race-loss from real DB errors The reaper's per-row error handler bucketed every failure from TransitionExecutionStatus into RaceLost, including real DB outages (connection refused, query timeout, etc.). That masked persistent store failures: a downed DB would surface only as a quiet count inflation on RaceLost, with no errored signal for ops to alert on. Wrap the two legitimate race-loss outcomes from TransitionExecutionStatus in sentinel errors so callers can use errors.Is rather than brittle string matching: - ErrExecutionNotInExpectedStatus: row exists but its status moved out of the allowed set before the CAS (the real executor finished between SELECT and UPDATE — race lost). - ErrNotFound (reused for the "row vanished" case): row was deleted between SELECT and UPDATE — also a race outcome, nothing for the reaper to do. In the reaper, classify errors.Is(err, ErrExecutionNotInExpectedStatus) or errors.Is(err, ErrNotFound) as RaceLost (INFO log); everything else bumps Errored with an ERROR log so the ops signal is visible. Addresses A1 from CodeRabbit round-1 review on #681. Regression tests: - TestReapStuckExecutions_CASRaceLostNoError now wraps the sentinel (was raw errors.New); guards the "race-loss is INFO-only" path. - TestReapStuckExecutions_RowVanishedTreatedAsRaceLost covers the ErrNotFound branch (manual DELETE between SELECT and CAS). - TestReapStuckExecutions_HardDBErrorClassifiedAsErrored asserts a non-sentinel error bumps Errored, not RaceLost — the regression guard for the original A1 finding. The wrapped errors keep the original message substrings ("not found", "cannot transition from ...") so existing call sites that match on substrings (e.g. internal/purchase/manager.go's RecoverStrandedApprovals, internal/purchase/approvals_test.go) are unaffected. * fix(purchase/reaper): reject non-positive reap durations + stabilize env tests ParseReapAfterFromEnv accepted "0s" and negative durations such as "-5m" because time.ParseDuration treats them as syntactically valid. Feeding either into ListStuckExecutions would be catastrophic: - 0s: the cutoff "updated_at < NOW() - 0s" matches every approved/running row regardless of age — the reaper would flip fresh, in-flight executions to failed. - -5m: the cutoff "updated_at < NOW() - (-5m)" == "updated_at < NOW() + 5m" matches rows from 5 minutes in the future, i.e. effectively every row. Same outcome. The store-side guard added in the original PR (ListStuckExecutions rejects olderThan <= 0) prevents the broken SELECT from executing, but a misconfigured env value would still cause every sweep to fail silently with a confusing store error rather than a WARN at the config-parse boundary. Reject non-positive durations at the env-parse layer with a WARN log + fallback to DefaultReapAfter so the misconfig is visible in the Lambda's startup logs. Addresses A2 + A3 (defense-in-depth) + A4 + A5 from CodeRabbit round-1 review on #681. Regression tests (A4): - TestParseReapAfterFromEnv_ZeroFallsBackToDefault — "0s" - TestParseReapAfterFromEnv_NegativeFallsBackToDefault — "-5m" - TestParseReapAfterFromEnv_GarbageFallsBackToDefault — explicit "garbage" case complementing the existing _Invalid* test, named per the CR finding so the regression intent is searchable. handler_test.go (A5): - TestHandleScheduledTask's table-driven loop now calls t.Setenv("PURCHASE_APPROVED_REAP_AFTER", "") inside the per-case sub-Run. The two reap_stuck_purchases cases assert reapAfter == 10*time.Minute (the default); without an explicit env clear, an ambient PURCHASE_APPROVED_REAP_AFTER in CI/dev would silently make them flaky. t.Setenv auto-restores the prior value at cleanup. The store-side olderThan <= 0 guard (A3) was already folded into the earlier rebased commit "feat(config): add ListStuckExecutions store method" so no additional store-side change is needed here — A3 is covered by the existing defense-in-depth, and this commit completes the env-side validation A2 calls for.
Summary
Adds idempotent commitment creation so a re-driven purchase can never double-buy (#636) — the prerequisite for eventually auto-re-driving stranded approvals (#635's recovery sweep currently safe-fails).
A deterministic per-rec token
DeriveIdempotencyToken(executionID, recIndex)is threaded throughPurchaseOptions.IdempotencyTokenand the execution fan-out, so a re-drive of the same execution/rec produces the identical token.Per-provider double-buy guarantee
CreateSavingsPlan.ClientTokenis AWS's documented native idempotency identifier (the SDK auto-injects a UUID when nil; confirmed in savingsplans v1.31.0). Passing our deterministic token means a re-drive sends the identical token and AWS dedupes server-side atomically. (Note: this corrects the premise in feat(purchases): idempotent commitment creation (ClientToken/dedupe) to enable auto-re-drive of stranded approvals #636's body / bug(purchases): approve strands execution in 'approved' (purchase never runs, no error) on interrupted sync execution #632 that SP has no native idempotency — it does, and we use it rather than a weaker lookup guard.)ClientTokenexists forPurchaseReservedInstancesOffering. The guard queriesDescribeReservedInstancesfiltered bytag:cudly-idempotency-token+ stateactive|payment-pending(matchingGetExistingCommitments) before buying, short-circuits if found, fails-loud (refuses to buy) on lookup error, and tags after purchase. The only residual double-buy window is purchase-succeeds-then-tag-write-fails, which is irreducible given EC2's API and stays backstopped by fix(purchases): recover executions stranded in 'approved' (closes #632) #635's safe-fail.Scope
Closes #636 (both EC2 + SP idempotency primitives). The
RecoverStrandedApprovalsre-drive flip is not done here — it remains the explicitly-deferred follow-up now that idempotency is safe.Test plan
Closes #636.
Summary by CodeRabbit