Skip to content

feat(purchases): idempotent commitment creation (ClientToken/dedupe) (closes #636)#638

Merged
cristim merged 2 commits into
feat/multicloud-web-frontendfrom
feat/idempotent-commitment-creation
May 21, 2026
Merged

feat(purchases): idempotent commitment creation (ClientToken/dedupe) (closes #636)#638
cristim merged 2 commits into
feat/multicloud-web-frontendfrom
feat/idempotent-commitment-creation

Conversation

@cristim
Copy link
Copy Markdown
Member

@cristim cristim commented May 21, 2026

Summary

Adds idempotent commitment creation so a re-driven purchase can never double-buy (#636) — the prerequisite for eventually auto-re-driving stranded approvals (#635's recovery sweep currently safe-fails).

A deterministic per-rec token DeriveIdempotencyToken(executionID, recIndex) is threaded through PurchaseOptions.IdempotencyToken and the execution fan-out, so a re-drive of the same execution/rec produces the identical token.

Per-provider double-buy guarantee

Scope

Closes #636 (both EC2 + SP idempotency primitives). The RecoverStrandedApprovals re-drive flip is not done here — it remains the explicitly-deferred follow-up now that idempotency is safe.

Test plan

  • 371 tests pass (pkg/common 130, aws ec2+sp 100, internal/purchase 141)
  • Asserts the same token on re-drive; SP native-ClientToken passthrough; EC2 tag-guard short-circuit + fail-loud-on-lookup-error
  • go vet + gofmt + gosec (new token const) clean

Closes #636.

Summary by CodeRabbit

  • New Features
    • Implemented purchase deduplication for cloud commitments. Each recommendation now receives a unique idempotency token to prevent duplicate purchases when operations are retried for EC2 Reserved Instances and Savings Plans.

Review Change Stack

cristim added 2 commits May 21, 2026 07:58
Add DeriveIdempotencyToken (SHA-256 of execution_id:rec_index, 64-char
hex fitting the AWS ClientToken limit) plus an IdempotencyToken field on
PurchaseOptions and an IdempotencyTagKey constant. The token is stable
across a strand-and-re-drive so a re-driven purchase reuses the same
value: Savings Plans dedupe on it natively via ClientToken and EC2 RIs
use it as a dedupe tag, making commitment creation idempotent.
)

Thread a deterministic per-rec idempotency token (derived from
execution_id + rec index) through the purchase fan-out so a re-driven
stranded execution never double-buys.

Savings Plans set the token as the native CreateSavingsPlan ClientToken,
which AWS dedupes server-side. EC2 Reserved Instances have no
ClientToken, so the client looks up an RI already tagged with the token
before purchasing (short-circuiting on a re-drive) and tags the newly
bought RI with it afterwards; a failed lookup refuses to purchase rather
than risk a double-buy. The residual purchase-then-tag-fails window is
irreducible given EC2's API and stays backstopped by #635 safe-fail.

Does not flip RecoverStrandedApprovals to re-drive; that remains the
documented follow-up now that idempotency is safe to rely on.
@cristim cristim added enhancement New feature or request triaged Item has been triaged priority/p2 Backlog-worthy severity/medium Moderate harm urgency/this-quarter Within the quarter impact/all-users Affects every user effort/l Weeks type/feat New capability labels May 21, 2026
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 21, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3ae99ce9-3d8f-44ef-b845-8b15b9db3568

📥 Commits

Reviewing files that changed from the base of the PR and between 5baf1f6 and 3eb85c5.

📒 Files selected for processing (8)
  • internal/purchase/execution.go
  • internal/purchase/execution_test.go
  • pkg/common/tokens.go
  • pkg/common/tokens_test.go
  • pkg/common/types.go
  • providers/aws/services/ec2/client.go
  • providers/aws/services/savingsplans/client.go
  • providers/aws/services/savingsplans/client_test.go

📝 Walkthrough

Walkthrough

The PR implements deterministic per-recommendation idempotency tokens to enable idempotent commitment creation for both EC2 Reserved Instances and Savings Plans. Tokens are derived from execution ID and recommendation index, ensuring re-drives reuse the same token. EC2 uses a dedupe guard and tag-based lookup; Savings Plans leverages AWS-native ClientToken idempotency.

Changes

Idempotent Commitment Creation

Layer / File(s) Summary
Idempotency Token Contract and Derivation
pkg/common/tokens.go, pkg/common/types.go, pkg/common/tokens_test.go
Introduces DeriveIdempotencyToken(executionID, recIndex) function that produces a deterministic 64-character hex SHA-256 token; adds IdempotencyTagKey constant and IdempotencyToken field to PurchaseOptions; validates token stability, collision resistance, and format in unit tests.
Per-Recommendation Token Threading
internal/purchase/execution.go, internal/purchase/execution_test.go
Per-recommendation fan-out now derives a unique idempotency token for each recommendation using execution ID and index, copies it into per-rec recOpts, and passes it to executeSinglePurchase; test mocks updated to validate token derivation across web and fallback source paths.
EC2 Reserved Instance Idempotency Implementation
providers/aws/services/ec2/client.go
Adds dedupe guard: before purchase, checks for existing active/pending RI tagged with idempotency token and short-circuits or fails early to prevent double-buy; introduces findRIByIdempotencyToken helper to query by tag; post-purchase tagging includes idempotency token alongside CUDly tags.
Savings Plans ClientToken Integration
providers/aws/services/savingsplans/client.go, providers/aws/services/savingsplans/client_test.go
CreateSavingsPlanInput.ClientToken is populated from opts.IdempotencyToken when provided, enabling AWS-native idempotency for repeated calls; token remains unset for CLI path to preserve prior non-idempotent behavior; tests validate ClientToken is passed through consistently and remains nil when unset.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • LeanerCloud/CUDly#373: Both PRs modify the per-recommendation purchase fan-out/aggregation flow in internal/purchase/execution.go (same processPurchaseRecommendations/executeSinglePurchase call path), with the main PR adding deterministic per-rec idempotency tokens on top of the retrieved PR's synchronous/parallel execution changes.

Suggested labels

urgency/this-sprint, impact/many, effort/m

Poem

🐰 A token born from execution's seed,
Derived with index, all that we need.
EC2 tags it, Savings Plans hold fast,
No double-purchase on re-drive blast!
Idempotence hops through the cloud so bright. 🌩️

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main feature: deterministic per-record idempotency tokens for commitment creation to enable safe re-drives, directly aligned with the changeset.
Linked Issues check ✅ Passed The PR fully implements objectives from #636: (1) derives deterministic tokens via DeriveIdempotencyToken, (2) threads through EC2 RI and Savings Plans with appropriate dedupe guards (ClientToken for SP, tag-based lookup for EC2), and (3) supports idempotent re-drives.
Out of Scope Changes check ✅ Passed All changes are in scope: token generation utilities, PurchaseOptions extensions, EC2/Savings Plans idempotency logic, and comprehensive tests. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/idempotent-commitment-creation

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@cristim cristim merged commit edbb913 into feat/multicloud-web-frontend May 21, 2026
5 checks passed
@cristim cristim deleted the feat/idempotent-commitment-creation branch May 21, 2026 11:54
cristim added a commit that referenced this pull request May 21, 2026
…t idempotent

Extends the #636 IdempotencyToken mechanism to the five remaining AWS
commitment executors so a re-drive (and #639's auto-re-drive sweep) cannot
double-purchase. Stacks on #638.

- RDS/ElastiCache/MemoryDB: derive the customer-supplied reservation ID
  (ReservedDBInstanceId/ReservedCacheNodeId/ReservationId) deterministically
  from the token instead of time.Now(). Pre-purchase Describe-by-ID guard
  short-circuits a re-drive; AWS's native *AlreadyExists* fault backstops a
  guard miss and is recovered to the existing commitment. Double-buy is
  impossible by two independent mechanisms.
- OpenSearch: derive ReservationName from the token; the name is unique per
  account+region so AWS rejects a duplicate with ResourceAlreadyExistsException
  (recovered by name). A pre-purchase by-name Describe guard short-circuits.
- Redshift: no customer ID, no native dedupe, no tag filter, so an EC2-style
  tag-guard: DescribeReservedNodes + per-node DescribeTags matches the token
  tag, with the token written post-purchase via CreateTags. Residual window
  (tagging support uncertain) documented and backstopped by #635's safe-fail.

Lookup errors fail loud (refuse to purchase) to avoid a guard-miss double-buy.
Empty token preserves prior non-idempotent behaviour for non-execution callers.

Adds common.IdempotentReservationID helper. Per-provider regression tests:
same token on re-drive does not create a second commitment (guard short-circuit,
AlreadyExists recovery, fail-loud-on-lookup-error).

Refs #641.
cristim added a commit that referenced this pull request May 21, 2026
Extends the #636 IdempotencyToken mechanism to all five Azure reservation
executors (cache, compute, cosmosdb, database, search) so a re-drive (and
#639's auto-re-drive sweep) cannot double-purchase. Stacks on #638.

Each executor previously minted a fresh reservationOrderID per call
(uuid.New(), or a time.Now() timestamp in search), so a retry created a
second reservation. The Azure Reservations API PUTs to
reservationOrders/{id} and is idempotent on a stable order ID, so the
order ID is now derived deterministically from the IdempotencyToken via
common.IdempotencyGUID: the same token re-PUTs the same order and returns
the existing reservation rather than creating a second. An empty/short
token falls back to the prior non-idempotent ID, preserving behaviour for
non-execution callers.

Adds common.IdempotencyGUID (canonical 8-4-4-4-12 GUID from the first 128
bits of the SHA-256 token) and common.ReservationOrderID, a single helper
that returns the derived GUID or the caller's fallback so each executor
stays a one-statement assignment (keeps gocyclo under threshold and avoids
repeating the empty-token guard across services). Per-provider regression
tests on compute and search assert that a same-token re-drive PUTs to the
identical reservationOrders/{id} URL (no second reservation) while a
distinct token targets a distinct order.

Refs #641.
cristim added a commit that referenced this pull request May 22, 2026
…t idempotent (refs #641) (#652)

* feat(purchases): make AWS RDS/ElastiCache/MemoryDB/OpenSearch/Redshift idempotent

Extends the #636 IdempotencyToken mechanism to the five remaining AWS
commitment executors so a re-drive (and #639's auto-re-drive sweep) cannot
double-purchase. Stacks on #638.

- RDS/ElastiCache/MemoryDB: derive the customer-supplied reservation ID
  (ReservedDBInstanceId/ReservedCacheNodeId/ReservationId) deterministically
  from the token instead of time.Now(). Pre-purchase Describe-by-ID guard
  short-circuits a re-drive; AWS's native *AlreadyExists* fault backstops a
  guard miss and is recovered to the existing commitment. Double-buy is
  impossible by two independent mechanisms.
- OpenSearch: derive ReservationName from the token; the name is unique per
  account+region so AWS rejects a duplicate with ResourceAlreadyExistsException
  (recovered by name). A pre-purchase by-name Describe guard short-circuits.
- Redshift: no customer ID, no native dedupe, no tag filter, so an EC2-style
  tag-guard: DescribeReservedNodes + per-node DescribeTags matches the token
  tag, with the token written post-purchase via CreateTags. Residual window
  (tagging support uncertain) documented and backstopped by #635's safe-fail.

Lookup errors fail loud (refuse to purchase) to avoid a guard-miss double-buy.
Empty token preserves prior non-idempotent behaviour for non-execution callers.

Adds common.IdempotentReservationID helper. Per-provider regression tests:
same token on re-drive does not create a second commitment (guard short-circuit,
AlreadyExists recovery, fail-loud-on-lookup-error).

Refs #641.

* fix(purchases): redact idempotency tokens in AWS re-drive skip logs

CodeRabbit (PR #652) flagged that the "already exists; skipping purchase"
log lines emit the full caller-supplied idempotency token, a stable
per-execution identifier that should not leak verbatim into persistent logs.

Add common.MaskToken (8-char prefix + ellipsis, "(none)" for empty) and use
it across all five idempotent AWS executors (RDS, ElastiCache, MemoryDB,
OpenSearch, Redshift). The masked prefix still correlates log lines for a
single purchase. CR flagged three; OpenSearch and Redshift carried the
identical pattern and are fixed here too for consistency.

Log-message-only change: the idempotency guard, derived reservation IDs, and
AlreadyExists recovery are untouched, so the same-token-no-double-buy
invariant is preserved per service.

Refs #641.
cristim added a commit that referenced this pull request May 22, 2026
New Manager.ReapStuckExecutions(ctx, reapAfter): one sweep that finds
executions stuck in approved/running longer than reapAfter and
atomically transitions them to "failed" via the existing
TransitionExecutionStatus CAS. Each successful transition also writes
a canonical, human-readable error string so the History UI (#621)
shows operators why the row was reaped and confirms it's safe to retry.

Safety properties:
- Local-status-only: never touches provider commitments. If the real
  executor did manage to create a commitment before dying, the
  retry hits the idempotency path (#636/#638/#652) which surfaces a
  duplicate-reservation error cleanly.
- CAS-protected: if the real executor wakes up and finishes between
  the SELECT and the transition, the CAS rejects; logged at INFO,
  not surfaced as an error. The real executor wins the race.
- Per-row error-isolation: a failure on row N never blocks N+1..K.

Adds ParseReapAfterFromEnv to read PURCHASE_APPROVED_REAP_AFTER and
fall back to a 10m default on missing-or-malformed env (with a WARN
log so ops can spot a typo). Never panics — a misconfigured env var
must not crash the other scheduled tasks sharing the Lambda.

Tests cover:
- stale approved row → flipped to failed with canonical message
- stale running row → same
- younger-than-threshold rows are filtered out by the SELECT (no
  TransitionExecutionStatus / SavePurchaseExecution calls happen)
- terminal-status rows are never passed to the SELECT (regression
  guard against accidentally widening stuckStatuses)
- CAS race lost → logged + counted, not surfaced as sweep error
- CAS (nil, nil) defensive path treated as race-lost
- 3-stuck-row integration with per-row CAS expectations
- SELECT error → propagated as sweep error
- per-row save error after successful CAS → Reaped++ AND Errored++
- env-var: unset, valid, invalid, non-default unit (2h30m)
cristim added a commit that referenced this pull request May 22, 2026
closes #678) (#681)

* feat(config): add ListStuckExecutions store method

Selects purchase_executions rows whose status is in a caller-supplied
set AND whose updated_at is older than a caller-supplied interval.
Newest-stuck-first, capped at MaxListLimit per call. Mirrors the
existing GetStaleProcessingExchanges pattern for RI exchanges.

Used by the reaper sweep (issue #678) to find executions stuck in
approved/running because the synchronous executor failed mid-flight
(Lambda timeout, OOM, network hang) without flipping the row to a
terminal state. Local SELECT only — no provider-side mutation.

Adds the method to StoreInterface and to every test-side mock store
that satisfies it (purchase, api, analytics, scheduler, server
health). Adds pgxmock coverage for the happy path, the
empty-statuses short-circuit, and a query-error path.

* feat(purchase): add ReapStuckExecutions sweep + per-row CAS to failed

New Manager.ReapStuckExecutions(ctx, reapAfter): one sweep that finds
executions stuck in approved/running longer than reapAfter and
atomically transitions them to "failed" via the existing
TransitionExecutionStatus CAS. Each successful transition also writes
a canonical, human-readable error string so the History UI (#621)
shows operators why the row was reaped and confirms it's safe to retry.

Safety properties:
- Local-status-only: never touches provider commitments. If the real
  executor did manage to create a commitment before dying, the
  retry hits the idempotency path (#636/#638/#652) which surfaces a
  duplicate-reservation error cleanly.
- CAS-protected: if the real executor wakes up and finishes between
  the SELECT and the transition, the CAS rejects; logged at INFO,
  not surfaced as an error. The real executor wins the race.
- Per-row error-isolation: a failure on row N never blocks N+1..K.

Adds ParseReapAfterFromEnv to read PURCHASE_APPROVED_REAP_AFTER and
fall back to a 10m default on missing-or-malformed env (with a WARN
log so ops can spot a typo). Never panics — a misconfigured env var
must not crash the other scheduled tasks sharing the Lambda.

Tests cover:
- stale approved row → flipped to failed with canonical message
- stale running row → same
- younger-than-threshold rows are filtered out by the SELECT (no
  TransitionExecutionStatus / SavePurchaseExecution calls happen)
- terminal-status rows are never passed to the SELECT (regression
  guard against accidentally widening stuckStatuses)
- CAS race lost → logged + counted, not surfaced as sweep error
- CAS (nil, nil) defensive path treated as race-lost
- 3-stuck-row integration with per-row CAS expectations
- SELECT error → propagated as sweep error
- per-row save error after successful CAS → Reaped++ AND Errored++
- env-var: unset, valid, invalid, non-default unit (2h30m)

* feat(purchase): wire periodic reaper invocation (closes #678)

Wires the ReapStuckExecutions sweep (added in the previous commit) into
the existing scheduled-task pipeline:

- New ScheduledTaskType "reap_stuck_purchases" registered alongside the
  other periodic tasks (cleanup, analytics_refresh, ri_exchange_reshape).
  Adds a dedicated handleReapStuckPurchases dispatcher that reads the
  threshold via purchase.ParseReapAfterFromEnv on every invocation so an
  ops-side env-var tune via PURCHASE_APPROVED_REAP_AFTER takes effect on
  the next sweep without a redeploy.
- Adds ReapStuckExecutions to PurchaseManagerInterface so the handler
  contract is symmetric with the Manager + the testutil mock can satisfy
  it. MockPurchaseManager gains ReapStuckExecutionsFunc following the
  existing per-method-Func mock convention.
- New EventBridge rule "${stack_name}-reap-stuck-purchases" on
  rate(5 minutes), targeting the main Lambda with
  {"action":"reap_stuck_purchases"}. Cadence is intentionally more
  frequent than the 10m default threshold so a stuck row is reaped
  within ~1 threshold-window. The reaper itself is CAS-protected
  (TransitionExecutionStatus from approved/running → failed) so an
  over-run is safe — the real executor wins the race.
- New terraform vars: enable_reap_stuck_purchases_schedule (bool,
  default true), reap_stuck_purchases_schedule (string,
  default "rate(5 minutes)"), purchase_approved_reap_after (string,
  default "" → use in-code DefaultReapAfter). Empty-string default for
  the threshold avoids pinning a Lambda env value when ops hasn't taken
  a position.
- ParseScheduledEvent learns the "reap_stuck_purchases" action so the
  EventBridge → Lambda payload round-trips through the dispatcher.

Tests cover the dispatcher success path (asserts the default 10m
threshold is passed when the env var is unset), the store-error
propagation path, the lock-ID uniqueness guard, and the
ParseScheduledEvent action mapping.

* fix(purchase/reaper): distinguish CAS race-loss from real DB errors

The reaper's per-row error handler bucketed every failure from
TransitionExecutionStatus into RaceLost, including real DB outages
(connection refused, query timeout, etc.). That masked persistent
store failures: a downed DB would surface only as a quiet count
inflation on RaceLost, with no errored signal for ops to alert on.

Wrap the two legitimate race-loss outcomes from
TransitionExecutionStatus in sentinel errors so callers can use
errors.Is rather than brittle string matching:

  - ErrExecutionNotInExpectedStatus: row exists but its status moved
    out of the allowed set before the CAS (the real executor finished
    between SELECT and UPDATE — race lost).
  - ErrNotFound (reused for the "row vanished" case): row was deleted
    between SELECT and UPDATE — also a race outcome, nothing for the
    reaper to do.

In the reaper, classify errors.Is(err, ErrExecutionNotInExpectedStatus)
or errors.Is(err, ErrNotFound) as RaceLost (INFO log); everything else
bumps Errored with an ERROR log so the ops signal is visible.

Addresses A1 from CodeRabbit round-1 review on #681.

Regression tests:
  - TestReapStuckExecutions_CASRaceLostNoError now wraps the sentinel
    (was raw errors.New); guards the "race-loss is INFO-only" path.
  - TestReapStuckExecutions_RowVanishedTreatedAsRaceLost covers the
    ErrNotFound branch (manual DELETE between SELECT and CAS).
  - TestReapStuckExecutions_HardDBErrorClassifiedAsErrored asserts a
    non-sentinel error bumps Errored, not RaceLost — the regression
    guard for the original A1 finding.

The wrapped errors keep the original message substrings ("not found",
"cannot transition from ...") so existing call sites that match on
substrings (e.g. internal/purchase/manager.go's
RecoverStrandedApprovals, internal/purchase/approvals_test.go) are
unaffected.

* fix(purchase/reaper): reject non-positive reap durations + stabilize env tests

ParseReapAfterFromEnv accepted "0s" and negative durations such as
"-5m" because time.ParseDuration treats them as syntactically valid.
Feeding either into ListStuckExecutions would be catastrophic:

  - 0s: the cutoff "updated_at < NOW() - 0s" matches every
    approved/running row regardless of age — the reaper would flip
    fresh, in-flight executions to failed.
  - -5m: the cutoff "updated_at < NOW() - (-5m)" == "updated_at <
    NOW() + 5m" matches rows from 5 minutes in the future, i.e.
    effectively every row. Same outcome.

The store-side guard added in the original PR (ListStuckExecutions
rejects olderThan <= 0) prevents the broken SELECT from executing,
but a misconfigured env value would still cause every sweep to fail
silently with a confusing store error rather than a WARN at the
config-parse boundary. Reject non-positive durations at the env-parse
layer with a WARN log + fallback to DefaultReapAfter so the misconfig
is visible in the Lambda's startup logs.

Addresses A2 + A3 (defense-in-depth) + A4 + A5 from CodeRabbit
round-1 review on #681.

Regression tests (A4):
  - TestParseReapAfterFromEnv_ZeroFallsBackToDefault — "0s"
  - TestParseReapAfterFromEnv_NegativeFallsBackToDefault — "-5m"
  - TestParseReapAfterFromEnv_GarbageFallsBackToDefault — explicit
    "garbage" case complementing the existing _Invalid* test, named
    per the CR finding so the regression intent is searchable.

handler_test.go (A5):
  - TestHandleScheduledTask's table-driven loop now calls
    t.Setenv("PURCHASE_APPROVED_REAP_AFTER", "") inside the per-case
    sub-Run. The two reap_stuck_purchases cases assert reapAfter ==
    10*time.Minute (the default); without an explicit env clear, an
    ambient PURCHASE_APPROVED_REAP_AFTER in CI/dev would silently
    make them flaky. t.Setenv auto-restores the prior value at
    cleanup.

The store-side olderThan <= 0 guard (A3) was already folded into the
earlier rebased commit "feat(config): add ListStuckExecutions store
method" so no additional store-side change is needed here — A3 is
covered by the existing defense-in-depth, and this commit completes
the env-side validation A2 calls for.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

effort/l Weeks enhancement New feature or request impact/all-users Affects every user priority/p2 Backlog-worthy severity/medium Moderate harm triaged Item has been triaged type/feat New capability urgency/this-quarter Within the quarter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant