feat(purchases): make AWS RDS/ElastiCache/MemoryDB/OpenSearch/Redshift idempotent (refs #641)#652
Conversation
…t idempotent Extends the #636 IdempotencyToken mechanism to the five remaining AWS commitment executors so a re-drive (and #639's auto-re-drive sweep) cannot double-purchase. Stacks on #638. - RDS/ElastiCache/MemoryDB: derive the customer-supplied reservation ID (ReservedDBInstanceId/ReservedCacheNodeId/ReservationId) deterministically from the token instead of time.Now(). Pre-purchase Describe-by-ID guard short-circuits a re-drive; AWS's native *AlreadyExists* fault backstops a guard miss and is recovered to the existing commitment. Double-buy is impossible by two independent mechanisms. - OpenSearch: derive ReservationName from the token; the name is unique per account+region so AWS rejects a duplicate with ResourceAlreadyExistsException (recovered by name). A pre-purchase by-name Describe guard short-circuits. - Redshift: no customer ID, no native dedupe, no tag filter, so an EC2-style tag-guard: DescribeReservedNodes + per-node DescribeTags matches the token tag, with the token written post-purchase via CreateTags. Residual window (tagging support uncertain) documented and backstopped by #635's safe-fail. Lookup errors fail loud (refuse to purchase) to avoid a guard-miss double-buy. Empty token preserves prior non-idempotent behaviour for non-execution callers. Adds common.IdempotentReservationID helper. Per-provider regression tests: same token on re-drive does not create a second commitment (guard short-circuit, AlreadyExists recovery, fail-loud-on-lookup-error). Refs #641.
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
📝 WalkthroughWalkthroughThis PR implements deterministic, idempotent reservation purchasing across five AWS services (RDS, ElastiCache, MemoryDB, OpenSearch, Redshift) by introducing a common idempotency token utility and applying consistent guard-and-recovery patterns. Each service derives stable reservation identifiers from tokens, checks for duplicates before purchasing, and recovers gracefully from server-side duplication errors. ChangesIdempotent AWS commitment purchasing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@providers/aws/services/elasticache/client.go`:
- Line 225: The log currently prints the full idempotency token (variable token)
which can leak caller identifiers; update the log in the ElastiCache reservation
path so it does not emit the raw token — either redact it (e.g., replace with
fixed mask), show only a safe suffix/prefix, or log a hashed version of token,
and keep existingID visible; modify the log.Printf call that includes token and
existingID so it outputs the sanitized token representation instead of the raw
token.
In `@providers/aws/services/memorydb/client.go`:
- Line 219: The log currently prints the raw idempotency token via the
log.Printf call (variables token and existingID); change it to avoid emitting
the full token by omitting or redacting it (e.g., replace token with a fixed
placeholder like "<redacted>" or a masked version showing only the last 4 chars)
and keep existingID if needed; update the existing log.Printf invocation where
token is used so it no longer logs the raw token and ensure any helper you add
(e.g., maskToken) is used by the same call.
In `@providers/aws/services/rds/client.go`:
- Line 221: The log message prints the raw caller-supplied idempotency token
(variable token) which may leak identifiers; change the log in the RDS purchase
flow that calls log.Printf(...) to avoid logging the raw token by either
omitting it or emitting a redacted form (e.g., mask all but last N chars or log
a stable hash of token) and keep existingID as-is; update the log.Printf
invocation that currently references token and existingID so it uses the
redacted/hashed token variable instead of the raw token.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: caf805b0-dbd4-44da-9e0c-1218bdb34578
📒 Files selected for processing (12)
pkg/common/identifiers.gopkg/common/identifiers_test.goproviders/aws/services/elasticache/client.goproviders/aws/services/elasticache/client_test.goproviders/aws/services/memorydb/client.goproviders/aws/services/memorydb/client_test.goproviders/aws/services/opensearch/client.goproviders/aws/services/opensearch/client_test.goproviders/aws/services/rds/client.goproviders/aws/services/rds/client_test.goproviders/aws/services/redshift/client.goproviders/aws/services/redshift/client_test.go
CodeRabbit (PR #652) flagged that the "already exists; skipping purchase" log lines emit the full caller-supplied idempotency token, a stable per-execution identifier that should not leak verbatim into persistent logs. Add common.MaskToken (8-char prefix + ellipsis, "(none)" for empty) and use it across all five idempotent AWS executors (RDS, ElastiCache, MemoryDB, OpenSearch, Redshift). The masked prefix still correlates log lines for a single purchase. CR flagged three; OpenSearch and Redshift carried the identical pattern and are fixed here too for consistency. Log-message-only change: the idempotency guard, derived reservation IDs, and AlreadyExists recovery are untouched, so the same-token-no-double-buy invariant is preserved per service. Refs #641.
CodeRabbit pass 1 addressed (commit d2d5013)All 3 actionable findings were the same issue: the "already exists; skipping purchase" log lines emitted the raw idempotency token. Fixed (all addressed, not dismissed):
Idempotency invariant preserved: this is a log-message-only change. The idempotency guard, derived reservation IDs, and AlreadyExists recovery are untouched, so a re-drive with the same token still cannot create a second commitment, per service. Out of scope, follow-up filed: the EC2 RI executor ( @coderabbitai review |
|
Triggering a review of the new commit now. ✅ Actions performedReview triggered.
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
GCP Compute CUD creation named commitments cud-<unix-second> and ignored opts.IdempotencyToken, so a re-drive of the same execution (issue #639's recovery sweep) would create a SECOND committed-use discount: a financial double-purchase. Thread opts.IdempotencyToken into RegionCommitments.Insert via two deterministic mechanisms so the same token can never buy twice: - RequestId: GCP's native server-side idempotency key on Insert, which the API documents as preventing duplicate commitments. It must be a valid non-zero UUID, so we format the SHA-256 token into a canonical UUID with the existing common.IdempotencyGUID helper (the same derivation PR #653 used for the Azure reservationOrderID). The same token always yields the same RequestId, so the second Insert is a server-side no-op. - Name: also derived from the token (cud-<first-32-hex>, RFC1035-valid) as defense in depth. Commitment names are unique per project+region, so a re-drive that somehow reached Insert collides on the name and GCP rejects it with ALREADY_EXISTS rather than creating a duplicate. An empty token preserves the prior non-idempotent timestamp-based name (the CLI path, which has no owning execution). The token is masked in logs via common.MaskToken (never logged verbatim), matching PR #652. Adds a re-drive regression test mirroring the AWS/Azure ones: the same token on a second PurchaseCommitment yields an identical RequestId and commitment name, plus an empty-token test confirming the CLI path stays non-idempotent. Scope is computeengine only; CloudSQL/Storage/Memorystore were made advisory-only in #649 and are untouched. Closes #654 refs #641
New Manager.ReapStuckExecutions(ctx, reapAfter): one sweep that finds executions stuck in approved/running longer than reapAfter and atomically transitions them to "failed" via the existing TransitionExecutionStatus CAS. Each successful transition also writes a canonical, human-readable error string so the History UI (#621) shows operators why the row was reaped and confirms it's safe to retry. Safety properties: - Local-status-only: never touches provider commitments. If the real executor did manage to create a commitment before dying, the retry hits the idempotency path (#636/#638/#652) which surfaces a duplicate-reservation error cleanly. - CAS-protected: if the real executor wakes up and finishes between the SELECT and the transition, the CAS rejects; logged at INFO, not surfaced as an error. The real executor wins the race. - Per-row error-isolation: a failure on row N never blocks N+1..K. Adds ParseReapAfterFromEnv to read PURCHASE_APPROVED_REAP_AFTER and fall back to a 10m default on missing-or-malformed env (with a WARN log so ops can spot a typo). Never panics — a misconfigured env var must not crash the other scheduled tasks sharing the Lambda. Tests cover: - stale approved row → flipped to failed with canonical message - stale running row → same - younger-than-threshold rows are filtered out by the SELECT (no TransitionExecutionStatus / SavePurchaseExecution calls happen) - terminal-status rows are never passed to the SELECT (regression guard against accidentally widening stuckStatuses) - CAS race lost → logged + counted, not surfaced as sweep error - CAS (nil, nil) defensive path treated as race-lost - 3-stuck-row integration with per-row CAS expectations - SELECT error → propagated as sweep error - per-row save error after successful CAS → Reaped++ AND Errored++ - env-var: unset, valid, invalid, non-default unit (2h30m)
closes #678) (#681) * feat(config): add ListStuckExecutions store method Selects purchase_executions rows whose status is in a caller-supplied set AND whose updated_at is older than a caller-supplied interval. Newest-stuck-first, capped at MaxListLimit per call. Mirrors the existing GetStaleProcessingExchanges pattern for RI exchanges. Used by the reaper sweep (issue #678) to find executions stuck in approved/running because the synchronous executor failed mid-flight (Lambda timeout, OOM, network hang) without flipping the row to a terminal state. Local SELECT only — no provider-side mutation. Adds the method to StoreInterface and to every test-side mock store that satisfies it (purchase, api, analytics, scheduler, server health). Adds pgxmock coverage for the happy path, the empty-statuses short-circuit, and a query-error path. * feat(purchase): add ReapStuckExecutions sweep + per-row CAS to failed New Manager.ReapStuckExecutions(ctx, reapAfter): one sweep that finds executions stuck in approved/running longer than reapAfter and atomically transitions them to "failed" via the existing TransitionExecutionStatus CAS. Each successful transition also writes a canonical, human-readable error string so the History UI (#621) shows operators why the row was reaped and confirms it's safe to retry. Safety properties: - Local-status-only: never touches provider commitments. If the real executor did manage to create a commitment before dying, the retry hits the idempotency path (#636/#638/#652) which surfaces a duplicate-reservation error cleanly. - CAS-protected: if the real executor wakes up and finishes between the SELECT and the transition, the CAS rejects; logged at INFO, not surfaced as an error. The real executor wins the race. - Per-row error-isolation: a failure on row N never blocks N+1..K. Adds ParseReapAfterFromEnv to read PURCHASE_APPROVED_REAP_AFTER and fall back to a 10m default on missing-or-malformed env (with a WARN log so ops can spot a typo). Never panics — a misconfigured env var must not crash the other scheduled tasks sharing the Lambda. Tests cover: - stale approved row → flipped to failed with canonical message - stale running row → same - younger-than-threshold rows are filtered out by the SELECT (no TransitionExecutionStatus / SavePurchaseExecution calls happen) - terminal-status rows are never passed to the SELECT (regression guard against accidentally widening stuckStatuses) - CAS race lost → logged + counted, not surfaced as sweep error - CAS (nil, nil) defensive path treated as race-lost - 3-stuck-row integration with per-row CAS expectations - SELECT error → propagated as sweep error - per-row save error after successful CAS → Reaped++ AND Errored++ - env-var: unset, valid, invalid, non-default unit (2h30m) * feat(purchase): wire periodic reaper invocation (closes #678) Wires the ReapStuckExecutions sweep (added in the previous commit) into the existing scheduled-task pipeline: - New ScheduledTaskType "reap_stuck_purchases" registered alongside the other periodic tasks (cleanup, analytics_refresh, ri_exchange_reshape). Adds a dedicated handleReapStuckPurchases dispatcher that reads the threshold via purchase.ParseReapAfterFromEnv on every invocation so an ops-side env-var tune via PURCHASE_APPROVED_REAP_AFTER takes effect on the next sweep without a redeploy. - Adds ReapStuckExecutions to PurchaseManagerInterface so the handler contract is symmetric with the Manager + the testutil mock can satisfy it. MockPurchaseManager gains ReapStuckExecutionsFunc following the existing per-method-Func mock convention. - New EventBridge rule "${stack_name}-reap-stuck-purchases" on rate(5 minutes), targeting the main Lambda with {"action":"reap_stuck_purchases"}. Cadence is intentionally more frequent than the 10m default threshold so a stuck row is reaped within ~1 threshold-window. The reaper itself is CAS-protected (TransitionExecutionStatus from approved/running → failed) so an over-run is safe — the real executor wins the race. - New terraform vars: enable_reap_stuck_purchases_schedule (bool, default true), reap_stuck_purchases_schedule (string, default "rate(5 minutes)"), purchase_approved_reap_after (string, default "" → use in-code DefaultReapAfter). Empty-string default for the threshold avoids pinning a Lambda env value when ops hasn't taken a position. - ParseScheduledEvent learns the "reap_stuck_purchases" action so the EventBridge → Lambda payload round-trips through the dispatcher. Tests cover the dispatcher success path (asserts the default 10m threshold is passed when the env var is unset), the store-error propagation path, the lock-ID uniqueness guard, and the ParseScheduledEvent action mapping. * fix(purchase/reaper): distinguish CAS race-loss from real DB errors The reaper's per-row error handler bucketed every failure from TransitionExecutionStatus into RaceLost, including real DB outages (connection refused, query timeout, etc.). That masked persistent store failures: a downed DB would surface only as a quiet count inflation on RaceLost, with no errored signal for ops to alert on. Wrap the two legitimate race-loss outcomes from TransitionExecutionStatus in sentinel errors so callers can use errors.Is rather than brittle string matching: - ErrExecutionNotInExpectedStatus: row exists but its status moved out of the allowed set before the CAS (the real executor finished between SELECT and UPDATE — race lost). - ErrNotFound (reused for the "row vanished" case): row was deleted between SELECT and UPDATE — also a race outcome, nothing for the reaper to do. In the reaper, classify errors.Is(err, ErrExecutionNotInExpectedStatus) or errors.Is(err, ErrNotFound) as RaceLost (INFO log); everything else bumps Errored with an ERROR log so the ops signal is visible. Addresses A1 from CodeRabbit round-1 review on #681. Regression tests: - TestReapStuckExecutions_CASRaceLostNoError now wraps the sentinel (was raw errors.New); guards the "race-loss is INFO-only" path. - TestReapStuckExecutions_RowVanishedTreatedAsRaceLost covers the ErrNotFound branch (manual DELETE between SELECT and CAS). - TestReapStuckExecutions_HardDBErrorClassifiedAsErrored asserts a non-sentinel error bumps Errored, not RaceLost — the regression guard for the original A1 finding. The wrapped errors keep the original message substrings ("not found", "cannot transition from ...") so existing call sites that match on substrings (e.g. internal/purchase/manager.go's RecoverStrandedApprovals, internal/purchase/approvals_test.go) are unaffected. * fix(purchase/reaper): reject non-positive reap durations + stabilize env tests ParseReapAfterFromEnv accepted "0s" and negative durations such as "-5m" because time.ParseDuration treats them as syntactically valid. Feeding either into ListStuckExecutions would be catastrophic: - 0s: the cutoff "updated_at < NOW() - 0s" matches every approved/running row regardless of age — the reaper would flip fresh, in-flight executions to failed. - -5m: the cutoff "updated_at < NOW() - (-5m)" == "updated_at < NOW() + 5m" matches rows from 5 minutes in the future, i.e. effectively every row. Same outcome. The store-side guard added in the original PR (ListStuckExecutions rejects olderThan <= 0) prevents the broken SELECT from executing, but a misconfigured env value would still cause every sweep to fail silently with a confusing store error rather than a WARN at the config-parse boundary. Reject non-positive durations at the env-parse layer with a WARN log + fallback to DefaultReapAfter so the misconfig is visible in the Lambda's startup logs. Addresses A2 + A3 (defense-in-depth) + A4 + A5 from CodeRabbit round-1 review on #681. Regression tests (A4): - TestParseReapAfterFromEnv_ZeroFallsBackToDefault — "0s" - TestParseReapAfterFromEnv_NegativeFallsBackToDefault — "-5m" - TestParseReapAfterFromEnv_GarbageFallsBackToDefault — explicit "garbage" case complementing the existing _Invalid* test, named per the CR finding so the regression intent is searchable. handler_test.go (A5): - TestHandleScheduledTask's table-driven loop now calls t.Setenv("PURCHASE_APPROVED_REAP_AFTER", "") inside the per-case sub-Run. The two reap_stuck_purchases cases assert reapAfter == 10*time.Minute (the default); without an explicit env clear, an ambient PURCHASE_APPROVED_REAP_AFTER in CI/dev would silently make them flaky. t.Setenv auto-restores the prior value at cleanup. The store-side olderThan <= 0 guard (A3) was already folded into the earlier rebased commit "feat(config): add ListStuckExecutions store method" so no additional store-side change is needed here — A3 is covered by the existing defense-in-depth, and this commit completes the env-side validation A2 calls for.
Summary
Extends idempotent commitment creation (started in #636/#638 for EC2 RI + Savings Plans) to the remaining AWS commitment services, so a re-drive / scheduler retry cannot double-purchase. Partially addresses #641 (AWS-other slice).
Stacks on #638 (
feat/idempotent-commitment-creation) — merge #638 first.Per-service mechanism + double-buy argument
opts.IdempotencyToken; a pre-purchase Describe-by-ID short-circuits a re-drive, and AWS's native*AlreadyExistsfault backstops a guard miss (recovers to the existing commitment). Lookup errors fail loud.ResourceAlreadyExistsException.DescribeReservedNodes+ per-nodeDescribeTags, tag post-purchase); the irreducible tag-write window is backstopped by fix(purchases): recover executions stranded in 'approved' (closes #632) #635 safe-fail.Test plan
Partially addresses #641 (Azure slice + GCP Compute tracked separately).
Summary by CodeRabbit
New Features
Tests