Skip to content

feat(purchases): make AWS RDS/ElastiCache/MemoryDB/OpenSearch/Redshift idempotent (refs #641)#652

Merged
cristim merged 2 commits into
feat/multicloud-web-frontendfrom
fix/641-aws-other-idem
May 22, 2026
Merged

feat(purchases): make AWS RDS/ElastiCache/MemoryDB/OpenSearch/Redshift idempotent (refs #641)#652
cristim merged 2 commits into
feat/multicloud-web-frontendfrom
fix/641-aws-other-idem

Conversation

@cristim
Copy link
Copy Markdown
Member

@cristim cristim commented May 21, 2026

Summary

Extends idempotent commitment creation (started in #636/#638 for EC2 RI + Savings Plans) to the remaining AWS commitment services, so a re-drive / scheduler retry cannot double-purchase. Partially addresses #641 (AWS-other slice).

Stacks on #638 (feat/idempotent-commitment-creation) — merge #638 first.

Per-service mechanism + double-buy argument

  • RDS / ElastiCache / MemoryDB: customer reservation ID derived deterministically from opts.IdempotencyToken; a pre-purchase Describe-by-ID short-circuits a re-drive, and AWS's native *AlreadyExists fault backstops a guard miss (recovers to the existing commitment). Lookup errors fail loud.
  • OpenSearch: ReservationName derived from the token (unique per account+region); a duplicate is rejected with ResourceAlreadyExistsException.
  • Redshift: EC2-style tag-guard (DescribeReservedNodes + per-node DescribeTags, tag post-purchase); the irreducible tag-write window is backstopped by fix(purchases): recover executions stranded in 'approved' (closes #632) #635 safe-fail.

Test plan

  • 133 pkg/common + 208 AWS-service tests pass; build + go vet clean
  • Per-service re-drive regression: same token does not create a second commitment

Partially addresses #641 (Azure slice + GCP Compute tracked separately).

Summary by CodeRabbit

  • New Features

    • Added idempotent reservation purchase support across AWS ElastiCache, MemoryDB, OpenSearch, RDS, and Redshift services, enabling safe retries without risk of duplicate purchases.
    • Implemented automatic recovery from server-side duplication errors during reservation purchases.
  • Tests

    • Added comprehensive test coverage for idempotent purchase behavior, including guard logic, error recovery, and lookup validation.

Review Change Stack

…t idempotent

Extends the #636 IdempotencyToken mechanism to the five remaining AWS
commitment executors so a re-drive (and #639's auto-re-drive sweep) cannot
double-purchase. Stacks on #638.

- RDS/ElastiCache/MemoryDB: derive the customer-supplied reservation ID
  (ReservedDBInstanceId/ReservedCacheNodeId/ReservationId) deterministically
  from the token instead of time.Now(). Pre-purchase Describe-by-ID guard
  short-circuits a re-drive; AWS's native *AlreadyExists* fault backstops a
  guard miss and is recovered to the existing commitment. Double-buy is
  impossible by two independent mechanisms.
- OpenSearch: derive ReservationName from the token; the name is unique per
  account+region so AWS rejects a duplicate with ResourceAlreadyExistsException
  (recovered by name). A pre-purchase by-name Describe guard short-circuits.
- Redshift: no customer ID, no native dedupe, no tag filter, so an EC2-style
  tag-guard: DescribeReservedNodes + per-node DescribeTags matches the token
  tag, with the token written post-purchase via CreateTags. Residual window
  (tagging support uncertain) documented and backstopped by #635's safe-fail.

Lookup errors fail loud (refuse to purchase) to avoid a guard-miss double-buy.
Empty token preserves prior non-idempotent behaviour for non-execution callers.

Adds common.IdempotentReservationID helper. Per-provider regression tests:
same token on re-drive does not create a second commitment (guard short-circuit,
AlreadyExists recovery, fail-loud-on-lookup-error).

Refs #641.
@cristim cristim added bug Something isn't working triaged Item has been triaged priority/p1 Next up; this sprint severity/high Significant harm urgency/this-sprint Within the current sprint impact/all-users Affects every user effort/l Weeks type/bug Defect labels May 21, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Warning

Rate limit exceeded

@cristim has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 24 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3c2e46cb-2fca-4d33-aa51-d1f069435692

📥 Commits

Reviewing files that changed from the base of the PR and between 5995f20 and d2d5013.

📒 Files selected for processing (7)
  • pkg/common/tokens.go
  • pkg/common/tokens_test.go
  • providers/aws/services/elasticache/client.go
  • providers/aws/services/memorydb/client.go
  • providers/aws/services/opensearch/client.go
  • providers/aws/services/rds/client.go
  • providers/aws/services/redshift/client.go
📝 Walkthrough

Walkthrough

This PR implements deterministic, idempotent reservation purchasing across five AWS services (RDS, ElastiCache, MemoryDB, OpenSearch, Redshift) by introducing a common idempotency token utility and applying consistent guard-and-recovery patterns. Each service derives stable reservation identifiers from tokens, checks for duplicates before purchasing, and recovers gracefully from server-side duplication errors.

Changes

Idempotent AWS commitment purchasing

Layer / File(s) Summary
Idempotent reservation ID foundation
pkg/common/identifiers.go, pkg/common/identifiers_test.go
New exported IdempotentReservationID(prefix, token) deterministically derives AWS-safe reservation IDs by truncating tokens to 40 hex characters and concatenating with prefix. Tests validate determinism, distinct tokens, empty-token behavior, and formatting constraints.
RDS idempotent reservation purchasing
providers/aws/services/rds/client.go, providers/aws/services/rds/client_test.go
RDS PurchaseCommitment derives ReservedDBInstanceId from token, guards against existing active/payment-pending reservations, and recovers from ReservedDBInstanceAlreadyExistsFault. Added deriveReservationID, idempotencyGuard, recoverAlreadyExists, and findReservationByID helpers with four test scenarios.
ElastiCache idempotent reservation purchasing
providers/aws/services/elasticache/client.go, providers/aws/services/elasticache/client_test.go
ElastiCache PurchaseCommitment derives token-based ReservedCacheNodeId, performs pre-purchase guard lookup, and recovers from ReservedCacheNodeAlreadyExistsFault. Added findReservationByID, idempotencyGuard, and recoverAlreadyExists helpers with test cases covering guard short-circuit, not-found proceed, already-exists recovery, and lookup error paths.
MemoryDB idempotent reservation purchasing
providers/aws/services/memorydb/client.go, providers/aws/services/memorydb/client_test.go
MemoryDB PurchaseCommitment derives deterministic ReservationId from token, guards with pre-purchase active/payment-pending lookup, and recovers from ReservedNodeAlreadyExistsFault. Added helper methods for by-ID lookup, guard short-circuiting, and already-exists recovery with test coverage.
OpenSearch idempotent reservation purchasing
providers/aws/services/opensearch/client.go, providers/aws/services/opensearch/client_test.go
OpenSearch PurchaseCommitment derives ReservationName deterministically from token, uses findReservationByName to page and filter reservations, guards against duplicates, and recovers from ResourceAlreadyExistsException. Tests validate guard short-circuiting, not-found proceed, already-exists recovery, and lookup error handling.
Redshift idempotent reservation purchasing via tags
providers/aws/services/redshift/client.go, providers/aws/services/redshift/client_test.go
Redshift implements idempotency via tag-based guard: RedshiftAPI interface gains DescribeTags method, pre-purchase lookup scans nodes for existing idempotency token tags and short-circuits if found, post-purchase tags with both source and idempotency token, and fails loudly on lookup errors. Tests cover tag guard short-circuiting, non-tagged first-time purchase with tagging, and lookup error refusal.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

  • LeanerCloud/CUDly#641: PR directly implements the requested IdempotencyToken support for AWS commitment executors across RDS, ElastiCache, MemoryDB, OpenSearch, and Redshift.
  • LeanerCloud/CUDly#617: Related to RDS ReservedDBInstanceId derivation and idempotent purchase flow updates in providers/aws/services/rds/client.go.

Possibly related PRs

  • LeanerCloud/CUDly#638: Both PRs implement commitment idempotency by threading IdempotencyToken through PurchaseCommitment options across multiple AWS services to create deterministic, duplicate-resistant purchase flows.

Suggested labels

priority/p2, severity/medium, urgency/this-quarter, type/feat

Poem

🐰 Five AWS services, now idempotent and kind,
Tokens derived, no duplicates left behind,
Guards standing watch, recovery paths wide,
Reservation purchases, safe and allied!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.81% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding idempotent purchase support to five AWS services (RDS, ElastiCache, MemoryDB, OpenSearch, Redshift), which is the primary focus of this changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/641-aws-other-idem

Comment @coderabbitai help to get the list of available commands and usage tips.

@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 21, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@providers/aws/services/elasticache/client.go`:
- Line 225: The log currently prints the full idempotency token (variable token)
which can leak caller identifiers; update the log in the ElastiCache reservation
path so it does not emit the raw token — either redact it (e.g., replace with
fixed mask), show only a safe suffix/prefix, or log a hashed version of token,
and keep existingID visible; modify the log.Printf call that includes token and
existingID so it outputs the sanitized token representation instead of the raw
token.

In `@providers/aws/services/memorydb/client.go`:
- Line 219: The log currently prints the raw idempotency token via the
log.Printf call (variables token and existingID); change it to avoid emitting
the full token by omitting or redacting it (e.g., replace token with a fixed
placeholder like "<redacted>" or a masked version showing only the last 4 chars)
and keep existingID if needed; update the existing log.Printf invocation where
token is used so it no longer logs the raw token and ensure any helper you add
(e.g., maskToken) is used by the same call.

In `@providers/aws/services/rds/client.go`:
- Line 221: The log message prints the raw caller-supplied idempotency token
(variable token) which may leak identifiers; change the log in the RDS purchase
flow that calls log.Printf(...) to avoid logging the raw token by either
omitting it or emitting a redacted form (e.g., mask all but last N chars or log
a stable hash of token) and keep existingID as-is; update the log.Printf
invocation that currently references token and existingID so it uses the
redacted/hashed token variable instead of the raw token.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: caf805b0-dbd4-44da-9e0c-1218bdb34578

📥 Commits

Reviewing files that changed from the base of the PR and between 45a92e7 and 5995f20.

📒 Files selected for processing (12)
  • pkg/common/identifiers.go
  • pkg/common/identifiers_test.go
  • providers/aws/services/elasticache/client.go
  • providers/aws/services/elasticache/client_test.go
  • providers/aws/services/memorydb/client.go
  • providers/aws/services/memorydb/client_test.go
  • providers/aws/services/opensearch/client.go
  • providers/aws/services/opensearch/client_test.go
  • providers/aws/services/rds/client.go
  • providers/aws/services/rds/client_test.go
  • providers/aws/services/redshift/client.go
  • providers/aws/services/redshift/client_test.go

Comment thread providers/aws/services/elasticache/client.go Outdated
Comment thread providers/aws/services/memorydb/client.go Outdated
Comment thread providers/aws/services/rds/client.go Outdated
CodeRabbit (PR #652) flagged that the "already exists; skipping purchase"
log lines emit the full caller-supplied idempotency token, a stable
per-execution identifier that should not leak verbatim into persistent logs.

Add common.MaskToken (8-char prefix + ellipsis, "(none)" for empty) and use
it across all five idempotent AWS executors (RDS, ElastiCache, MemoryDB,
OpenSearch, Redshift). The masked prefix still correlates log lines for a
single purchase. CR flagged three; OpenSearch and Redshift carried the
identical pattern and are fixed here too for consistency.

Log-message-only change: the idempotency guard, derived reservation IDs, and
AlreadyExists recovery are untouched, so the same-token-no-double-buy
invariant is preserved per service.

Refs #641.
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 22, 2026

CodeRabbit pass 1 addressed (commit d2d5013)

All 3 actionable findings were the same issue: the "already exists; skipping purchase" log lines emitted the raw idempotency token.

Fixed (all addressed, not dismissed):

  • Added common.MaskToken (pkg/common/tokens.go) - 8-char prefix + ellipsis, (none) for empty - and applied it to the skip-purchase logs in all five idempotent AWS executors: RDS, ElastiCache, MemoryDB, plus OpenSearch and Redshift (the latter two carried the identical pattern but were not individually flagged; fixed for consistency).
  • The masked prefix still correlates log lines for a single purchase, so we keep operational debuggability without leaking the full per-execution identifier.
  • Unit tests added for MaskToken (empty / short / long / never-emits-full-token).

Idempotency invariant preserved: this is a log-message-only change. The idempotency guard, derived reservation IDs, and AlreadyExists recovery are untouched, so a re-drive with the same token still cannot create a second commitment, per service.

Out of scope, follow-up filed: the EC2 RI executor (#636, not part of this PR) has the same raw-token log line - tracked in #656.

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

@cristim: Thanks for the thorough write-up! The common.MaskToken approach (8-char prefix + ellipsis) is a clean solution — it preserves log correlatability across a single purchase flow without leaking the full token. Applying it proactively to OpenSearch and Redshift (even though they weren't individually flagged) is the right call for consistency. Good to see unit tests covering the edge cases (empty, short, long, never-emits-full-token) as well.

Triggering a review of the new commit now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@cristim cristim merged commit 97a1cd0 into feat/multicloud-web-frontend May 22, 2026
4 checks passed
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 22, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

cristim added a commit that referenced this pull request May 22, 2026
GCP Compute CUD creation named commitments cud-<unix-second> and ignored
opts.IdempotencyToken, so a re-drive of the same execution (issue #639's
recovery sweep) would create a SECOND committed-use discount: a financial
double-purchase.

Thread opts.IdempotencyToken into RegionCommitments.Insert via two
deterministic mechanisms so the same token can never buy twice:

- RequestId: GCP's native server-side idempotency key on Insert, which the
  API documents as preventing duplicate commitments. It must be a valid
  non-zero UUID, so we format the SHA-256 token into a canonical UUID with
  the existing common.IdempotencyGUID helper (the same derivation PR #653
  used for the Azure reservationOrderID). The same token always yields the
  same RequestId, so the second Insert is a server-side no-op.
- Name: also derived from the token (cud-<first-32-hex>, RFC1035-valid) as
  defense in depth. Commitment names are unique per project+region, so a
  re-drive that somehow reached Insert collides on the name and GCP rejects
  it with ALREADY_EXISTS rather than creating a duplicate.

An empty token preserves the prior non-idempotent timestamp-based name (the
CLI path, which has no owning execution). The token is masked in logs via
common.MaskToken (never logged verbatim), matching PR #652.

Adds a re-drive regression test mirroring the AWS/Azure ones: the same token
on a second PurchaseCommitment yields an identical RequestId and commitment
name, plus an empty-token test confirming the CLI path stays non-idempotent.

Scope is computeengine only; CloudSQL/Storage/Memorystore were made
advisory-only in #649 and are untouched.

Closes #654
refs #641
cristim added a commit that referenced this pull request May 22, 2026
New Manager.ReapStuckExecutions(ctx, reapAfter): one sweep that finds
executions stuck in approved/running longer than reapAfter and
atomically transitions them to "failed" via the existing
TransitionExecutionStatus CAS. Each successful transition also writes
a canonical, human-readable error string so the History UI (#621)
shows operators why the row was reaped and confirms it's safe to retry.

Safety properties:
- Local-status-only: never touches provider commitments. If the real
  executor did manage to create a commitment before dying, the
  retry hits the idempotency path (#636/#638/#652) which surfaces a
  duplicate-reservation error cleanly.
- CAS-protected: if the real executor wakes up and finishes between
  the SELECT and the transition, the CAS rejects; logged at INFO,
  not surfaced as an error. The real executor wins the race.
- Per-row error-isolation: a failure on row N never blocks N+1..K.

Adds ParseReapAfterFromEnv to read PURCHASE_APPROVED_REAP_AFTER and
fall back to a 10m default on missing-or-malformed env (with a WARN
log so ops can spot a typo). Never panics — a misconfigured env var
must not crash the other scheduled tasks sharing the Lambda.

Tests cover:
- stale approved row → flipped to failed with canonical message
- stale running row → same
- younger-than-threshold rows are filtered out by the SELECT (no
  TransitionExecutionStatus / SavePurchaseExecution calls happen)
- terminal-status rows are never passed to the SELECT (regression
  guard against accidentally widening stuckStatuses)
- CAS race lost → logged + counted, not surfaced as sweep error
- CAS (nil, nil) defensive path treated as race-lost
- 3-stuck-row integration with per-row CAS expectations
- SELECT error → propagated as sweep error
- per-row save error after successful CAS → Reaped++ AND Errored++
- env-var: unset, valid, invalid, non-default unit (2h30m)
cristim added a commit that referenced this pull request May 22, 2026
closes #678) (#681)

* feat(config): add ListStuckExecutions store method

Selects purchase_executions rows whose status is in a caller-supplied
set AND whose updated_at is older than a caller-supplied interval.
Newest-stuck-first, capped at MaxListLimit per call. Mirrors the
existing GetStaleProcessingExchanges pattern for RI exchanges.

Used by the reaper sweep (issue #678) to find executions stuck in
approved/running because the synchronous executor failed mid-flight
(Lambda timeout, OOM, network hang) without flipping the row to a
terminal state. Local SELECT only — no provider-side mutation.

Adds the method to StoreInterface and to every test-side mock store
that satisfies it (purchase, api, analytics, scheduler, server
health). Adds pgxmock coverage for the happy path, the
empty-statuses short-circuit, and a query-error path.

* feat(purchase): add ReapStuckExecutions sweep + per-row CAS to failed

New Manager.ReapStuckExecutions(ctx, reapAfter): one sweep that finds
executions stuck in approved/running longer than reapAfter and
atomically transitions them to "failed" via the existing
TransitionExecutionStatus CAS. Each successful transition also writes
a canonical, human-readable error string so the History UI (#621)
shows operators why the row was reaped and confirms it's safe to retry.

Safety properties:
- Local-status-only: never touches provider commitments. If the real
  executor did manage to create a commitment before dying, the
  retry hits the idempotency path (#636/#638/#652) which surfaces a
  duplicate-reservation error cleanly.
- CAS-protected: if the real executor wakes up and finishes between
  the SELECT and the transition, the CAS rejects; logged at INFO,
  not surfaced as an error. The real executor wins the race.
- Per-row error-isolation: a failure on row N never blocks N+1..K.

Adds ParseReapAfterFromEnv to read PURCHASE_APPROVED_REAP_AFTER and
fall back to a 10m default on missing-or-malformed env (with a WARN
log so ops can spot a typo). Never panics — a misconfigured env var
must not crash the other scheduled tasks sharing the Lambda.

Tests cover:
- stale approved row → flipped to failed with canonical message
- stale running row → same
- younger-than-threshold rows are filtered out by the SELECT (no
  TransitionExecutionStatus / SavePurchaseExecution calls happen)
- terminal-status rows are never passed to the SELECT (regression
  guard against accidentally widening stuckStatuses)
- CAS race lost → logged + counted, not surfaced as sweep error
- CAS (nil, nil) defensive path treated as race-lost
- 3-stuck-row integration with per-row CAS expectations
- SELECT error → propagated as sweep error
- per-row save error after successful CAS → Reaped++ AND Errored++
- env-var: unset, valid, invalid, non-default unit (2h30m)

* feat(purchase): wire periodic reaper invocation (closes #678)

Wires the ReapStuckExecutions sweep (added in the previous commit) into
the existing scheduled-task pipeline:

- New ScheduledTaskType "reap_stuck_purchases" registered alongside the
  other periodic tasks (cleanup, analytics_refresh, ri_exchange_reshape).
  Adds a dedicated handleReapStuckPurchases dispatcher that reads the
  threshold via purchase.ParseReapAfterFromEnv on every invocation so an
  ops-side env-var tune via PURCHASE_APPROVED_REAP_AFTER takes effect on
  the next sweep without a redeploy.
- Adds ReapStuckExecutions to PurchaseManagerInterface so the handler
  contract is symmetric with the Manager + the testutil mock can satisfy
  it. MockPurchaseManager gains ReapStuckExecutionsFunc following the
  existing per-method-Func mock convention.
- New EventBridge rule "${stack_name}-reap-stuck-purchases" on
  rate(5 minutes), targeting the main Lambda with
  {"action":"reap_stuck_purchases"}. Cadence is intentionally more
  frequent than the 10m default threshold so a stuck row is reaped
  within ~1 threshold-window. The reaper itself is CAS-protected
  (TransitionExecutionStatus from approved/running → failed) so an
  over-run is safe — the real executor wins the race.
- New terraform vars: enable_reap_stuck_purchases_schedule (bool,
  default true), reap_stuck_purchases_schedule (string,
  default "rate(5 minutes)"), purchase_approved_reap_after (string,
  default "" → use in-code DefaultReapAfter). Empty-string default for
  the threshold avoids pinning a Lambda env value when ops hasn't taken
  a position.
- ParseScheduledEvent learns the "reap_stuck_purchases" action so the
  EventBridge → Lambda payload round-trips through the dispatcher.

Tests cover the dispatcher success path (asserts the default 10m
threshold is passed when the env var is unset), the store-error
propagation path, the lock-ID uniqueness guard, and the
ParseScheduledEvent action mapping.

* fix(purchase/reaper): distinguish CAS race-loss from real DB errors

The reaper's per-row error handler bucketed every failure from
TransitionExecutionStatus into RaceLost, including real DB outages
(connection refused, query timeout, etc.). That masked persistent
store failures: a downed DB would surface only as a quiet count
inflation on RaceLost, with no errored signal for ops to alert on.

Wrap the two legitimate race-loss outcomes from
TransitionExecutionStatus in sentinel errors so callers can use
errors.Is rather than brittle string matching:

  - ErrExecutionNotInExpectedStatus: row exists but its status moved
    out of the allowed set before the CAS (the real executor finished
    between SELECT and UPDATE — race lost).
  - ErrNotFound (reused for the "row vanished" case): row was deleted
    between SELECT and UPDATE — also a race outcome, nothing for the
    reaper to do.

In the reaper, classify errors.Is(err, ErrExecutionNotInExpectedStatus)
or errors.Is(err, ErrNotFound) as RaceLost (INFO log); everything else
bumps Errored with an ERROR log so the ops signal is visible.

Addresses A1 from CodeRabbit round-1 review on #681.

Regression tests:
  - TestReapStuckExecutions_CASRaceLostNoError now wraps the sentinel
    (was raw errors.New); guards the "race-loss is INFO-only" path.
  - TestReapStuckExecutions_RowVanishedTreatedAsRaceLost covers the
    ErrNotFound branch (manual DELETE between SELECT and CAS).
  - TestReapStuckExecutions_HardDBErrorClassifiedAsErrored asserts a
    non-sentinel error bumps Errored, not RaceLost — the regression
    guard for the original A1 finding.

The wrapped errors keep the original message substrings ("not found",
"cannot transition from ...") so existing call sites that match on
substrings (e.g. internal/purchase/manager.go's
RecoverStrandedApprovals, internal/purchase/approvals_test.go) are
unaffected.

* fix(purchase/reaper): reject non-positive reap durations + stabilize env tests

ParseReapAfterFromEnv accepted "0s" and negative durations such as
"-5m" because time.ParseDuration treats them as syntactically valid.
Feeding either into ListStuckExecutions would be catastrophic:

  - 0s: the cutoff "updated_at < NOW() - 0s" matches every
    approved/running row regardless of age — the reaper would flip
    fresh, in-flight executions to failed.
  - -5m: the cutoff "updated_at < NOW() - (-5m)" == "updated_at <
    NOW() + 5m" matches rows from 5 minutes in the future, i.e.
    effectively every row. Same outcome.

The store-side guard added in the original PR (ListStuckExecutions
rejects olderThan <= 0) prevents the broken SELECT from executing,
but a misconfigured env value would still cause every sweep to fail
silently with a confusing store error rather than a WARN at the
config-parse boundary. Reject non-positive durations at the env-parse
layer with a WARN log + fallback to DefaultReapAfter so the misconfig
is visible in the Lambda's startup logs.

Addresses A2 + A3 (defense-in-depth) + A4 + A5 from CodeRabbit
round-1 review on #681.

Regression tests (A4):
  - TestParseReapAfterFromEnv_ZeroFallsBackToDefault — "0s"
  - TestParseReapAfterFromEnv_NegativeFallsBackToDefault — "-5m"
  - TestParseReapAfterFromEnv_GarbageFallsBackToDefault — explicit
    "garbage" case complementing the existing _Invalid* test, named
    per the CR finding so the regression intent is searchable.

handler_test.go (A5):
  - TestHandleScheduledTask's table-driven loop now calls
    t.Setenv("PURCHASE_APPROVED_REAP_AFTER", "") inside the per-case
    sub-Run. The two reap_stuck_purchases cases assert reapAfter ==
    10*time.Minute (the default); without an explicit env clear, an
    ambient PURCHASE_APPROVED_REAP_AFTER in CI/dev would silently
    make them flaky. t.Setenv auto-restores the prior value at
    cleanup.

The store-side olderThan <= 0 guard (A3) was already folded into the
earlier rebased commit "feat(config): add ListStuckExecutions store
method" so no additional store-side change is needed here — A3 is
covered by the existing defense-in-depth, and this commit completes
the env-side validation A2 calls for.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working effort/l Weeks impact/all-users Affects every user priority/p1 Next up; this sprint severity/high Significant harm triaged Item has been triaged type/bug Defect urgency/this-sprint Within the current sprint

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant