Symptom
Multiple purchase_executions rows observed stuck in the approved state — the user clicked approve, the row flipped to approved, but the synchronous executor that should drive it through running → completed/failed never finished. Result: the rec sits indefinitely as "approved — purchase in progress" with no automatic recovery.
Why this happens
Manager.ApproveAndExecute (internal/purchase/approvals.go) persists Status = "approved" BEFORE calling executeAndFinalize. The terminal transition to completed / failed only fires inside finalizeExecution (internal/purchase/manager.go) AFTER executePurchase returns. If the synchronous path is interrupted — Lambda timeout, container kill, panic, network hang to the provider API, request-context cancellation — the row remains in approved with no further state machinery to advance it.
The repro path is just "approve a purchase and have anything fail mid-flight before finalizeExecution". The window is real and reproducible — multiple rows already in this state on the deployed branch.
Why a reaper, not just better error handling on the hot path
Even with perfect synchronous error handling, the row can be orphaned by:
- Lambda timeout exceeded mid-
PurchaseCommitment call
- Container OOM/SIGKILL
- Database connection drop after status flip but before purchase result is persisted
- Concurrent re-drive crashing one of the two executors
These are non-bug failure modes inherent to the architecture. A reaper is the correct backstop.
Proposed behaviour
A background sweep (every 1-5 minutes is fine, exact cadence less important than the timeout) that:
- Selects
purchase_executions rows where status IN ('approved', 'running') AND updated_at < NOW() - INTERVAL '10 minutes'.
- For each, calls
TransitionExecutionStatus(executionID, fromStatuses=['approved', 'running'], toStatus='failed') — atomic CAS to avoid racing the real executor if it happens to wake up.
- Records an
error field on the failed row: something like "reaped after 10m in approved state — executor did not complete; safe to retry".
- Logs the reap event (per-execution-ID, with original status + age) at WARN so it's surfaced in ops dashboards.
Operator UX
Once the row is failed, it will appear in the History view (per #621 fix). User can choose to re-approve or cancel the rec; if the real executor was actually still working, the atomic CAS will have rejected the reap (real executor wins the race).
Important: the reaper MUST NOT touch the underlying provider commitment. It only flips the local row. If AWS actually did get the purchase through but our row says failed, the user will see a duplicate-reservation-detected error on retry (already covered by #636/#638/#652 idempotency work) — which is the correct and safe behaviour.
Threshold choice
10 minutes is conservative — most successful purchases complete in <60s, the longest legitimately-slow paths (multi-account fan-out, retry-backoff on provider rate limit) settle in <3 min. 10min gives multiple multiples of headroom so the reaper never fights a real executor.
Make it configurable via env var (PURCHASE_APPROVED_REAP_AFTER, default 10m) so ops can tune without a deploy.
Scope
- New:
internal/purchase/reaper.go (or attach to scheduler) with the sweep + CAS logic.
- New: scheduled invocation — either via the existing scheduler Lambda or a Postgres-backed
pg_cron-style scan from the API, whichever matches the project's existing background-job pattern.
- New: integration test that creates a stale-
approved row, runs the reaper, asserts the row is now failed with the canonical error message.
- Unit tests for the CAS race (reaper loses to real executor / wins when executor is gone).
- Wire-up via
cmd/server or wherever the scheduler is started.
Related
Why this matters
Without the reaper, every Lambda timeout / container kill / network hiccup leaves a permanent orphan row that the user must manually intervene on. With dozens of rows already in this state on the production deployment, it's a continuous source of confusion ("did my approve work? why is it still in progress?") and operator toil.
Symptom
Multiple
purchase_executionsrows observed stuck in theapprovedstate — the user clicked approve, the row flipped toapproved, but the synchronous executor that should drive it throughrunning → completed/failednever finished. Result: the rec sits indefinitely as "approved — purchase in progress" with no automatic recovery.Why this happens
Manager.ApproveAndExecute(internal/purchase/approvals.go) persistsStatus = "approved"BEFORE callingexecuteAndFinalize. The terminal transition tocompleted/failedonly fires insidefinalizeExecution(internal/purchase/manager.go) AFTERexecutePurchasereturns. If the synchronous path is interrupted — Lambda timeout, container kill, panic, network hang to the provider API, request-context cancellation — the row remains inapprovedwith no further state machinery to advance it.The repro path is just "approve a purchase and have anything fail mid-flight before
finalizeExecution". The window is real and reproducible — multiple rows already in this state on the deployed branch.Why a reaper, not just better error handling on the hot path
Even with perfect synchronous error handling, the row can be orphaned by:
PurchaseCommitmentcallThese are non-bug failure modes inherent to the architecture. A reaper is the correct backstop.
Proposed behaviour
A background sweep (every 1-5 minutes is fine, exact cadence less important than the timeout) that:
purchase_executionsrows wherestatus IN ('approved', 'running')ANDupdated_at < NOW() - INTERVAL '10 minutes'.TransitionExecutionStatus(executionID, fromStatuses=['approved', 'running'], toStatus='failed')— atomic CAS to avoid racing the real executor if it happens to wake up.errorfield on the failed row: something like"reaped after 10m in approved state — executor did not complete; safe to retry".Operator UX
Once the row is
failed, it will appear in the History view (per #621 fix). User can choose to re-approve or cancel the rec; if the real executor was actually still working, the atomic CAS will have rejected the reap (real executor wins the race).Important: the reaper MUST NOT touch the underlying provider commitment. It only flips the local row. If AWS actually did get the purchase through but our row says
failed, the user will see a duplicate-reservation-detected error on retry (already covered by #636/#638/#652 idempotency work) — which is the correct and safe behaviour.Threshold choice
10 minutes is conservative — most successful purchases complete in <60s, the longest legitimately-slow paths (multi-account fan-out, retry-backoff on provider rate limit) settle in <3 min. 10min gives multiple multiples of headroom so the reaper never fights a real executor.
Make it configurable via env var (
PURCHASE_APPROVED_REAP_AFTER, default10m) so ops can tune without a deploy.Scope
internal/purchase/reaper.go(or attach to scheduler) with the sweep + CAS logic.pg_cron-style scan from the API, whichever matches the project's existing background-job pattern.approvedrow, runs the reaper, asserts the row is nowfailedwith the canonical error message.cmd/serveror wherever the scheduler is started.Related
approved/runningrows visible in History (so reaped rows are surfaced).Why this matters
Without the reaper, every Lambda timeout / container kill / network hiccup leaves a permanent orphan row that the user must manually intervene on. With dozens of rows already in this state on the production deployment, it's a continuous source of confusion ("did my approve work? why is it still in progress?") and operator toil.