Skip to content

bug(purchases): reap purchase_executions stuck in approved/running for >10min (mark as failed) #678

@cristim

Description

@cristim

Symptom

Multiple purchase_executions rows observed stuck in the approved state — the user clicked approve, the row flipped to approved, but the synchronous executor that should drive it through running → completed/failed never finished. Result: the rec sits indefinitely as "approved — purchase in progress" with no automatic recovery.

Why this happens

Manager.ApproveAndExecute (internal/purchase/approvals.go) persists Status = "approved" BEFORE calling executeAndFinalize. The terminal transition to completed / failed only fires inside finalizeExecution (internal/purchase/manager.go) AFTER executePurchase returns. If the synchronous path is interrupted — Lambda timeout, container kill, panic, network hang to the provider API, request-context cancellation — the row remains in approved with no further state machinery to advance it.

The repro path is just "approve a purchase and have anything fail mid-flight before finalizeExecution". The window is real and reproducible — multiple rows already in this state on the deployed branch.

Why a reaper, not just better error handling on the hot path

Even with perfect synchronous error handling, the row can be orphaned by:

  • Lambda timeout exceeded mid-PurchaseCommitment call
  • Container OOM/SIGKILL
  • Database connection drop after status flip but before purchase result is persisted
  • Concurrent re-drive crashing one of the two executors

These are non-bug failure modes inherent to the architecture. A reaper is the correct backstop.

Proposed behaviour

A background sweep (every 1-5 minutes is fine, exact cadence less important than the timeout) that:

  1. Selects purchase_executions rows where status IN ('approved', 'running') AND updated_at < NOW() - INTERVAL '10 minutes'.
  2. For each, calls TransitionExecutionStatus(executionID, fromStatuses=['approved', 'running'], toStatus='failed') — atomic CAS to avoid racing the real executor if it happens to wake up.
  3. Records an error field on the failed row: something like "reaped after 10m in approved state — executor did not complete; safe to retry".
  4. Logs the reap event (per-execution-ID, with original status + age) at WARN so it's surfaced in ops dashboards.

Operator UX

Once the row is failed, it will appear in the History view (per #621 fix). User can choose to re-approve or cancel the rec; if the real executor was actually still working, the atomic CAS will have rejected the reap (real executor wins the race).

Important: the reaper MUST NOT touch the underlying provider commitment. It only flips the local row. If AWS actually did get the purchase through but our row says failed, the user will see a duplicate-reservation-detected error on retry (already covered by #636/#638/#652 idempotency work) — which is the correct and safe behaviour.

Threshold choice

10 minutes is conservative — most successful purchases complete in <60s, the longest legitimately-slow paths (multi-account fan-out, retry-backoff on provider rate limit) settle in <3 min. 10min gives multiple multiples of headroom so the reaper never fights a real executor.

Make it configurable via env var (PURCHASE_APPROVED_REAP_AFTER, default 10m) so ops can tune without a deploy.

Scope

  • New: internal/purchase/reaper.go (or attach to scheduler) with the sweep + CAS logic.
  • New: scheduled invocation — either via the existing scheduler Lambda or a Postgres-backed pg_cron-style scan from the API, whichever matches the project's existing background-job pattern.
  • New: integration test that creates a stale-approved row, runs the reaper, asserts the row is now failed with the canonical error message.
  • Unit tests for the CAS race (reaper loses to real executor / wins when executor is gone).
  • Wire-up via cmd/server or wherever the scheduler is started.

Related

Why this matters

Without the reaper, every Lambda timeout / container kill / network hiccup leaves a permanent orphan row that the user must manually intervene on. With dozens of rows already in this state on the production deployment, it's a continuous source of confusion ("did my approve work? why is it still in progress?") and operator toil.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions