bug(purchases): reap purchase_executions stuck in approved/running for >10min (mark as failed)

## Symptom

Multiple `purchase_executions` rows observed stuck in the `approved` state — the user clicked approve, the row flipped to `approved`, but the synchronous executor that should drive it through `running → completed`/`failed` never finished. Result: the rec sits indefinitely as "approved — purchase in progress" with no automatic recovery.

## Why this happens

`Manager.ApproveAndExecute` (`internal/purchase/approvals.go`) persists `Status = "approved"` BEFORE calling `executeAndFinalize`. The terminal transition to `completed` / `failed` only fires inside `finalizeExecution` (`internal/purchase/manager.go`) AFTER `executePurchase` returns. If the synchronous path is interrupted — Lambda timeout, container kill, panic, network hang to the provider API, request-context cancellation — the row remains in `approved` with no further state machinery to advance it.

The repro path is just "approve a purchase and have anything fail mid-flight before `finalizeExecution`". The window is real and reproducible — multiple rows already in this state on the deployed branch.

## Why a reaper, not just better error handling on the hot path

Even with perfect synchronous error handling, the row can be orphaned by:
- Lambda timeout exceeded mid-`PurchaseCommitment` call
- Container OOM/SIGKILL
- Database connection drop after status flip but before purchase result is persisted
- Concurrent re-drive crashing one of the two executors

These are non-bug failure modes inherent to the architecture. A reaper is the correct backstop.

## Proposed behaviour

A background sweep (every 1-5 minutes is fine, exact cadence less important than the timeout) that:

1. Selects `purchase_executions` rows where `status IN ('approved', 'running')` AND `updated_at < NOW() - INTERVAL '10 minutes'`.
2. For each, calls `TransitionExecutionStatus(executionID, fromStatuses=['approved', 'running'], toStatus='failed')` — atomic CAS to avoid racing the real executor if it happens to wake up.
3. Records an `error` field on the failed row: something like `"reaped after 10m in approved state — executor did not complete; safe to retry"`.
4. Logs the reap event (per-execution-ID, with original status + age) at WARN so it's surfaced in ops dashboards.

## Operator UX

Once the row is `failed`, it will appear in the History view (per #621 fix). User can choose to re-approve or cancel the rec; if the real executor was actually still working, the atomic CAS will have rejected the reap (real executor wins the race).

Important: the reaper MUST NOT touch the underlying provider commitment. It only flips the local row. If AWS actually did get the purchase through but our row says `failed`, the user will see a duplicate-reservation-detected error on retry (already covered by #636/#638/#652 idempotency work) — which is the correct and safe behaviour.

## Threshold choice

10 minutes is conservative — most successful purchases complete in <60s, the longest legitimately-slow paths (multi-account fan-out, retry-backoff on provider rate limit) settle in <3 min. 10min gives multiple multiples of headroom so the reaper never fights a real executor.

Make it configurable via env var (`PURCHASE_APPROVED_REAP_AFTER`, default `10m`) so ops can tune without a deploy.

## Scope

- New: `internal/purchase/reaper.go` (or attach to scheduler) with the sweep + CAS logic.
- New: scheduled invocation — either via the existing scheduler Lambda or a Postgres-backed `pg_cron`-style scan from the API, whichever matches the project's existing background-job pattern.
- New: integration test that creates a stale-`approved` row, runs the reaper, asserts the row is now `failed` with the canonical error message.
- Unit tests for the CAS race (reaper loses to real executor / wins when executor is gone).
- Wire-up via `cmd/server` or wherever the scheduler is started.

## Related

- #621 — makes `approved`/`running` rows visible in History (so reaped rows are surfaced).
- #636 / #638 / #652 — idempotent commitment creation; ensures a manual retry of a reaped-but-actually-succeeded purchase doesn't double-buy.

## Why this matters

Without the reaper, every Lambda timeout / container kill / network hiccup leaves a permanent orphan row that the user must manually intervene on. With dozens of rows already in this state on the production deployment, it's a continuous source of confusion ("did my approve work? why is it still in progress?") and operator toil.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(purchases): reap purchase_executions stuck in approved/running for >10min (mark as failed) #678

Symptom

Why this happens

Why a reaper, not just better error handling on the hot path

Proposed behaviour

Operator UX

Threshold choice

Scope

Related

Why this matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bug(purchases): reap purchase_executions stuck in approved/running for >10min (mark as failed) #678

Description

Symptom

Why this happens

Why a reaper, not just better error handling on the hot path

Proposed behaviour

Operator UX

Threshold choice

Scope

Related

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions