Skip to content

fix(pool): reap orphaned assigned/failed pool_items (sweep #8)#44

Merged
mastermanas805 merged 1 commit into
masterfrom
fix/pool-reap-orphaned-items
Jun 4, 2026
Merged

fix(pool): reap orphaned assigned/failed pool_items (sweep #8)#44
mastermanas805 merged 1 commit into
masterfrom
fix/pool-reap-orphaned-items

Conversation

@mastermanas805

Copy link
Copy Markdown
Member

What

Adds a reaper to the hot-pool Manager (internal/pool/manager.go) so leaked pool_items no longer strand backing infra. Closes sweep-backlog finding #8 (P2).

On each maintenance tick reapStale now:

  1. Reaps failed items older than failedReapGrace (10m), bounded to reapBatchLimit (50) per pass: routes Deprovision through the resource-type backend using the row's pool_token as the naming token, then deletes the row. A Deprovision failure leaves the row for the next tick — the tracking row is never deleted before its infra is freed. Deprovision is idempotent (DROP ... IF EXISTS), so reaping an item whose infra is already gone is a safe no-op. A failed row (set only by Discard, before the item is ever returned to api) has by construction no owning resources row, so this is a pure leak with no live owner.

  2. Reports assigned items older than stuckAssignedGrace (30m) on the new instant_pool_stuck_assigned gauge — but does not deprovision them.

Why assigned is reported, not reaped (scope decision)

The finding asked to reap orphaned assigned items (crashed-claim). I verified this cannot be done safely from the provisioner:

  • The pool item lifecycle has only two status writers: Claim (→assigned) and Discard (→failed). There is no write-back when api successfully binds a claimed item to a resources row.
  • So from the provisioner's own DB, an orphaned assigned row is indistinguishable from one a live api request bound to a resources row. A bound item's infra is owned by that resource row and reaped by the worker's resource-TTL path. Deprovisioning by age here would destroy live customer infra — the truehomie-db DROP incident class.
  • pool_items lives in the provisioner's own standalone Postgres (PROVISIONER_DATABASE_URL); resources lives in platform_db. No current service has both, so a safe orphan-assigned anti-join reaper has no correct home today. The failed-drain half is squarely a provisioner job; the assigned half is surfaced as an operator signal + documented follow-up rather than forced unsafely.

Metrics / Rule 25 follow-up

New metrics: instant_pool_reap_total{resource_type,status,outcome}, instant_pool_stuck_assigned{resource_type}.

The alert + dashboard tile + METRICS-CATALOG.md row mandated by rule 25 live in the infra repo, which is out of scope for this provisioner-only PR. Follow-up required in infra: Prom rule + NR alert (P2 observability: rising instant_pool_stuck_assigned = claim-path leak; non-zero instant_pool_reap_total{outcome="deprovision_err"} rate = wedged reaper), dashboard tile, catalog rows.

Coverage

Symptom:        pool_items stuck 'failed' (leaked infra) / 'assigned' (orphaned by crashed claim)
Enumeration:    rg -n "UPDATE pool_items|status =" internal/pool/manager.go  → 2 status writers (Claim, Discard); no reaper, no bound write-back
Sites found:    1 (the maintenance loop ticker arm in run())
Sites touched:  1 (reapStale wired into the ticker arm)
Coverage test:  TestReapFailed_ReapsOrphanedPastGrace (orphaned past grace IS reaped + correct pool_token);
                TestReapFailed_DeprovisionErrorLeavesRow (no orphaned infra on Deprovision failure);
                TestReapStale_NeverDeprovisionsAssigned (truehomie guard — assigned never deprovisioned);
                TestReapFailed_BatchBounded; gauge reset-to-zero; fakeDB/fakeRows seam for Query/Scan/Rows.Err/DELETE error arms
Live verified:  N/A pre-merge (provisioner is in-cluster gRPC, no public URL). reaper functions 100% covered,
                package 97.6%, full `go build ./... && go vet ./... && go test ./... -short -p 1` GREEN + -race clean
                against a local Postgres (DB-gated tests ran, not skipped). CI runs the same with TEST_PROVISIONER_DATABASE_URL set.

🤖 Generated with Claude Code

The hot-pool Manager had no reaper for pool_items left in 'failed' or
'assigned' state. A 'failed' item (Discard marked it unusable on the
provisioner-side claim path) leaks its backing infra forever: no
resources row owns it, so the worker's resource-TTL reaper never touches
db_pool-<uuid> / usr_pool-<uuid> / keyspace pool-<uuid>:*.

Adds a reaper on the maintenance ticker that, per pass:
  - deprovisions + deletes 'failed' rows older than failedReapGrace
    (10m), bounded to reapBatchLimit (50) per tick, routed through the
    resource-type backend with the pool_token as naming token; a
    Deprovision failure leaves the row for the next tick so infra is
    never orphaned by deleting its tracking row first (Deprovision is
    idempotent — DROP ... IF EXISTS);
  - reports 'assigned' rows older than stuckAssignedGrace (30m) on the
    new instant_pool_stuck_assigned gauge but does NOT deprovision them.

Why 'assigned' is reported, not reaped: from the provisioner's own DB an
orphaned (crashed-claim) 'assigned' row is indistinguishable from one a
live api request successfully bound to a resources row — there is no
write-back on a successful bind. The bound item's infra is owned by that
resources row and reaped by the worker's resource-TTL path;
deprovisioning it here would destroy live customer infra (the
truehomie-db DROP incident class). A safe orphan-'assigned' reaper needs
an anti-join against the resources table, which lives in a different
database than pool_items, so it cannot be done from the provisioner.

Metric instant_pool_reap_total{resource_type,status,outcome} +
instant_pool_stuck_assigned{resource_type}. Rule-25 alert + dashboard +
catalog rows belong in the infra repo (out of scope for this PR) — see
PR description for the follow-up.

Tests: deprovisionBacking routing/unknown-type/error; DB-gated
reapFailed orphaned-past-grace IS reaped + fresh-inside-grace is NOT +
correct pool_token deprovisioned, deprovision-error-leaves-row,
batch-bound; reapStale NEVER deprovisions assigned; gauge reset-to-zero;
and a fakeDB/fakeRows seam covering the Query/Scan/Rows.Err/DELETE error
arms. internal/pool reaper functions at 100% coverage, package 97.6%,
-race clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 enabled auto-merge (squash) June 4, 2026 15:28
@mastermanas805 mastermanas805 merged commit abfb80a into master Jun 4, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant