fix(pool): reap orphaned assigned/failed pool_items (sweep #8) by mastermanas805 · Pull Request #44 · InstaNode-dev/provisioner

mastermanas805 · 2026-06-04T15:27:55Z

What

Adds a reaper to the hot-pool Manager (internal/pool/manager.go) so leaked pool_items no longer strand backing infra. Closes sweep-backlog finding #8 (P2).

On each maintenance tick reapStale now:

Reaps failed items older than failedReapGrace (10m), bounded to reapBatchLimit (50) per pass: routes Deprovision through the resource-type backend using the row's pool_token as the naming token, then deletes the row. A Deprovision failure leaves the row for the next tick — the tracking row is never deleted before its infra is freed. Deprovision is idempotent (DROP ... IF EXISTS), so reaping an item whose infra is already gone is a safe no-op. A failed row (set only by Discard, before the item is ever returned to api) has by construction no owning resources row, so this is a pure leak with no live owner.
Reports assigned items older than stuckAssignedGrace (30m) on the new instant_pool_stuck_assigned gauge — but does not deprovision them.

Why `assigned` is reported, not reaped (scope decision)

The finding asked to reap orphaned assigned items (crashed-claim). I verified this cannot be done safely from the provisioner:

The pool item lifecycle has only two status writers: Claim (→assigned) and Discard (→failed). There is no write-back when api successfully binds a claimed item to a resources row.
So from the provisioner's own DB, an orphaned assigned row is indistinguishable from one a live api request bound to a resources row. A bound item's infra is owned by that resource row and reaped by the worker's resource-TTL path. Deprovisioning by age here would destroy live customer infra — the truehomie-db DROP incident class.
pool_items lives in the provisioner's own standalone Postgres (PROVISIONER_DATABASE_URL); resources lives in platform_db. No current service has both, so a safe orphan-assigned anti-join reaper has no correct home today. The failed-drain half is squarely a provisioner job; the assigned half is surfaced as an operator signal + documented follow-up rather than forced unsafely.

Metrics / Rule 25 follow-up

New metrics: instant_pool_reap_total{resource_type,status,outcome}, instant_pool_stuck_assigned{resource_type}.

The alert + dashboard tile + METRICS-CATALOG.md row mandated by rule 25 live in the infra repo, which is out of scope for this provisioner-only PR. Follow-up required in infra: Prom rule + NR alert (P2 observability: rising instant_pool_stuck_assigned = claim-path leak; non-zero instant_pool_reap_total{outcome="deprovision_err"} rate = wedged reaper), dashboard tile, catalog rows.

Coverage

Symptom:        pool_items stuck 'failed' (leaked infra) / 'assigned' (orphaned by crashed claim)
Enumeration:    rg -n "UPDATE pool_items|status =" internal/pool/manager.go  → 2 status writers (Claim, Discard); no reaper, no bound write-back
Sites found:    1 (the maintenance loop ticker arm in run())
Sites touched:  1 (reapStale wired into the ticker arm)
Coverage test:  TestReapFailed_ReapsOrphanedPastGrace (orphaned past grace IS reaped + correct pool_token);
                TestReapFailed_DeprovisionErrorLeavesRow (no orphaned infra on Deprovision failure);
                TestReapStale_NeverDeprovisionsAssigned (truehomie guard — assigned never deprovisioned);
                TestReapFailed_BatchBounded; gauge reset-to-zero; fakeDB/fakeRows seam for Query/Scan/Rows.Err/DELETE error arms
Live verified:  N/A pre-merge (provisioner is in-cluster gRPC, no public URL). reaper functions 100% covered,
                package 97.6%, full `go build ./... && go vet ./... && go test ./... -short -p 1` GREEN + -race clean
                against a local Postgres (DB-gated tests ran, not skipped). CI runs the same with TEST_PROVISIONER_DATABASE_URL set.

🤖 Generated with Claude Code

The hot-pool Manager had no reaper for pool_items left in 'failed' or 'assigned' state. A 'failed' item (Discard marked it unusable on the provisioner-side claim path) leaks its backing infra forever: no resources row owns it, so the worker's resource-TTL reaper never touches db_pool-<uuid> / usr_pool-<uuid> / keyspace pool-<uuid>:*. Adds a reaper on the maintenance ticker that, per pass: - deprovisions + deletes 'failed' rows older than failedReapGrace (10m), bounded to reapBatchLimit (50) per tick, routed through the resource-type backend with the pool_token as naming token; a Deprovision failure leaves the row for the next tick so infra is never orphaned by deleting its tracking row first (Deprovision is idempotent — DROP ... IF EXISTS); - reports 'assigned' rows older than stuckAssignedGrace (30m) on the new instant_pool_stuck_assigned gauge but does NOT deprovision them. Why 'assigned' is reported, not reaped: from the provisioner's own DB an orphaned (crashed-claim) 'assigned' row is indistinguishable from one a live api request successfully bound to a resources row — there is no write-back on a successful bind. The bound item's infra is owned by that resources row and reaped by the worker's resource-TTL path; deprovisioning it here would destroy live customer infra (the truehomie-db DROP incident class). A safe orphan-'assigned' reaper needs an anti-join against the resources table, which lives in a different database than pool_items, so it cannot be done from the provisioner. Metric instant_pool_reap_total{resource_type,status,outcome} + instant_pool_stuck_assigned{resource_type}. Rule-25 alert + dashboard + catalog rows belong in the infra repo (out of scope for this PR) — see PR description for the follow-up. Tests: deprovisionBacking routing/unknown-type/error; DB-gated reapFailed orphaned-past-grace IS reaped + fresh-inside-grace is NOT + correct pool_token deprovisioned, deprovision-error-leaves-row, batch-bound; reapStale NEVER deprovisions assigned; gauge reset-to-zero; and a fakeDB/fakeRows seam covering the Query/Scan/Rows.Err/DELETE error arms. internal/pool reaper functions at 100% coverage, package 97.6%, -race clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mastermanas805 enabled auto-merge (squash) June 4, 2026 15:28

mastermanas805 merged commit abfb80a into master Jun 4, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(pool): reap orphaned assigned/failed pool_items (sweep #8)#44

fix(pool): reap orphaned assigned/failed pool_items (sweep #8)#44
mastermanas805 merged 1 commit into
masterfrom
fix/pool-reap-orphaned-items

mastermanas805 commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mastermanas805 commented Jun 4, 2026

What

Why assigned is reported, not reaped (scope decision)

Metrics / Rule 25 follow-up

Coverage

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why `assigned` is reported, not reaped (scope decision)