fix(pool): reap orphaned assigned/failed pool_items (sweep #8)#44
Merged
Conversation
The hot-pool Manager had no reaper for pool_items left in 'failed' or
'assigned' state. A 'failed' item (Discard marked it unusable on the
provisioner-side claim path) leaks its backing infra forever: no
resources row owns it, so the worker's resource-TTL reaper never touches
db_pool-<uuid> / usr_pool-<uuid> / keyspace pool-<uuid>:*.
Adds a reaper on the maintenance ticker that, per pass:
- deprovisions + deletes 'failed' rows older than failedReapGrace
(10m), bounded to reapBatchLimit (50) per tick, routed through the
resource-type backend with the pool_token as naming token; a
Deprovision failure leaves the row for the next tick so infra is
never orphaned by deleting its tracking row first (Deprovision is
idempotent — DROP ... IF EXISTS);
- reports 'assigned' rows older than stuckAssignedGrace (30m) on the
new instant_pool_stuck_assigned gauge but does NOT deprovision them.
Why 'assigned' is reported, not reaped: from the provisioner's own DB an
orphaned (crashed-claim) 'assigned' row is indistinguishable from one a
live api request successfully bound to a resources row — there is no
write-back on a successful bind. The bound item's infra is owned by that
resources row and reaped by the worker's resource-TTL path;
deprovisioning it here would destroy live customer infra (the
truehomie-db DROP incident class). A safe orphan-'assigned' reaper needs
an anti-join against the resources table, which lives in a different
database than pool_items, so it cannot be done from the provisioner.
Metric instant_pool_reap_total{resource_type,status,outcome} +
instant_pool_stuck_assigned{resource_type}. Rule-25 alert + dashboard +
catalog rows belong in the infra repo (out of scope for this PR) — see
PR description for the follow-up.
Tests: deprovisionBacking routing/unknown-type/error; DB-gated
reapFailed orphaned-past-grace IS reaped + fresh-inside-grace is NOT +
correct pool_token deprovisioned, deprovision-error-leaves-row,
batch-bound; reapStale NEVER deprovisions assigned; gauge reset-to-zero;
and a fakeDB/fakeRows seam covering the Query/Scan/Rows.Err/DELETE error
arms. internal/pool reaper functions at 100% coverage, package 97.6%,
-race clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a reaper to the hot-pool
Manager(internal/pool/manager.go) so leakedpool_itemsno longer strand backing infra. Closes sweep-backlog finding #8 (P2).On each maintenance tick
reapStalenow:Reaps
faileditems older thanfailedReapGrace(10m), bounded toreapBatchLimit(50) per pass: routesDeprovisionthrough the resource-type backend using the row'spool_tokenas the naming token, then deletes the row. ADeprovisionfailure leaves the row for the next tick — the tracking row is never deleted before its infra is freed.Deprovisionis idempotent (DROP ... IF EXISTS), so reaping an item whose infra is already gone is a safe no-op. Afailedrow (set only byDiscard, before the item is ever returned to api) has by construction no owningresourcesrow, so this is a pure leak with no live owner.Reports
assigneditems older thanstuckAssignedGrace(30m) on the newinstant_pool_stuck_assignedgauge — but does not deprovision them.Why
assignedis reported, not reaped (scope decision)The finding asked to reap orphaned
assigneditems (crashed-claim). I verified this cannot be done safely from the provisioner:Claim(→assigned) andDiscard(→failed). There is no write-back when api successfully binds a claimed item to aresourcesrow.assignedrow is indistinguishable from one a live api request bound to aresourcesrow. A bound item's infra is owned by that resource row and reaped by the worker's resource-TTL path. Deprovisioning by age here would destroy live customer infra — thetruehomie-dbDROP incident class.pool_itemslives in the provisioner's own standalone Postgres (PROVISIONER_DATABASE_URL);resourceslives inplatform_db. No current service has both, so a safe orphan-assignedanti-join reaper has no correct home today. Thefailed-drain half is squarely a provisioner job; theassignedhalf is surfaced as an operator signal + documented follow-up rather than forced unsafely.Metrics / Rule 25 follow-up
New metrics:
instant_pool_reap_total{resource_type,status,outcome},instant_pool_stuck_assigned{resource_type}.The alert + dashboard tile +
METRICS-CATALOG.mdrow mandated by rule 25 live in the infra repo, which is out of scope for this provisioner-only PR. Follow-up required in infra: Prom rule + NR alert (P2 observability: risinginstant_pool_stuck_assigned= claim-path leak; non-zeroinstant_pool_reap_total{outcome="deprovision_err"}rate = wedged reaper), dashboard tile, catalog rows.Coverage
🤖 Generated with Claude Code