feat: add RegradeResource RPC — re-apply tier connection caps to live Postgres roles#8
Merged
Merged
Conversation
… Postgres roles
A plan upgrade flips resources.tier but never re-applies the HARD
infrastructure limits baked at provision time — the Postgres role
CONNECTION LIMIT in particular. RegradeResource closes that gap:
* server.go — RegradeResource handler dispatches to the postgres
backend; non-postgres types return applied=false + skip_reason.
* backend/postgres — Regrade() ALTERs the role CONNECTION LIMIT to
the tier-entitled cap from instant.dev/common/plans. K8sBackend
is fail-soft when the customer pod is unreachable (returns
applied=false so the caller retries on the next sweep).
Idempotent: re-applying the same limit is a no-op. 3 new server tests.
Phase 1 of the entitlement re-grade work. Requires the matching proto
change (RegradeResource RPC) — CI builds check out proto@master, so
the proto PR must merge before this one's CI goes green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
mastermanas805
added a commit
that referenced
this pull request
Jun 4, 2026
The hot-pool Manager had no reaper for pool_items left in 'failed' or
'assigned' state. A 'failed' item (Discard marked it unusable on the
provisioner-side claim path) leaks its backing infra forever: no
resources row owns it, so the worker's resource-TTL reaper never touches
db_pool-<uuid> / usr_pool-<uuid> / keyspace pool-<uuid>:*.
Adds a reaper on the maintenance ticker that, per pass:
- deprovisions + deletes 'failed' rows older than failedReapGrace
(10m), bounded to reapBatchLimit (50) per tick, routed through the
resource-type backend with the pool_token as naming token; a
Deprovision failure leaves the row for the next tick so infra is
never orphaned by deleting its tracking row first (Deprovision is
idempotent — DROP ... IF EXISTS);
- reports 'assigned' rows older than stuckAssignedGrace (30m) on the
new instant_pool_stuck_assigned gauge but does NOT deprovision them.
Why 'assigned' is reported, not reaped: from the provisioner's own DB an
orphaned (crashed-claim) 'assigned' row is indistinguishable from one a
live api request successfully bound to a resources row — there is no
write-back on a successful bind. The bound item's infra is owned by that
resources row and reaped by the worker's resource-TTL path;
deprovisioning it here would destroy live customer infra (the
truehomie-db DROP incident class). A safe orphan-'assigned' reaper needs
an anti-join against the resources table, which lives in a different
database than pool_items, so it cannot be done from the provisioner.
Metric instant_pool_reap_total{resource_type,status,outcome} +
instant_pool_stuck_assigned{resource_type}. Rule-25 alert + dashboard +
catalog rows belong in the infra repo (out of scope for this PR) — see
PR description for the follow-up.
Tests: deprovisionBacking routing/unknown-type/error; DB-gated
reapFailed orphaned-past-grace IS reaped + fresh-inside-grace is NOT +
correct pool_token deprovisioned, deprovision-error-leaves-row,
batch-bound; reapStale NEVER deprovisions assigned; gauge reset-to-zero;
and a fakeDB/fakeRows seam covering the Query/Scan/Rows.Err/DELETE error
arms. internal/pool reaper functions at 100% coverage, package 97.6%,
-race clean.
Co-authored-by: Manas Srivastava <[email protected]>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
RegradeResourcegRPC RPC: re-applies a resource's tier-derived Postgres roleCONNECTION LIMITto its live, already-provisioned infrastructure.K8sBackend.RegraderunsALTER ROLE … CONNECTION LIMIT, capping at the tier-entitled value frominstant.dev/common/plans. Fail-soft: an unreachable customer pod returnsapplied=falseso the caller retries next sweep — one bad pod never aborts a fleet sweep.applied=false+skip_reason(no DB-level cap).Phase 1 of the entitlement re-grade work — fixes "upgrade drift" where a plan upgrade flips
resources.tierbut the baked role connection cap is never re-applied.CI checks out
proto@masteras a sibling. This PR's CI will be red until proto#2 merges (it needs the newRegradeResourceRPC). Merge proto first, then re-run CI here.Verified
Integration-tested live on the prod cluster (2026-05-15): hobby→pro upgrade re-graded a real Postgres role
8 → 20connections via this RPC; idempotent on the next sweep.Test plan
go test ./... -shortgreen locallymaster-<sha>, replacing thephase1-regradepre-release image🤖 Generated with Claude Code