feat: add RegradeResource RPC — re-apply tier connection caps to live Postgres roles by mastermanas805 · Pull Request #8 · InstaNode-dev/provisioner

mastermanas805 · 2026-05-15T18:49:05Z

Summary

Implements the RegradeResource gRPC RPC: re-applies a resource's tier-derived Postgres role CONNECTION LIMIT to its live, already-provisioned infrastructure.
K8sBackend.Regrade runs ALTER ROLE … CONNECTION LIMIT, capping at the tier-entitled value from instant.dev/common/plans. Fail-soft: an unreachable customer pod returns applied=false so the caller retries next sweep — one bad pod never aborts a fleet sweep.
Non-postgres resource types return applied=false + skip_reason (no DB-level cap).
3 new server tests.

Phase 1 of the entitlement re-grade work — fixes "upgrade drift" where a plan upgrade flips resources.tier but the baked role connection cap is never re-applied.

⚠️ Dependency order

CI checks out proto@master as a sibling. This PR's CI will be red until proto#2 merges (it needs the new RegradeResource RPC). Merge proto first, then re-run CI here.

Verified

Integration-tested live on the prod cluster (2026-05-15): hobby→pro upgrade re-graded a real Postgres role 8 → 20 connections via this RPC; idempotent on the next sweep.

Test plan

go test ./... -short green locally
CI green after proto#2 merges
Merge auto-deploys master-<sha>, replacing the phase1-regrade pre-release image

🤖 Generated with Claude Code

… Postgres roles A plan upgrade flips resources.tier but never re-applies the HARD infrastructure limits baked at provision time — the Postgres role CONNECTION LIMIT in particular. RegradeResource closes that gap: * server.go — RegradeResource handler dispatches to the postgres backend; non-postgres types return applied=false + skip_reason. * backend/postgres — Regrade() ALTERs the role CONNECTION LIMIT to the tier-entitled cap from instant.dev/common/plans. K8sBackend is fail-soft when the customer pod is unreachable (returns applied=false so the caller retries on the next sweep). Idempotent: re-applying the same limit is a no-op. 3 new server tests. Phase 1 of the entitlement re-grade work. Requires the matching proto change (RegradeResource RPC) — CI builds check out proto@master, so the proto PR must merge before this one's CI goes green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The hot-pool Manager had no reaper for pool_items left in 'failed' or 'assigned' state. A 'failed' item (Discard marked it unusable on the provisioner-side claim path) leaks its backing infra forever: no resources row owns it, so the worker's resource-TTL reaper never touches db_pool-<uuid> / usr_pool-<uuid> / keyspace pool-<uuid>:*. Adds a reaper on the maintenance ticker that, per pass: - deprovisions + deletes 'failed' rows older than failedReapGrace (10m), bounded to reapBatchLimit (50) per tick, routed through the resource-type backend with the pool_token as naming token; a Deprovision failure leaves the row for the next tick so infra is never orphaned by deleting its tracking row first (Deprovision is idempotent — DROP ... IF EXISTS); - reports 'assigned' rows older than stuckAssignedGrace (30m) on the new instant_pool_stuck_assigned gauge but does NOT deprovision them. Why 'assigned' is reported, not reaped: from the provisioner's own DB an orphaned (crashed-claim) 'assigned' row is indistinguishable from one a live api request successfully bound to a resources row — there is no write-back on a successful bind. The bound item's infra is owned by that resources row and reaped by the worker's resource-TTL path; deprovisioning it here would destroy live customer infra (the truehomie-db DROP incident class). A safe orphan-'assigned' reaper needs an anti-join against the resources table, which lives in a different database than pool_items, so it cannot be done from the provisioner. Metric instant_pool_reap_total{resource_type,status,outcome} + instant_pool_stuck_assigned{resource_type}. Rule-25 alert + dashboard + catalog rows belong in the infra repo (out of scope for this PR) — see PR description for the follow-up. Tests: deprovisionBacking routing/unknown-type/error; DB-gated reapFailed orphaned-past-grace IS reaped + fresh-inside-grace is NOT + correct pool_token deprovisioned, deprovision-error-leaves-row, batch-bound; reapStale NEVER deprovisions assigned; gauge reset-to-zero; and a fakeDB/fakeRows seam covering the Query/Scan/Rows.Err/DELETE error arms. internal/pool reaper functions at 100% coverage, package 97.6%, -race clean. Co-authored-by: Manas Srivastava <[email protected]> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mastermanas805 mentioned this pull request May 15, 2026

feat(jobs): add entitlement reconciler — fix plan-upgrade connection-cap drift InstaNode-dev/worker#33

Merged

3 tasks

mastermanas805 merged commit 339a322 into master May 15, 2026

mastermanas805 mentioned this pull request Jun 4, 2026

fix(pool): reap orphaned assigned/failed pool_items (sweep #8) #44

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add RegradeResource RPC — re-apply tier connection caps to live Postgres roles#8

feat: add RegradeResource RPC — re-apply tier connection caps to live Postgres roles#8
mastermanas805 merged 1 commit into
masterfrom
phase1-resource-regrade

mastermanas805 commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mastermanas805 commented May 15, 2026

Summary

⚠️ Dependency order

Verified

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant