Skip to content

feat: add RegradeResource RPC — re-apply tier connection caps to live Postgres roles#8

Merged
mastermanas805 merged 1 commit into
masterfrom
phase1-resource-regrade
May 15, 2026
Merged

feat: add RegradeResource RPC — re-apply tier connection caps to live Postgres roles#8
mastermanas805 merged 1 commit into
masterfrom
phase1-resource-regrade

Conversation

@mastermanas805

Copy link
Copy Markdown
Member

Summary

  • Implements the RegradeResource gRPC RPC: re-applies a resource's tier-derived Postgres role CONNECTION LIMIT to its live, already-provisioned infrastructure.
  • K8sBackend.Regrade runs ALTER ROLE … CONNECTION LIMIT, capping at the tier-entitled value from instant.dev/common/plans. Fail-soft: an unreachable customer pod returns applied=false so the caller retries next sweep — one bad pod never aborts a fleet sweep.
  • Non-postgres resource types return applied=false + skip_reason (no DB-level cap).
  • 3 new server tests.

Phase 1 of the entitlement re-grade work — fixes "upgrade drift" where a plan upgrade flips resources.tier but the baked role connection cap is never re-applied.

⚠️ Dependency order

CI checks out proto@master as a sibling. This PR's CI will be red until proto#2 merges (it needs the new RegradeResource RPC). Merge proto first, then re-run CI here.

Verified

Integration-tested live on the prod cluster (2026-05-15): hobby→pro upgrade re-graded a real Postgres role 8 → 20 connections via this RPC; idempotent on the next sweep.

Test plan

  • go test ./... -short green locally
  • CI green after proto#2 merges
  • Merge auto-deploys master-<sha>, replacing the phase1-regrade pre-release image

🤖 Generated with Claude Code

… Postgres roles

A plan upgrade flips resources.tier but never re-applies the HARD
infrastructure limits baked at provision time — the Postgres role
CONNECTION LIMIT in particular. RegradeResource closes that gap:

  * server.go — RegradeResource handler dispatches to the postgres
    backend; non-postgres types return applied=false + skip_reason.
  * backend/postgres — Regrade() ALTERs the role CONNECTION LIMIT to
    the tier-entitled cap from instant.dev/common/plans. K8sBackend
    is fail-soft when the customer pod is unreachable (returns
    applied=false so the caller retries on the next sweep).

Idempotent: re-applying the same limit is a no-op. 3 new server tests.

Phase 1 of the entitlement re-grade work. Requires the matching proto
change (RegradeResource RPC) — CI builds check out proto@master, so
the proto PR must merge before this one's CI goes green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 merged commit 339a322 into master May 15, 2026
mastermanas805 added a commit that referenced this pull request Jun 4, 2026
The hot-pool Manager had no reaper for pool_items left in 'failed' or
'assigned' state. A 'failed' item (Discard marked it unusable on the
provisioner-side claim path) leaks its backing infra forever: no
resources row owns it, so the worker's resource-TTL reaper never touches
db_pool-<uuid> / usr_pool-<uuid> / keyspace pool-<uuid>:*.

Adds a reaper on the maintenance ticker that, per pass:
  - deprovisions + deletes 'failed' rows older than failedReapGrace
    (10m), bounded to reapBatchLimit (50) per tick, routed through the
    resource-type backend with the pool_token as naming token; a
    Deprovision failure leaves the row for the next tick so infra is
    never orphaned by deleting its tracking row first (Deprovision is
    idempotent — DROP ... IF EXISTS);
  - reports 'assigned' rows older than stuckAssignedGrace (30m) on the
    new instant_pool_stuck_assigned gauge but does NOT deprovision them.

Why 'assigned' is reported, not reaped: from the provisioner's own DB an
orphaned (crashed-claim) 'assigned' row is indistinguishable from one a
live api request successfully bound to a resources row — there is no
write-back on a successful bind. The bound item's infra is owned by that
resources row and reaped by the worker's resource-TTL path;
deprovisioning it here would destroy live customer infra (the
truehomie-db DROP incident class). A safe orphan-'assigned' reaper needs
an anti-join against the resources table, which lives in a different
database than pool_items, so it cannot be done from the provisioner.

Metric instant_pool_reap_total{resource_type,status,outcome} +
instant_pool_stuck_assigned{resource_type}. Rule-25 alert + dashboard +
catalog rows belong in the infra repo (out of scope for this PR) — see
PR description for the follow-up.

Tests: deprovisionBacking routing/unknown-type/error; DB-gated
reapFailed orphaned-past-grace IS reaped + fresh-inside-grace is NOT +
correct pool_token deprovisioned, deprovision-error-leaves-row,
batch-bound; reapStale NEVER deprovisions assigned; gauge reset-to-zero;
and a fakeDB/fakeRows seam covering the Query/Scan/Rows.Err/DELETE error
arms. internal/pool reaper functions at 100% coverage, package 97.6%,
-race clean.

Co-authored-by: Manas Srivastava <[email protected]>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant