feat: v1.5.0 Manager — MT cache + push hot-reload#13
Merged
Conversation
Introduce the v1.5.0 Manager surface: a per-tenant in-process cache + push hot-reload changefeed that closes the ST↔MT asymmetry left behind by v1.4.0. This slice ships the scaffolding only — types compile, the public API is reachable, and the binding is wired through Client.BindManager — but no behaviour changes for existing callers. Callers that do NOT bind a Manager observe identical v1.4.0 behaviour. Layout: - internal/manager/ — Manager type, per-tenant state, lifecycle handler stubs, cache lookup stubs, callback registry, telemetry stubs - internal/client/manager_binding.go — Client.BindManager method + ClientHooks adapter so the Manager can read the registry and lifecycle ctx without importing internal/client (avoids cycle) - manager.go (root) — public Manager type alias, NewManager constructor, WithManagerLogger / WithManagerTelemetry / WithManagerAggregateTenantThreshold Design reference: docs/v1.5.0-manager-design.md
Wire the multi-tenant Client.Get path through the bound Manager's per-tenant cache. On miss the path falls through to the existing tenant-DB read; on DB-success the Manager populates the cache so subsequent Gets bypass the DB once the tenant has been activated (slice 3 wires activation). - internal/manager/get.go: Lookup / Populate / Invalidate against the per-tenant cache, gated on tenant existence in perTenant. Empty tenantID or unactivated tenant misses gracefully so v1.4.0 behaviour is preserved. - internal/manager/get.go: TenantIDFromContext helper wraps tmcore.GetTenantIDContext so the Client side can extract the tenant from ctx without importing tmcore. - internal/client/get.go: MT Get path consults manager.Lookup first; falls back to store.Get on miss; calls manager.Populate on a successful DB read. - internal/manager/manager_test.go: unit tests for Lookup/Populate semantics, Drain idempotency, lifecycle no-ops for empty tenants / closed manager, nil-receiver safety. - internal/client/manager_binding_test.go: backward-compat tests pinning ErrNotSupportedInMultiTenant when no Manager is bound, idempotency of BindManager, nil-safety, and ST mode never routing through the Manager. Without a bound Manager Client.Get behaviour is unchanged from v1.4.0.
OnTenantActivated now performs the full bootstrap sequence for an activated
tenant:
1. Resolve the tenant DB via tmpostgres.Manager.GetConnection
2. Run idempotent schema DDL (CREATE TABLE IF NOT EXISTS + NOTIFY triggers).
The DDL is byte-compatible with internal/postgres/postgres_schema.go so
a tenant DB bootstrapped by either path is interchangeable.
3. Seed every registered key's default value with
INSERT ... ON CONFLICT (namespace, key) DO NOTHING — never overwrites
operator-set values.
4. Warm-load every registered key into the per-tenant cache.
OnTenantDeleted removes the per-tenant state and records the deactivation.
OnTenantSuspended marks the cache stale (entries kept for fast reactivation).
OnTenantCredentialsRotated routes through delete + activate (slice 4 will
replace this with a real LISTEN restart).
Schema bootstrap and seeding live in internal/manager/schema.go. Slice 4
will add the per-tenant LISTEN goroutine that consumes the NOTIFY events the
triggers emit.
Lifecycle handlers stay non-blocking: nil pgMgr / unbound hooks /
empty tenant ID / closed Manager → no-op. v1.4.0 callers without a bound
Manager are unaffected.
Each active tenant now owns one dedicated pgx LISTEN connection running against its primary database. NOTIFYs are decoded, the per-tenant cache is updated (or evicted on delete), and OnChange callbacks fire. - internal/manager/listen.go: startListen acquires the tenant DSN via tmpostgres.Manager.GetConnection, opens pgx LISTEN, then launches a goroutine via lib-observability/runtime.SafeGo with the Client's lifecycle ctx so shutdown propagates cleanly. - Reconnect: exponential backoff (base 500ms, capped 30s) with jitter, retries forever per design. After 3 consecutive failures the cache is marked stale so reads fall through to the DB until the next successful reconnect. - Upsert NOTIFYs trigger a fresh tenant-DB read (readSingle) so the cache reflects committed state. Delete NOTIFYs evict directly. - OnTenantSuspended cancels the goroutine and marks stale. - OnTenantDeleted cancels the goroutine and evicts state. - OnTenantCredentialsRotated routes through delete + activate which now closes the old goroutine and opens a fresh one with refreshed DSN. - Drain closes every per-tenant LISTEN goroutine on shutdown (matches design open-question answer: explicit graceful drain). Cache write under MaxEntriesPerTenant bound; updates beyond the bound are silently dropped (with metric in slice 6). ST mode and MT-without-Manager paths are untouched.
Client.OnChange in multi-tenant mode now routes through the bound Manager's per-tenant LISTEN dispatcher when a Manager is bound. Without a bound Manager the call still returns ErrNotSupportedInMultiTenant — the v1.4.0 contract is preserved exactly for callers that have not opted in. The MT-with-Manager path: - Returns a working unsubscribe closure - Forwards callbacks with the LISTEN-goroutine ctx (carrying tenant scope) - Treats nil fn the same as the ST path (returns no-op unsubscribe, no error) internal/client/manager_binding_test.go adds a test pinning the new register/unsubscribe lifecycle, complementing the existing TestBackwardCompat_MTWithoutManager_OnChangeReturnsErr.
Wire real OpenTelemetry instruments for the Manager. Six metrics covering the per-tenant LISTEN goroutine lifecycle and cache hit rate: - systemplane.manager.tenants_active (gauge) - systemplane.manager.cache_entries (gauge, tenant_id) - systemplane.manager.notify_received_total (counter, tenant_id + op) - systemplane.manager.listen_disconnects_total (counter, tenant_id + reason) - systemplane.manager.warmload_latency_seconds (histogram, tenant_id) - systemplane.manager.get_cache_hits_total (counter, tenant_id + outcome) Tenant-id cardinality is bounded by aggregateTenantThreshold (default 1000, configurable via WithManagerAggregateTenantThreshold): once the active-tenant count exceeds the threshold the tenant_id label collapses to "aggregate". All recordXxx helpers are nil-receiver safe and lazily initialize their instruments via sync.Once so a Manager constructed without telemetry stays a strict no-op. Cache miss outcome is now also recorded in Lookup so operators can compute the hit rate as a ratio.
Comprehensive unit tests for the in-process Manager paths (cache, callback registry, lifecycle idempotency, notify payload decoding, metric cardinality rollup). internal/manager/lifecycle_test.go: - OnTenantSuspended / OnTenantDeleted before activation: no-op + no state creation - OnTenantSuspended / OnTenantDeleted idempotency - OnTenantCredentialsRotated unknown-tenant no-op - Lookup after suspend falls through to DB (stale flag) - Populate enforces MaxEntriesPerTenant bound - Callback registry dispatch (single, multiple, panicking) - decodeNotifyPayload accepts/rejects expected shapes - tenantLabel aggregate-rollup behaviour above/below threshold - Concurrent tenantStateFor returns stable instance internal/manager/internal_test.go: - TenantIDFromContext (tmcore delegation) - quoteIdentifier - lifecycleContext fallback to Background - Invalidate removes entries - applyEvent delete path (cache eviction + callback fan-out) - clearStale resets the flag Coverage on internal/manager: 35.9% on package; the schema/listen/applyEvent upsert paths require live Postgres and are covered by the integration test suite landing in the next slice.
End-to-end integration tests against a real Postgres testcontainer covering the full Manager lifecycle: - TestIntegration_Manager_OnTenantActivated_BootstrapsSchemaAndSeeds: schema DDL runs, defaults seed via INSERT ON CONFLICT, cache warm-loads. - TestIntegration_Manager_NotifyEndToEnd: external UPDATE triggers NOTIFY, LISTEN goroutine consumes it, cache updates, OnChange callback fires. - TestIntegration_Manager_MultiTenantIsolation: two activated tenants observe their own values; cross-tenant cache contamination is impossible. - TestIntegration_Manager_ReconnectAfterTerminateBackend: pg_terminate_backend on the LISTEN connection triggers reconnect; post-reconnect NOTIFYs flow. - TestIntegration_Manager_DeleteThenReactivate: clean round-trip; cache rebuilds after deletion. - TestIntegration_Manager_CredentialsRotated: rotation tears down old LISTEN, opens fresh one, NOTIFYs continue. - TestIntegration_Manager_SeedDoesNotOverwriteExisting: operator-set values survive re-activation seed (ON CONFLICT DO NOTHING). To support tests without spinning up the tenant-manager gRPC client, this slice extracts a Connector interface (internal/manager/connector.go) that abstracts ResolveDB + ResolveDSN. New() wires the default pgMgrConnector when a *tmpostgres.Manager is supplied; SetConnector is the test seam. Production behaviour is unchanged: New(pgMgr, ...) still produces a Manager backed by lib-commons tenant-manager Postgres. All 7 tests pass under -race against postgres:16-alpine.
Per-lifecycle-path goroutine-cleanup tests pinning the contract that every handler releases its LISTEN goroutines: - OnTenantDeleted releases the goroutine - OnTenantSuspended releases the goroutine - OnTenantCredentialsRotated releases the pre-rotation goroutine - Drain releases every active goroutine Package-level goleak guard (internal/manager/main_test.go) runs under both unit and integration build tags so any leak from cache/callback code or from the LISTEN reader/reconnect loops surfaces immediately. Ignore list covers testcontainers' Reaper, pgx's pool health-checker, and testcontainers' HTTP keep-alive goroutines — all are intentionally process-lifetime.
Surface the Manager's lifecycle and drain API as first-class methods on the public *Manager type so godoc renders them at the public package boundary. - manager_methods.go: proxy methods (OnTenantActivated, OnTenantSuspended, OnTenantDeleted, OnTenantCredentialsRotated, Drain, IsClosed) with full godoc covering idempotency, order-tolerance, and shutdown semantics. - examples/manager/main.go: end-to-end consumer integration example showing the canonical construction order — MT Client → Register keys → NewManager(client, pgMgr) → forward lifecycle events → Drain on exit. godoc summary: type Manager internalmanager.Manager func NewManager(...) *Manager func (m *Manager) Drain(ctx context.Context) error func (m *Manager) IsClosed() bool func (m *Manager) OnTenantActivated(ctx context.Context, tenantID string) error func (m *Manager) OnTenantCredentialsRotated(ctx context.Context, tenantID string) error func (m *Manager) OnTenantDeleted(ctx context.Context, tenantID string) error func (m *Manager) OnTenantSuspended(ctx context.Context, tenantID string) error
Drain now respects the supplied context.Context. If the caller's shutdown budget expires before every per-tenant LISTEN goroutine has acknowledged its cancel signal, Drain returns immediately. The in-flight goroutines still observe the per-listen ctx cancellation and will exit on their own — Drain just stops waiting for them. - internal/manager/listen.go: stopListenCtx variant accepting ctx; the goroutine wait selects on done, listenCloseTimeout, AND ctx.Done. - internal/manager/lifecycle.go: Drain's perTenant.Range loop bails out early when ctx cancels. - manager_methods.go: refined godoc reflecting the new semantics. New integration test (TestIntegration_Manager_Drain_HonoursCtxCancellation) calls Drain with an already-canceled ctx and asserts it returns inside 2 seconds. Followed by a second idempotent Drain. This closes the design's open question: "Should Manager expose a Drain method... Recommend: yes, explicit Drain(ctx) called from the consumer's shutdown sequence." Drain + ctx cancellation makes it usable inside strict shutdown budgets (k8s preStop, signal handlers).
Document the v1.5.0 additions, the OnChange behaviour change (opt-in via BindManager), the OTel metrics, and the backward-compatibility contract. go.mod / version bump is intentionally NOT in this commit — release prep is the maintainer's responsibility.
Pin sentinel error propagation for the schema and connector layers without requiring a live Postgres: - internal/manager/schema_test.go: empty seedDefaults no-op, resolveTenantDB / tenantDSN return ErrPgMgrUnavailable when no connector is wired, startListen with nil ts / failing ResolveDSN releases the listen slot cleanly. - internal/manager/connector_test.go: nil receiver and nil-manager paths on pgMgrConnector return ErrPgMgrUnavailable. - internal/manager/manager_test.go: nil-option / WithLogger(nil) / WithTelemetry(nil) / WithAggregateTenantThreshold are tolerated; nil SetConnector is safe. Coverage on internal/manager: 75.6% (combined unit + integration). Per-file: tenant_state 100%, get 97%, lifecycle 91%, callbacks 90%, manager 88%, listen 85%, schema 73%, metrics 67%, logging 33%, connector 23%. The remaining gaps are intentional — pgMgrConnector requires a real tmpostgres.Manager (gRPC tenant-config client) and metrics.init requires a real OTel MeterProvider; both are exercised by consumers in production.
Address golangci-lint findings flagged by the repo's .golangci.yml against
the new Manager code:
- listen.go:
- rename predeclared 'cap' to 'maxDelay'
- drop math.IsNaN check on int-derived Duration (SA4015); replaced with
a negative-value guard which is the actual concern
- perfsprint: SafeGo name uses string concatenation instead of fmt.Sprintf
- gosec G118 lint suppression on the WithCancel pair stored on
listenHandle (released via stopListen — same justification used by the
Client lifecycle ctx)
- staticcheck S1016: change notifyPayload tags so notifyEvent(p) conversion
replaces the struct literal
- wsl_v5: missing whitespace between non-related statements
- manager.go: wsl_v5 whitespace around the closed flag assignment
- schema.go: perfsprint string concat
- logging.go: drop unused logError helper
- examples/manager/main.go: nolint:gosec on the TenantCredentialsRotated
event-name constant (not a credential); upgrade to sync.WaitGroup.Go
(modernize)
All lint clean. Tests still pass: 113 unit (race) + 12 integration.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (35)
WalkthroughThis PR introduces ChangesMulti-Tenant Manager with LISTEN/NOTIFY Cache System
Comment |
| eventActivated = "TenantActivated" | ||
| eventSuspended = "TenantSuspended" | ||
| eventDeleted = "TenantDeleted" | ||
| eventCredsRotate = "TenantCredentialsRotated" //nolint:gosec // event type name, not a credential |
Approach (B) from the gap report: extend the existing connector test seam
rather than introduce a parallel bufconn-gRPC fake. The upstream
tmpostgres.Manager exposes two construction shapes that avoid spinning up
the tenant-config gRPC client:
- NewManager(nil, "svc") leaves the client nil, so GetConnection routes
through createConnection and short-circuits with the canonical
"tenant manager client is required for multi-tenant connections"
error — exercising the GetConnection error branch of both ResolveDB
and ResolveDSN.
- NewManager(nil, "svc", WithTestConnections(tID)) preloads a
zero-value PostgresConnection. GetConnection returns the cached
entry; conn.GetDB() then errors because ConnectionDB is nil, and
ConnectionStringPrimary is empty for the DSN branch — exercising
the second-tier error branches.
Success-path coverage (ConnectionDB non-nil, primary DSN non-empty)
runs on the existing testcontainers Postgres lane: seed the entry via
WithTestConnections, mutate ConnectionDB + ConnectionStringPrimary
through the returned pointer, and observe ResolveDB returning a live
dbresolver.DB that pings cleanly.
Coverage: connector.go 23.53% → 100%.
The new Test New_WiresPgMgrConnector also pins the production-path
wiring (manager.New(pgMgr) installs *pgMgrConnector{mgr: pgMgr}) without
needing a live tenant-manager.
No production code changed.
Wire a real *tracing.Telemetry whose MeterProvider is backed by an in-process sdkmetric.ManualReader so every Manager instrument can be driven and the emitted metric points asserted directly. Covers every record method: - recordTenantActivated / recordTenantDeactivated (gauge) - recordCacheEntries (UpDownCounter with tenant_id) - recordNotifyReceived (counter with tenant_id + op) - recordListenDisconnect (counter with tenant_id + reason) - recordWarmloadLatency (float64 histogram) - recordGetCacheOutcome (counter with tenant_id + outcome) Pins the cardinality cap on tenantLabel: below the aggregate threshold each tenant ID is preserved; above it every label collapses to "aggregate" so Prometheus stays bounded under thousands of tenants. Also pins the nil-receiver and nil-telemetry edges, the nil-logger fallback in newMetrics, and the initOnce contract under repeated init. Coverage: metrics.go 55.07% → 97.10%. No production code changed.
logging_test.go pins the m==nil and m.logger==nil early-return guards in logDebug/logInfo/logWarn — production wiring always installs log.NewNop() so the guards are otherwise unreachable. Extracts newTestTelemetry into metrics_helper_test.go under the "unit || integration" build tag so the integration-tagged listen tests can reuse the in-process OTel MeterProvider. Coverage: logging.go 66.67% → 100.00%.
Drives runSchema, runSchemaAndSeed, seedDefaults and warmLoad against a
live Postgres so every error and corner-case branch executes:
- Fresh DB → CREATE TABLE + 2 triggers
- Already exists → idempotent (CREATE TABLE IF NOT EXISTS,
CREATE OR REPLACE FUNCTION, DROP TRIGGER IF EXISTS + CREATE TRIGGER)
- Trigger missing → recreated by re-bootstrap
- DDL error (closed DB) → "create table" wrap surfaces from runSchema
and short-circuits runSchemaAndSeed
- seedDefaults: ON CONFLICT preserves operator value; new key gets the
registered default; marshal error (channel value) and insert error
both return the firstErr while logging each subsequent failure
- warmLoad: registered + unregistered keys filtered; cache cap honoured;
query error wrapped as "warm-load query"; stale flag cleared
- Concurrent runSchemaAndSeed for the same tenant — at least one
succeeds, no duplicate seeded rows (ON CONFLICT semantics)
Coverage: schema.go 69.44% → 90.28%.
No production code changed.
Drives startListen, consumeAndReconnect, reconnect, applyEvent and
readSingle against a live testcontainers Postgres so the production
LISTEN/NOTIFY paths execute end-to-end:
- startListen: pgx.Connect failure releases ts.listen slot; second
startListen call is a no-op (handle already installed)
- lifecycleContext: nil hooks ctx falls back to context.Background
- readSingle: sql.ErrNoRows → (nil, false, nil); JSON decode failure
wrapped as "decode ns/k"; QueryRowContext error wrapped as scan error
- applyEvent delete: removes entry, dispatches OnChange with nil value
- applyEvent upsert: ResolveDB failure short-circuits; row missing
deletes cache entry; success updates cache and dispatches OnChange
with the fresh value
- consumeAndReconnect: real NOTIFY round-trip through the goroutine
updates the cache within the 5s deadline
- reconnect: terminating the LISTEN backend via pg_terminate_backend
triggers reconnect with tightened backoff (25ms base / 1s cap), and
a subsequent INSERT delivers via the new connection
Coverage: listen.go 79.57% → 86.02%.
No production code changed.
5 tasks
This was referenced May 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the ST↔MT asymmetry left behind by v1.4.0 by adding a per-tenant
in-process cache plus a per-tenant LISTEN/NOTIFY changefeed for
multi-tenant deployments. The Manager is strictly opt-in — callers that
do not call
Client.BindManagerobserve identical v1.4.0 behaviour.Design source of truth: #12
(
docs/v1.5.0-manager-design.mdonfeat/v1.5.0-manager-design).Slices
Each commit is one slice from the design's 11-step implementation plan,
landed in order so intermediate states keep ST and MT (without Manager)
fully functional.
feat(manager): scaffolding— types, public surface,BindManagerstubfeat(manager): per-tenant cache and Client.Get routing— Lookup/Populate, MT Get fast-pathfeat(manager): tenant lifecycle handlers (schema + seed + warm-load)—OnTenantActivateddoes the full DDL + INSERT ON CONFLICT seed + cache warm-loadfeat(manager): per-tenant LISTEN/NOTIFY goroutine— one pgx LISTEN per active tenant, reconnect with capped exponential backoff, stale-after-3-failures, SafeGo for observabilityfeat(manager): OnChange MT-aware—Client.OnChangein MT routes through bound Manager; without binding still returnsErrNotSupportedInMultiTenantfeat(manager): telemetry— six OTel instruments, configurable tenant-id aggregate rolluptest(manager): unit— idempotency, out-of-order delivery, nil-safety, notify decodetest(manager): integration— testcontainers Postgres: schema bootstrap, NOTIFY E2E, multi-tenant isolation, reconnect afterpg_terminate_backend, delete+reactivate, credentials rotation, seed never overwrites operatortest(manager): goleak— every lifecycle path releases its goroutine; package-levelgoleak.VerifyTestMainunder both build tagsdocs(manager): public API documentation— godoc on every exported symbol;examples/manager/main.goshowing canonical consumer integrationfeat(manager): Drain honours ctx cancellation—Drain(ctx)bails out early when caller's shutdown budget expiresPlus three supporting commits:
chore(changelog): v1.5.0 entry for Managertest(manager): cover error and nil-receiver paths— coverage bumpchore(manager): fix lint findingsBackward compatibility
Preserved exactly. A test (
TestBackwardCompat_MTWithoutManager_OnChangeReturnsErr) pins the v1.4.0 invariant. Allexisting v1.4.0 unit and integration tests pass unchanged:
ST mode is entirely untouched. The new code paths are all gated on
client.manager != nilandclient.BindManager(m)is the only way to setit. The public API surface is purely additive.
What's NOT in this PR
go.modversion bump. Release prep is the maintainer's responsibility.plugin-br-bank-transfer/migrations/000011_systemplane_defaults_seed.up.sqlstop-gap remains in place; its retirement is plugin-side Phase 6 work
that consumes this Manager. Removing it from the plugin is not a
prerequisite for merging this PR.
a
Connectortest seam (extracted in slice 8) rather than spinning upa full lib-commons tenant-manager Postgres Manager. The production path
remains wired through
tmpostgres.Manager.GetConnection.Validation
go build ./...cleango vet ./...cleango test -tags unit -count=1 -race ./...— 113 tests passgo test -tags integration -count=1 -p=1 ./...— 35 tests pass (12 new manager + 23 existing v1.4.0)golangci-lint run ./...— 0 issuesgoleak.VerifyTestMainclean under both build tagsinternal/manager: 75.6% combined (per-file: tenant_state 100%, get 97%, lifecycle 91%, callbacks 90%, manager 88%, listen 85%, schema 73%, metrics 67%). Remaining gaps arepgMgrConnector(requires real tenant-config gRPC client) andmetrics.init(requires a real OTel MeterProvider) — both exercised by consumers in production.Notes
This PR is intentionally DRAFT. The maintainer should review the 14
commits, run any internal validation pipelines, and bump the version /
cut the release tag separately.