Skip to content

feat(store): Postgres backend — pgx driver, Ent schema parity, full store port, LISTEN/NOTIFY#304

Merged
ptone merged 58 commits into
GoogleCloudPlatform:mainfrom
ptone:pr/postgres-core
Jun 5, 2026
Merged

feat(store): Postgres backend — pgx driver, Ent schema parity, full store port, LISTEN/NOTIFY#304
ptone merged 58 commits into
GoogleCloudPlatform:mainfrom
ptone:pr/postgres-core

Conversation

@ptone
Copy link
Copy Markdown
Member

@ptone ptone commented Jun 4, 2026

Summary

Adds Postgres as a supported database backend alongside SQLite. SQLite remains the default — no existing deployments are affected unless they opt in via database.driver: postgres.

This is the foundation PR. PRs for multi-node broker dispatch and NFS workspace depend on this merging first.

What's included

  • P0 — Switch from lib/pqpgx/v5 stdlib; connection pool config (MaxOpenConns, ConnMaxIdleTime, ConnMaxLifetime); CRUD-parity test harness; Cloud Monitoring scaffolding for LISTEN/NOTIFY metrics
  • P1 — 23 new Ent schemas covering all ~25 raw-SQL tables (full table parity with pkg/store/sqlite/)
  • P2 — Port all store domains to Ent entadapter: user, allowlist/invite, schedule, maintenance, message, notification, GCP/GitHub/token, secret/env_var, template/harness_config, project/broker, agent. Remove raw-SQL store. Collapse dual-DB into single Ent store.
  • P3 — Postgres LISTEN/NOTIFY event publisher (PostgresEventPublisher); dialect-aware multi-replica concurrency primitives (advisory locks, singleton scheduler); parameterized store tests over {sqlite, postgres}
  • Pool hardeningConnMaxIdleTime (5 min default), TCP keepalive GUCs, advisory-lock tick-skip on connection error, bounded lock-checkout timeout
  • Migration tooling — In-process migration α (legacy raw-SQL hub.db → Ent-on-SQLite); migration β tool (Ent-SQLite → Ent-Postgres)

Test plan

  • go build ./... passes (verified on branch)
  • go test ./pkg/ent/... ./pkg/store/... passes
  • Integration test: go test -tags integration ./pkg/store/integrationtest/... against a Postgres instance
  • Deploy with database.driver: sqlite (default) — confirm zero behavior change
  • Deploy with database.driver: postgres + DSN — confirm hub boots and operates normally

@google-cla
Copy link
Copy Markdown

google-cla Bot commented Jun 4, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements database-related enhancements, including PostgreSQL support, database connection pooling, SQLite-to-PostgreSQL migration commands, and a test fixture generator. Feedback on these changes highlights potential goroutine and connection leaks due to the use of context.Background() when initializing the event publisher and running database migrations. Additionally, the DSN parsing logic for SQLite should be refined to handle file:// prefixes correctly to avoid leaving triple slashes in file paths.

Comment thread cmd/server_foreground.go Outdated

// Create shared event publisher for real-time SSE
eventPub := hub.NewChannelEventPublisher()
eventPub := newEventPublisher(context.Background(), cfg)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using context.Background() here when initializing the event publisher can lead to goroutine and connection leaks. The Postgres event publisher starts a background listener for LISTEN/NOTIFY that runs indefinitely until the context is cancelled. Since context.Background() is never cancelled, this connection and its associated goroutines will leak when the server shuts down or during testing. Consider refactoring initWebServer to accept a context.Context parameter and pass it here.

Comment thread cmd/server_foreground.go Outdated
s := entadapter.NewCompositeStore(sqliteStore, entClient)
// Migrate runs Ent's schema migration and seeds built-in maintenance
// operations (parity with the former raw-SQL store).
if err := s.Migrate(context.Background()); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using context.Background() for database migrations prevents graceful cancellation (e.g., if the operator interrupts the startup process with Ctrl+C). Consider refactoring initStore to accept a context.Context parameter so that s.Migrate and s.Ping can be cancelled properly.

Comment thread cmd/server_migrate.go
Comment on lines +165 to +166
case strings.HasPrefix(raw, "file:"):
path = strings.TrimPrefix(raw, "file:")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When parsing a file:/// DSN, trimming only the file: prefix leaves triple slashes (e.g., ///var/lib/...), which can cause issues when passed to file system operations like os.Remove. Handling file:// explicitly avoids this.

	case strings.HasPrefix(raw, "file://"):
		path = strings.TrimPrefix(raw, "file://")
	case strings.HasPrefix(raw, "file:"):
		path = strings.TrimPrefix(raw, "file:")

scion-gteam Bot pushed a commit to ptone/scion that referenced this pull request Jun 4, 2026
… and DSN parsing

Thread the server's cancellable context into initStore and
initWebServer instead of using context.Background(), so that:

- DB migrations and the health-check ping cancel on Ctrl+C during
  startup (medium-priority review comment).
- The Postgres LISTEN/NOTIFY event publisher goroutine shuts down
  cleanly when the server exits, preventing connection leaks
  (high-priority review comment).

Also fix parseSQLiteSourceDSN to handle the file:// prefix before
the file: prefix, so that file:///var/lib/hub.db correctly resolves
to /var/lib/hub.db instead of ///var/lib/hub.db. Add test cases for
file:// and file:/// DSN forms.
scion-gteam Bot pushed a commit to ptone/scion that referenced this pull request Jun 4, 2026
scion-gteam Bot pushed a commit to ptone/scion that referenced this pull request Jun 5, 2026
… and DSN parsing

Thread the server's cancellable context into initStore and
initWebServer instead of using context.Background(), so that:

- DB migrations and the health-check ping cancel on Ctrl+C during
  startup (medium-priority review comment).
- The Postgres LISTEN/NOTIFY event publisher goroutine shuts down
  cleanly when the server exits, preventing connection leaks
  (high-priority review comment).

Also fix parseSQLiteSourceDSN to handle the file:// prefix before
the file: prefix, so that file:///var/lib/hub.db correctly resolves
to /var/lib/hub.db instead of ///var/lib/hub.db. Add test cases for
file:// and file:/// DSN forms.
scion-gteam Bot pushed a commit to ptone/scion that referenced this pull request Jun 5, 2026
@scion-gteam scion-gteam Bot force-pushed the pr/postgres-core branch from b377aff to bdbe327 Compare June 5, 2026 00:26
Scion and others added 21 commits June 5, 2026 00:35
- Add github.com/jackc/pgx/v5/stdlib (registers as "pgx")
- driver_postgres.go: blank import pgx stdlib instead of lib/pq
- OpenPostgres: open via sql.Open("pgx", dsn) + entsql.OpenDB
- Introduce PoolConfig (applied to *sql.DB); thread through
  OpenSQLite/OpenPostgres and update all callers
- go mod tidy drops lib/pq
- DatabaseConfig gains MaxOpenConns / MaxIdleConns / ConnMaxLifetime
  plus ConnMaxLifetimeDuration() helper
- DefaultGlobalConfig sets sqlite pool defaults (MaxOpenConns=1,
  load-bearing for write serialization)
- applyDatabasePoolDefaults fills postgres defaults (20/5/30m) and
  forces sqlite MaxOpenConns=1; called in both load paths
- Mirror fields in V1DatabaseConfig + both conversion directions
- Wire pool settings into entc.OpenSQLite in initStore
P0-3: pkg/store/storetest/ — backend-agnostic, table-driven CRUD oracle.
A Factory(t) -> store.Store is injected; generic Domain[T] descriptors drive
Create/Read/Update/Delete (+optional soft-delete)/List-paginate/List-filter.
Ships group + policy domains and runs green against today's CompositeStore
(SQLite base + Ent DB). Ready to accept a postgresFactory for P3-2.

P0-4: internal/fixturegen/ — Go-defined spec seeding >=1 row per table across
all 30 domain tables, with edge cases (NULL optionals, max-length strings,
nested/unicode JSON, soft-deleted agent, BLOB). Deterministic. 'go run
./internal/fixturegen' emits testdata/hub-v46-fixture.db, prints a 30-table
coverage report, and caches the blob to the scratchpad mount. CI gate fails if
any table has zero rows.
Add Ent-backed implementations of the notification, GCP service account,
GitHub App installation, and user access token store sub-interfaces:

- notification_store.go: NotificationStore (subscriptions, notifications,
  templates). Dispatch uses an atomic conditional update as the multi-replica
  claim primitive, and an optional NotificationPublisher designs in the
  LISTEN/NOTIFY fan-out for created/dispatched events.
- external_store.go: GCPServiceAccountStore + GitHubInstallationStore +
  UserAccessTokenStore. GitHub create is idempotent (INSERT OR IGNORE
  semantics), repositories/scopes are JSON, default_scopes is CSV, and tokens
  support key-hash lookup. Legacy api_keys is intentionally not surfaced.
- storetest: add GCPServiceAccount, SubscriptionTemplate, and
  NotificationSubscription CRUD-parity domains.

Does not modify composite.go.
- schedule_store.go: ScheduleStore + ScheduledEventStore sub-interfaces with
  dialect-aware SELECT FOR UPDATE SKIP LOCKED claim helper for the
  ListDueSchedules / ListPendingScheduledEvents job-claim paths (plain SELECT
  on SQLite, SKIP LOCKED on Postgres).
- maintenance_store.go: run-state RMW, AbortRunningMaintenanceOps, Go-side
  seed (uuid.New) replacing SQLite randomblob() UUID seeds.
- message_store.go: CRUD, read flags, PurgeOldMessages, design-in
  PublishUserMessage hook for Postgres LISTEN/NOTIFY.
- pkg/ent/client_driver.go: hand-written Client.Driver() accessor for
  dialect detection + raw locking queries.
Implements the Ent-backed store adapters for the user and
allowlist/invite domains, plus their CRUD-parity oracle descriptors.

pkg/store/entadapter/user_store.go (store.UserStore):
- CreateUser/GetUser/GetUserByEmail/UpdateUser/UpdateUserLastSeen/
  DeleteUser/ListUsers.
- Case-insensitive email: emails are normalized to lower case on write
  (so the plain unique index enforces case-insensitive uniqueness,
  equivalent to the legacy UNIQUE COLLATE NOCASE) and matched with
  EmailEqualFold (lower(email)=lower($1)) on read. ent codegen +
  AutoMigrate cannot emit a real lower(email) functional index across
  both SQLite (tests) and Postgres, so the invariant is enforced at the
  port layer.
- Offset-based pagination matching the legacy SQLite store.

pkg/store/entadapter/allowlist_store.go (store.AllowListStore +
store.InviteCodeStore):
- Full allow-list + invite-code CRUD.
- BulkAddAllowListEntries uses CreateBulk + OnConflictColumns(email).
  Ignore() for race-safe INSERT-OR-IGNORE; added/skipped counts mirror
  the legacy per-row semantics (existing + within-batch dups skipped).
- IncrementInviteUseCount is a single atomic conditional UPDATE
  (revoked=false AND not expired AND (max_uses=0 OR use_count<max_uses)),
  which is race-free on both backends without SELECT...FOR UPDATE. The
  sql/lock feature is enabled and ForUpdate is available for genuine
  multi-statement RMW paths.
- ListAllowListEntriesWithInvites batch-joins invite codes (invite_id is
  a plain column, not an Ent edge).

Schema:
- pkg/ent/schema/user.go: add nillable last_seen field (+ index) needed
  by UpdateUserLastSeen / lastSeen sort; document the case-insensitive
  email strategy.
- pkg/ent/generate.go: enable --feature sql/upsert,sql/lock (required for
  OnConflict and ForUpdate).

Tests (all passing):
- pkg/store/storetest/domains_user.go: UserDomain, AllowListDomain,
  InviteCodeDomain oracle descriptors (kept in a separate file to avoid
  contending on domains.go).
- entadapter oracle test runs the shared CRUD-parity suite directly
  against the new adapters; behavior tests cover case-insensitivity,
  bulk idempotency, conditional increment, stats, and the invite join.

NOTE: Generated Ent code under pkg/ent/** is intentionally NOT included.
This is a shared worktree where sibling port agents concurrently modify
schemas and the same feature flags; the generated code must be
regenerated at wave integration via:
    go generate ./pkg/ent/...
Verified locally that regeneration + full build + tests pass.

Per P2 scope: composite.go wiring and ensureEntUser shadow removal are
deferred to P2-collapse.
Add Ent-backed store implementations for the secret/env and
template/harness domains, mirroring the legacy SQLite semantics:

- entadapter/secret_store.go: SecretStore implementing store.SecretStore
  + store.EnvVarStore. Polymorphic (scope, scope_id) addressing, COALESCE
  target->key projection, version bump on update, get-then-update upsert,
  and transitive ListProgenySecrets via a created_by IN-list over the
  ancestor set (user scope + allow_progeny only; encrypted value withheld).
- entadapter/template_store.go: TemplateStore implementing
  store.TemplateStore + store.HarnessConfigStore. base_template hierarchy,
  scope/project_id backwards-compat lookups, content_hash, JSON config/files
  columns, DeleteByScope. Subscription templates are owned by NotificationStore.
- Direct Ent unit tests incl. a progeny-inheritance parity test.
- storetest: Template/HarnessConfig/Secret/EnvVar domain descriptors wired
  into RunStoreSuite for cross-backend CRUD parity.
Port the project/broker domain (projects, runtime_brokers, project_contributors,
project_sync_state) and the broker-auth domain (broker_secrets,
broker_join_tokens) from raw SQL to Ent adapters.

- pkg/store/entadapter/project_store.go: implements ProjectStore,
  RuntimeBrokerStore, ProjectProviderStore and ProjectSyncStateStore.
  * provider + sync-state upserts use Ent OnConflict().UpdateNewValues()
    (sql/upsert) keyed on the (project_id, broker_id) unique index.
  * runtime broker heartbeat/update use an optimistic version-CAS loop on a
    new internal lock_version token, serializing concurrent writers portably
    across SQLite (tests) and Postgres without SELECT ... FOR UPDATE.
  * slug lookups support case-insensitive matching (EqualFold).
  * project computed fields (AgentCount, ActiveBrokerCount, ProjectType) are
    derived via Ent queries, matching the legacy SQLite store.
- pkg/store/entadapter/brokersecret_store.go: implements BrokerSecretStore
  (per-broker HMAC secrets + short-lived join tokens, expiry cleanup).
- Project Ent schema: add operational fields for full parity
  (default_runtime_broker_id, shared_dirs, github_*, git_identity).
- RuntimeBroker Ent schema: relax vestigial type column to Optional, add
  internal lock_version concurrency token.
- Regenerate Ent with sql/upsert,sql/lock features.
- storetest: add Project, RuntimeBroker, BrokerSecret and BrokerJoinToken
  CRUD-parity domains.
- Unit tests for both adapters.

Per the integration plan, composite.go wiring and ensureEntProject shadow
removal are deferred to P2-collapse.
Regenerated with --feature sql/upsert,sql/lock to support
OnConflict upserts and ForUpdate/SKIP LOCKED job claims.
Wire all Ent-backed sub-stores into CompositeStore via embedding, removing
the raw-SQL base store and the User/Agent/Project shadow-sync machinery
(ensureEntUser/ensureEntAgent/ensureEntProject). CompositeStore now serves
every domain from a single Ent client and implements Close/Ping/Migrate
directly.

Collapse initStore() to open one Ent SQLite DB (no _ent shadow DSN, no
MigrateGroveToProjectData, no raw sqlite.New). Register the User, AllowList,
and InviteCode domains in the storetest CRUD-parity suite. Update entadapter
tests for the single-DB NewCompositeStore(client) signature.

go build ./... green; go test ./pkg/store/entadapter/... ./pkg/store/storetest/... green.
Delete the ~6k-LOC raw-SQL store (sqlite.go) and its per-domain sibling
files (brokersecret, gcp_service_account, github_installation, maintenance,
messages, notification, project_sync_state, schedule, scheduled_event) plus
their tests, including the inline schema-migration scaffold. Keep driver.go,
which registers the pure-Go SQLite driver used by Ent's SQLite backend.

Repoint the two non-test consumers to the Ent-backed store:
  - cmd/hub_secret_migrate.go now opens an Ent client + CompositeStore.
  - internal/fixturegen opens via entc and seeds the Ent schema's *sql.DB.

go build ./... green; no remaining production references to the raw store.
…y PK

Replace the removed raw-SQL store in downstream tests with an Ent-backed
newTestStore helper (pkg/hub, pkg/secret) and update cmd/server_test.go and
internal/fixturegen tests. Port the 8 raw-SQL DB() access sites in hub tests
via a new CompositeStore.DB() escape-hatch accessor.

Fix a production bug surfaced by the collapse: hub/server.go signingKeySecretID
generated a non-UUID secret primary key, which the Ent secret store rejects;
it now derives a deterministic UUIDv5. go build ./... green; entadapter and
storetest suites green.

NOTE: hub/secret/fixturegen suites now COMPILE but many tests still fail
because their fixtures seed non-UUID string IDs that the UUID-PK Ent schema
rejects; addressed in follow-up commits (tid() helper).
Wrap human-readable test identifiers in tid() (deterministic UUIDv5) so the
UUID-PK Ent store accepts them while preserving cross-reference consistency and
ID-equality assertions. Reduces pkg/hub failures from 611 to 79; remaining
failures are behavioral, not ID-format, and are addressed separately.

# Conflicts:
#	pkg/hub/handlers_project_test.go
#	pkg/hub/httpdispatcher_test.go
Restore raw-SQL parity: CompositeStore.Migrate now runs AutoMigrate and seeds
built-in maintenance operations (the raw store seeded these in its migrations).
initStore and hub test helpers call s.Migrate() so production and tests seed
consistently. Fixes the maintenance-operation hub tests (404 'Operation not
found'). pkg/hub failures 79 -> 71.
Add slugs/broker names to test fixtures that previously relied on the raw
store's lenient (no-validator) inserts: project/agent slugs in the logs test
helper, broker slugs in embedded/profile/authz fixtures, and BrokerName on
envgather ProjectProvider literals. pkg/hub failures 71 -> 57.
Apply the tid() helper to pkg/secret fixtures (including a dynamically built
secret ID) so the UUID-PK Ent store accepts them. pkg/secret now fully green.
…ug/name

Wrap broker/grove/agent IDs passed to registerGlobalProjectAndBroker and the
dispatcher tests in tid(), and supply RuntimeBroker.slug / ProjectContributor
broker_name to satisfy Ent validators. cmd now green except
TestDeleteStopped_RequiresGroveContext, which requires the 'docker' binary
(absent in this sandbox) and is unrelated to the store migration.

# Conflicts:
#	cmd/server_dispatcher_test.go
Catch IDs that surfaced behind earlier failures (stale-agent-*, agent-visible-authz,
agent-profile-hb, env-owner-1). No more UUID-parse errors in pkg/hub; the
remaining ~56 failures are behavioral (URL paths built from old raw IDs,
assertion mismatches), addressed next.
Scion and others added 20 commits June 5, 2026 00:36
…arity); test(hub): seed FK users, broker_name, deterministic UUIDs
…er/gcp/web); valid-UUID NotFound cases; tid() scheduled-event URLs
…fallback); test(hub): seed FK owner users, UUID policy/broker/agent IDs in authz remediation
Implements 'scion server migrate --from sqlite://... --to postgres://...'
per postgres-strategy.md §7.3.

- entc.OpenSQLiteReadOnly: opens source with PRAGMA query_only=ON (no WAL
  write), MaxOpenConns=1 so the source is never mutated.
- entc.MigrateData: generic reflection-based, dependency-ordered copy of all
  30 Ent entities (FK-ordered core first), idempotent (skips rows whose PK
  already exists), atomic per entity (txn), chunked CreateBulk, source/dest
  row-count verification after each entity, plus the Group.child_groups M2M
  edge. FK columns are plain fields so edges are preserved via setters.
- cmd/server migrate: DSN parsing (sqlite://, file:, bare path; postgres URL
  or keyword form), --keep-source default / --drop-source cutover, progress
  logging.

Verified end-to-end against live CloudSQL Postgres 16 (integration test +
real CLI run): full copy, idempotent re-run, FK + M2M + value round-trips,
--drop-source removal.
…s (P3-3..6)

Add cluster-coordination primitives so N stateless hub processes can share one
Postgres, each degrading to a no-op on single-writer SQLite:

- store.AdvisoryLocker + entadapter TryAdvisoryLock (pg_try_advisory_lock on a
  dedicated conn); Scheduler.RegisterRecurringSingleton gates the heartbeat,
  stalled, purge, schedule-evaluator and github-health sweeps to one
  replica/tick.
- store.ScheduledEventClaimer + ClaimScheduledEvent atomic claim; fireEvent
  claims one-shot events before side effects (dedup across replica startup
  recovery).
- CompositeStore.RunSerializable: SERIALIZABLE + retry on 40001/40P01 (single
  run on SQLite) for future multi-row invariants.
- dbmetrics.StartPoolSampler feeds DB connection-pool gauges to the P0-5
  scaffold; wired into StartBackgroundServices via SetDBMetrics.

Verified existing primitives correct (agent StateVersion CAS, FOR UPDATE sweeps,
notification atomic dispatch). Found and documented the schedule SKIP LOCKED
early-commit gap (lock released before the status transition), closed by the
singleton evaluator. Audit + budget docs in scratchpad.

Tests: locking_test.go (advisory no-op, serializable, claim exactly-once incl.
8-way concurrent), pool_sampler_test.go.
…/NOTIFY publisher

P3-7: Decouple call sites from the concrete *ChannelEventPublisher.
- Add Subscribe(patterns...) (<-chan Event, func()) to the EventPublisher
  interface; implement it on noopEventPublisher (nil channel) — *ChannelEventPublisher
  already had it.
- Factor the Publish* methods into a shared eventBuilder (sink func) so every
  backend emits identical subjects/payloads; ChannelEventPublisher embeds it.
- web.go (field + SetEventPublisher), messagebroker.go and notifications.go
  (field + constructor) now take EventPublisher; handlers_messages.go gates SSE
  on "not the no-op publisher" instead of a concrete type assertion.

P3-8: PostgresEventPublisher over pgx LISTEN/NOTIFY (cross-replica delivery).
- Per-grove channels plus a global channel (flat exact-match); event type in the
  JSON envelope. Grove-scoped subjects publish to both the grove channel and the
  global channel; subscriptions group their patterns by resolved channel so an
  event is matched only against patterns that opted into the arriving channel
  (no double delivery).
- 8 KB NOTIFY limit handled by reference-and-refetch via scion_event_payloads
  (TTL-swept so every replica can refetch).
- PublishTx enrolls the NOTIFY in a caller transaction (atomic write+publish;
  rollback => no deliver). Delivery flows exclusively through the listener.
- Listener goroutine reconnects with backoff and re-LISTENs (resubscribe);
  dynamic LISTEN/UNLISTEN applied on a poll (WaitForNotification timeout does
  not invalidate the pgconn connection).
- Emits pkg/observability/dbmetrics signals (published/delivered/dropped,
  payload size, publish->deliver latency, reconnects, pool stats).
- cmd: newEventPublisher selects the backend by database driver (postgres =>
  PostgresEventPublisher, else ChannelEventPublisher) with safe fallback.

Tests: routing/registry/payload-offload/metrics/transactional-executor unit
tests run without a DB; cross-replica delivery, oversized round-trip,
transactional rollback, and reconnect+resubscribe are gated behind
SCION_TEST_POSTGRES_DSN. go build ./... green; full pkg/hub suite green.

Note: server.go's equivalent type-assertion cleanup is left in the working tree
(co-edited with concurrent P0-5/scheduler work) and is functionally optional —
HEAD server.go already compiles against the widened interface.
Add pkg/store/enttest: a backend-selecting Ent client factory for the store
test suites. Default is in-memory SQLite; built with -tags integration and
SCION_TEST_POSTGRES_URL set, it provisions a per-package ephemeral Postgres
database (created/dropped via TestMain) and isolates each test in its own
schema (search_path) so tests never observe each other's rows. Falls back to
SQLite when the env var is unset.

Route all entadapter and storetest helpers through enttest.NewClient so the
same CRUD-parity oracle runs unchanged against either backend.

Fix two real Postgres bugs surfaced by the new path:
- entadapter/dialect.go ancestryContains: emit the bind parameter via
  Builder.Arg ($n on Postgres) instead of a literal '?' through ExprP, which
  was not rebound and produced a syntax error; and use jsonb_array_elements_text
  (the column is jsonb on Postgres, not json).
- schedule_store_test ClaimPath: make the concurrent-claim assertion
  backend-aware. SQLite serializes (MaxOpenConns=1, no SKIP LOCKED) so every
  caller sees both due rows; Postgres uses FOR UPDATE SKIP LOCKED so concurrent
  callers may observe a disjoint subset (0..2) and must only never error or
  exceed 2.

Verified: full SQLite suite green; storetest CRUD parity green on CloudSQL
Postgres; entadapter green on Postgres (schedule ClaimPath fix confirmed).
…ublisher

Wave C integration: newEventPublisher can now return a PostgresEventPublisher
(LISTEN/NOTIFY) in addition to ChannelEventPublisher. The dispatcher/broker
startup previously hard-asserted *ChannelEventPublisher, which silently skipped
starting them under Postgres. Gate on (not noop and not nil) instead, matching
the existing pattern in handlers_messages.go.
…l default

Task 1 — LISTEN/NOTIFY publish path:
- Add TestPostgresIntegration_HandlerCreateProjectEmitsNotify: drives the real
  POST /api/v1/projects handler with a PostgresEventPublisher and asserts a
  pg_notify lands on scion_ev_global via an independent raw LISTEN — the exact
  capability the multi-replica live test probed. Verified PASSING against live
  CloudSQL, proving the handler -> s.events -> pg_notify wiring is correct end
  to end (the four pre-existing SCION_TEST_POSTGRES_DSN integration tests also
  pass). The multi-hub 'no NOTIFY' symptom was not reproducible against the
  current tree.
- Bound the autocommit publish (Publish* methods) with publishTimeout (5s).
  These run synchronously on the caller's (request handler) goroutine and
  acquire from the event pool; on a connection-starved instance that acquire
  could block indefinitely, stalling CRUD and silently never emitting NOTIFY.
  The timeout converts that into a logged error + dropped event (publishing is
  fire-and-forget). PublishTx (transactional path) is unaffected.

Task 2 — connection budget:
- Lower the default Postgres MaxOpenConns 20 -> 10 so multiple replicas fit a
  modest connection budget (see CONNECTION-BUDGET.md). CloudSQL instance
  scion-postgres-test resized db-f1-micro -> db-g1-small and max_connections
  set to 100 (out of band).
…tion, pool, NOTIFY, migration, schema, multi-process)

Add pkg/store/integrationtest/: a Postgres-only suite that exercises behavior
the SQLite parity suites cannot reach. Gated by //go:build integration and
SCION_TEST_POSTGRES_URL; skips cleanly otherwise.

Coverage:
- Contention: state_version CAS race (no lost updates, >=N-1 retries, final
  version==1+N), SKIP LOCKED / conditional-UPDATE event claim (single winner +
  disjoint drain), unique-key races (project slug, user email, agent slug).
- Isolation: SERIALIZABLE conflict + RunSerializable retry recovery, REPEATABLE
  READ no-phantom snapshot, READ COMMITTED dirty-read prevention.
- Pool: exhaustion + queued recovery, saturated pool honoring context deadline,
  long txn not starving short queries, healing after pg_terminate_backend.
- LISTEN/NOTIFY: ordered burst no-drop, 8000B payload limit, listener
  reconnect/resume, cross-channel isolation.
- Migration: 1000+ row counts + bounded-memory listing, idempotent re-migration.
- Schema: NULL semantics, unicode/emoji, nested JSON + special chars, large-text
  non-truncation, TIMESTAMPTZ microsecond precision.
- Multi-process: forks the test binary for cross-process advisory-lock
  exclusivity and cross-process NOTIFY delivery.

Configurable concurrency via SCION_TEST_CONCURRENCY (default 10).

Extend pkg/store/enttest with Active() and NewSchemaURL() so tests can open
custom-pool clients and share a DSN with forked child processes; non-integration
stubs keep the package API stable.
…k error

Stale-connection pool stalls (CloudSQL drops idle conns after ~10m):
- Add ConnMaxIdleTime to DatabaseConfig/PoolConfig (default 5m pg, 0 sqlite)
  and apply SetConnMaxIdleTime on the database/sql pool.
- OpenPostgres now parses the DSN with pgx and opens via stdlib.OpenDB with
  TCP keepalive GUCs (idle 60s / interval 15s / count 4) and a 10s connect
  timeout, so a silently-dropped peer is detected instead of the first query
  after idle hanging on a dead socket.
- pgx event pool (events_postgres.go): set keepalives + connect timeout on
  both the pool's ConnConfig and the dedicated listener connection, plus
  MaxConnIdleTime 5m / MaxConnLifetime 30m.

Advisory-lock leader election (scheduler.go):
- A lock-acquisition error no longer falls open to running the handler
  unguarded (which would duplicate singleton work across replicas); the tick
  is skipped and retried next interval. Added regression tests.

Test harness (enttest/integrationtest):
- Accept libpq keyword/value DSNs (not just URL form) when deriving the
  ephemeral db/schema/params; add WithConnParam helper.
- Fix migration idempotency test's per-pass row-count expectation.
…eout

TryAdvisoryLock checked a connection out of the pool and ran the unlock
on the full 55s scheduler-handler context (acquire) and an unbounded
context.Background() (release). On a pool that could not promptly serve a
healthy connection, db.Conn() blocked for the entire 55s before failing
with 'context deadline exceeded' on every tick; with several singleton
handlers firing each 60s tick, those long-blocked goroutines and their
pending pool connection requests piled up across ticks and kept the pool
jammed (checked out client-side, idle server-side).

The unbounded unlock was a second leak vector: if the held connection
died mid critical-section, ExecContext could hang forever, so conn.Close()
never ran and the connection leaked out of the pool permanently.

Bind both the acquire (db.Conn + pg_try_advisory_lock) and the release
(pg_advisory_unlock) to a 5s timeout so a bad tick fails fast and retries
next tick instead of parking a goroutine for ~55s, and so a dead
connection can never block release from freeing the conn. Lock semantics
are unchanged: cancelling the acquire context tears down only that
context, not the checked-out session that holds the lock.
Upgrade a legacy raw-SQL Hub database (the ~53-migration, 30-table schema
from the removed pkg/store/sqlite store) to the consolidated Ent-backed
SQLite schema, in-process on first boot, behind an automatic backup.

pkg/ent/entc/migrate_alpha.go:
- IsLegacyRawSQLSchema: detect via the schema_migrations sentinel + the
  legacy-only agents.agent_id column (no-op for an Ent/empty/absent file).
- MigrateAlphaSQLite: backup (checkpoint WAL + copy to hub.db.bak.<ts>),
  AutoMigrate a fresh Ent schema, ATTACH the legacy file, copy every table
  with INSERT…SELECT (foreign_keys OFF), verify per-table row counts, then
  atomically swap the migrated file into place.
- Data-driven column mapping (created_at→created, updated_at→updated,
  agents.agent_id→slug, policies→access_policies); bespoke SQL for the
  group_members/policy_bindings polymorphic splits and surrogate ids;
  groups.parent_id→group_child_groups edge.
- Deterministic UUIDv5 remap for legacy non-UUID primary keys (internal
  signing-key secrets; plugin runtime-broker ids) with consistent rewrite
  of every foreign-key reference via a TEMP _id_remap table.
- Tolerates missing legacy tables (older schema versions).

cmd/server_foreground.go: detect + migrate in initStore's sqlite path,
with a --no-auto-migrate operator opt-out (cmd/server.go).

Validated end-to-end against four production hub.db files (scion-integration,
-integration2, -demo, -gteam): exact row-count parity (up to ~19k rows),
every entity reads back through the live Ent store, idempotent re-runs, and
broker FK references resolve post-remap. Pre-existing dangling agent
created_by/owner_id refs are faithfully preserved (loader runs FK-off).
…1 starved the pool)

The struct-level default for Database.MaxOpenConns/MaxIdleConns is 1 — the
value SQLite REQUIRES to serialize writes. applyDatabasePoolDefaults only
bumped postgres to a real pool when the value was <= 0, but a postgres
deployment configured via env/driver override inherits the embedded default
of 1, so the guard never fired and the Ent pool ran with a SINGLE connection.

Effect in production (both integration hubs): every singleton scheduler tick
checks out the lone pool connection to hold its advisory lock, then blocks
waiting for a second connection to do its work — a self-deadlock that resolves
only at the 55s handler context deadline. All API requests serialize behind
the one connection, so GET /api/v1/* served in ~55s across the board.

Note env overrides could not paper over this: envKeyToConfigKey splits on
every underscore, so SCION_SERVER_DATABASE_MAX_OPEN_CONNS maps to
database.max.open.conns, not database.max_open_conns — silently ignored.

Treat the leaked SQLite default (<= 1) as 'unset' for postgres so the pool
default (10) applies; explicit sizing of 2+ is still respected. SQLite remains
pinned to 1. Adds regression tests for all three cases.
- broker-dispatch.md: DB-as-state-machine + LISTEN/NOTIFY pattern for
  cross-replica broker command routing and agent lifecycle dispatch
- nfs-workspace.md: NFS workspace coordination for VM (host bind-mount)
  and K8s/Cloud Run (per-pod mount) runtime models
… and DSN parsing

Thread the server's cancellable context into initStore and
initWebServer instead of using context.Background(), so that:

- DB migrations and the health-check ping cancel on Ctrl+C during
  startup (medium-priority review comment).
- The Postgres LISTEN/NOTIFY event publisher goroutine shuts down
  cleanly when the server exits, preventing connection leaks
  (high-priority review comment).

Also fix parseSQLiteSourceDSN to handle the file:// prefix before
the file: prefix, so that file:///var/lib/hub.db correctly resolves
to /var/lib/hub.db instead of ///var/lib/hub.db. Add test cases for
file:// and file:/// DSN forms.
1. Thread the server's cancellable context through
   maybeMigrateLegacySQLite → MigrateAlphaSQLite so that Ctrl+C
   during first-boot legacy migration aborts it instead of running
   with an uncancellable context.Background().

2. Guard against a double "file:" prefix when constructing the
   SQLite DSN. If the operator's database.url already starts with
   "file:", we no longer blindly prepend another "file:" prefix.
   Also correctly appends cache=shared with "&" when the DSN
   already contains query parameters.
…se fixup)

Upstream renamed hub-native to hub-managed while the PR was in
flight. Update the two remaining references that the rebase
conflict resolution missed.
@scion-gteam scion-gteam Bot force-pushed the pr/postgres-core branch from bdbe327 to b0da883 Compare June 5, 2026 00:37
@ptone ptone merged commit f7e0109 into GoogleCloudPlatform:main Jun 5, 2026
4 of 5 checks passed
@scion-gteam scion-gteam Bot deleted the pr/postgres-core branch June 5, 2026 04:07
ptone added a commit that referenced this pull request Jun 6, 2026
…by/owner_id)

The Agent Ent schema modeled created_by/owner_id as foreign keys to the
users table. When an agent creates a sub-agent, those columns hold the
*creating agent's* ID, which has no users-table row, so Postgres rejected
the insert with a foreign-key violation. mapError maps that to
ErrInvalidInput, surfacing as a detail-free "validation_error: Invalid
input (status: 400)" on every agent-initiated `scion start`. User-created
agents were unaffected, masking the regression (introduced when #304
ported the agent store onto Ent).

created_by/owner_id are polymorphic principal references (user OR agent),
like ancestry. Drop the User-typed edges and keep them as plain principal
UUID fields; resolve the delegation creator by ID and tolerate "no such
user". Atlas AutoMigrate drops the two FK constraints on existing DBs at
next boot.

Tests: the sole sub-agent creation test only passed because it seeded a
fake user row sharing the agent's ID — an impossible production state.
Remove that workaround so it exercises the real path, and add store/ent
regression tests asserting a non-user principal ID is accepted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant