feat(store): Postgres backend — pgx driver, Ent schema parity, full store port, LISTEN/NOTIFY#304
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
There was a problem hiding this comment.
Code Review
This pull request implements database-related enhancements, including PostgreSQL support, database connection pooling, SQLite-to-PostgreSQL migration commands, and a test fixture generator. Feedback on these changes highlights potential goroutine and connection leaks due to the use of context.Background() when initializing the event publisher and running database migrations. Additionally, the DSN parsing logic for SQLite should be refined to handle file:// prefixes correctly to avoid leaving triple slashes in file paths.
|
|
||
| // Create shared event publisher for real-time SSE | ||
| eventPub := hub.NewChannelEventPublisher() | ||
| eventPub := newEventPublisher(context.Background(), cfg) |
There was a problem hiding this comment.
Using context.Background() here when initializing the event publisher can lead to goroutine and connection leaks. The Postgres event publisher starts a background listener for LISTEN/NOTIFY that runs indefinitely until the context is cancelled. Since context.Background() is never cancelled, this connection and its associated goroutines will leak when the server shuts down or during testing. Consider refactoring initWebServer to accept a context.Context parameter and pass it here.
| s := entadapter.NewCompositeStore(sqliteStore, entClient) | ||
| // Migrate runs Ent's schema migration and seeds built-in maintenance | ||
| // operations (parity with the former raw-SQL store). | ||
| if err := s.Migrate(context.Background()); err != nil { |
There was a problem hiding this comment.
| case strings.HasPrefix(raw, "file:"): | ||
| path = strings.TrimPrefix(raw, "file:") |
There was a problem hiding this comment.
When parsing a file:/// DSN, trimming only the file: prefix leaves triple slashes (e.g., ///var/lib/...), which can cause issues when passed to file system operations like os.Remove. Handling file:// explicitly avoids this.
case strings.HasPrefix(raw, "file://"):
path = strings.TrimPrefix(raw, "file://")
case strings.HasPrefix(raw, "file:"):
path = strings.TrimPrefix(raw, "file:")… and DSN parsing Thread the server's cancellable context into initStore and initWebServer instead of using context.Background(), so that: - DB migrations and the health-check ping cancel on Ctrl+C during startup (medium-priority review comment). - The Postgres LISTEN/NOTIFY event publisher goroutine shuts down cleanly when the server exits, preventing connection leaks (high-priority review comment). Also fix parseSQLiteSourceDSN to handle the file:// prefix before the file: prefix, so that file:///var/lib/hub.db correctly resolves to /var/lib/hub.db instead of ///var/lib/hub.db. Add test cases for file:// and file:/// DSN forms.
… and DSN parsing Thread the server's cancellable context into initStore and initWebServer instead of using context.Background(), so that: - DB migrations and the health-check ping cancel on Ctrl+C during startup (medium-priority review comment). - The Postgres LISTEN/NOTIFY event publisher goroutine shuts down cleanly when the server exits, preventing connection leaks (high-priority review comment). Also fix parseSQLiteSourceDSN to handle the file:// prefix before the file: prefix, so that file:///var/lib/hub.db correctly resolves to /var/lib/hub.db instead of ///var/lib/hub.db. Add test cases for file:// and file:/// DSN forms.
b377aff to
bdbe327
Compare
- Add github.com/jackc/pgx/v5/stdlib (registers as "pgx")
- driver_postgres.go: blank import pgx stdlib instead of lib/pq
- OpenPostgres: open via sql.Open("pgx", dsn) + entsql.OpenDB
- Introduce PoolConfig (applied to *sql.DB); thread through
OpenSQLite/OpenPostgres and update all callers
- go mod tidy drops lib/pq
- DatabaseConfig gains MaxOpenConns / MaxIdleConns / ConnMaxLifetime plus ConnMaxLifetimeDuration() helper - DefaultGlobalConfig sets sqlite pool defaults (MaxOpenConns=1, load-bearing for write serialization) - applyDatabasePoolDefaults fills postgres defaults (20/5/30m) and forces sqlite MaxOpenConns=1; called in both load paths - Mirror fields in V1DatabaseConfig + both conversion directions - Wire pool settings into entc.OpenSQLite in initStore
P0-3: pkg/store/storetest/ — backend-agnostic, table-driven CRUD oracle. A Factory(t) -> store.Store is injected; generic Domain[T] descriptors drive Create/Read/Update/Delete (+optional soft-delete)/List-paginate/List-filter. Ships group + policy domains and runs green against today's CompositeStore (SQLite base + Ent DB). Ready to accept a postgresFactory for P3-2. P0-4: internal/fixturegen/ — Go-defined spec seeding >=1 row per table across all 30 domain tables, with edge cases (NULL optionals, max-length strings, nested/unicode JSON, soft-deleted agent, BLOB). Deterministic. 'go run ./internal/fixturegen' emits testdata/hub-v46-fixture.db, prints a 30-table coverage report, and caches the blob to the scratchpad mount. CI gate fails if any table has zero rows.
…FY metrics (P0-5)
Add Ent-backed implementations of the notification, GCP service account, GitHub App installation, and user access token store sub-interfaces: - notification_store.go: NotificationStore (subscriptions, notifications, templates). Dispatch uses an atomic conditional update as the multi-replica claim primitive, and an optional NotificationPublisher designs in the LISTEN/NOTIFY fan-out for created/dispatched events. - external_store.go: GCPServiceAccountStore + GitHubInstallationStore + UserAccessTokenStore. GitHub create is idempotent (INSERT OR IGNORE semantics), repositories/scopes are JSON, default_scopes is CSV, and tokens support key-hash lookup. Legacy api_keys is intentionally not surfaced. - storetest: add GCPServiceAccount, SubscriptionTemplate, and NotificationSubscription CRUD-parity domains. Does not modify composite.go.
- schedule_store.go: ScheduleStore + ScheduledEventStore sub-interfaces with dialect-aware SELECT FOR UPDATE SKIP LOCKED claim helper for the ListDueSchedules / ListPendingScheduledEvents job-claim paths (plain SELECT on SQLite, SKIP LOCKED on Postgres). - maintenance_store.go: run-state RMW, AbortRunningMaintenanceOps, Go-side seed (uuid.New) replacing SQLite randomblob() UUID seeds. - message_store.go: CRUD, read flags, PurgeOldMessages, design-in PublishUserMessage hook for Postgres LISTEN/NOTIFY. - pkg/ent/client_driver.go: hand-written Client.Driver() accessor for dialect detection + raw locking queries.
Implements the Ent-backed store adapters for the user and
allowlist/invite domains, plus their CRUD-parity oracle descriptors.
pkg/store/entadapter/user_store.go (store.UserStore):
- CreateUser/GetUser/GetUserByEmail/UpdateUser/UpdateUserLastSeen/
DeleteUser/ListUsers.
- Case-insensitive email: emails are normalized to lower case on write
(so the plain unique index enforces case-insensitive uniqueness,
equivalent to the legacy UNIQUE COLLATE NOCASE) and matched with
EmailEqualFold (lower(email)=lower($1)) on read. ent codegen +
AutoMigrate cannot emit a real lower(email) functional index across
both SQLite (tests) and Postgres, so the invariant is enforced at the
port layer.
- Offset-based pagination matching the legacy SQLite store.
pkg/store/entadapter/allowlist_store.go (store.AllowListStore +
store.InviteCodeStore):
- Full allow-list + invite-code CRUD.
- BulkAddAllowListEntries uses CreateBulk + OnConflictColumns(email).
Ignore() for race-safe INSERT-OR-IGNORE; added/skipped counts mirror
the legacy per-row semantics (existing + within-batch dups skipped).
- IncrementInviteUseCount is a single atomic conditional UPDATE
(revoked=false AND not expired AND (max_uses=0 OR use_count<max_uses)),
which is race-free on both backends without SELECT...FOR UPDATE. The
sql/lock feature is enabled and ForUpdate is available for genuine
multi-statement RMW paths.
- ListAllowListEntriesWithInvites batch-joins invite codes (invite_id is
a plain column, not an Ent edge).
Schema:
- pkg/ent/schema/user.go: add nillable last_seen field (+ index) needed
by UpdateUserLastSeen / lastSeen sort; document the case-insensitive
email strategy.
- pkg/ent/generate.go: enable --feature sql/upsert,sql/lock (required for
OnConflict and ForUpdate).
Tests (all passing):
- pkg/store/storetest/domains_user.go: UserDomain, AllowListDomain,
InviteCodeDomain oracle descriptors (kept in a separate file to avoid
contending on domains.go).
- entadapter oracle test runs the shared CRUD-parity suite directly
against the new adapters; behavior tests cover case-insensitivity,
bulk idempotency, conditional increment, stats, and the invite join.
NOTE: Generated Ent code under pkg/ent/** is intentionally NOT included.
This is a shared worktree where sibling port agents concurrently modify
schemas and the same feature flags; the generated code must be
regenerated at wave integration via:
go generate ./pkg/ent/...
Verified locally that regeneration + full build + tests pass.
Per P2 scope: composite.go wiring and ensureEntUser shadow removal are
deferred to P2-collapse.
Add Ent-backed store implementations for the secret/env and template/harness domains, mirroring the legacy SQLite semantics: - entadapter/secret_store.go: SecretStore implementing store.SecretStore + store.EnvVarStore. Polymorphic (scope, scope_id) addressing, COALESCE target->key projection, version bump on update, get-then-update upsert, and transitive ListProgenySecrets via a created_by IN-list over the ancestor set (user scope + allow_progeny only; encrypted value withheld). - entadapter/template_store.go: TemplateStore implementing store.TemplateStore + store.HarnessConfigStore. base_template hierarchy, scope/project_id backwards-compat lookups, content_hash, JSON config/files columns, DeleteByScope. Subscription templates are owned by NotificationStore. - Direct Ent unit tests incl. a progeny-inheritance parity test. - storetest: Template/HarnessConfig/Secret/EnvVar domain descriptors wired into RunStoreSuite for cross-backend CRUD parity.
Port the project/broker domain (projects, runtime_brokers, project_contributors,
project_sync_state) and the broker-auth domain (broker_secrets,
broker_join_tokens) from raw SQL to Ent adapters.
- pkg/store/entadapter/project_store.go: implements ProjectStore,
RuntimeBrokerStore, ProjectProviderStore and ProjectSyncStateStore.
* provider + sync-state upserts use Ent OnConflict().UpdateNewValues()
(sql/upsert) keyed on the (project_id, broker_id) unique index.
* runtime broker heartbeat/update use an optimistic version-CAS loop on a
new internal lock_version token, serializing concurrent writers portably
across SQLite (tests) and Postgres without SELECT ... FOR UPDATE.
* slug lookups support case-insensitive matching (EqualFold).
* project computed fields (AgentCount, ActiveBrokerCount, ProjectType) are
derived via Ent queries, matching the legacy SQLite store.
- pkg/store/entadapter/brokersecret_store.go: implements BrokerSecretStore
(per-broker HMAC secrets + short-lived join tokens, expiry cleanup).
- Project Ent schema: add operational fields for full parity
(default_runtime_broker_id, shared_dirs, github_*, git_identity).
- RuntimeBroker Ent schema: relax vestigial type column to Optional, add
internal lock_version concurrency token.
- Regenerate Ent with sql/upsert,sql/lock features.
- storetest: add Project, RuntimeBroker, BrokerSecret and BrokerJoinToken
CRUD-parity domains.
- Unit tests for both adapters.
Per the integration plan, composite.go wiring and ensureEntProject shadow
removal are deferred to P2-collapse.
Regenerated with --feature sql/upsert,sql/lock to support OnConflict upserts and ForUpdate/SKIP LOCKED job claims.
Wire all Ent-backed sub-stores into CompositeStore via embedding, removing the raw-SQL base store and the User/Agent/Project shadow-sync machinery (ensureEntUser/ensureEntAgent/ensureEntProject). CompositeStore now serves every domain from a single Ent client and implements Close/Ping/Migrate directly. Collapse initStore() to open one Ent SQLite DB (no _ent shadow DSN, no MigrateGroveToProjectData, no raw sqlite.New). Register the User, AllowList, and InviteCode domains in the storetest CRUD-parity suite. Update entadapter tests for the single-DB NewCompositeStore(client) signature. go build ./... green; go test ./pkg/store/entadapter/... ./pkg/store/storetest/... green.
Delete the ~6k-LOC raw-SQL store (sqlite.go) and its per-domain sibling files (brokersecret, gcp_service_account, github_installation, maintenance, messages, notification, project_sync_state, schedule, scheduled_event) plus their tests, including the inline schema-migration scaffold. Keep driver.go, which registers the pure-Go SQLite driver used by Ent's SQLite backend. Repoint the two non-test consumers to the Ent-backed store: - cmd/hub_secret_migrate.go now opens an Ent client + CompositeStore. - internal/fixturegen opens via entc and seeds the Ent schema's *sql.DB. go build ./... green; no remaining production references to the raw store.
…y PK Replace the removed raw-SQL store in downstream tests with an Ent-backed newTestStore helper (pkg/hub, pkg/secret) and update cmd/server_test.go and internal/fixturegen tests. Port the 8 raw-SQL DB() access sites in hub tests via a new CompositeStore.DB() escape-hatch accessor. Fix a production bug surfaced by the collapse: hub/server.go signingKeySecretID generated a non-UUID secret primary key, which the Ent secret store rejects; it now derives a deterministic UUIDv5. go build ./... green; entadapter and storetest suites green. NOTE: hub/secret/fixturegen suites now COMPILE but many tests still fail because their fixtures seed non-UUID string IDs that the UUID-PK Ent schema rejects; addressed in follow-up commits (tid() helper).
Wrap human-readable test identifiers in tid() (deterministic UUIDv5) so the UUID-PK Ent store accepts them while preserving cross-reference consistency and ID-equality assertions. Reduces pkg/hub failures from 611 to 79; remaining failures are behavioral, not ID-format, and are addressed separately. # Conflicts: # pkg/hub/handlers_project_test.go # pkg/hub/httpdispatcher_test.go
Restore raw-SQL parity: CompositeStore.Migrate now runs AutoMigrate and seeds built-in maintenance operations (the raw store seeded these in its migrations). initStore and hub test helpers call s.Migrate() so production and tests seed consistently. Fixes the maintenance-operation hub tests (404 'Operation not found'). pkg/hub failures 79 -> 71.
Add slugs/broker names to test fixtures that previously relied on the raw store's lenient (no-validator) inserts: project/agent slugs in the logs test helper, broker slugs in embedded/profile/authz fixtures, and BrokerName on envgather ProjectProvider literals. pkg/hub failures 71 -> 57.
Apply the tid() helper to pkg/secret fixtures (including a dynamically built secret ID) so the UUID-PK Ent store accepts them. pkg/secret now fully green.
…ug/name Wrap broker/grove/agent IDs passed to registerGlobalProjectAndBroker and the dispatcher tests in tid(), and supply RuntimeBroker.slug / ProjectContributor broker_name to satisfy Ent validators. cmd now green except TestDeleteStopped_RequiresGroveContext, which requires the 'docker' binary (absent in this sandbox) and is unrelated to the store migration. # Conflicts: # cmd/server_dispatcher_test.go
Catch IDs that surfaced behind earlier failures (stale-agent-*, agent-visible-authz, agent-profile-hb, env-owner-1). No more UUID-parse errors in pkg/hub; the remaining ~56 failures are behavioral (URL paths built from old raw IDs, assertion mismatches), addressed next.
…arity); test(hub): seed FK users, broker_name, deterministic UUIDs
…id() URLs + FK user seeds in events/stopall
…er/gcp/web); valid-UUID NotFound cases; tid() scheduled-event URLs
…fallback); test(hub): seed FK owner users, UUID policy/broker/agent IDs in authz remediation
Implements 'scion server migrate --from sqlite://... --to postgres://...' per postgres-strategy.md §7.3. - entc.OpenSQLiteReadOnly: opens source with PRAGMA query_only=ON (no WAL write), MaxOpenConns=1 so the source is never mutated. - entc.MigrateData: generic reflection-based, dependency-ordered copy of all 30 Ent entities (FK-ordered core first), idempotent (skips rows whose PK already exists), atomic per entity (txn), chunked CreateBulk, source/dest row-count verification after each entity, plus the Group.child_groups M2M edge. FK columns are plain fields so edges are preserved via setters. - cmd/server migrate: DSN parsing (sqlite://, file:, bare path; postgres URL or keyword form), --keep-source default / --drop-source cutover, progress logging. Verified end-to-end against live CloudSQL Postgres 16 (integration test + real CLI run): full copy, idempotent re-run, FK + M2M + value round-trips, --drop-source removal.
…s (P3-3..6) Add cluster-coordination primitives so N stateless hub processes can share one Postgres, each degrading to a no-op on single-writer SQLite: - store.AdvisoryLocker + entadapter TryAdvisoryLock (pg_try_advisory_lock on a dedicated conn); Scheduler.RegisterRecurringSingleton gates the heartbeat, stalled, purge, schedule-evaluator and github-health sweeps to one replica/tick. - store.ScheduledEventClaimer + ClaimScheduledEvent atomic claim; fireEvent claims one-shot events before side effects (dedup across replica startup recovery). - CompositeStore.RunSerializable: SERIALIZABLE + retry on 40001/40P01 (single run on SQLite) for future multi-row invariants. - dbmetrics.StartPoolSampler feeds DB connection-pool gauges to the P0-5 scaffold; wired into StartBackgroundServices via SetDBMetrics. Verified existing primitives correct (agent StateVersion CAS, FOR UPDATE sweeps, notification atomic dispatch). Found and documented the schedule SKIP LOCKED early-commit gap (lock released before the status transition), closed by the singleton evaluator. Audit + budget docs in scratchpad. Tests: locking_test.go (advisory no-op, serializable, claim exactly-once incl. 8-way concurrent), pool_sampler_test.go.
…/NOTIFY publisher P3-7: Decouple call sites from the concrete *ChannelEventPublisher. - Add Subscribe(patterns...) (<-chan Event, func()) to the EventPublisher interface; implement it on noopEventPublisher (nil channel) — *ChannelEventPublisher already had it. - Factor the Publish* methods into a shared eventBuilder (sink func) so every backend emits identical subjects/payloads; ChannelEventPublisher embeds it. - web.go (field + SetEventPublisher), messagebroker.go and notifications.go (field + constructor) now take EventPublisher; handlers_messages.go gates SSE on "not the no-op publisher" instead of a concrete type assertion. P3-8: PostgresEventPublisher over pgx LISTEN/NOTIFY (cross-replica delivery). - Per-grove channels plus a global channel (flat exact-match); event type in the JSON envelope. Grove-scoped subjects publish to both the grove channel and the global channel; subscriptions group their patterns by resolved channel so an event is matched only against patterns that opted into the arriving channel (no double delivery). - 8 KB NOTIFY limit handled by reference-and-refetch via scion_event_payloads (TTL-swept so every replica can refetch). - PublishTx enrolls the NOTIFY in a caller transaction (atomic write+publish; rollback => no deliver). Delivery flows exclusively through the listener. - Listener goroutine reconnects with backoff and re-LISTENs (resubscribe); dynamic LISTEN/UNLISTEN applied on a poll (WaitForNotification timeout does not invalidate the pgconn connection). - Emits pkg/observability/dbmetrics signals (published/delivered/dropped, payload size, publish->deliver latency, reconnects, pool stats). - cmd: newEventPublisher selects the backend by database driver (postgres => PostgresEventPublisher, else ChannelEventPublisher) with safe fallback. Tests: routing/registry/payload-offload/metrics/transactional-executor unit tests run without a DB; cross-replica delivery, oversized round-trip, transactional rollback, and reconnect+resubscribe are gated behind SCION_TEST_POSTGRES_DSN. go build ./... green; full pkg/hub suite green. Note: server.go's equivalent type-assertion cleanup is left in the working tree (co-edited with concurrent P0-5/scheduler work) and is functionally optional — HEAD server.go already compiles against the widened interface.
Add pkg/store/enttest: a backend-selecting Ent client factory for the store test suites. Default is in-memory SQLite; built with -tags integration and SCION_TEST_POSTGRES_URL set, it provisions a per-package ephemeral Postgres database (created/dropped via TestMain) and isolates each test in its own schema (search_path) so tests never observe each other's rows. Falls back to SQLite when the env var is unset. Route all entadapter and storetest helpers through enttest.NewClient so the same CRUD-parity oracle runs unchanged against either backend. Fix two real Postgres bugs surfaced by the new path: - entadapter/dialect.go ancestryContains: emit the bind parameter via Builder.Arg ($n on Postgres) instead of a literal '?' through ExprP, which was not rebound and produced a syntax error; and use jsonb_array_elements_text (the column is jsonb on Postgres, not json). - schedule_store_test ClaimPath: make the concurrent-claim assertion backend-aware. SQLite serializes (MaxOpenConns=1, no SKIP LOCKED) so every caller sees both due rows; Postgres uses FOR UPDATE SKIP LOCKED so concurrent callers may observe a disjoint subset (0..2) and must only never error or exceed 2. Verified: full SQLite suite green; storetest CRUD parity green on CloudSQL Postgres; entadapter green on Postgres (schedule ClaimPath fix confirmed).
…ublisher Wave C integration: newEventPublisher can now return a PostgresEventPublisher (LISTEN/NOTIFY) in addition to ChannelEventPublisher. The dispatcher/broker startup previously hard-asserted *ChannelEventPublisher, which silently skipped starting them under Postgres. Gate on (not noop and not nil) instead, matching the existing pattern in handlers_messages.go.
…l default Task 1 — LISTEN/NOTIFY publish path: - Add TestPostgresIntegration_HandlerCreateProjectEmitsNotify: drives the real POST /api/v1/projects handler with a PostgresEventPublisher and asserts a pg_notify lands on scion_ev_global via an independent raw LISTEN — the exact capability the multi-replica live test probed. Verified PASSING against live CloudSQL, proving the handler -> s.events -> pg_notify wiring is correct end to end (the four pre-existing SCION_TEST_POSTGRES_DSN integration tests also pass). The multi-hub 'no NOTIFY' symptom was not reproducible against the current tree. - Bound the autocommit publish (Publish* methods) with publishTimeout (5s). These run synchronously on the caller's (request handler) goroutine and acquire from the event pool; on a connection-starved instance that acquire could block indefinitely, stalling CRUD and silently never emitting NOTIFY. The timeout converts that into a logged error + dropped event (publishing is fire-and-forget). PublishTx (transactional path) is unaffected. Task 2 — connection budget: - Lower the default Postgres MaxOpenConns 20 -> 10 so multiple replicas fit a modest connection budget (see CONNECTION-BUDGET.md). CloudSQL instance scion-postgres-test resized db-f1-micro -> db-g1-small and max_connections set to 100 (out of band).
…tion, pool, NOTIFY, migration, schema, multi-process) Add pkg/store/integrationtest/: a Postgres-only suite that exercises behavior the SQLite parity suites cannot reach. Gated by //go:build integration and SCION_TEST_POSTGRES_URL; skips cleanly otherwise. Coverage: - Contention: state_version CAS race (no lost updates, >=N-1 retries, final version==1+N), SKIP LOCKED / conditional-UPDATE event claim (single winner + disjoint drain), unique-key races (project slug, user email, agent slug). - Isolation: SERIALIZABLE conflict + RunSerializable retry recovery, REPEATABLE READ no-phantom snapshot, READ COMMITTED dirty-read prevention. - Pool: exhaustion + queued recovery, saturated pool honoring context deadline, long txn not starving short queries, healing after pg_terminate_backend. - LISTEN/NOTIFY: ordered burst no-drop, 8000B payload limit, listener reconnect/resume, cross-channel isolation. - Migration: 1000+ row counts + bounded-memory listing, idempotent re-migration. - Schema: NULL semantics, unicode/emoji, nested JSON + special chars, large-text non-truncation, TIMESTAMPTZ microsecond precision. - Multi-process: forks the test binary for cross-process advisory-lock exclusivity and cross-process NOTIFY delivery. Configurable concurrency via SCION_TEST_CONCURRENCY (default 10). Extend pkg/store/enttest with Active() and NewSchemaURL() so tests can open custom-pool clients and share a DSN with forked child processes; non-integration stubs keep the package API stable.
…k error Stale-connection pool stalls (CloudSQL drops idle conns after ~10m): - Add ConnMaxIdleTime to DatabaseConfig/PoolConfig (default 5m pg, 0 sqlite) and apply SetConnMaxIdleTime on the database/sql pool. - OpenPostgres now parses the DSN with pgx and opens via stdlib.OpenDB with TCP keepalive GUCs (idle 60s / interval 15s / count 4) and a 10s connect timeout, so a silently-dropped peer is detected instead of the first query after idle hanging on a dead socket. - pgx event pool (events_postgres.go): set keepalives + connect timeout on both the pool's ConnConfig and the dedicated listener connection, plus MaxConnIdleTime 5m / MaxConnLifetime 30m. Advisory-lock leader election (scheduler.go): - A lock-acquisition error no longer falls open to running the handler unguarded (which would duplicate singleton work across replicas); the tick is skipped and retried next interval. Added regression tests. Test harness (enttest/integrationtest): - Accept libpq keyword/value DSNs (not just URL form) when deriving the ephemeral db/schema/params; add WithConnParam helper. - Fix migration idempotency test's per-pass row-count expectation.
…eout TryAdvisoryLock checked a connection out of the pool and ran the unlock on the full 55s scheduler-handler context (acquire) and an unbounded context.Background() (release). On a pool that could not promptly serve a healthy connection, db.Conn() blocked for the entire 55s before failing with 'context deadline exceeded' on every tick; with several singleton handlers firing each 60s tick, those long-blocked goroutines and their pending pool connection requests piled up across ticks and kept the pool jammed (checked out client-side, idle server-side). The unbounded unlock was a second leak vector: if the held connection died mid critical-section, ExecContext could hang forever, so conn.Close() never ran and the connection leaked out of the pool permanently. Bind both the acquire (db.Conn + pg_try_advisory_lock) and the release (pg_advisory_unlock) to a 5s timeout so a bad tick fails fast and retries next tick instead of parking a goroutine for ~55s, and so a dead connection can never block release from freeing the conn. Lock semantics are unchanged: cancelling the acquire context tears down only that context, not the checked-out session that holds the lock.
Upgrade a legacy raw-SQL Hub database (the ~53-migration, 30-table schema from the removed pkg/store/sqlite store) to the consolidated Ent-backed SQLite schema, in-process on first boot, behind an automatic backup. pkg/ent/entc/migrate_alpha.go: - IsLegacyRawSQLSchema: detect via the schema_migrations sentinel + the legacy-only agents.agent_id column (no-op for an Ent/empty/absent file). - MigrateAlphaSQLite: backup (checkpoint WAL + copy to hub.db.bak.<ts>), AutoMigrate a fresh Ent schema, ATTACH the legacy file, copy every table with INSERT…SELECT (foreign_keys OFF), verify per-table row counts, then atomically swap the migrated file into place. - Data-driven column mapping (created_at→created, updated_at→updated, agents.agent_id→slug, policies→access_policies); bespoke SQL for the group_members/policy_bindings polymorphic splits and surrogate ids; groups.parent_id→group_child_groups edge. - Deterministic UUIDv5 remap for legacy non-UUID primary keys (internal signing-key secrets; plugin runtime-broker ids) with consistent rewrite of every foreign-key reference via a TEMP _id_remap table. - Tolerates missing legacy tables (older schema versions). cmd/server_foreground.go: detect + migrate in initStore's sqlite path, with a --no-auto-migrate operator opt-out (cmd/server.go). Validated end-to-end against four production hub.db files (scion-integration, -integration2, -demo, -gteam): exact row-count parity (up to ~19k rows), every entity reads back through the live Ent store, idempotent re-runs, and broker FK references resolve post-remap. Pre-existing dangling agent created_by/owner_id refs are faithfully preserved (loader runs FK-off).
…1 starved the pool) The struct-level default for Database.MaxOpenConns/MaxIdleConns is 1 — the value SQLite REQUIRES to serialize writes. applyDatabasePoolDefaults only bumped postgres to a real pool when the value was <= 0, but a postgres deployment configured via env/driver override inherits the embedded default of 1, so the guard never fired and the Ent pool ran with a SINGLE connection. Effect in production (both integration hubs): every singleton scheduler tick checks out the lone pool connection to hold its advisory lock, then blocks waiting for a second connection to do its work — a self-deadlock that resolves only at the 55s handler context deadline. All API requests serialize behind the one connection, so GET /api/v1/* served in ~55s across the board. Note env overrides could not paper over this: envKeyToConfigKey splits on every underscore, so SCION_SERVER_DATABASE_MAX_OPEN_CONNS maps to database.max.open.conns, not database.max_open_conns — silently ignored. Treat the leaked SQLite default (<= 1) as 'unset' for postgres so the pool default (10) applies; explicit sizing of 2+ is still respected. SQLite remains pinned to 1. Adds regression tests for all three cases.
- broker-dispatch.md: DB-as-state-machine + LISTEN/NOTIFY pattern for cross-replica broker command routing and agent lifecycle dispatch - nfs-workspace.md: NFS workspace coordination for VM (host bind-mount) and K8s/Cloud Run (per-pod mount) runtime models
… and DSN parsing Thread the server's cancellable context into initStore and initWebServer instead of using context.Background(), so that: - DB migrations and the health-check ping cancel on Ctrl+C during startup (medium-priority review comment). - The Postgres LISTEN/NOTIFY event publisher goroutine shuts down cleanly when the server exits, preventing connection leaks (high-priority review comment). Also fix parseSQLiteSourceDSN to handle the file:// prefix before the file: prefix, so that file:///var/lib/hub.db correctly resolves to /var/lib/hub.db instead of ///var/lib/hub.db. Add test cases for file:// and file:/// DSN forms.
1. Thread the server's cancellable context through maybeMigrateLegacySQLite → MigrateAlphaSQLite so that Ctrl+C during first-boot legacy migration aborts it instead of running with an uncancellable context.Background(). 2. Guard against a double "file:" prefix when constructing the SQLite DSN. If the operator's database.url already starts with "file:", we no longer blindly prepend another "file:" prefix. Also correctly appends cache=shared with "&" when the DSN already contains query parameters.
…se fixup) Upstream renamed hub-native to hub-managed while the PR was in flight. Update the two remaining references that the rebase conflict resolution missed.
bdbe327 to
b0da883
Compare
…by/owner_id) The Agent Ent schema modeled created_by/owner_id as foreign keys to the users table. When an agent creates a sub-agent, those columns hold the *creating agent's* ID, which has no users-table row, so Postgres rejected the insert with a foreign-key violation. mapError maps that to ErrInvalidInput, surfacing as a detail-free "validation_error: Invalid input (status: 400)" on every agent-initiated `scion start`. User-created agents were unaffected, masking the regression (introduced when #304 ported the agent store onto Ent). created_by/owner_id are polymorphic principal references (user OR agent), like ancestry. Drop the User-typed edges and keep them as plain principal UUID fields; resolve the delegation creator by ID and tolerate "no such user". Atlas AutoMigrate drops the two FK constraints on existing DBs at next boot. Tests: the sole sub-agent creation test only passed because it seeded a fake user row sharing the agent's ID — an impossible production state. Remove that workaround so it exercises the real path, and add store/ent regression tests asserting a non-user principal ID is accepted.
Summary
Adds Postgres as a supported database backend alongside SQLite. SQLite remains the default — no existing deployments are affected unless they opt in via
database.driver: postgres.This is the foundation PR. PRs for multi-node broker dispatch and NFS workspace depend on this merging first.
What's included
lib/pq→pgx/v5stdlib; connection pool config (MaxOpenConns,ConnMaxIdleTime,ConnMaxLifetime); CRUD-parity test harness; Cloud Monitoring scaffolding for LISTEN/NOTIFY metricspkg/store/sqlite/)PostgresEventPublisher); dialect-aware multi-replica concurrency primitives (advisory locks, singleton scheduler); parameterized store tests over{sqlite, postgres}ConnMaxIdleTime(5 min default), TCP keepalive GUCs, advisory-lock tick-skip on connection error, bounded lock-checkout timeouthub.db→ Ent-on-SQLite); migration β tool (Ent-SQLite → Ent-Postgres)Test plan
go build ./...passes (verified on branch)go test ./pkg/ent/... ./pkg/store/...passesgo test -tags integration ./pkg/store/integrationtest/...against a Postgres instancedatabase.driver: sqlite(default) — confirm zero behavior changedatabase.driver: postgres+ DSN — confirm hub boots and operates normally