fix(data-warehouse): gate CDC source setup and per-table PK#60274
Merged
Conversation
Postgres CDC (logical replication) cannot identify rows for UPDATE/DELETE without a primary key. The frontend wizard gates on `supports_cdc`, but the backend API previously accepted `sync_type=cdc` with an empty `primary_key_columns`, letting schemas reach streaming mode and brick on the first run when no PK could be discovered. Adds three backend gates: - Source create: pre-fetch PKs via `get_primary_key_columns` and 400 with the list of missing-PK tables before creating the replication slot/publication. - Schema update: when switching `sync_type` to `cdc`, query the source PK, reject if missing, persist if found. - Schema update: when enabling `should_sync` on an existing CDC schema with empty stored PK, reject — covers schemas created before the gate landed. The slot-setup call moves after PK validation so we don't leave replication state behind for configs we're about to refuse.
Contributor
Prompt To Fix All With AIFix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
products/data_warehouse/backend/api/test/test_external_data_schema.py:323-443
**Prefer parameterised tests for the CDC update pair**
`test_update_schema_to_cdc_persists_primary_key` and `test_update_schema_to_cdc_rejects_table_without_primary_key` test the same code path (PATCH `sync_type=cdc`) with the only meaningful difference being the return value of `get_primary_key_columns` and the expected outcome. The two tests duplicate the source/schema creation, the mock setup for `cdc_pg_connection` and `is_cdc_enabled_for_team`, and the patch for the connection context manager. Collapsing them into a single `@parameterized.expand` case (parameterising on the `get_primary_key_columns` return value, expected status, and expected `sync_type` after the call) would halve the boilerplate and match the team convention for parameterised tests.
Reviews (1): Last reviewed commit: "fix(data-warehouse): reject CDC schemas ..." | Re-trigger Greptile |
When a user skipped the CDC toggle on step 1 of the source wizard but the schema-selection step still let them pick `sync_type=cdc`, the backend would persist CDC schemas without ever calling `_setup_cdc_slot`. The result: no replication slot or publication on the source DB, empty `primary_key_columns` in `sync_type_config`, and every CDC schema failing once it flipped from snapshot to streaming mode. Backend: reject any `sync_type=cdc` schema in the create payload when source- level `cdc_enabled` (`payload.cdc_enabled` + team CDC flag) is False. Frontend: in `getDatabaseSchemas`, clear `cdc_available` on every discovered schema when source-level CDC isn't enabled. The backend's `cdc_available` only reflects the team-level flag, so without this guard `SyncMethodForm` would still offer CDC for every PK-having table even though the source itself has CDC off.
Contributor
|
Size Change: 0 B Total Size: 80.3 MB ℹ️ View Unchanged
|
Per code-review feedback. The two PATCH `sync_type=cdc` tests duplicated source/schema creation and mock setup with only the queried PK and expected outcome differing — fold them into a single `@parameterized.expand` case.
Contributor
Query snapshots: Backend query snapshots updatedChanges: 1 snapshots (1 modified, 0 added, 0 deleted) What this means:
Next steps:
|
- Remove the Postgres-specific live PK lookup from `PATCH sync_type=cdc`. Use the caller's `primary_key_columns` or the value discovery already stored; refuse otherwise. Drops the inline imports and source-type branch so the path is source-agnostic. - Switch the enable-sync gate to read `instance.primary_key_columns`. - Drop the redundant `source_type_model == POSTGRES` check on `_setup_cdc_slot` — `cdc_enabled` already implies a CDC-capable source. - Shorten the verbose explanatory comments.
Contributor
Query snapshots: Backend query snapshots updatedChanges: 1 snapshots (1 modified, 0 added, 0 deleted) What this means:
Next steps:
|
The new CDC gate refuses sync_type=cdc when neither caller nor stored sync_type_config has primary_key_columns. The fixture in test_update_schema_cdc_with_blank_source_schema_uses_physical_schema_metadata was created pre-gate and needs the PK that discovery would have populated.
Per review — cdc_enabled was only gated on the payload flag + team feature, so a non-Postgres source could set cdc_enabled=True and reach the Postgres-only slot setup. Bake the POSTGRES check into cdc_enabled itself so every downstream block stays safe without repeating the source-type guard.
Gilbert09
approved these changes
May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Postgres CDC schemas could be created in unworkable states via the API:
sync_type=cdcwhile the source had no replication slot/publication.sync_type=cdcwith an emptyprimary_key_columns, so schemas entered streaming mode and bricked once the snapshot→streaming transition fired.Changes
Three backend gates, one frontend gate:
sync_type=cdcbut source-level CDC isn't enabled (cdc_enablednow also requires a Postgres source).pg_catalogPK lookup on the source. Slot setup moved after PK validation so we don't leave replication state on the source for a rejected config.PATCH sync_type=cdcwhen no PK is available. This path does NOT hit the source — it uses the caller-providedprimary_key_columnsor the value discovery already stored onsync_type_config, and refuses when neither is set. (Earlier revisions did a live lookup here too; dropped per review to keep the update path source-agnostic and avoid a second round-trip.)PATCH should_sync=trueon a CDC schema with empty stored PK (covers schemas created before this gate).cdc_availableon every discovered schema when source-level CDC isn't enabled, soSyncMethodFormstops offering CDC for those sources.How did you test this code?
Agent-authored — automated tests only. Added (and parameterised) tests for each gate:
test_create_postgres_cdc_rejects_table_without_primary_keytest_create_rejects_cdc_schemas_when_source_cdc_disabledtest_update_schema_to_cdc(parameterised: reuses stored PK / caller override wins / 400 when neither set)test_update_schema_enable_should_sync_rejects_cdc_without_primary_keyAll pass locally via
./bin/hogli test.Publish to changelog?
no
Docs update
skip-inkeep-docs
🤖 Agent context
Found by debugging a prod source where every CDC schema was failing. Source
job_inputshad nocdc_enabled/slot/publication keys but all schemas hadsync_type=cdcwithprimary_key_columns: []—_setup_cdc_slotnever ran. Two regressions in the same flow: backend let CDC slip in when source-level CDC was off, and the wizard'scdc_availableonly reflected the team flag, not the source toggle. Source-create does a livepg_catalogPK lookup; the schema-update path trusts the caller value or the PK discovery already stored, so it stays source-agnostic.Tools: Claude Code.