feat(data-warehouse): column selection on MySQL, MSSQL, BigQuery, Snowflake, Redshift#60155
Conversation
Hoists the column-projection logic (`build_select_clause`,
`filter_columns_by_enabled_columns`, `filter_dwh_columns_by_enabled_columns`,
`postgres_schema_metadata`) from Postgres-specific helpers into
`posthog/temporal/data_imports/sources/common/sql/{projection,metadata}.py`.
Every SQL source can now reuse them.
Adds the `SQLSource.supports_column_selection` capability flag (Postgres
flips to `True`; every other SQL source stays `False` until PR2 enables
their `enabled_columns` plumbing), exposes it on `ExternalDataSourceSerializer`
so the frontend can gate the picker without a hardcoded source-type list,
and threads `enabled_columns` through `SourceInputs` so a single place
populates it for every SQL driver.
Also extends `SelectQueryBuilder.select_all` with `enabled_columns` /
`primary_keys` so MySQL/MSSQL/Snowflake/Redshift/BigQuery pick up
column projection for free in PR2, and adds a generic
`SQLSource.reconcile_schema_metadata` hook + a `prune_enabled_columns`
pass so a source-side column drop can't break the next sync.
No user-visible behavior change in this PR — Postgres is still the only
source with `supports_column_selection=True`. PR2 broadens to MySQL,
MSSQL, BigQuery, Snowflake, Redshift and lands the warehouse-mode UI.
Generated-By: PostHog Code
Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
|
Size Change: 0 B Total Size: 80.7 MB ℹ️ View Unchanged
|
|
| projected = compute_projected_columns(enabled_columns, primary_keys, incremental_field) | ||
| if projected is None: | ||
| return "*" | ||
| return ", ".join(f"`{column}`" for column in projected) |
There was a problem hiding this comment.
BigQuery SQL injection via unvalidated column names in
_bq_select_clause
_bq_select_clause (bigquery.py line 387) embeds column names directly with f"`{column}`" — it is the only driver that bypasses IdentifierQuoter._validate_identifier(). Every other driver routes through format_projected_select_clause(projected, quoter) which calls _validate_identifier and raises InvalidIdentifierError for any character outside the allowlist (so backtick, semicolon, space, etc. are rejected for MySQL/MSSQL/Snowflake; psycopg escaping handles Redshift).
The API-layer guard in ExternalDataSchemaSerializer.update() only rejects unknown column names if known: — i.e., only when schema_metadata.columns is already populated. For legacy BigQuery sources that have never been refreshed (or any source where schema_metadata is null), arbitrary strings can be stored as enabled_columns.
A team member with write access to warehouse schema configuration can then set enabled_columns to a value containing a backtick (e.g. col1\, (SELECT secret FROM `proj.ds`.`t` LIMIT 1) --) on such a source, trigger an incremental sync (or a VIEW/MATERIALIZED_VIEW sync), and the injected subquery executes against the BigQuery project under the stored service-account credentials. The incremental path goes through _get_query()→_bq_select_clause()and the result is fed directly tobq_client.query()`. Impact is scoped to tables accessible to the service-account key (attacker's own organization's GCP project), but can expose columns/tables beyond what column selection was intended to restrict.
Prompt To Fix With AI
Replace the raw f-string backtick wrapping in `_bq_select_clause` with `format_projected_select_clause` using a `BacktickIdentifierQuoter` (or a new `BigQueryIdentifierQuoter` that uses backticks), matching exactly what MySQL does. This ensures `_validate_identifier` rejects any column name that contains a backtick or other disallowed character before it reaches the SQL string.
Concrete change in `bigquery.py`:
1. Import `BacktickIdentifierQuoter` and `format_projected_select_clause` from `posthog.temporal.data_imports.sources.common.sql`.
2. Add a module-level quoter: `_BQ_IDENTIFIER_QUOTER = BacktickIdentifierQuoter()`
3. Replace `_bq_select_clause`:
```python
def _bq_select_clause(
enabled_columns: list[str] | None,
primary_keys: list[str] | None,
incremental_field: str | None,
) -> str:
projected = compute_projected_columns(enabled_columns, primary_keys, incremental_field)
return format_projected_select_clause(projected, _BQ_IDENTIFIER_QUOTER)
```
This is a one-line change to the return and makes BigQuery consistent with all other drivers.Severity: medium | Confidence: 65%
Fixed in 5ac3078f.
…factor Two cleanups from Greptile review on #60153: 1. `compute_projected_columns` had redundant `column in retained` guards in each iteration. Since `retained` was built from `enabled_columns + primary_keys + incremental_field`, every iterated element was guaranteed to be in `retained`. Drop the set and rely on `seen` for dedup. 2. Add tests for `project_arrow_columns`: `retained=None` pass-through, normal subset, source-order preservation, all-missing fallback, partial overlap. The silent fallback when no columns match could otherwise mask a driver passing the wrong column list in PR2. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
dacd58a to
93114cc
Compare
PR overviewAll previously flagged issues have been addressed. No open security concerns remain on this pull request. Security reviewNo open security issues remain on this pull request. Fixed/addressed: 1 · PR risk: 0/10 |
Query snapshots: Backend query snapshots updatedChanges: 1 snapshots (1 modified, 0 added, 0 deleted) What this means:
Next steps:
|
Drop self-explanatory comments, history references, and PR-specific prose. Shorten docstrings to one line where the function name plus the module docstring already cover the semantics. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
8b0f5be to
097150e
Compare
…line imports The Postgres `build_select_clause` helper was a 4-line wrapper around `compute_projected_columns` plus a `psycopg.sql.Identifier` wrap. Inline at the two call sites (`postgres.py`, `partitioned_tables.py`) and delete the file. Move four inline imports to module top — `reconcile_postgres_schemas` in the API + `PostgresSource`, and `ExternalDataSchema` in `SQLSource.reconcile_schema_metadata`. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
…ry, Snowflake, Redshift Builds on PR #60153. Flips `supports_column_selection=True` on the five remaining SQL drivers, threading `enabled_columns` from `SourceInputs` through each driver's `_build_query` and Arrow-schema projection so the SELECT clause and the streaming reader's row shape stay in lock-step. Drivers: - **MySQL / MSSQL / Redshift**: project via `compute_projected_columns` + each driver's existing identifier quoter (`BacktickIdentifierQuoter`, `BracketIdentifierQuoter`, psycopg `sql.Identifier`). Arrow schema is filtered through `project_arrow_columns` so `cursor.description` matches the projected SELECT. - **Snowflake**: hand-rolled `_build_query` keeps the `IDENTIFIER(%s)` table-name wrapping, but the SELECT clause now renders the explicit column list via `AnsiIdentifierQuoter` when projection is active. - **BigQuery**: rewrites `_get_query` to honor `enabled_columns`. On the direct-storage-API path (non-incremental tables), the `ReadSession.TableReadOptions.selected_fields` field restricts the stream to the projected columns; on the query path (incremental + view/materialized_view), the destination temp table already holds only the projected columns. API + frontend: - Drops the POSTGRES gate at `external_data_source.py:1168` so every SQL source persists `schema_metadata.columns` on create — the column picker has `available_columns` to render immediately. Non-Postgres SQL sources also route through `SQLSource.reconcile_schema_metadata` on `refresh_schemas` so the metadata stays current. - `SchemaScene.showColumnsSection` now reads `source.supports_column_selection` with a fallback to the legacy `available_columns.length > 0` check — picker surfaces even before the first refresh has populated columns. Per-driver tests added under `tests/test_<driver>.py::TestBuildQueryEnabledColumns` mirroring the existing Postgres projection test cases. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
6736fad to
2da241e
Compare
…-selection-all-db-sources-pr2 # Conflicts: # products/data_warehouse/backend/api/external_data_source.py
…er available_columns across sql sources Hoist the "Columns" button + `ColumnSelectionModal` from `DirectQuerySchemasTab` into `ManagedSchemasTab`, gated on `source.supports_column_selection` (no source-type list in TS). Adds parameterized API tests covering `available_columns` + `supports_column_selection` for every SQL source so the serializer's source-type-agnostic behavior is locked in. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
…river retained-column helpers Routes BigQuery's `_bq_select_clause` through `format_projected_select_clause` + `BacktickIdentifierQuoter` so user-controlled `enabled_columns` go through the same allowlist check every other SQL driver uses. Without this, a column name like `` `x` FROM `other.dataset.private` -- `` would inject into the SELECT list (Veria HIGH / Hex MEDIUM). Also removes the three near-identical `_retained_column_names` helpers from MySQL / MSSQL / Redshift in favor of the existing `compute_projected_columns` + `project_arrow_columns` chain in `common/sql/projection.py` — same source- order behavior, one less code path to keep in sync. Migrates the MySQL `TestBuildQueryEnabledColumns` test off `parameterized` to native `pytest.mark.parametrize` for style consistency with the rest of the driver tests added in this PR. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
…ection `SchemaForm` only checked `!isDirectQueryMode`, so the column picker was shown for every non-direct-query SQL source — including ones whose driver silently ignores `enabled_columns` at sync time. Adds a `supportsColumnSelection` flag to `SourceConfig` (sourced from `SQLSource.supports_column_selection`), piped through both `wizard` and `public_source_configs` endpoints, and gates the picker on it in the wizard. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
The original gating used `a ?? (b > 0)`, which is correct but depends on JavaScript operator precedence reading. Split into two named locals so the intent — "show if the source advertises column selection OR available columns are already populated" — is obvious and survives future edits. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
|
⏭️ Skipped snapshot commit because branch advanced to The new commit will trigger its own snapshot update workflow. If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:
|
…as tab The per-schema configuration page already exposes column selection through the dedicated "Columns" section in the sidebar, so the per-row button next to "Configure" was just a second entry point to the same flow. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
…ll new schemas" button When a schema row has no `available_columns` (e.g. a source created before PR2's `reconcile_schema_metadata` ran), the columns section showed a dead end. Mirrors the column-picker modal's empty-state copy and routes through the existing `refreshSchemas` action so the user can fix it from the same page instead of bouncing back to the Schemas tab. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
…g master merge The merge from master into this branch (522b7ca) had a conflict in external_data_source.py that I resolved without preserving the new CDC safety checks master had added in parallel. Two validations and their regression tests were silently dropped: - Refuse per-schema `sync_type=cdc` when source-level CDC is off — prevents saving schemas that would never get a replication slot. - Refuse CDC tables with no primary key — prevents creating replication state on the customer source for a config we'd reject downstream. `_setup_cdc_slot` also runs after PK validation again, restoring the ordering invariant ("don't leave replication state on the source for a config we're about to refuse"). Both regression tests are back too. No interaction with PR2's column-selection changes — the dropped block was unrelated production code that the merge swallowed. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
…sfy mypy `get_postgres_source_table_location` returns `tuple[str | None, str, str]`, so mypy inferred `resolved_source_schema` / `resolved_source_table_name` as `str` from the if-branch and rejected the `else` branch's `... if X else None` assignments. `sql_schema_metadata` already accepts `str | None`, so widen the variable annotations to match. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
| table=bq_table.to_bqstorage(), | ||
| data_format=bigquery_storage.DataFormat.ARROW, | ||
| read_options=bigquery_storage.ReadSession.TableReadOptions( | ||
| selected_fields=storage_selected_fields or [], |
There was a problem hiding this comment.
Critical bug: selected_fields=storage_selected_fields or [] will pass an empty list to BigQuery when storage_selected_fields is None, causing BigQuery to select zero columns instead of all columns.
When doing a full-refresh sync (not incremental) with no column projection (SELECT *), storage_selected_fields remains None (initialized line 732, only set in lines 784-785). The or [] fallback converts this to an empty list, which BigQuery interprets as "select no fields".
Fix:
read_options=bigquery_storage.ReadSession.TableReadOptions(
selected_fields=storage_selected_fields, # Remove the `or []`
arrow_serialization_options=...
)BigQuery's API treats None/absent as "all fields" but an empty list as "no fields".
| selected_fields=storage_selected_fields or [], | |
| selected_fields=storage_selected_fields, |
Spotted by Graphite
Is this helpful? React 👍 or 👎 to let us know.
There was a problem hiding this comment.
False positive — for BigQuery Storage API TableReadOptions.selected_fields (a repeated string proto field), an empty list is the documented signal for "read all fields", not zero. Per the BQ Storage API docs: "Names of the fields in the table that should be read. If empty, all fields will be read."
So storage_selected_fields or [] correctly produces:
None or [] → []→ all columns read (full-refresh no projection, intentional)["col_a"] or [] → ["col_a"]→ only selected column read (projection case)
No change needed.
Prompt To Fix All With AIFix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
posthog/temporal/data_imports/sources/snowflake/tests/test_snowflake.py:160-173
`test_full_refresh_subset_keeps_incremental_field` calls `_build_query` with `should_use_incremental_field=False` and `incremental_field=None`, so there is no incremental field to "keep". The assertion (`'"EMAIL"' in sql`, `'"ID"' in sql`) is also a strict subset of what `test_full_refresh_subset_projects_pk_retained` already checks with `sql.startswith('SELECT "EMAIL", "ID" FROM IDENTIFIER(%s)')`. The test says something different from what it does (OnceAndOnlyOnce).
```suggestion
@pytest.mark.parametrize(
"enabled_columns,primary_keys,incremental_field,expected_start",
[
(["EMAIL"], ["ID"], None, 'SELECT "EMAIL", "ID" FROM IDENTIFIER(%s)'),
(["EMAIL"], ["ID"], "CREATED_AT", 'SELECT "EMAIL", "ID", "CREATED_AT" FROM IDENTIFIER(%s)'),
],
)
def test_full_refresh_subset_retains_pk_and_incremental_field(
self, enabled_columns, primary_keys, incremental_field, expected_start
):
sql, _ = _build_query(
"DB",
"PUBLIC",
"t",
False,
incremental_field,
None,
None,
enabled_columns=enabled_columns,
primary_keys=primary_keys,
)
assert sql.startswith(expected_start)
```
Reviews (2): Last reviewed commit: "fix(data-warehouse): annotate resolved_s..." | Re-trigger Greptile |
…ions to satisfy mypy The CDC pre-fetch loop earlier in the function already typed `resolved_source_schema` / `resolved_source_table_name` as `str` from `get_postgres_source_table_location`. When PR2 added the non-Postgres `else` branch with `... if X else None`, mypy flagged it as `str | None` re-binding into a `str` name. Split the two concerns: - `metadata_source_*` (`str | None`) feeds the source-type-agnostic `sql_schema_metadata` call, which already accepts `None`. - `postgres_db_schema` / `postgres_db_table_name` (`str`) hold the guaranteed-non-None Postgres resolver output, used by the CDC publication + direct-postgres upsert paths that only ever run for Postgres sources anyway. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
…rce, drop literal source-type set - All 6 concrete `SQLSource` subclasses overrode `supports_column_selection` to `True`. Flipping the base default and dropping the overrides is a net deletion that better matches the contract: a class extending `SQLSource` can select columns unless it explicitly opts out. - `_SQL_SOURCE_TYPES_WITH_COLUMN_SELECTION` was a literal mirror of the same flag, kept around because test mocks of `SourceRegistry.get_source` don't satisfy `isinstance(SQLSource)`. The schema-metadata gate now uses `bool(getattr(source, "supports_column_selection", False))` — same coercion the serializer already uses, lets mocks satisfy it. The `refresh_schemas` gate keeps the `isinstance(SQLSource)` check (needed to call the typed hook) but reads the flag off the source directly instead of a parallel enum set. - Trim a few overly-explanatory comments left over from earlier review iterations. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
…ource, drop dual postgres location variables Three cleanups from review: - Default `supports_column_selection = False` on `_BaseSource` so every source has the attribute. `getattr(source, "supports_column_selection", False)` collapses to `source.supports_column_selection` at all 4 call sites (serializer, schema-metadata gate, wizard, public source configs viewset). The `bool()` wrap stays on the serializer (Mock attrs aren't orjson-safe). - The previous fix introduced parallel `postgres_db_schema` / `postgres_db_table_name` locals to satisfy mypy at the Postgres-only call sites (CDC publication, direct-postgres upsert). Replace with a single set of `metadata_source_*` locals plus narrowing `assert` at each Postgres-only call site — `get_postgres_source_table_location` already guarantees non-None inside that branch. - `SchemaScene.showColumnsSection` no longer falls back to `available_columns.length > 0` — the legacy fallback was added during a local-env diagnosis that turned out to be a stale Vite issue, not a real backward-compat concern. Single source of truth is now the backend `supports_column_selection` flag. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
…rowing Asserts get stripped under `python -O` and feel out of place in a request handler. `cast(str, ...)` is the mypy-only narrowing we actually want here — no runtime check, no behavior change. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
|
🎭 Playwright report · View test results →
These issues are not necessarily caused by your changes. |
Per-column sync selection is now supported on Redshift (previously Postgres only). Adds documentation for selecting specific columns during source setup and after setup. Ref: PostHog/posthog#60155
Problem
Postgres has had per-column sync selection for a while. Every other SQL source we ingest from — MySQL, MSSQL, BigQuery, Snowflake, Redshift — was hard-coded to
SELECT *. That inflates sync cost on wide tables, leaks PII columns into HogQL, and trips up customers who want to exclude expensive blob/JSON columns from their warehouse.PR1 hoisted the column-projection logic into shared SQL primitives. This PR flips the capability on for the five remaining SQL drivers and surfaces the picker in the warehouse-mode UI.
Changes
Per-driver
_build_query+ Arrow projectionEvery driver now reads
inputs.enabled_columnsand threads it into the SQL builder. The Arrow schema fed to dlt is projected throughproject_arrow_columns(or each driver's hand-rolled equivalent) socursor.descriptionmatches the projected SELECT.mysql.py)_QUERY_BUILDER.select_all(..., enabled_columns=, primary_keys=)viaBacktickIdentifierQuoter. Arrow schema projected viaproject_arrow_columns.mssql.py)format_projected_select_clause(...)viaBracketIdentifierQuoter. Same Arrow projection.redshift.py)_redshift_select_clause(...)wrapscompute_projected_columnswithpsycopg.sql.Identifier(...)— keeps psycopg's escaping for parity with Postgres.snowflake.py)IDENTIFIER(%s)table-name wrapping; SELECT clause renders the explicit column list viaAnsiIdentifierQuoter.bigquery.py)SELECT col1, col2 FROM …so it holds only the projection. Direct-storage-API path:ReadSession.TableReadOptions.selected_fieldsrestricts the stream.Each driver also sets
supports_column_selection: bool = Trueon itsSQLSourcesubclass — that flag flows out the API to the frontend.API + persistence
if source_type_model == POSTGRESgate inexternal_data_source.py:1168so every SQL source persistsschema_metadata.columnson create. The column picker hasavailable_columnsto render immediately — no waiting for the first 6h refresh._SQL_SOURCE_TYPES_WITH_COLUMN_SELECTIONconstant gates this and avoidsisinstance(source, SQLSource)checks that tests'Mock(SourceRegistry.get_source)can't satisfy.refresh_schemasalso routes non-Postgres SQL sources through theirSQLSource.reconcile_schema_metadatahook (Postgres keeps its direct call toreconcile_postgres_schemasfor the direct-queryDataWarehouseTablerebuild path).Frontend
SchemaScene.showColumnsSectionnow readssource.supports_column_selectionwith a fallback to the legacyavailable_columns.length > 0check. The picker surfaces even before the first refresh has populated columns —ColumnsSectionitself shows a "no columns discovered yet" hint in that state.Tests
Each driver gets a
TestBuildQueryEnabledColumnsclass mirroringtest_postgres.py:None→SELECT *SELECT col, pk FROM …(PK retained)SELECT col, pk, incremental_field FROM …withWHERE incremental_field > …[]with no PK / incremental → falls back toSELECT *[]with PK only →SELECT pk FROM …Plus a BigQuery-specific
_bq_select_clauseparameterized test.How did you test this code?
Agent-authored. I'm an AI agent; all testing below is automated.
mysql,mssql,bigquery,snowflake,redshift) + new projection casesposthog/temporal/data_imports/sources/suiteruff check+ruff formathogli build:openapiNot tested manually — the configuration UI loads behind a route, the SQL driver behavior requires real DB connections to verify end-to-end. The PR1 test plan listed the manual verifications a reviewer should run before merge — same plan applies here:
DataWarehouseTablewith only those columns + PKs.Publish to changelog?
yes — "Per-column sync selection now works on MySQL, MSSQL, BigQuery, Snowflake, and Redshift (previously Postgres only). Cuts sync cost on wide tables and lets you exclude columns from HogQL."
Docs update
Skip — column-selection docs already cover the Postgres case; updating to drop the "Postgres only" caveat is a docs-only follow-up.
🤖 Agent context
Authored by PostHog Code (Claude Opus 4.7), second of two PRs stacked on the user-approved plan at
claude/plans/1-about-stuff-like-lovely-sundae.md.Key decisions:
_SQL_SOURCE_TYPES_WITH_COLUMN_SELECTIONis a literal set, notisinstance(source, SQLSource)— every test intest_external_data_source.pymocksSourceRegistry.get_source, so anisinstancecheck would always be False in tests. The set duplicates thesupports_column_selection = Trueattribute on each driver class, which is a minor maintenance cost; the alternative (rewriting ~10 test mocks to usespec=PostgresSource) is worse.bq_client.query(...), so projecting at the query level is enough — the Storage API session reads from the projected temp table. For the direct-table read path,TableReadOptions.selected_fieldsis the only knob that restricts the stream.IDENTIFIER(%s)for the table name but projects the SELECT clause inline viaAnsiIdentifierQuoter. Snowflake doesn't accept identifier-bind for column lists, so this is necessary.??for backward compat: if an old API client doesn't returnsupports_column_selection, we fall back to the legacyavailable_columns.length > 0check rather than hiding the picker. Avoids a flicker for sessions that span a deploy.Created with PostHog Code