refactor(data-warehouse): hoist column-selection helpers into common/sql#60153
Conversation
Hoists the column-projection logic (`build_select_clause`,
`filter_columns_by_enabled_columns`, `filter_dwh_columns_by_enabled_columns`,
`postgres_schema_metadata`) from Postgres-specific helpers into
`posthog/temporal/data_imports/sources/common/sql/{projection,metadata}.py`.
Every SQL source can now reuse them.
Adds the `SQLSource.supports_column_selection` capability flag (Postgres
flips to `True`; every other SQL source stays `False` until PR2 enables
their `enabled_columns` plumbing), exposes it on `ExternalDataSourceSerializer`
so the frontend can gate the picker without a hardcoded source-type list,
and threads `enabled_columns` through `SourceInputs` so a single place
populates it for every SQL driver.
Also extends `SelectQueryBuilder.select_all` with `enabled_columns` /
`primary_keys` so MySQL/MSSQL/Snowflake/Redshift/BigQuery pick up
column projection for free in PR2, and adds a generic
`SQLSource.reconcile_schema_metadata` hook + a `prune_enabled_columns`
pass so a source-side column drop can't break the next sync.
No user-visible behavior change in this PR — Postgres is still the only
source with `supports_column_selection=True`. PR2 broadens to MySQL,
MSSQL, BigQuery, Snowflake, Redshift and lands the warehouse-mode UI.
Generated-By: PostHog Code
Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
|
🎭 Playwright didn't run on this PR — your changes touch code that could affect E2E behavior, but Playwright is opt-in via label now to keep CI cost down. Add the Most PRs don't need this. Real regressions still get caught on master and fix-forward. |
|
Size Change: 0 B Total Size: 80.2 MB ℹ️ View Unchanged
|
Prompt To Fix All With AIFix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
posthog/temporal/data_imports/sources/common/sql/projection.py:60-71
The `column in retained` guard in the first loop is always `True` because `retained` is initialised as `set(enabled_columns)`, making it a superset of every element that will be iterated. The same redundancy appears for `pk in retained` in the second loop (all PKs are explicitly added to `retained`), and for `incremental_field in retained` (also added above). These checks add noise without protecting against any real case — simplifying removes the conceptual misdirection.
```suggestion
seen: set[str] = set()
ordered: list[str] = []
for column in enabled_columns:
if column not in seen:
seen.add(column)
ordered.append(column)
for column in primary_keys or []:
if column not in seen:
seen.add(column)
ordered.append(column)
if incremental_field and incremental_field not in seen:
ordered.append(incremental_field)
```
### Issue 2 of 2
posthog/temporal/data_imports/sources/common/sql/tests/test_projection.py:1-21
`project_arrow_columns` is the only exported function in `projection.py` with no tests. Its silent fallback — when `retained` is non-`None` but none of the requested column names exist in the table it returns the original table unchanged — is the kind of behaviour that could mask a driver passing the wrong column list in PR2. Adding a parameterised test covering at least `retained=None` (pass-through), a normal subset, and the all-missing fallback would match the coverage level of every other function in this module.
Reviews (1): Last reviewed commit: "refactor(data-warehouse): hoist column-s..." | Re-trigger Greptile |
| seen: set[str] = set() | ||
| ordered: list[str] = [] | ||
| for column in enabled_columns: | ||
| if column in retained and column not in seen: | ||
| seen.add(column) | ||
| ordered.append(column) | ||
| for column in primary_keys or []: | ||
| if column in retained and column not in seen: | ||
| seen.add(column) | ||
| ordered.append(column) | ||
| if incremental_field and incremental_field in retained and incremental_field not in seen: | ||
| ordered.append(incremental_field) |
There was a problem hiding this comment.
The
column in retained guard in the first loop is always True because retained is initialised as set(enabled_columns), making it a superset of every element that will be iterated. The same redundancy appears for pk in retained in the second loop (all PKs are explicitly added to retained), and for incremental_field in retained (also added above). These checks add noise without protecting against any real case — simplifying removes the conceptual misdirection.
| seen: set[str] = set() | |
| ordered: list[str] = [] | |
| for column in enabled_columns: | |
| if column in retained and column not in seen: | |
| seen.add(column) | |
| ordered.append(column) | |
| for column in primary_keys or []: | |
| if column in retained and column not in seen: | |
| seen.add(column) | |
| ordered.append(column) | |
| if incremental_field and incremental_field in retained and incremental_field not in seen: | |
| ordered.append(incremental_field) | |
| seen: set[str] = set() | |
| ordered: list[str] = [] | |
| for column in enabled_columns: | |
| if column not in seen: | |
| seen.add(column) | |
| ordered.append(column) | |
| for column in primary_keys or []: | |
| if column not in seen: | |
| seen.add(column) | |
| ordered.append(column) | |
| if incremental_field and incremental_field not in seen: | |
| ordered.append(incremental_field) |
Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/temporal/data_imports/sources/common/sql/projection.py
Line: 60-71
Comment:
The `column in retained` guard in the first loop is always `True` because `retained` is initialised as `set(enabled_columns)`, making it a superset of every element that will be iterated. The same redundancy appears for `pk in retained` in the second loop (all PKs are explicitly added to `retained`), and for `incremental_field in retained` (also added above). These checks add noise without protecting against any real case — simplifying removes the conceptual misdirection.
```suggestion
seen: set[str] = set()
ordered: list[str] = []
for column in enabled_columns:
if column not in seen:
seen.add(column)
ordered.append(column)
for column in primary_keys or []:
if column not in seen:
seen.add(column)
ordered.append(column)
if incremental_field and incremental_field not in seen:
ordered.append(incremental_field)
```
How can I resolve this? If you propose a fix, please make it concise.| """Tests for the column-projection helpers in `common/sql/projection.py`.""" | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import pytest | ||
|
|
||
| from parameterized import parameterized | ||
|
|
||
| from posthog.temporal.data_imports.sources.common.sql.identifiers import ( | ||
| AnsiIdentifierQuoter, | ||
| BacktickIdentifierQuoter, | ||
| BracketIdentifierQuoter, | ||
| InvalidIdentifierError, | ||
| ) | ||
| from posthog.temporal.data_imports.sources.common.sql.projection import ( | ||
| compute_projected_columns, | ||
| filter_columns_by_enabled_columns, | ||
| filter_dwh_columns_by_enabled_columns, | ||
| format_projected_select_clause, | ||
| prune_enabled_columns, | ||
| ) |
There was a problem hiding this comment.
project_arrow_columns is the only exported function in projection.py with no tests. Its silent fallback — when retained is non-None but none of the requested column names exist in the table it returns the original table unchanged — is the kind of behaviour that could mask a driver passing the wrong column list in PR2. Adding a parameterised test covering at least retained=None (pass-through), a normal subset, and the all-missing fallback would match the coverage level of every other function in this module.
Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/temporal/data_imports/sources/common/sql/tests/test_projection.py
Line: 1-21
Comment:
`project_arrow_columns` is the only exported function in `projection.py` with no tests. Its silent fallback — when `retained` is non-`None` but none of the requested column names exist in the table it returns the original table unchanged — is the kind of behaviour that could mask a driver passing the wrong column list in PR2. Adding a parameterised test covering at least `retained=None` (pass-through), a normal subset, and the all-missing fallback would match the coverage level of every other function in this module.
How can I resolve this? If you propose a fix, please make it concise.
Query snapshots: Backend query snapshots updatedChanges: 1 snapshots (1 modified, 0 added, 0 deleted) What this means:
Next steps:
|
…factor Two cleanups from Greptile review on #60153: 1. `compute_projected_columns` had redundant `column in retained` guards in each iteration. Since `retained` was built from `enabled_columns + primary_keys + incremental_field`, every iterated element was guaranteed to be in `retained`. Drop the set and rely on `seen` for dedup. 2. Add tests for `project_arrow_columns`: `retained=None` pass-through, normal subset, source-order preservation, all-missing fallback, partial overlap. The silent fallback when no columns match could otherwise mask a driver passing the wrong column list in PR2. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
…ry, Snowflake, Redshift Builds on PR #60153. Flips `supports_column_selection=True` on the five remaining SQL drivers, threading `enabled_columns` from `SourceInputs` through each driver's `_build_query` and Arrow-schema projection so the SELECT clause and the streaming reader's row shape stay in lock-step. Drivers: - **MySQL / MSSQL / Redshift**: project via `compute_projected_columns` + each driver's existing identifier quoter (`BacktickIdentifierQuoter`, `BracketIdentifierQuoter`, psycopg `sql.Identifier`). Arrow schema is filtered through `project_arrow_columns` so `cursor.description` matches the projected SELECT. - **Snowflake**: hand-rolled `_build_query` keeps the `IDENTIFIER(%s)` table-name wrapping, but the SELECT clause now renders the explicit column list via `AnsiIdentifierQuoter` when projection is active. - **BigQuery**: rewrites `_get_query` to honor `enabled_columns`. On the direct-storage-API path (non-incremental tables), the `ReadSession.TableReadOptions.selected_fields` field restricts the stream to the projected columns; on the query path (incremental + view/materialized_view), the destination temp table already holds only the projected columns. API + frontend: - Drops the POSTGRES gate at `external_data_source.py:1168` so every SQL source persists `schema_metadata.columns` on create — the column picker has `available_columns` to render immediately. Non-Postgres SQL sources also route through `SQLSource.reconcile_schema_metadata` on `refresh_schemas` so the metadata stays current. - `SchemaScene.showColumnsSection` now reads `source.supports_column_selection` with a fallback to the legacy `available_columns.length > 0` check — picker surfaces even before the first refresh has populated columns. Per-driver tests added under `tests/test_<driver>.py::TestBuildQueryEnabledColumns` mirroring the existing Postgres projection test cases. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
Query snapshots: Backend query snapshots updatedChanges: 1 snapshots (1 modified, 0 added, 0 deleted) What this means:
Next steps:
|
Drop self-explanatory comments, history references, and PR-specific prose. Shorten docstrings to one line where the function name plus the module docstring already cover the semantics. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
…ry, Snowflake, Redshift Builds on PR #60153. Flips `supports_column_selection=True` on the five remaining SQL drivers, threading `enabled_columns` from `SourceInputs` through each driver's `_build_query` and Arrow-schema projection so the SELECT clause and the streaming reader's row shape stay in lock-step. Drivers: - **MySQL / MSSQL / Redshift**: project via `compute_projected_columns` + each driver's existing identifier quoter (`BacktickIdentifierQuoter`, `BracketIdentifierQuoter`, psycopg `sql.Identifier`). Arrow schema is filtered through `project_arrow_columns` so `cursor.description` matches the projected SELECT. - **Snowflake**: hand-rolled `_build_query` keeps the `IDENTIFIER(%s)` table-name wrapping, but the SELECT clause now renders the explicit column list via `AnsiIdentifierQuoter` when projection is active. - **BigQuery**: rewrites `_get_query` to honor `enabled_columns`. On the direct-storage-API path (non-incremental tables), the `ReadSession.TableReadOptions.selected_fields` field restricts the stream to the projected columns; on the query path (incremental + view/materialized_view), the destination temp table already holds only the projected columns. API + frontend: - Drops the POSTGRES gate at `external_data_source.py:1168` so every SQL source persists `schema_metadata.columns` on create — the column picker has `available_columns` to render immediately. Non-Postgres SQL sources also route through `SQLSource.reconcile_schema_metadata` on `refresh_schemas` so the metadata stays current. - `SchemaScene.showColumnsSection` now reads `source.supports_column_selection` with a fallback to the legacy `available_columns.length > 0` check — picker surfaces even before the first refresh has populated columns. Per-driver tests added under `tests/test_<driver>.py::TestBuildQueryEnabledColumns` mirroring the existing Postgres projection test cases. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
…line imports The Postgres `build_select_clause` helper was a 4-line wrapper around `compute_projected_columns` plus a `psycopg.sql.Identifier` wrap. Inline at the two call sites (`postgres.py`, `partitioned_tables.py`) and delete the file. Move four inline imports to module top — `reconcile_postgres_schemas` in the API + `PostgresSource`, and `ExternalDataSchema` in `SQLSource.reconcile_schema_metadata`. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
…ry, Snowflake, Redshift Builds on PR #60153. Flips `supports_column_selection=True` on the five remaining SQL drivers, threading `enabled_columns` from `SourceInputs` through each driver's `_build_query` and Arrow-schema projection so the SELECT clause and the streaming reader's row shape stay in lock-step. Drivers: - **MySQL / MSSQL / Redshift**: project via `compute_projected_columns` + each driver's existing identifier quoter (`BacktickIdentifierQuoter`, `BracketIdentifierQuoter`, psycopg `sql.Identifier`). Arrow schema is filtered through `project_arrow_columns` so `cursor.description` matches the projected SELECT. - **Snowflake**: hand-rolled `_build_query` keeps the `IDENTIFIER(%s)` table-name wrapping, but the SELECT clause now renders the explicit column list via `AnsiIdentifierQuoter` when projection is active. - **BigQuery**: rewrites `_get_query` to honor `enabled_columns`. On the direct-storage-API path (non-incremental tables), the `ReadSession.TableReadOptions.selected_fields` field restricts the stream to the projected columns; on the query path (incremental + view/materialized_view), the destination temp table already holds only the projected columns. API + frontend: - Drops the POSTGRES gate at `external_data_source.py:1168` so every SQL source persists `schema_metadata.columns` on create — the column picker has `available_columns` to render immediately. Non-Postgres SQL sources also route through `SQLSource.reconcile_schema_metadata` on `refresh_schemas` so the metadata stays current. - `SchemaScene.showColumnsSection` now reads `source.supports_column_selection` with a fallback to the legacy `available_columns.length > 0` check — picker surfaces even before the first refresh has populated columns. Per-driver tests added under `tests/test_<driver>.py::TestBuildQueryEnabledColumns` mirroring the existing Postgres projection test cases. Generated-By: PostHog Code Task-Id: 8363dadb-b9a9-4287-9257-8f0e3c4d3cda
…-selection-all-db-sources-pr1
Problem
Column selection (the
ColumnSelectionModalpicker that lets a user pick a subset of source columns to sync into the warehouse) currently works only for Postgres — both direct-query and warehouse-managed modes. Every other SQL source (MySQL, MSSQL, BigQuery, Snowflake, Redshift) is forced intoSELECT *, which inflates sync cost, leaks PII columns into HogQL, and degrades sync reliability on wide tables.This is the first of two PRs that close that gap. It's a pure refactor: it hoists the Postgres column-selection logic into the common SQL layer so PR2 can flip the capability flag on for each remaining driver without touching
external_data_source.pyagain.Changes
Hoisted into
posthog/temporal/data_imports/sources/common/sql/projection.py(new) —compute_projected_columns,format_projected_select_clause,filter_columns_by_enabled_columns,filter_dwh_columns_by_enabled_columns,project_arrow_columns,prune_enabled_columns. Pure logic; no driver-specific quoting.metadata.py(new) —sql_schema_metadata(rename ofpostgres_schema_metadata, identical shape) andextract_available_column_names.SelectQueryBuilder.select_all— gainsenabled_columns+primary_keysparameters. When unset, emitsSELECT *(current behavior); when set, projects via the builder'squoterand always retains PKs + active incremental field. MySQL/MSSQL/Snowflake/Redshift pick this up for free in PR2.SQLSourceadditionssupports_column_selection: boolclass attribute. Postgres flips toTrue; every other SQL source staysFalse.reconcile_schema_metadatadefault hook — writesschema_metadatato every matching row (merging, not replacing), and prunesenabled_columnsto the intersection with the newly discovered column set so a source-side column drop can't break the next sync. Postgres overrides to delegate toreconcile_postgres_schemas(keeps direct-queryDataWarehouseTablerebuild).SourceInputs.enabled_columnsThreaded through
workflow_activities/import_data_sync.pyso every SQL driver reads the user's column selection from a single field instead of re-queryingschema.enabled_columnsin eachsource_for_pipeline.Capability flag exposed on the API
ExternalDataSourceSerializer.supports_column_selection(read-only, sourced fromSourceRegistry.get_source(...).supports_column_selection). The frontend will read this single boolean in PR2 — no hardcoded source-type lists anywhere in TS. Regeneratedapi.schemas.tsaccordingly.Backwards compatibility
postgres/postgres.py,partitioned_tables.py) still produce identical SQL —postgres/query_builders.build_select_clauseis now a thin wrapper overcompute_projected_columns+psycopg.sql.Identifier(...)so psycopg's escaping rules still apply.postgres_helpers.{filter_columns_by_enabled_columns, filter_dwh_columns_by_enabled_columns}remain importable as re-exports for legacy callers. All in-repo callers updated to import fromcommon/sql/directly.postgres_schema_metadatabecomes a thin wrapper that delegates tosql_schema_metadata— existing rows are read-compatible.schema_metadatawrite atexternal_data_source.py:1168keeps its Postgres gate; non-Postgres sources will populateschema_metadataon their nextrefresh_schemasrun (every 6h via the discovery schedule) via the new hook. Rejected the alternative — a one-shot data migration would hammer every customer DB simultaneously on deploy.How did you test this code?
Agent-authored. I'm an AI agent; all testing below is automated.
common/sql/tests/(new + existing)postgres/test_postgres.py::TestBuildQuery*(theenabled_columnsprojection cases at L711-790)products/data_warehouse/backend/test/test_postgres_helpers.pyproducts/data_warehouse/backend/api/test/test_external_data_source.pytest_list_external_data_sourcequery count, fails identically on master)products/data_warehouse/backend/api/test/test_external_data_schema.pyposthog/temporal/tests/data_imports/test_import_data.pyruff check+ruff formathogli build:openapisupports_column_selection: booleaninapi.schemas.tsandservices/mcp/src/api/generated.tspnpm typescript:checkWorkflowsScene.tsxerrors fail identically on master)Not tested manually — flag is dark, no UI surface area changes. PR2 ships the user-visible enablement and includes the manual test plan.
Publish to changelog?
no
Docs update
No docs change needed. PR2 updates user-facing docs.
🤖 Agent context
Authored by PostHog Code (Claude Opus 4.7) acting on a user-approved 2-PR plan. Key decisions:
_build_querychanges + frontend to PR2. Single PR would be ~2.5k lines; the refactor-vs-feature split keeps each side reviewable in isolation. PR2's blast radius is per-driver — a bug in one driver's_build_querydoesn't take down the others.supports_column_selection), not a hardcodedif source_type in [...]in TS. Source-type literals in frontend rot the moment a new SQL source lands.external_data_source.pystay as directreconcile_postgres_schemas(...)calls (not routed through the new hook) so existing tests that mockSourceRegistry.get_sourcestill exercise the real direct-queryDataWarehouseTablerebuild. The hook still works in production where the realPostgresSourceinstance is the one dispatched.enabled_columnspruning — pre-existing latent bug in Postgres (a source-side column drop would causeSELECT … missing_colnext sync). Fixed defensively here so PR2 doesn't enlarge exposure. Pruning is logged via structlog so we can monitor frequency.Created with PostHog Code