Skip to content

fix(data-warehouse): use inclusive cursor on date-typed incremental sync#59033

Merged
Gilbert09 merged 5 commits into
masterfrom
tom/date-cursor-inclusive
May 20, 2026
Merged

fix(data-warehouse): use inclusive cursor on date-typed incremental sync#59033
Gilbert09 merged 5 commits into
masterfrom
tom/date-cursor-inclusive

Conversation

@Gilbert09
Copy link
Copy Markdown
Member

Problem

SQL data warehouse sources (Postgres, Redshift, MySQL, MSSQL, BigQuery, Snowflake) advance their incremental cursor with WHERE incremental_field > saved_cursor. For DateTime / Timestamp / Integer / Numeric / ObjectID columns the cursor carries enough resolution that > is safe — two distinct rows essentially never share the boundary value.

For IncrementalFieldType.Date, the cursor only carries day granularity. If sync A completes mid-day and saves cursor 2026-05-13, sync B's > '2026-05-13' skips every row that lands on 2026-05-13 after sync A advanced the cursor — those rows are silently dropped from the warehouse on every subsequent sync. The bug is invisible: sync B "succeeds" with zero new rows for the boundary day.

Changes

  • New helper incremental_type_to_operator(field_type) in posthog/temporal/data_imports/pipelines/helpers.py returns ">=" for Date and ">" for every other IncrementalFieldType. Centralizing the decision keeps the six SQL sources in sync.
  • All single-shot incremental cursor predicates now flow through the helper:
    • postgres/postgres.py (main scan + count query)
    • postgres/partitioned_tables.py (build_partition_query, called per child partition)
    • redshift/redshift.py (sampled + non-sampled scans)
    • mysql/mysql.py
    • mssql/mssql.py
    • bigquery/bigquery.py
    • snowflake/snowflake.py
  • Postgres windowed mode (_build_query with upper_bound_inclusive set, used by iterate_date_windows) deliberately keeps > even for Date. Consecutive windows pass the previous window's hi as the next window's lo, so >= would re-fetch every boundary row inside a single run and produce duplicates between windows. A regression test (test_windowed_mode_keeps_exclusive_lower_bound_for_date) locks this behavior in.

Trade-off this PR accepts: for APPEND sync_type on a Date-typed cursor, the next sync after a cursor advance will re-ship rows at the boundary day. Without primary-key dedup those land as duplicates. This is a smaller harm than the current silent data loss, and INCREMENTAL sync_type (the much more common case) absorbs the re-shipped rows via primary-key dedup.

How did you test this code?

I am an agent. Only automated tests were run — no manual sync verification.

  • New parameterized helper test in posthog/temporal/data_imports/pipelines/test_helpers.py covering every IncrementalFieldType member.
  • New parameterized tests in posthog/temporal/data_imports/sources/postgres/test_postgres.py:
    • TestBuildQuery.test_operator_matches_field_type (Date, DateTime, Timestamp, Integer)
    • TestBuildQuery.test_count_query_operator_matches_field_type
    • TestBuildPartitionQuery.test_operator_matches_field_type
    • TestBuildQuery.test_windowed_mode_keeps_exclusive_lower_bound_for_date — regression guard for the windowed path
  • All 213 postgres builder/iterator tests + the 44 mysql tests pass against the new code, including the real-DB TestIterateDateWindowsRealDb::test_yields_all_rows_over_partitioned_table that caught the earlier draft over-applying >= to windowed mode.

Publish to changelog?

no

🤖 Agent context

  • Tooling: Claude Code (Opus 4.7).
  • Considered alternative: gate the >= switch on sync_type == INCREMENTAL so APPEND keeps >. Rejected because most SQL sources don't have sync_type available where the query is built, and the bug is severe enough (silent boundary-day loss across all syncs) that the small APPEND duplication is preferable to threading a new parameter through every source.
  • Considered alternative: step the saved cursor back by one day for Date types so > still captures the boundary day. Rejected as more invasive (every source's cursor advancement logic would need to know about it) and harder to reason about — the operator-side fix keeps the cursor's stored value honest and isolates the change to query construction.
  • The windowed Postgres path (iterate_date_windows) still has the same boundary-day loss between sync runs because the first window's lo is the saved cursor and uses >. Fixing that requires iterate_date_windows to step the initial lo back by one unit for Date types — left for a follow-up since it has its own correctness considerations around partition snap-forward (_step_back_one) and is a distinct codepath from the single-shot scan the customer-facing bug surfaces in.

Copilot AI review requested due to automatic review settings May 19, 2026 15:05
@assign-reviewers-posthog assign-reviewers-posthog Bot requested a review from a team May 19, 2026 15:05
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 19, 2026

Reviews (1): Last reviewed commit: "fix(data-warehouse): use inclusive curso..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a correctness bug in incremental syncs for SQL data warehouse sources when the incremental cursor is a Date (day-granularity) by switching the lower-bound predicate to be inclusive (>=) for Date cursors while keeping the existing exclusive behavior (>) for higher-resolution cursor types.

Changes:

  • Introduces incremental_type_to_operator() to centralize the cursor lower-bound operator choice (>= for IncrementalFieldType.Date, otherwise >).
  • Updates multiple SQL source query builders (Postgres, Redshift, MySQL, MSSQL, BigQuery, Snowflake) to use the helper for incremental predicates.
  • Adds/extends tests covering the helper mapping and Postgres query builders, including a regression test ensuring Postgres windowed mode keeps an exclusive lower bound for Date.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
posthog/temporal/data_imports/pipelines/helpers.py Adds incremental_type_to_operator() helper to choose >= for Date cursors and > otherwise.
posthog/temporal/data_imports/pipelines/test_helpers.py Adds unit test for incremental_type_to_operator() across all IncrementalFieldType members.
posthog/temporal/data_imports/sources/postgres/postgres.py Uses the helper for single-shot incremental scans and count queries; keeps > in windowed mode.
posthog/temporal/data_imports/sources/postgres/partitioned_tables.py Uses the helper in per-child partition incremental queries.
posthog/temporal/data_imports/sources/postgres/test_postgres.py Adds tests asserting operator selection in main, count, partition queries, plus windowed-mode regression coverage.
posthog/temporal/data_imports/sources/redshift/redshift.py Uses the helper for sampled and non-sampled incremental predicates.
posthog/temporal/data_imports/sources/mysql/mysql.py Uses the helper in MySQL incremental WHERE clause.
posthog/temporal/data_imports/sources/mssql/mssql.py Uses the helper in MSSQL incremental WHERE clause.
posthog/temporal/data_imports/sources/bigquery/bigquery.py Uses the helper in BigQuery incremental query construction.
posthog/temporal/data_imports/sources/snowflake/snowflake.py Uses the helper in Snowflake incremental query construction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread posthog/temporal/data_imports/pipelines/test_helpers.py
@Gilbert09 Gilbert09 merged commit a81a147 into master May 20, 2026
212 checks passed
@Gilbert09 Gilbert09 deleted the tom/date-cursor-inclusive branch May 20, 2026 14:09
@deployment-status-posthog
Copy link
Copy Markdown

deployment-status-posthog Bot commented May 20, 2026

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-05-20 14:48 UTC Run
prod-us ✅ Deployed 2026-05-20 15:20 UTC Run
prod-eu ✅ Deployed 2026-05-20 15:22 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants