fix: Strip toTimeZone from timestamp comparison in WHERE clause to fix primary key usage#54819
Conversation
ef5463b to
8cde337
Compare
|
⏭️ Skipped snapshot commit because branch advanced to The new commit will trigger its own snapshot update workflow. If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:
|
2c6830c to
3baed27
Compare
|
⏭️ Skipped snapshot commit because branch advanced to The new commit will trigger its own snapshot update workflow. If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:
|
bdfbdcc to
4a29517
Compare
|
⏭️ Skipped snapshot commit because branch advanced to The new commit will trigger its own snapshot update workflow. If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:
|
4a29517 to
45ee5e0
Compare
|
🎭 Playwright report · View test results →
These issues are not necessarily caused by your changes. |
|
⏭️ Skipped snapshot commit because branch advanced to The new commit will trigger its own snapshot update workflow. If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:
|
6a5c375 to
2191322
Compare
|
⏭️ Skipped snapshot commit because branch advanced to The new commit will trigger its own snapshot update workflow. If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:
|
Prompt To Fix All With AIThis is a comment left during a code review.
Path: posthog/hogql/transforms/property_types.py
Line: 195-200
Comment:
**Imprecise type annotation on `_RANGE_OPS`**
`set[str]` should be `set[ast.CompareOperationOp]`. Even though `CompareOperationOp` is a `StrEnum` (and its members technically _are_ strings), the declared type doesn't communicate intent and the membership check on line 233 (`result.op not in self._RANGE_OPS`) is then typed as comparing a `CompareOperationOp` against a `set[str]`.
```suggestion
_RANGE_OPS: set[ast.CompareOperationOp] = {
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: posthog/hogql/transforms/test/test_property_types.py
Line: 325-366
Comment:
**Use `@parameterized.expand` instead of `subTest`**
The repo convention (AGENTS.md: "We always prefer parameterised tests") calls for the `parameterized` library rather than `self.subTest`. `test_toTimeZone_breaks_partition_and_pk_pruning` loops with `subTest`, and `test_index_hint_not_added_inside_function_calls` tests three separate SQL/count combinations in sequence. Both are candidates for `@parameterized.expand`.
For example, `test_index_hint_not_added_inside_function_calls` could become:
```python
@parameterized.expand([
("if_in_select",
"SELECT if(timestamp >= '2024-03-01', 'yes', 'no') FROM events",
0),
("mixed_where_and_if",
"SELECT if(timestamp >= '2024-01-01', 'new', 'old') FROM events "
"WHERE timestamp >= '2024-03-01' AND timestamp < '2024-04-01'",
2),
("deep_nesting",
"SELECT coalesce(if(timestamp < '2024-06-01', NULL, timestamp), now()) FROM events",
0),
])
def test_index_hint_not_added_inside_function_calls(self, _name, hogql, expected_count):
sql, _ = self._compile_hogql(hogql, timezone="America/New_York")
assert sql.count("indexHint") == expected_count
```
**Context Used:** Do not attempt to comment on incorrect alphabetica... ([source](https://app.greptile.com/review/custom-context?memory=instruction-0))
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: posthog/hogql/transforms/test/test_property_types.py
Line: 499-520
Comment:
**Misleading test name contradicts its own docstring**
The method is named `test_utc_zero_buffer_returns_correct_results`, but the docstring immediately says _"UTC gets the same 14h buffer as other timezones."_ The name implies no buffer is applied for UTC, which is the opposite of what the code does (`_tz_buffer_hours = 14` whenever `setTimeZones=True`, regardless of timezone). Consider renaming to something like `test_utc_timezone_returns_correct_results`.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix: add indexHint to restore partition/..." | Re-trigger Greptile |
Query snapshots: Backend query snapshots updatedChanges: 62 snapshots (62 modified, 0 added, 0 deleted) What this means:
Next steps:
|
…pruning
ClickHouse can't use the partition key (toYYYYMM(timestamp)) or primary
key (toDate(timestamp)) when the WHERE clause wraps timestamp in
toTimeZone(). Since toTimeZone() only changes display metadata — not the
underlying epoch — we move the timezone from the field to the constant:
toTimeZone(timestamp, 'US/Pacific') >= '2024-03-01'
→ timestamp >= toDateTime64('2024-03-01', 6, 'US/Pacific')
This lets the planner see bare timestamp for pruning while ClickHouse
interprets the constant in the correct timezone.
Only applies to top-level WHERE range comparisons (>=, >, <, <=).
toTimeZone is preserved in SELECT expressions and inside function calls.
Generated-By: PostHog Code
Task-Id: 3282717a-63ab-4824-b509-b4ecb978a729
f4aa428 to
6ff37d0
Compare
Generated-By: PostHog Code Task-Id: 3282717a-63ab-4824-b509-b4ecb978a729
Generated-By: PostHog Code Task-Id: 3282717a-63ab-4824-b509-b4ecb978a729
|
⏭️ Skipped snapshot commit because branch advanced to The new commit will trigger its own snapshot update workflow. If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:
|
Raw benchmark data (7-day funnel, run 9)All runs used EXPLAIN indexes (sharded_events, load_balancing=in_order)Baseline indexes[
{
"Type": "MinMax",
"Keys": [
"timestamp"
],
"Condition": "and((toTimezone(timestamp, 'US/Pacific') in (-Inf, '1776409199.999999']), (toTimezone(timestamp, 'US/Pacific') in ['1775718000', +Inf)))",
"Initial Parts": 1810,
"Selected Parts": 49,
"Initial Granules": 189403366,
"Selected Granules": 6194548
},
{
"Type": "Partition",
"Condition": "true",
"Initial Parts": 49,
"Selected Parts": 49,
"Initial Granules": 6194548,
"Selected Granules": 6194548
},
{
"Type": "PrimaryKey",
"Keys": [
"team_id",
"event"
],
"Condition": "and((event in 2-element set), (team_id in [2, 2]))",
"Search Algorithm": "generic exclusion search",
"Initial Parts": 49,
"Selected Parts": 49,
"Initial Granules": 6194548,
"Selected Granules": 73638
},
{
"Type": "Skip",
"Name": "minmax_sharded_events_timestamp",
"Description": "minmax GRANULARITY 1",
"Initial Parts": 49,
"Selected Parts": 47,
"Initial Granules": 73638,
"Selected Granules": 60683
}
]indexHint +7/+8 indexes[
{
"Type": "MinMax",
"Keys": [
"timestamp"
],
"Condition": "and((timestamp in (-Inf, '1776437999.999999']), and((toTimezone(timestamp, 'US/Pacific') in (-Inf, '1776409199.999999']), and((timestamp in ['1775743200', +Inf)), (toTimezone(timestamp, 'US/Pacific') in ['1775718000', +Inf)))))",
"Initial Parts": 1863,
"Selected Parts": 51,
"Initial Granules": 191551084,
"Selected Granules": 5514013
},
{
"Type": "Partition",
"Keys": [
"toYYYYMM(timestamp)"
],
"Condition": "and((toYYYYMM(timestamp) in (-Inf, 202604]), (toYYYYMM(timestamp) in [202604, +Inf)))",
"Initial Parts": 51,
"Selected Parts": 51,
"Initial Granules": 5514013,
"Selected Granules": 5514013
},
{
"Type": "PrimaryKey",
"Keys": [
"team_id",
"toDate(timestamp)",
"event"
],
"Condition": "and((event in 2-element set), and((toDate(timestamp) in (-Inf, 20560]), and((toDate(timestamp) in [20552, +Inf)), (team_id in [2, 2]))))",
"Search Algorithm": "generic exclusion search",
"Initial Parts": 51,
"Selected Parts": 51,
"Initial Granules": 5514013,
"Selected Granules": 60434
},
{
"Type": "Skip",
"Name": "minmax_sharded_events_timestamp",
"Description": "minmax GRANULARITY 1",
"Initial Parts": 51,
"Selected Parts": 49,
"Initial Granules": 60434,
"Selected Granules": 55716
}
]Stripping indexes (this PR)[
{
"Type": "MinMax",
"Keys": [
"timestamp"
],
"Condition": "and((timestamp in (-Inf, '1776409199.999999']), (timestamp in ['1775718000', +Inf)))",
"Initial Parts": 1990,
"Selected Parts": 60,
"Initial Granules": 180809332,
"Selected Granules": 5177243
},
{
"Type": "Partition",
"Keys": [
"toYYYYMM(timestamp)"
],
"Condition": "and((toYYYYMM(timestamp) in (-Inf, 202604]), (toYYYYMM(timestamp) in [202604, +Inf)))",
"Initial Parts": 60,
"Selected Parts": 60,
"Initial Granules": 5177243,
"Selected Granules": 5177243
},
{
"Type": "PrimaryKey",
"Keys": [
"team_id",
"toDate(timestamp)",
"event"
],
"Condition": "and((event in 2-element set), and((toDate(timestamp) in (-Inf, 20560]), and((toDate(timestamp) in [20552, +Inf)), (team_id in [2, 2]))))",
"Search Algorithm": "generic exclusion search",
"Initial Parts": 60,
"Selected Parts": 60,
"Initial Granules": 5177243,
"Selected Granules": 23828
},
{
"Type": "Skip",
"Name": "minmax_sharded_events_timestamp",
"Description": "minmax GRANULARITY 1",
"Initial Parts": 60,
"Selected Parts": 56,
"Initial Granules": 23828,
"Selected Granules": 23291
}
]Execution results (query_log, 5 runs each)Query log results[
{
"variant": "baseline",
"query_duration_ms": "4,755",
"read_rows": "27,535,611",
"read_bytes_human": "2.24 GiB",
"selected_parts": "59",
"selected_marks": "62,117"
},
{
"variant": "baseline",
"query_duration_ms": "5,970",
"read_rows": "25,203,657",
"read_bytes_human": "2.23 GiB",
"selected_parts": "73",
"selected_marks": "20,975"
},
{
"variant": "baseline",
"query_duration_ms": "2,824",
"read_rows": "27,530,511",
"read_bytes_human": "2.24 GiB",
"selected_parts": "73",
"selected_marks": "23,829"
},
{
"variant": "baseline",
"query_duration_ms": "6,440",
"read_rows": "25,745,832",
"read_bytes_human": "2.23 GiB",
"selected_parts": "67",
"selected_marks": "61,636"
},
{
"variant": "baseline",
"query_duration_ms": "3,357",
"read_rows": "27,649,289",
"read_bytes_human": "2.24 GiB",
"selected_parts": "81",
"selected_marks": "24,708"
},
{
"variant": "indexHint +7/+8",
"query_duration_ms": "9,307",
"read_rows": "25,427,025",
"read_bytes_human": "2.13 GiB",
"selected_parts": "81",
"selected_marks": "19,630"
},
{
"variant": "indexHint +7/+8",
"query_duration_ms": "6,582",
"read_rows": "26,139,681",
"read_bytes_human": "2.13 GiB",
"selected_parts": "47",
"selected_marks": "23,470"
},
{
"variant": "indexHint +7/+8",
"query_duration_ms": "4,127",
"read_rows": "23,915,002",
"read_bytes_human": "2.13 GiB",
"selected_parts": "62",
"selected_marks": "23,085"
},
{
"variant": "indexHint +7/+8",
"query_duration_ms": "6,677",
"read_rows": "25,213,166",
"read_bytes_human": "2.13 GiB",
"selected_parts": "95",
"selected_marks": "19,674"
},
{
"variant": "indexHint +7/+8",
"query_duration_ms": "8,449",
"read_rows": "26,260,186",
"read_bytes_human": "2.13 GiB",
"selected_parts": "83",
"selected_marks": "59,663"
},
{
"variant": "stripping",
"query_duration_ms": "2,192",
"read_rows": "27,488,851",
"read_bytes_human": "2.23 GiB",
"selected_parts": "55",
"selected_marks": "24,693"
},
{
"variant": "stripping",
"query_duration_ms": "3,149",
"read_rows": "27,469,271",
"read_bytes_human": "2.23 GiB",
"selected_parts": "57",
"selected_marks": "61,281"
},
{
"variant": "stripping",
"query_duration_ms": "3,250",
"read_rows": "27,001,413",
"read_bytes_human": "2.23 GiB",
"selected_parts": "80",
"selected_marks": "21,475"
},
{
"variant": "stripping",
"query_duration_ms": "8,356",
"read_rows": "27,260,208",
"read_bytes_human": "2.24 GiB",
"selected_parts": "103",
"selected_marks": "21,513"
},
{
"variant": "stripping",
"query_duration_ms": "2,464",
"read_rows": "26,727,215",
"read_bytes_human": "2.23 GiB",
"selected_parts": "77",
"selected_marks": "24,968"
}
] |
Query snapshots: Backend query snapshots updatedChanges: 68 snapshots (68 modified, 0 added, 0 deleted) What this means:
Next steps:
|
Generated-By: PostHog Code Task-Id: 3282717a-63ab-4824-b509-b4ecb978a729
a469095 to
7e571fb
Compare
The toTimeZone stripping is a pruning optimization that only benefits WHERE/PREWHERE clauses. Stripping in JOIN ON, SELECT, HAVING etc. is unnecessary and changes SQL output without benefit. Each SelectQuery now gets its own scope — a subquery inside WHERE correctly resets the flag so its SELECT/JOIN clauses aren't affected. Generated-By: PostHog Code Task-Id: 3282717a-63ab-4824-b509-b4ecb978a729
- Subquery in WHERE: inner SELECT preserves toTimeZone, inner WHERE strips - HAVING preserves toTimeZone Generated-By: PostHog Code Task-Id: 3282717a-63ab-4824-b509-b4ecb978a729
|
⏭️ Skipped snapshot commit because branch advanced to The new commit will trigger its own snapshot update workflow. If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:
|
…are constants - Skip toDateTime64 wrapping when the constant is already a timezone-aware datetime (the printer already emits toDateTime64 with timezone for these) - Preserve Alias wrapper on bare constants, consistent with the assumeNotNull branch Generated-By: PostHog Code Task-Id: 3282717a-63ab-4824-b509-b4ecb978a729
…are constants - Skip toDateTime64 wrapping when the constant is already a timezone-aware datetime — the printer converts to team tz and emits toDateTime64 with the correct timezone regardless of the constant's original tzinfo - Preserve Alias wrapper on bare constants, consistent with the assumeNotNull branch Generated-By: PostHog Code Task-Id: 3282717a-63ab-4824-b509-b4ecb978a729
|
⏭️ Skipped snapshot commit because branch advanced to The new commit will trigger its own snapshot update workflow. If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:
|
Query snapshots: Backend query snapshots updatedChanges: 68 snapshots (68 modified, 0 added, 0 deleted) What this means:
Next steps:
|
Public-facing post on how we used pi-autoresearch (with our own campaign / lane / hypothesis / experiment orchestration on top) at the Lisbon team offsite hackathon to find a 3-year-old ClickHouse primary-key bug. Links to PostHog/posthog#54819 in the main repo for the fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* content: add autoresearch hackathon blog post Public-facing post on how we used pi-autoresearch (with our own campaign / lane / hypothesis / experiment orchestration on top) at the Lisbon team offsite hackathon to find a 3-year-old ClickHouse primary-key bug. Links to PostHog/posthog#54819 in the main repo for the fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * content: correct "disabling" -> "stopping ... from fully using" PK Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * content: simplify the lead, attribute the wrong usage rather than the filter Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * content: "every query with a timestamp filter" not "every timestamp filter" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix formatting * add a couple of internal links * Update contents/blog/autoresearch-found-a-3-year-old-clickhouse-bug.md Co-authored-by: Ian Vanagas <34755028+ivanagas@users.noreply.github.com> * Update contents/blog/autoresearch-found-a-3-year-old-clickhouse-bug.md Co-authored-by: Ian Vanagas <34755028+ivanagas@users.noreply.github.com> * Update contents/blog/autoresearch-found-a-3-year-old-clickhouse-bug.md Co-authored-by: Ian Vanagas <34755028+ivanagas@users.noreply.github.com> * Update contents/blog/autoresearch-found-a-3-year-old-clickhouse-bug.md Co-authored-by: Ian Vanagas <34755028+ivanagas@users.noreply.github.com> * Update contents/blog/autoresearch-found-a-3-year-old-clickhouse-bug.md Co-authored-by: Ian Vanagas <34755028+ivanagas@users.noreply.github.com> * Update contents/blog/autoresearch-found-a-3-year-old-clickhouse-bug.md Co-authored-by: Ian Vanagas <34755028+ivanagas@users.noreply.github.com> * Update contents/blog/autoresearch-found-a-3-year-old-clickhouse-bug.md Co-authored-by: Ian Vanagas <34755028+ivanagas@users.noreply.github.com> * Update contents/blog/autoresearch-found-a-3-year-old-clickhouse-bug.md Co-authored-by: Ian Vanagas <34755028+ivanagas@users.noreply.github.com> * Update contents/blog/autoresearch-found-a-3-year-old-clickhouse-bug.md Co-authored-by: Ian Vanagas <34755028+ivanagas@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Ian Vanagas <34755028+ivanagas@users.noreply.github.com> * update slug * update date * fix spacing * a few more edits * date * date --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Ian Vanagas <34755028+ivanagas@users.noreply.github.com>
TLDR: 22-37% faster 7-day funnel, 62% fewer granules scanned — by restoring ClickHouse partition/PK pruning. Use
toTimeZonebroke clickhouse's ability to use the primary key, this fixes that.We found this with pi-autoresearch in #hackathon-lisbon-query-performance-autoresearch
PostHog Code continues:
Problem
The events table uses
PARTITION BY toYYYYMM(timestamp)andORDER BY (team_id, toDate(timestamp), event, ...), but HogQL wraps all timestamp fields withtoTimeZone(timestamp, tz)for timezone support. ClickHouse's query planner can't derivetoYYYYMM(timestamp)ortoDate(timestamp)bounds fromtoTimeZone(timestamp, tz)comparisons, so both partition pruning and primary key usage were lost.The MinMax index on
timestampstill worked (it evaluatestoTimeZone()against per-part min/max values), which is why queries weren't catastrophically slow — but after MinMax narrowed down the parts, ClickHouse scanned all granules within those parts instead of usingtoDate(timestamp)in the primary key to skip to the right date range.Verified via
EXPLAIN PLAN indexes=1, json=1onsharded_eventswithload_balancing='in_order':Partition: Condition='true'(not pruning), PK keys['team_id', 'event']→ 60,683 granules after skip indexPartition: toYYYYMM(timestamp) in [202604, 202604], PK keys['team_id', 'toDate(timestamp)', 'event']→ 23,291 granules after skip index (62% reduction)Changes
In
PropertySwapper.visit_compare_operation, for top-level WHERE range comparisons (>=,>,<,<=), we move the timezone from the field side to the constant side:This works because
toTimeZone()only changes display metadata on DateTime values — the underlying epoch is unchanged. By annotating the constant with the timezone instead (toDateTime64(constant, 6, tz)), ClickHouse interprets it in the correct timezone while the planner can see the baretimestampfield for pruning.Key design decisions:
if(),coalesce()— tracked via_inside_call_depthtoTimeZoneis preserved in SELECT expressions for correct displaytoDateTime64) are left unchangedassumeNotNull(toDateTime(...))are recursed intoWe also compared against an
indexHint()approach (addingindexHint(bare_timestamp >= constant ± tz_offset)alongside thetoTimeZonecomparison). The stripping approach outperforms it significantly — cleaner planner signal, fewer granules, lower variance.Credit to @andyzzhao for the insight that we can strip
toTimeZoneentirely and annotate the constant instead.How did you test this code?
New tests in
TestTimezoneIndexPruning(intest_property_types.py):EXPLAIN-based pruning tests:
test_bare_timestamp_prunes_partition_and_primary_key— baseline confirming bare timestamp pruning workstest_toTimeZone_breaks_partition_and_pk_pruning— canary test asserting the ClickHouse limitation still exists (if it starts failing, the workaround is removable)test_hogql_compiled_query_has_partition_pruning— full HogQL pipeline now has partition + PK pruningCorrectness tests (real ClickHouse queries with event data):
test_dst_boundary_does_not_drop_events— America/New_York DST transitiontest_positive_utc_offset_does_not_drop_events— Asia/Tokyo (UTC+9)test_utc_returns_correct_resultstest_brazil_historical_dst_does_not_drop_events— historical DST that was later abolishedtest_half_hour_offset_does_not_drop_events— Asia/Kolkata (UTC+5:30)test_lord_howe_half_hour_dst_does_not_drop_events— 30-minute DST shiftSQL shape tests:
test_toTimeZone_stripped_from_where_but_kept_in_selecttest_toTimeZone_not_stripped_inside_function_callstest_constant_gets_timezone_annotationBenchmarks
Profiled on prod ClickHouse via Metabase — 7-day funnel (pageview → autocapture) on team 2, US/Pacific timezone. All variants used
load_balancing='in_order'for consistency. Each variant run 5 times.EXPLAIN granule counts (single shard via
sharded_events):true❌Execution timing (distributed
eventstable, 5 runs each):Publish to changelog?
No
Docs update
No docs changes needed.
🤖 LLM context
Authored by PostHog Code. The approach evolved during the session:
indexHint()+ bare timestamp bounds with a 14-hour buffertoTimeZoneand annotating the constant instead, which is both simpler and more effective (fewer granules, lower variance, no buffer arithmetic)Created with PostHog Code