chore(skills): add optimizing-clickhouse-and-hogql-queries skill#60451
Merged
Conversation
Workflow skill for optimizing ClickHouse queries and HogQL queries (which compile to ClickHouse). Does not cover Postgres / Django ORM / app-DB queries, which need pganalyze and the Postgres section of docs/.../databases/query-performance-optimization.md instead. Structure: - Step 0 triages what layer the slow query lives at (HogQL printer vs hand-written ClickHouse SQL vs Django ORM vs personhog) and redirects out for the non-ClickHouse cases. Includes a multi-layer-workflow note for coordinators / Celery tasks / Temporal workflows where the ClickHouse query lives one dispatch further in. - Background reads: events / sessions / persons / person overrides / cohorts table SQL, HogQL printer entry points, cluster topology in posthog/clickhouse/migrations/CLAUDE.md, plus a pointer to find any ClickHouse table not in the listed schemas. - Step 1 extracts the printed ClickHouse SQL from a HogQL query via execute_hogql_query() / prepare_and_print_ast() / .ambr snapshots / production system.query_log, or skips ahead for hand-written SQL. - Step 2 calls out the recurring smells with rewrites and links: FROM ... FINAL (with the safe rewrite hierarchy: materialized column, narrow argMax, wide argMax as a last resort), JSONExtract over properties (with the test-vs-prod materialization caveat), primary key / skip index coverage, self-joins on events, CTE inlining. - Step 3 lists EXPLAIN flavors and links the upstream docs. - Step 4 covers measurement: local ClickHouse for correctness and bytes read, Test Cluster for timing, with use_uncompressed_cache=0, median of 5, pulling metrics from system.query_log, swapping in the cluster's materialized columns before timing, and the discipline to measure before proposing. - Step 5 picks the lowest-blast-radius layer for the fix: query runner, new HogQL function, printer rule, or ClickHouse migration. Defers to /clickhouse-migrations for migration mechanics. - Test discipline plus a pointer to the learnings log. references/learnings.md is an append-only case-study log for findings that contradict or nuance the skill's smell descriptions. Comes with a PII / customer-data prohibition (the file lands in the public OSS repo) and a worked first entry: dropping FROM person FINAL via argMax(properties, version) GROUP BY id was 46% slower than the original and used ~10x more memory on a 1M-person Test Cluster slice because argMax buffers wide blobs per group; argMax over the materialized pmat_$browser column was 4.6x faster and read 33x fewer bytes. The blanket 'FINAL bad, argMax good' framing is wrong; materialization dominates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d6833ff to
901c994
Compare
Links the materialization-swap mechanism that backs the test-vs-prod caveat in the JSON smell. Three pieces: - ee/clickhouse/materialized_columns/columns.py for the registry (get_materialized_columns / get_enabled_materialized_columns). - posthog/hogql/printer/base.py for the printer's visit_property_type swap point at line 1354 and its _get_materialized_property_source_* helper at 1260. - posthog/hogql/printer/clickhouse.py for the ClickHouse override at 412. Notes that the swap happens automatically for HogQL queries (default assumption: property access is materialized in prod), and that the exception is hand-written SQL in product code that never goes through the printer and has to do its own lookup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three URLs covering the conceptual model that frame the rest of the reading list: - query-performance-optimization for the general approach - hogql-python for the printer pipeline - clickhouse-queries-new-products for table/query-runner design Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the slow query is hand-written SQL in product code (Python f-strings shipped via sync_execute / client.read_query / client.execute, not HogQL printer output), surface this to the user before applying a local fix. HogQL queries get materialized-column substitution, property-group dispatch, lazy joins, team-id guards, and a pile of other optimizations automatically through the printer. Raw SQL has to reimplement each of those or live without them, so the structural fix is usually to move the query to HogQL rather than patch the raw string. Migrations and one-shot operational scripts are reasonable exceptions; long-lived read paths in product code usually aren't. Caught while dogfooding on a backfill workflow that f-strings JSONExtract calls and ships them raw; measured to be reading 1.27 GB on a 1M-person slice where the same query expressed in HogQL would have automatically landed on the materialized pmat_$browser column (30 MB). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mization Earlier wording said 'surface this to the user before going further' which read as 'stop and propose HogQL instead of the local fix'. Reframes it as: flag up front, then continue with the local optimization. The user gets both options (local fix now, HogQL move later, or both); the agent shouldn't withhold the local analysis they asked for. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uilt envelope HogQL is intentionally read-only. For INSERTs, the recommended pattern is to express the SELECT in HogQL, print it, and concatenate it into an 'INSERT INTO <table> <printed_select>' string. The read half still gets materialization, lazy joins, team-id guards, etc.; only the INSERT wrapper is hand-built. Adds this as a sentence right after the 'flag and continue' note. Keeps the migrations/one-shot-scripts caveat as the only true exception. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… to local md A blind dogfood test surfaced a routing problem: an agent given 'audit these slow queries' picked writing-clickhouse-queries (which has 'debugging slow ClickHouse queries' in its trigger list) instead of the optimization skill. The agent's worktree also predated the new skill's commit, so it couldn't even see it, but the routing description overlap is a real issue regardless. Three updates: - writing-clickhouse-queries: description now narrows to writing/designing scenarios and explicitly redirects existing-query optimization to /optimizing-clickhouse-and-hogql-queries. Adds a prominent in-body callout at the top so any reader who lands there for an audit gets bounced before reading further. - optimizing-clickhouse-and-hogql-queries: handbook URLs that were pointing at posthog.com/handbook switched to the local docs/published/handbook/engineering/databases/*.md files. The posthog.com pages are built from these md files; linking the source works offline and is versioned with the code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
Greptile encountered an error while reviewing this PR. Please reach out to support@greptile.com for assistance. |
Contributor
There was a problem hiding this comment.
Purely additive documentation — two new Markdown skill files for ClickHouse/HogQL query optimization guidance and a minor cross-reference update to an existing skill. No production code, API contracts, data models, or executable logic touched. The content is internally consistent Markdown referencing real PostHog files.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Optimizing a slow HogQL query takes scattered knowledge: where to read the generated ClickHouse SQL, which materialized-column strategy to reach for, which test helpers exist for asserting skip-index use, how to use the Test Cluster for real timing measurements, and how to translate the fix back into the printer / query runner / a migration. Each piece is documented somewhere, but the workflow that ties them together isn't.
Changes
Adds
.agents/skills/optimizing-clickhouse-queries/SKILL.md. Five steps mapped to where engineers actually get stuck:execute_hogql_query()/prepare_and_print_ast/print_prepared_ast,.ambrsnapshots, or production via/query-clickhouse-via-metabaseagainstposthog.query_log_archive(orsystem.query_logwithclusterAllReplicasandis_initial_query).JSONExtractover raw properties (with pointers to direct materialization, property groups, DMAT slots, and the new JSON type experiment, plus thematerialized()test context manager); primary key / skip indexes (withget_index_from_explain/get_indexes_from_explain); self-joins (rewrite to one-passsumIf/uniqIf/uniqMapIf); CTE inlining behavior.SETTINGS use_uncompressed_cache=0, median of 5, pull numbers fromsystem.query_lognot request latency), and pi-autoresearch as the powertool._get_optimized_materialized_column_equals_operationas a template), or ClickHouse migration (defers to/clickhouse-migrations).Plus a note on team-asymmetric optimizations (the funnels heuristic story) and a short test-discipline section.
The skill links to existing source files rather than duplicating their content, so it stays in sync when the underlying code moves. Defers to
/writing-clickhouse-queries,/clickhouse-migrations, and/query-clickhouse-via-metabaserather than re-stating them.How did you test this code?
I'm an agent. I ran
./bin/hogli lint:skills(passes:OK: 43 skill(s) passed lint checks) and verified all 28 file paths referenced in the skill exist in the tree. No manual testing of the skill in an agent session yet.Publish to changelog?
no
🤖 Agent context
Authored by Claude (Opus 4.7) via Claude Code. The user provided the outline of what the skill should cover, including which files to link to, and the workflow shape (CH SQL first, then translate the change back). I explored the codebase to resolve each pointer to a concrete file or symbol (e.g. confirming the events sort key is
(team_id, toDate(timestamp), event, cityHash64(distinct_id), cityHash64(uuid)), findingmaterialized()andget_index_from_explaininposthog/test/base.py, locating the three materialization strategies, and the autoresearch coordinator).Decisions worth flagging for review:
.agents/skills/notproducts/*/skills/. Cross-cutting skill, follows the precedent ofwriting-clickhouse-queriesandclickhouse-migrations.use_uncompressed_cache=0. Mirrorsee/benchmarks/measure.sh. Reasonable people could argue formin_bytes_to_use_direct_io=1(bypasses OS page cache) instead or in addition; happy to swap if the existing benchmarking convention is wrong.