Skip to content

chore(skills): add optimizing-clickhouse-and-hogql-queries skill#60451

Merged
robbie-c merged 7 commits into
masterfrom
claude/heuristic-lewin-6d555e
May 28, 2026
Merged

chore(skills): add optimizing-clickhouse-and-hogql-queries skill#60451
robbie-c merged 7 commits into
masterfrom
claude/heuristic-lewin-6d555e

Conversation

@robbie-c
Copy link
Copy Markdown
Member

Problem

Optimizing a slow HogQL query takes scattered knowledge: where to read the generated ClickHouse SQL, which materialized-column strategy to reach for, which test helpers exist for asserting skip-index use, how to use the Test Cluster for real timing measurements, and how to translate the fix back into the printer / query runner / a migration. Each piece is documented somewhere, but the workflow that ties them together isn't.

Changes

Adds .agents/skills/optimizing-clickhouse-queries/SKILL.md. Five steps mapped to where engineers actually get stuck:

  1. Get the ClickHouse SQLexecute_hogql_query() / prepare_and_print_ast / print_prepared_ast, .ambr snapshots, or production via /query-clickhouse-via-metabase against posthog.query_log_archive (or system.query_log with clusterAllReplicas and is_initial_query).
  2. Scan for smellsJSONExtract over raw properties (with pointers to direct materialization, property groups, DMAT slots, and the new JSON type experiment, plus the materialized() test context manager); primary key / skip indexes (with get_index_from_explain / get_indexes_from_explain); self-joins (rewrite to one-pass sumIf / uniqIf / uniqMapIf); CTE inlining behavior.
  3. EXPLAIN — useful flavors and link to the ClickHouse docs.
  4. Measure — local CH (correctness + bytes-read), Test Cluster via the metabase skill (team 2 only, SETTINGS use_uncompressed_cache=0, median of 5, pull numbers from system.query_log not request latency), and pi-autoresearch as the powertool.
  5. Apply at the right layer — query runner, new HogQL function, printer rule (with _get_optimized_materialized_column_equals_operation as a template), or ClickHouse migration (defers to /clickhouse-migrations).

Plus a note on team-asymmetric optimizations (the funnels heuristic story) and a short test-discipline section.

The skill links to existing source files rather than duplicating their content, so it stays in sync when the underlying code moves. Defers to /writing-clickhouse-queries, /clickhouse-migrations, and /query-clickhouse-via-metabase rather than re-stating them.

How did you test this code?

I'm an agent. I ran ./bin/hogli lint:skills (passes: OK: 43 skill(s) passed lint checks) and verified all 28 file paths referenced in the skill exist in the tree. No manual testing of the skill in an agent session yet.

Publish to changelog?

no

🤖 Agent context

Authored by Claude (Opus 4.7) via Claude Code. The user provided the outline of what the skill should cover, including which files to link to, and the workflow shape (CH SQL first, then translate the change back). I explored the codebase to resolve each pointer to a concrete file or symbol (e.g. confirming the events sort key is (team_id, toDate(timestamp), event, cityHash64(distinct_id), cityHash64(uuid)), finding materialized() and get_index_from_explain in posthog/test/base.py, locating the three materialization strategies, and the autoresearch coordinator).

Decisions worth flagging for review:

  • Location: .agents/skills/ not products/*/skills/. Cross-cutting skill, follows the precedent of writing-clickhouse-queries and clickhouse-migrations.
  • Cache setting: use_uncompressed_cache=0. Mirrors ee/benchmarks/measure.sh. Reasonable people could argue for min_bytes_to_use_direct_io=1 (bypasses OS page cache) instead or in addition; happy to swap if the existing benchmarking convention is wrong.
  • pi-autoresearch section is honest about uncertainty. I don't know the exact runtime commands beyond the README snippets, so the skill tells the agent to ask the user and links the relevant files rather than guessing.
  • Team-specific heuristics framed as "suggest, don't implement." The funnels example is concrete but the shape of the right heuristic depends on the rewrite, so this is a design call for the human.

@robbie-c robbie-c changed the title chore(skills): add optimizing-clickhouse-queries skill chore(skills): add optimizing-clickhouse-and-hogql-queries skill May 28, 2026
Workflow skill for optimizing ClickHouse queries and HogQL queries
(which compile to ClickHouse). Does not cover Postgres / Django ORM /
app-DB queries, which need pganalyze and the Postgres section of
docs/.../databases/query-performance-optimization.md instead.

Structure:

- Step 0 triages what layer the slow query lives at (HogQL printer vs
  hand-written ClickHouse SQL vs Django ORM vs personhog) and redirects
  out for the non-ClickHouse cases. Includes a multi-layer-workflow
  note for coordinators / Celery tasks / Temporal workflows where the
  ClickHouse query lives one dispatch further in.
- Background reads: events / sessions / persons / person overrides /
  cohorts table SQL, HogQL printer entry points, cluster topology in
  posthog/clickhouse/migrations/CLAUDE.md, plus a pointer to find any
  ClickHouse table not in the listed schemas.
- Step 1 extracts the printed ClickHouse SQL from a HogQL query via
  execute_hogql_query() / prepare_and_print_ast() / .ambr snapshots /
  production system.query_log, or skips ahead for hand-written SQL.
- Step 2 calls out the recurring smells with rewrites and links:
  FROM ... FINAL (with the safe rewrite hierarchy: materialized column,
  narrow argMax, wide argMax as a last resort), JSONExtract over
  properties (with the test-vs-prod materialization caveat), primary
  key / skip index coverage, self-joins on events, CTE inlining.
- Step 3 lists EXPLAIN flavors and links the upstream docs.
- Step 4 covers measurement: local ClickHouse for correctness and bytes
  read, Test Cluster for timing, with use_uncompressed_cache=0, median
  of 5, pulling metrics from system.query_log, swapping in the cluster's
  materialized columns before timing, and the discipline to measure
  before proposing.
- Step 5 picks the lowest-blast-radius layer for the fix: query runner,
  new HogQL function, printer rule, or ClickHouse migration. Defers to
  /clickhouse-migrations for migration mechanics.
- Test discipline plus a pointer to the learnings log.

references/learnings.md is an append-only case-study log for findings
that contradict or nuance the skill's smell descriptions. Comes with a
PII / customer-data prohibition (the file lands in the public OSS repo)
and a worked first entry: dropping FROM person FINAL via
argMax(properties, version) GROUP BY id was 46% slower than the
original and used ~10x more memory on a 1M-person Test Cluster slice
because argMax buffers wide blobs per group; argMax over the
materialized pmat_$browser column was 4.6x faster and read 33x fewer
bytes. The blanket 'FINAL bad, argMax good' framing is wrong;
materialization dominates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@robbie-c robbie-c force-pushed the claude/heuristic-lewin-6d555e branch from d6833ff to 901c994 Compare May 28, 2026 16:25
robbie-c and others added 6 commits May 28, 2026 17:33
Links the materialization-swap mechanism that backs the test-vs-prod
caveat in the JSON smell. Three pieces:

- ee/clickhouse/materialized_columns/columns.py for the registry
  (get_materialized_columns / get_enabled_materialized_columns).
- posthog/hogql/printer/base.py for the printer's visit_property_type
  swap point at line 1354 and its _get_materialized_property_source_*
  helper at 1260.
- posthog/hogql/printer/clickhouse.py for the ClickHouse override at 412.

Notes that the swap happens automatically for HogQL queries (default
assumption: property access is materialized in prod), and that the
exception is hand-written SQL in product code that never goes through
the printer and has to do its own lookup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three URLs covering the conceptual model that frame the rest of the
reading list:

- query-performance-optimization for the general approach
- hogql-python for the printer pipeline
- clickhouse-queries-new-products for table/query-runner design

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the slow query is hand-written SQL in product code (Python f-strings
shipped via sync_execute / client.read_query / client.execute, not HogQL
printer output), surface this to the user before applying a local fix.
HogQL queries get materialized-column substitution, property-group dispatch,
lazy joins, team-id guards, and a pile of other optimizations automatically
through the printer. Raw SQL has to reimplement each of those or live
without them, so the structural fix is usually to move the query to HogQL
rather than patch the raw string. Migrations and one-shot operational
scripts are reasonable exceptions; long-lived read paths in product code
usually aren't.

Caught while dogfooding on a backfill workflow that f-strings JSONExtract
calls and ships them raw; measured to be reading 1.27 GB on a 1M-person
slice where the same query expressed in HogQL would have automatically
landed on the materialized pmat_$browser column (30 MB).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mization

Earlier wording said 'surface this to the user before going further' which
read as 'stop and propose HogQL instead of the local fix'. Reframes it as:
flag up front, then continue with the local optimization. The user gets
both options (local fix now, HogQL move later, or both); the agent
shouldn't withhold the local analysis they asked for.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uilt envelope

HogQL is intentionally read-only. For INSERTs, the recommended pattern is
to express the SELECT in HogQL, print it, and concatenate it into an
'INSERT INTO <table> <printed_select>' string. The read half still gets
materialization, lazy joins, team-id guards, etc.; only the INSERT
wrapper is hand-built.

Adds this as a sentence right after the 'flag and continue' note. Keeps
the migrations/one-shot-scripts caveat as the only true exception.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… to local md

A blind dogfood test surfaced a routing problem: an agent given 'audit these
slow queries' picked writing-clickhouse-queries (which has 'debugging slow
ClickHouse queries' in its trigger list) instead of the optimization skill.
The agent's worktree also predated the new skill's commit, so it couldn't
even see it, but the routing description overlap is a real issue regardless.

Three updates:

- writing-clickhouse-queries: description now narrows to writing/designing
  scenarios and explicitly redirects existing-query optimization to
  /optimizing-clickhouse-and-hogql-queries. Adds a prominent in-body
  callout at the top so any reader who lands there for an audit gets
  bounced before reading further.
- optimizing-clickhouse-and-hogql-queries: handbook URLs that were
  pointing at posthog.com/handbook switched to the local
  docs/published/handbook/engineering/databases/*.md files. The
  posthog.com pages are built from these md files; linking the source
  works offline and is versioned with the code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@robbie-c robbie-c marked this pull request as ready for review May 28, 2026 17:24
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Greptile encountered an error while reviewing this PR. Please reach out to support@greptile.com for assistance.

@robbie-c robbie-c added the stamphog Request AI review from stamphog label May 28, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Purely additive documentation — two new Markdown skill files for ClickHouse/HogQL query optimization guidance and a minor cross-reference update to an existing skill. No production code, API contracts, data models, or executable logic touched. The content is internally consistent Markdown referencing real PostHog files.

@robbie-c robbie-c merged commit dea99c5 into master May 28, 2026
169 checks passed
@robbie-c robbie-c deleted the claude/heuristic-lewin-6d555e branch May 28, 2026 18:49
@deployment-status-posthog
Copy link
Copy Markdown

deployment-status-posthog Bot commented May 28, 2026

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-05-28 19:38 UTC Run
prod-us ✅ Deployed 2026-05-28 19:52 UTC Run
prod-eu ✅ Deployed 2026-05-28 19:56 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stamphog Request AI review from stamphog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant