feat(hogql): default to cpp with rust-py shadow + parity metrics#60201
Conversation
Make `cpp_with_rust_py_shadow` the default parser mode (cpp stays primary, rust-py runs as the shadow) in both test and prod, to validate the hand-rolled rust-py parser against the cpp oracle on real traffic before ever flipping rust-py to primary. - Add the `cpp_with_rust_py_shadow` ParserMode (schema + regenerated schema.json/schema.py) and resolve it as the no-modifier default. - Test shadows 100% of parses (raising on any divergence so regressions fail loud); prod shadows 1% for the initial rollout (recording only), to ramp up once rust-py looks safe. The rust-py shadow runs on top of the existing cpp parse and is far cheaper, so cpp is never parsed twice. - Shadowed-run telemetry: a `hogql_parser_shadow_comparisons_total` counter (run count plus agreement rate via its result label); cpp/rust durations and their ratio come from the existing per-backend `parse_*_seconds` timer. The raw query behind a divergence is logged at warning (`hogql_parser_shadow_divergence`), the one thing that can't be a metric. - If the rust wheel fails to import, the default drops the shadow (cpp-only) so a broken wheel can't throw or spam the parse path; the unavailability is already logged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🎭 Playwright didn't run on this PR — your changes touch code that could affect E2E behavior, but Playwright is opt-in via label now to keep CI cost down. Add the Most PRs don't need this. Real regressions still get caught on master and fix-forward. |
|
|
Size Change: 0 B Total Size: 80.2 MB ℹ️ View Unchanged
|
Review feedback on #60201: - A test comment claimed sampling is 100% everywhere, but prod is 1%. Clarify that the shadow mode is active in both envs while the rate differs (prod 1%, tests 100%). - Add a prod-mode test that the shadow_rejected branch logs hogql_parser_shadow_divergence carrying the SQL; only the disagree path was covered before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR overviewAll previously flagged issues have been addressed. No open security concerns remain on this pull request. Security reviewNo open security issues remain on this pull request. Fixed/addressed: 3 · PR risk: 0/10 |
Comparing source positions in the shadow surfaced widespread cpp-vs-rust-py divergences on real-world queries, so the comparison stays structure-only (clear_locations) for now; position parity is validated by the cross-backend parser test suite and left to a follow-up. Documents why, and drops the prod sample rate from a test comment so it can't go stale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolve cpp/rust parser wheel versions once at import (via importlib.metadata, falling back to "unknown") and attach them to the shadow-rollout telemetry so results can be filtered by version once a fix ships: a `version` label on the parse_*_seconds histogram, and `primary_version`/`shadow_version` on the shadow-comparison counter and the capture_exception divergence properties. Also route the divergent query SQL through capture_exception (the channel that already carries query SQL on parse/execution failures) rather than the structured logs, and drop the now-redundant hogql_parser_shadow_divergence warning whose only payload was that SQL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lift the per-method `parser as parser_module`, `HogQLSyntaxError`, and `parser` imports to module level (per the imports-at-top convention), and collapse the multi-line test comments into single-line notes that stay under 120 chars. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove the per-test leading comments and let the test names carry the intent. Where a comment held context the name lacked, rename instead: e.g. the parser-class throw test now spells out "rejects primary accepted input", the agreement test names its version-label coverage, and the prod-path divergence/rejection tests say "in prod".
…st-py-shadow-default
The shadow comparison used dataclass `==`, which is always False for an
AST holding a NaN constant (`float("nan") != float("nan")`) even when
both backends build the identical node. That made the default cpp +
rust-py shadow raise HogQLParserShadowMismatch in TEST for every `nan` /
`+nan` parse (e.g. the test_parser_cpp_json number battery) and would
emit false "disagree" telemetry in prod for any NaN-bearing query.
Fall back to a repr comparison, which is stable for NaN, before flagging
a divergence. repr is only consulted when `==` is already False, so the
int/float leniency of `==` is preserved and only the NaN case flips back
to agreement.
Terseness pass over the comments and docstrings this PR added (version helper, sample rate, shadow counter, _run_shadow_comparison, divergence routing). No behavior change.
|
👋 Visual changes detected for this PR. Review and approve in PostHog Visual Review If these changes are unexpected, they may be caused by a flaky test or a broken snapshot on master. Don't approve — rerun the job or wait for a fix. |
Problem
Let's turn on rust-py in production (shadowing the c++ parser), and tweak the metrics reported.
Claude code continues:
The hand-rolled Rust HogQL parser (
rust-py, which buildsposthog.hogql.astdataclasses directly via PyO3) reached very close to output parity with the C++ ANTLR parser in #59628. Before it can ever become the primary parser, we need to validate it against cpp on real production traffic and watch the parity rate, not just in tests.Changes
cpp_with_rust_py_shadow(added to theParserModeschema). cpp stays the primary and its result is always returned; rust-py runs as a shadow and the two ASTs are compared. With no explicitparserMode, this is now the default in both test and prod._SHADOW_SAMPLE_RATE = 0.01). The shadow never affects the response (primary result is always returned, divergences are only recorded), but keeping prod at 1% bounds the blast radius of any latent rust-py pathology during rollout. It's a one-line bump to 100% once it looks safe.hogql_parser_shadow_comparisons_total{rule, result, primary_version, shadow_version}counter: the number of shadowed runs (sum acrossresult) and the agreement rate (agree/ that sum).resultis one ofagree,disagree,shadow_rejected,shadow_error. The*_versionlabels carry the cpp/rust wheel versions (resolved once at import viaimportlib.metadata) so results can be filtered by version once a fix ships.parse_*_seconds{backend, version}histogram. The rust shadow runs on top of the already-done cpp parse, so cpp is never parsed twice (rust is far cheaper).capture_exception(tagged with rule + both parser versions), not the structured logs. That's the channel that already receives query SQL on parse/execution failures, so it adds no new exposure class for the SQL.Behavior split: prod only records divergences and shadow crashes (never raises into a request), while tests raise on a divergence or on the shadow rejecting input the primary accepted.
How did you test this code?
Agent-authored (Claude Code); requires human review. Automated tests that actually ran (locally, via the project venv):
posthog/hogql/test/test_parser_mode.py: 22 passed (new default resolution, prod-shadows-without-raising, rust-unavailable fallback, counter agreement with version labels, divergence captures the SQL via error tracking, and per-backend version-label coverage).posthog/hogql/compiler/test/test_bytecode.py: 8 passed.printer/resolver/query/metadata) under the new rust-py default shadow: 973 passed. The only failures were 3 pre-existingtest_query.pyClickHouse-execution errors that fail identically on clean master (confirmed by stashing this change), unrelated to parsing.ruffandtyclean (via the pre-commit hook).No manual or UI testing was performed.
Publish to changelog?
no, this is internal parser-rollout tooling with no user-facing change.
Docs update
No user-facing changes. Add
skip-inkeep-docsif needed.🤖 Agent context
Authored with Claude Code (agent-assisted); requires human review, not self-merged.
The telemetry approach evolved during review, worth recording for reviewers:
distinct_id), and an event per parse is the wrong tool and volume.capture_exception) for the divergent SQL. The SQL stays out of the structured logs and rides the error-tracking channel that already carries query SQL on parse/execution failures (ClickHousesystem.query_logis a separate channel, so "already in query_log" didn't justify adding it to logs). Also rejected re-parsing cpp inside the shadow for a precise per-run ratio, since that would parse cpp twice on every prod parse; the durations and ratio reuse the existing per-backend timer instead.Prod sampling deliberately starts at 1% (ramp to 100% later) to limit blast radius if rust-py misbehaves under real traffic.