Skip to content

fix(llma): use toString zero-UUID guard in generation sampling#60632

Merged
andrewm4894 merged 3 commits into
masterfrom
fix/llma-touuidorzero-generation-summarization
May 29, 2026
Merged

fix(llma): use toString zero-UUID guard in generation sampling#60632
andrewm4894 merged 3 commits into
masterfrom
fix/llma-touuidorzero-generation-summarization

Conversation

@andrewm4894
Copy link
Copy Markdown
Member

@andrewm4894 andrewm4894 commented May 29, 2026

Problem

$ai_generation_summary has been at 0 since 2026-05-20, which in turn took $ai_generation_clusters to 0 on 2026-05-27 (clustering uses the last 7 days of summary embeddings). Trace summaries were unaffected because the trace-level path doesn't hit the same expression.

Root cause: the HAVING clause in the generation sampling query uses toUUIDOrZero(''), which is not in the HogQL function allowlist (posthog/hogql/functions/clickhouse/conversions.py). Every llma-generation-summarization-coordinator-schedule child workflow has been failing inside sample_items_in_window_activity with QueryError: Unsupported function call 'toUUIDOrZero(...)'. The existing unit test mocked execute_hogql_query, so the HogQL validator never ran against this query.

Changes

  • posthog/temporal/ai_observability/trace_summarization/sampling.py: replace HAVING last_generation_id != toUUIDOrZero('') with HAVING last_generation_id != toUUID('00000000-0000-0000-0000-000000000000'). toUUID is HogQL-allowed and constant-folded at plan time, so the comparand is a single UInt128 value — the per-row work is a 16-byte UUID compare, no string materialization.
  • posthog/temporal/ai_observability/trace_summarization/tests/test_workflow.py: update test_generation_filter_is_applied_inside_argmaxif to assert the new HAVING shape, then run the materialized query through prepare_and_print_ast. This catches the same class of regression (function not whitelisted) at unit-test time instead of in production.

How did you test this code?

I'm an agent (Claude Opus 4.7) collaborating with @andrewm4894.

Automated:

  • hogli test posthog/temporal/ai_observability/trace_summarization/tests/test_workflow.py — all pass, including the updated guard that now exercises the real HogQL printer.

Local end-to-end against real ClickHouse:

  • Confirmed local team 1 had 376 $ai_generation events spanning May 18–28 (via execute-sql).
  • Ran the activity directly through temporalio.testing.ActivityEnvironment in manage.py shell:
    inputs = BatchSummarizationInputs(
        team_id=1, max_items=20, analysis_level='generation',
        window_minutes=60*24*30,
        window_start='2026-04-29T00:00:00',
        window_end='2026-05-29T23:00:00',
    )
    items = await ActivityEnvironment().run(sample_items_in_window_activity, inputs)
    Returned 20 SampledItems with valid trace_id + generation_id, no QueryError. This is the same code path that's been crashing in prod since 2026-05-20.

Negative-case verification (reproducing the bug):

  • Temporarily reverted the HAVING clause back to toUUIDOrZero('') and re-ran the same activity.
  • Reproduced the exact production error string: QueryError: Unsupported function call 'toUUIDOrZero(...)'. Perhaps you meant 'toIntOrZero(...)'?
  • Restored the fix; git diff HEAD confirms the working tree matches the committed version.

Automatic notifications

  • Publish to changelog?
  • Alert Sales and Marketing teams?

Docs update

n/a

🤖 Agent context

  • Triggered from a Slack thread about the LLMA Generation Cluster Events anomaly alert. An earlier in-thread agent run had already pointed at fix(llma): apply cluster event filters inside generation argMaxIf #59186 and the unwhitelisted toUUIDOrZero. Verified independently by greping posthog/hogql/functions/ (toUUIDOrZero absent; toString and toUUID present in clickhouse/conversions.py).
  • First pass shipped with toString(last_generation_id) != '...'. Reviewed the perf shape on a follow-up: toUUID('...') is constant-folded at plan time and reduces the per-row work to a UInt128 compare instead of materializing a 36-byte string per group. Swapped to the UUID form — same semantics, strictly less work for ClickHouse.
  • The prepare_and_print_ast step in the test needs database_sync_to_async (printer loads the team via Django ORM) and enable_select_queries=True (default HogQLContext blocks full SELECTs). Both caught by failing test runs while iterating, not guessed up front.
  • Verified both directions against local ClickHouse: with the fix the activity returns 20 SampledItems; with the broken HAVING restored it raises the same QueryError seen in prod. Working tree confirmed clean against HEAD afterwards. Re-ran end-to-end after the toStringtoUUID swap, same 20-items result.

toUUIDOrZero is not in the HogQL function allowlist, so every
llma-generation-summarization-coordinator child workflow has been
failing inside sample_items_in_window_activity since #59186 with
QueryError: Unsupported function call 'toUUIDOrZero(...)'. That
collapsed $ai_generation_summary to 0, and (7 days later) $ai_generation_clusters too.

Replace the HAVING comparand with toString(...) against the literal
zero UUID — both are HogQL-supported. Update the test to assert the
new shape and to run the materialized query through prepare_and_print_ast
so the next "function not whitelisted" regression fails in CI, not prod.
@andrewm4894 andrewm4894 self-assigned this May 29, 2026
@assign-reviewers-posthog assign-reviewers-posthog Bot requested a review from a team May 29, 2026 09:48
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 29, 2026

Reviews (1): Last reviewed commit: "fix(llma): use toString zero-UUID guard ..." | Re-trigger Greptile

@andrewm4894 andrewm4894 enabled auto-merge (squash) May 29, 2026 09:53
toUUID('...') is constant-folded at plan time and gives a UInt128
compare instead of a per-row 36-byte string compare. Same semantics,
cheaper. Test guard updated to match.
@andrewm4894 andrewm4894 merged commit b169dbb into master May 29, 2026
203 checks passed
@andrewm4894 andrewm4894 deleted the fix/llma-touuidorzero-generation-summarization branch May 29, 2026 10:40
@deployment-status-posthog
Copy link
Copy Markdown

deployment-status-posthog Bot commented May 29, 2026

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-05-29 11:08 UTC Run
prod-us ✅ Deployed 2026-05-29 11:30 UTC Run
prod-eu ✅ Deployed 2026-05-29 11:42 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants