Skip to content

fix(aiobs): stop trace-clustering self-perpetuating loop#60259

Merged
carlos-marchal-ph merged 2 commits into
masterfrom
fix(aio)/trace-clustering-loop
May 27, 2026
Merged

fix(aiobs): stop trace-clustering self-perpetuating loop#60259
carlos-marchal-ph merged 2 commits into
masterfrom
fix(aio)/trace-clustering-loop

Conversation

@carlos-marchal-ph
Copy link
Copy Markdown
Contributor

Problem

The daily trace-clustering coordinator was emitting $ai_generation_clusters and $ai_trace_clusters events on behalf of teams that had nothing to cluster, and those empty events kept the same teams re-eligible for clustering the next day. The result was a self-perpetuating loop that inflated the apparent set of AI-active teams by ~50% versus reality, and roughly 70% of cluster events in the last week carried $ai_total_items_analyzed = 0.

Three colluding bugs:

  1. The DailyTraceClusteringWorkflow did not short-circuit when the compute activity returned an empty result — it still ran labeling, aggregates, and emission, writing a cluster event with zero items.
  2. Team discovery derived its trigger-event list from AIEventType, which includes the cluster events themselves. Any team that had an empty cluster event written on its behalf stayed eligible for the next 30 days regardless of customer activity.
  3. The discovery window (30 days) was decoupled from the workflow window (7 days), so even without bug Reports that users will need #1 discovery would enroll teams whose data the workflow then ignored.

Changes

  • Fix 1 — Workflow short-circuit (workflow.py): after the compute activity returns, if compute_result.items is empty, return a zero-item ClusteringResult immediately and skip labeling, aggregates, and emission. Stops the loop at the source.
  • Fix 2 — Decoupled discovery trigger list (llm_analytics_usage_report.py, team_discovery.py): added LLM_ANALYTICS_DISCOVERY_TRIGGER_EVENTS containing only customer-emitted signals (no $ai_*_clusters, no $ai_*_summary, no $ai_tag). get_teams_with_ai_events takes the trigger list as a required parameter; usage-report and discovery pass their respective constants explicitly.
  • Fix 3 — Aligned discovery window (team_discovery.py, clustering coordinator): DEFAULT_DISCOVERY_LOOKBACK_DAYS is now 7 (matches DEFAULT_LOOKBACK_DAYS). The clustering coordinator passes lookback_days=inputs.lookback_days through TeamDiscoveryInput, so discovery scopes its eligibility query to the workflow's own window.

How did you test this code?

I'm an agent. Automated tests only — no manual verification.

  • New TestEmptyComputeResultShortCircuit::test_empty_compute_skips_label_aggregates_and_emit drives the workflow with a compute activity returning an empty ClusteringComputeResult and asserts that the label/aggregates/emit activities are never invoked and that no event is written.
  • New TestGetTeamIdsForLlmAnalytics::test_uses_discovery_trigger_events_not_report_list asserts the narrow LLM_ANALYTICS_DISCOVERY_TRIGGER_EVENTS list is passed to the ClickHouse query and excludes the cluster/summary/tag events.
  • New TestGetTeamIdsForLlmAnalytics::test_lookback_uses_inputs_value verifies the discovery window now derives from TeamDiscoveryInput.lookback_days.
  • Existing TestGetLlmaWorkflowConfig tests updated for the simplified LLMAWorkflowConfig (no more discovery_lookback_days field).
  • Existing usage-report tests updated to pass LLM_ANALYTICS_REPORT_TRIGGER_EVENTS explicitly.
  • Full local run of trace_clustering/tests/test_workflow.py, trace_clustering/tests/test_coordinator.py, and llm_analytics/test_team_discovery.py — 53/53 pass. Ruff lint and format clean on all modified files.

Publish to changelog?

no

Docs update

No docs change needed.

🤖 Agent context

Authored by PostHog Code (Claude Opus 4.7) under my supervision.

The bug analysis (three colluding causes, prod impact numbers, historical chronology) was provided up-front so the agent had concrete file:line targets and could focus on implementation rather than discovery.

Two notable course corrections during the session:

  1. Avoided over-preserving the FF override hook for discovery_lookback_days. First pass kept a full FF-payload override path (config field as int | None, fallback ternary in the activity, dedicated tests). On reflection, with the coordinator now passing lookback_days through explicitly, the override hook was speculative — removed the field, validation, ternary, and ~3 obsolete test cases. Can be added back if any team actually needs to widen the window without a deploy.
  2. Avoided over-engineering get_teams_with_ai_events's signature. First pass made trigger_events optional with a None default that fell back to the broad report list. Made the parameter required and updated both call sites (usage-report and discovery) to pass their constants explicitly — call sites now self-document which list is in play.
  3. Rejected a _shortcircuit_pipeline helper. Tried extracting the short-circuit return into a helper function; the data lives across inputs, compute_result, and local workflow state with no single carrier object, so the helper had 6 loose parameters and was less readable than the inline version. Kept the inline early return.

No manual UI verification — pure backend / Temporal workflow logic. Verification per the original spec's ClickHouse query will need to happen post-deploy.

@carlos-marchal-ph carlos-marchal-ph requested a review from a team May 27, 2026 14:44
@carlos-marchal-ph carlos-marchal-ph self-assigned this May 27, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

🎭 Playwright didn't run on this PR — your changes touch code that could affect E2E behavior, but Playwright is opt-in via label now to keep CI cost down.

Add the run-playwright label if you want an E2E sweep before merging — CI will pick it up automatically.

Most PRs don't need this. Real regressions still get caught on master and fix-forward.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 27, 2026

Comments Outside Diff (1)

  1. posthog/tasks/llm_analytics_usage_report.py, line 239-258 (link)

    P2 The SQL named parameter llm_analytics_report_trigger_events is now used by both the usage-report caller and the discovery caller, so the name no longer reflects its purpose and will confuse anyone reading the query. Renaming it to trigger_events keeps the SQL aligned with the function signature.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: posthog/tasks/llm_analytics_usage_report.py
    Line: 239-258
    
    Comment:
    The SQL named parameter `llm_analytics_report_trigger_events` is now used by both the usage-report caller and the discovery caller, so the name no longer reflects its purpose and will confuse anyone reading the query. Renaming it to `trigger_events` keeps the SQL aligned with the function signature.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
posthog/temporal/llm_analytics/team_discovery.py:143-146
**Misleading comment after `lookback_days` change**

The comment says "FF payload overrides TeamDiscoveryInput when set" but the lines that follow do the opposite for `lookback_days`: it now reads from `inputs.lookback_days` (the Temporal input), not from the feature-flag config. A reader scanning this block will incorrectly infer that both `sample_percentage` and `lookback_days` are sourced from the FF payload, when only the former is. The comment should be split or reworded to reflect that `sample_percentage` comes from the FF config while `lookback_days` comes from `inputs`.

### Issue 2 of 2
posthog/tasks/llm_analytics_usage_report.py:239-258
The SQL named parameter `llm_analytics_report_trigger_events` is now used by both the usage-report caller and the discovery caller, so the name no longer reflects its purpose and will confuse anyone reading the query. Renaming it to `trigger_events` keeps the SQL aligned with the function signature.

```suggestion
    query = """
        SELECT DISTINCT team_id
        FROM events
        WHERE event IN %(trigger_events)s
          AND timestamp >= %(begin)s
          AND timestamp < %(end)s
    """

    with tags_context(
        product=Product.LLM_ANALYTICS,
        feature=Feature.QUERY,
        kind="celery",
        id=CELERY_TASK_ID,
        name="Get teams with AI observability trigger events",
        workload=Workload.OFFLINE.value,
    ):
        results = sync_execute(
            query,
            {
                "trigger_events": trigger_events,
```

Reviews (1): Last reviewed commit: "fix(llma): stop trace-clustering self-pe..." | Re-trigger Greptile

Comment thread posthog/temporal/llm_analytics/team_discovery.py Outdated
Comment thread posthog/tasks/llm_analytics_usage_report.py Outdated
@carlos-marchal-ph carlos-marchal-ph merged commit 9880103 into master May 27, 2026
214 checks passed
@carlos-marchal-ph carlos-marchal-ph deleted the fix(aio)/trace-clustering-loop branch May 27, 2026 16:11
@deployment-status-posthog
Copy link
Copy Markdown

deployment-status-posthog Bot commented May 27, 2026

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-05-27 16:49 UTC Run
prod-us ✅ Deployed 2026-05-27 17:03 UTC Run
prod-eu ✅ Deployed 2026-05-27 17:06 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants