Skip to content

fix(usage-report): trim oversize org payloads before capture#60462

Merged
abhischekt merged 5 commits into
masterfrom
posthog-code/trim-oversize-usage-report-payload
Jun 1, 2026
Merged

fix(usage-report): trim oversize org payloads before capture#60462
abhischekt merged 5 commits into
masterfrom
posthog-code/trim-oversize-usage-report-payload

Conversation

@abhischekt
Copy link
Copy Markdown
Contributor

Problem

The organization usage report event embeds a per-team breakdown under properties.teams. For very-large organizations — a few hundred projects each contributing ~3 KB of counters — the serialized payload exceeds Kafka's default message.max.bytes (~1 MiB). The broker rejects the event, and because posthog-python capture is fire-and-forget, the drop is silent: no organization usage report failure event fires, and the affected orgs simply don't appear in the dogfood project's usage data.

The largest org currently captured sits around 938 KB — right against the ceiling — and nothing past ~300 teams shows up at all. This was first surfaced by a stakeholder noticing one of their tracked orgs had no organization usage report events despite obvious activity in other event types.

Changes

  • Add MAX_USAGE_REPORT_PAYLOAD_BYTES = 900_000 with headroom for the Kafka envelope and the groups / $set / scope properties that capture_event layers on top.
  • Introduce _trim_oversize_usage_report_payload(...): returns the dict unchanged when small enough; otherwise returns a copy with teams={} and teams_omitted_due_to_size=True. Every org-level counter (including team_count) is preserved so existing insights and downstream consumers keep working.
  • capture_report passes the trimmed payload to the dogfood capture_event. The billing SQS path (_queue_report), group_identify, and per-person captures still see the full report — only the in-app analytics capture is trimmed.

The durable fix is to re-enable per-org capture from the v2 Temporal workflow (TODO at posthog/temporal/usage_report/activities.py:131-135), which chunks reports to S3 and avoids the Kafka per-message ceiling entirely. This PR is the narrow band-aid on the legacy Celery path until that cutover lands.

How did you test this code?

I'm an agent (PostHog Code). I did not perform any manual production verification. I added three automated tests in posthog/tasks/test/test_usage_report.py:

  • TestTrimOversizeUsageReportPayload.test_returns_dict_unchanged_when_under_limit — small payload returns the same object identity (no copy, no allocation).
  • TestTrimOversizeUsageReportPayload.test_drops_teams_and_sets_marker_when_over_limit — synthetic 600-team payload exceeds the threshold, the trimmed copy drops teams, sets the marker, keeps org-level counters, and serializes under the limit. Original dict is untouched.
  • TestCaptureReportTrimsOversizePayload.test_capture_report_drops_teams_when_payload_too_large — end-to-end through capture_report with a mocked PHA client, asserting the pha_client.capture call receives the trimmed properties.

All three pass. The pre-existing TestCaptureReportGroupProperties.test_capture_report_sets_org_group_properties still passes (group_identify continues to receive the full counts). Ran ruff check, ruff format, and ty check via the pre-commit hooks.

Publish to changelog?

no

Docs update

N/A — internal analytics capture path.

🤖 Agent context

Authored by PostHog Code in response to a CSM-reported anomaly: an internal insight filtering organization usage report on a specific organization_id was empty, but the org was clearly active in other event types. Investigation via the PostHog MCP confirmed the EU usage-report cron was running and emitting reports for ~140k other orgs in the same window — just not this one. No failure event existed for it either. A scan by team_count showed the largest captured org sat at 938 KB of payload (~294 teams). Nothing beyond that appeared at all — the pattern matched a per-message size limit, not a logic exclusion.

Decisions made along the way:

  • Scope of the fix. Considered (a) re-enabling per-org capture from the v2 Temporal flow, (b) emitting a separate organization usage report per team event, and (c) adding only a visibility-signal event. Picked the minimal trim because the legacy Celery path is still the only producer of organization usage report, the durable fix belongs with the v2 cutover, and bundling it here would expand blast radius unnecessarily.
  • What to drop. Dropped teams wholesale rather than truncating or sampling. team_count and every org-level counter are already in the top of the report; consumers that need per-team granularity can use the chunked JSONL the v2 workflow already writes to S3.
  • Where to apply. Only on the dogfood capture_event call — the billing SQS payload has separate chunking and a much larger size budget, so it stays unmodified.
  • Threshold choice. 900 KB leaves ~140 KB of headroom for the Kafka message envelope plus the groups, $set, scope, and instance_metadata keys that capture_event layers on. The largest currently-captured payload (938 KB) would not have been trimmed under this threshold, so steady-state behavior is unchanged for everyone already getting through.
  • Marker field. teams_omitted_due_to_size=True rather than e.g. removing the key silently, so downstream insight authors can detect the truncation and either backfill from the v2 chunks or skip the org.

Created with PostHog Code

Large orgs with several hundred projects produced
`organization usage report` events above Kafka's ~1 MiB
`message.max.bytes`, so the broker dropped them at ingestion. The
`posthog-python` capture is fire-and-forget, so nothing surfaced —
the affected orgs simply never appeared in the dogfood project.

Trim the per-team breakdown when the serialized payload would
exceed the limit, preserving every org-level counter and marking
the trimmed event with `teams_omitted_due_to_size=True` for
downstream consumers. The billing SQS path is untouched.

Generated-By: PostHog Code
Task-Id: 5a0684db-daba-47e7-86ae-54b4701f995f
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
posthog/tasks/test/test_usage_report.py:1741
**`json.dumps` precondition checks missing `default=str`**

The precondition assertions at lines 1741 and 1777 call `json.dumps(oversize_report)` and `json.dumps(full_report_dict)` without `default=str`, while `_trim_oversize_usage_report_payload` uses `json.dumps(full_report_dict, default=str)`. In production, the real usage-report dict can contain datetime or other non-JSON-serializable values; without `default=str`, these assertions would raise `TypeError` when the test data happens to include them, making the precondition a worse model of the actual check than intended. Both precondition calls should match the implementation's `default=str`.

### Issue 2 of 2
posthog/tasks/test/test_usage_report.py:1714-1752
**Non-parameterised tests**

The two methods in `TestTrimOversizeUsageReportPayload` — one for the under-limit path and one for the over-limit path — are good candidates for a single `@pytest.mark.parametrize` or `subTest` pattern, in line with the project's preference for parameterised tests. The under-limit case only asserts identity, but that could be expressed as a boolean flag in the parameter set (e.g., `expect_same_object=True/False`) alongside the other assertions, keeping both scenarios visible in one place without duplicating the test body structure.

Reviews (1): Last reviewed commit: "fix(usage-report): trim oversize org pay..." | Re-trigger Greptile

Comment thread posthog/tasks/test/test_usage_report.py Outdated
Comment thread posthog/tasks/test/test_usage_report.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

🎭 Playwright report · View test results →

⚠️ 1 flaky test:

  • Inline editing insight title via compact card popover (chromium)

These issues are not necessarily caused by your changes.
Annoyed by this comment? Help fix flakies and failures and it'll disappear!

@ceyniustranberg
Copy link
Copy Markdown
Contributor

Dropped teams wholesale rather than truncating or sampling. team_count and every org-level counter are already in the top of the report;

Worth double checking if this breaks any destinations

I imagine that option b) of having a team usage report would produce lots of events, but curious to hear if you would consider it at some point?

Copy link
Copy Markdown
Contributor

@ceyniustranberg ceyniustranberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I'm concerned, looks good

…versize-usage-report-payload

# Conflicts:
#	posthog/tasks/usage_report.py
Fold the under-/over-limit cases into one parameterized test driven by
team_count, and serialize the precondition assertions with default=str
so they mirror what _trim_oversize_usage_report_payload measures.
…versize-usage-report-payload

Resolve conflict in posthog/tasks/usage_report.py: keep both master's
get_teams_with_sdk_logs_records_in_period helper and this branch's
_trim_oversize_usage_report_payload helper.

Generated-By: PostHog Code
Task-Id: 5a0684db-daba-47e7-86ae-54b4701f995f
@abhischekt abhischekt enabled auto-merge (squash) June 1, 2026 09:12
@abhischekt abhischekt merged commit 779336f into master Jun 1, 2026
197 checks passed
@abhischekt abhischekt deleted the posthog-code/trim-oversize-usage-report-payload branch June 1, 2026 09:27
@deployment-status-posthog
Copy link
Copy Markdown

deployment-status-posthog Bot commented Jun 1, 2026

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-06-01 09:58 UTC Run
prod-us ✅ Deployed 2026-06-01 10:10 UTC Run
prod-eu ✅ Deployed 2026-06-01 10:14 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants