fix(usage-report): trim oversize org payloads before capture#60462
Merged
abhischekt merged 5 commits intoJun 1, 2026
Conversation
Large orgs with several hundred projects produced `organization usage report` events above Kafka's ~1 MiB `message.max.bytes`, so the broker dropped them at ingestion. The `posthog-python` capture is fire-and-forget, so nothing surfaced — the affected orgs simply never appeared in the dogfood project. Trim the per-team breakdown when the serialized payload would exceed the limit, preserving every org-level counter and marking the trimmed event with `teams_omitted_due_to_size=True` for downstream consumers. The billing SQS path is untouched. Generated-By: PostHog Code Task-Id: 5a0684db-daba-47e7-86ae-54b4701f995f
Contributor
Prompt To Fix All With AIFix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
posthog/tasks/test/test_usage_report.py:1741
**`json.dumps` precondition checks missing `default=str`**
The precondition assertions at lines 1741 and 1777 call `json.dumps(oversize_report)` and `json.dumps(full_report_dict)` without `default=str`, while `_trim_oversize_usage_report_payload` uses `json.dumps(full_report_dict, default=str)`. In production, the real usage-report dict can contain datetime or other non-JSON-serializable values; without `default=str`, these assertions would raise `TypeError` when the test data happens to include them, making the precondition a worse model of the actual check than intended. Both precondition calls should match the implementation's `default=str`.
### Issue 2 of 2
posthog/tasks/test/test_usage_report.py:1714-1752
**Non-parameterised tests**
The two methods in `TestTrimOversizeUsageReportPayload` — one for the under-limit path and one for the over-limit path — are good candidates for a single `@pytest.mark.parametrize` or `subTest` pattern, in line with the project's preference for parameterised tests. The under-limit case only asserts identity, but that could be expressed as a boolean flag in the parameter set (e.g., `expect_same_object=True/False`) alongside the other assertions, keeping both scenarios visible in one place without duplicating the test body structure.
Reviews (1): Last reviewed commit: "fix(usage-report): trim oversize org pay..." | Re-trigger Greptile |
Contributor
|
🎭 Playwright report · View test results →
These issues are not necessarily caused by your changes. |
Contributor
Worth double checking if this breaks any destinations I imagine that option b) of having a team usage report would produce lots of events, but curious to hear if you would consider it at some point? |
ceyniustranberg
approved these changes
May 29, 2026
Contributor
ceyniustranberg
left a comment
There was a problem hiding this comment.
As far as I'm concerned, looks good
…versize-usage-report-payload # Conflicts: # posthog/tasks/usage_report.py
Fold the under-/over-limit cases into one parameterized test driven by team_count, and serialize the precondition assertions with default=str so they mirror what _trim_oversize_usage_report_payload measures.
…versize-usage-report-payload Resolve conflict in posthog/tasks/usage_report.py: keep both master's get_teams_with_sdk_logs_records_in_period helper and this branch's _trim_oversize_usage_report_payload helper. Generated-By: PostHog Code Task-Id: 5a0684db-daba-47e7-86ae-54b4701f995f
…versize-usage-report-payload
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
organization usage reportevent embeds a per-team breakdown underproperties.teams. For very-large organizations — a few hundred projects each contributing ~3 KB of counters — the serialized payload exceeds Kafka's defaultmessage.max.bytes(~1 MiB). The broker rejects the event, and becauseposthog-pythoncapture is fire-and-forget, the drop is silent: noorganization usage report failureevent fires, and the affected orgs simply don't appear in the dogfood project's usage data.The largest org currently captured sits around 938 KB — right against the ceiling — and nothing past ~300 teams shows up at all. This was first surfaced by a stakeholder noticing one of their tracked orgs had no
organization usage reportevents despite obvious activity in other event types.Changes
MAX_USAGE_REPORT_PAYLOAD_BYTES = 900_000with headroom for the Kafka envelope and thegroups/$set/scopeproperties thatcapture_eventlayers on top._trim_oversize_usage_report_payload(...): returns the dict unchanged when small enough; otherwise returns a copy withteams={}andteams_omitted_due_to_size=True. Every org-level counter (includingteam_count) is preserved so existing insights and downstream consumers keep working.capture_reportpasses the trimmed payload to the dogfoodcapture_event. The billing SQS path (_queue_report),group_identify, and per-person captures still see the full report — only the in-app analytics capture is trimmed.The durable fix is to re-enable per-org capture from the v2 Temporal workflow (TODO at
posthog/temporal/usage_report/activities.py:131-135), which chunks reports to S3 and avoids the Kafka per-message ceiling entirely. This PR is the narrow band-aid on the legacy Celery path until that cutover lands.How did you test this code?
I'm an agent (PostHog Code). I did not perform any manual production verification. I added three automated tests in
posthog/tasks/test/test_usage_report.py:TestTrimOversizeUsageReportPayload.test_returns_dict_unchanged_when_under_limit— small payload returns the same object identity (no copy, no allocation).TestTrimOversizeUsageReportPayload.test_drops_teams_and_sets_marker_when_over_limit— synthetic 600-team payload exceeds the threshold, the trimmed copy dropsteams, sets the marker, keeps org-level counters, and serializes under the limit. Original dict is untouched.TestCaptureReportTrimsOversizePayload.test_capture_report_drops_teams_when_payload_too_large— end-to-end throughcapture_reportwith a mocked PHA client, asserting thepha_client.capturecall receives the trimmed properties.All three pass. The pre-existing
TestCaptureReportGroupProperties.test_capture_report_sets_org_group_propertiesstill passes (group_identifycontinues to receive the full counts). Ranruff check,ruff format, andty checkvia the pre-commit hooks.Publish to changelog?
no
Docs update
N/A — internal analytics capture path.
🤖 Agent context
Authored by PostHog Code in response to a CSM-reported anomaly: an internal insight filtering
organization usage reporton a specificorganization_idwas empty, but the org was clearly active in other event types. Investigation via the PostHog MCP confirmed the EU usage-report cron was running and emitting reports for ~140k other orgs in the same window — just not this one. No failure event existed for it either. A scan byteam_countshowed the largest captured org sat at 938 KB of payload (~294 teams). Nothing beyond that appeared at all — the pattern matched a per-message size limit, not a logic exclusion.Decisions made along the way:
organization usage report per teamevent, and (c) adding only a visibility-signal event. Picked the minimal trim because the legacy Celery path is still the only producer oforganization usage report, the durable fix belongs with the v2 cutover, and bundling it here would expand blast radius unnecessarily.teamswholesale rather than truncating or sampling.team_countand every org-level counter are already in the top of the report; consumers that need per-team granularity can use the chunked JSONL the v2 workflow already writes to S3.capture_eventcall — the billing SQS payload has separate chunking and a much larger size budget, so it stays unmodified.groups,$set,scope, andinstance_metadatakeys thatcapture_eventlayers on. The largest currently-captured payload (938 KB) would not have been trimmed under this threshold, so steady-state behavior is unchanged for everyone already getting through.teams_omitted_due_to_size=Truerather than e.g. removing the key silently, so downstream insight authors can detect the truncation and either backfill from the v2 chunks or skip the org.Created with PostHog Code