-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunk session recording events #3705
Conversation
ab3b451
to
2544866
Compare
2544866
to
b743f39
Compare
Closes #3632 and replaces https://github.com/PostHog/posthog/pull/3566/files This should make it possible to ingest large full snapshot events Base64 is used to compress the data for serialization purposes. pytest.mock is used for clean patching methods
b743f39
to
f01a876
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I didn't run it (yet), but got two related questions from just reading the code:
-
"assumption: all events within a request have same session_id". I'm not sure if I'd be brave enough to make this assumption. Right now, if a session recording event comes in through posthog-js, this assumption will be true. However it's not given that this is how we'll always get events into capture.py. What if there's someone who's using the bigquery plugin to export all their events... and later imports all of them via some bulk import or API? We can't guarantee any future developments will respect this assumption.
-
What if you re-capture a compressed snapshot? Similar case to the plugin scenario, what if these compressed snapshots are somehow re-ingested, either via a bulk export or some other plugin-based scenario. Will they all get compressed again and will the frontend know how to handle them?
-
Finally, should we add some postgres indices for the
JSONExtractBool(snapshot_data, 'has_full_snapshot'))
, etc columns?
Re 1, Good note. I generally don't think it's worth being super defensive in the code here for hypothetical usecases, but this is easy enough to add. Tangent: in general our event data is not symmetrical enough to allow for easy re-ingestion. Some fields are added in or removed from a valid request - the ip url parameter, token, project_id and others. IMO this is a small fail in our API design. This applies even more so to session recordings - the data format here is more volatile and likely to change over time, and will look completely different for e.g. an android session recording library. Re 2 - Not quite sure what you mean re Re 3 - This requires some actual measurements to determine whether the index would do anything here. There's already a index |
Re 3: I now think the index would do nothing since the query does require looking through all session recording events in timerange as it returns things like start/end/duration. :) |
Re tangent: totally agreed. For added side effects (documenting on the fly here), the plugin server also adds Re Re 2: I mean making a plugin that effectively does |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Closes #3632, Closes #3510 and replaces https://github.com/PostHog/posthog/pull/3566/files
This should make it possible to ingest large full snapshot events which currently get silently dropped due to kafka size limits.
Base64 is used to compress the data for serialization purposes.
pytest.mock is used for clean patching methods
Checklist