Chunk session recording events #3705

macobo · 2021-03-19T07:58:36Z

Closes #3632, Closes #3510 and replaces https://github.com/PostHog/posthog/pull/3566/files

This should make it possible to ingest large full snapshot events which currently get silently dropped due to kafka size limits.

Base64 is used to compress the data for serialization purposes.

pytest.mock is used for clean patching methods

Checklist

All querysets/queries filter by Organization, by Team, and by User
Django backend tests
Jest frontend tests
Cypress end-to-end tests

Closes #3632 and replaces https://github.com/PostHog/posthog/pull/3566/files This should make it possible to ingest large full snapshot events Base64 is used to compress the data for serialization purposes. pytest.mock is used for clean patching methods

mariusandra

Hey, I didn't run it (yet), but got two related questions from just reading the code:

"assumption: all events within a request have same session_id". I'm not sure if I'd be brave enough to make this assumption. Right now, if a session recording event comes in through posthog-js, this assumption will be true. However it's not given that this is how we'll always get events into capture.py. What if there's someone who's using the bigquery plugin to export all their events... and later imports all of them via some bulk import or API? We can't guarantee any future developments will respect this assumption.
What if you re-capture a compressed snapshot? Similar case to the plugin scenario, what if these compressed snapshots are somehow re-ingested, either via a bulk export or some other plugin-based scenario. Will they all get compressed again and will the frontend know how to handle them?
Finally, should we add some postgres indices for the JSONExtractBool(snapshot_data, 'has_full_snapshot')), etc columns?

macobo · 2021-03-23T06:54:40Z

Re 1, Good note. I generally don't think it's worth being super defensive in the code here for hypothetical usecases, but this is easy enough to add.

Tangent: in general our event data is not symmetrical enough to allow for easy re-ingestion. Some fields are added in or removed from a valid request - the ip url parameter, token, project_id and others. IMO this is a small fail in our API design.

This applies even more so to session recordings - the data format here is more volatile and likely to change over time, and will look completely different for e.g. an android session recording library.

Re 2 - Not quite sure what you mean re What if you re-capture a compressed snapshot? Do you mean that someone manages to pollute the database with junk data by duplicating rows somehow? Note that it's not possible to send the payloads generated here to capture again - it would re-chunk them in a different way in all likelihood. Also, deduping the underlying snapshots is mostly handled in rrweb.

Re 3 - This requires some actual measurements to determine whether the index would do anything here. There's already a index "posthog_ses_team_id_0409c4_idx" btree (team_id, "timestamp") which I think will carry most of the weight here. Will add a follow-up task for this.

macobo · 2021-03-23T07:03:37Z

Re 3: I now think the index would do nothing since the query does require looking through all session recording events in timerange as it returns things like start/end/duration. :)

mariusandra · 2021-03-23T07:36:34Z

Re tangent: totally agreed. For added side effects (documenting on the fly here), the plugin server also adds $set_once: {'initial_utm_source': 'mailchimp'} if it detects a utm_source parameter for example. This is added during ingestion (captureEE), so the last plugin won't see it, but it'll show up in the database.

Re Re 2: I mean making a plugin that effectively does if event == $snapshot then posthog.capture(the event itself) (ignore the obvious recursive race condition)... and probably not directly, but through some import/export script. I think a block to prevent rechunking is easy enough to add and should be worth it. Being defensive again, but these usecases seem not that unlikely with a growing userbase and customisable open source code...

mariusandra

LGTM!

This reverts commit 1adaa8a.

timgl temporarily deployed to posthog-session-recordi-rj5w17 March 19, 2021 08:03 Inactive

macobo force-pushed the session-recording-chunking branch from ab3b451 to 2544866 Compare March 19, 2021 08:20

macobo temporarily deployed to posthog-session-recordi-rj5w17 March 19, 2021 08:21 Inactive

macobo force-pushed the session-recording-chunking branch from 2544866 to b743f39 Compare March 19, 2021 12:00

Chunk session recording events

f01a876

Closes #3632 and replaces https://github.com/PostHog/posthog/pull/3566/files This should make it possible to ingest large full snapshot events Base64 is used to compress the data for serialization purposes. pytest.mock is used for clean patching methods

macobo temporarily deployed to posthog-session-recordi-rj5w17 March 19, 2021 12:00 Inactive

macobo force-pushed the session-recording-chunking branch from b743f39 to f01a876 Compare March 19, 2021 12:00

macobo temporarily deployed to posthog-session-recordi-rj5w17 March 19, 2021 12:00 Inactive

Mock time.time for py3.7 compatibility

b347402

macobo temporarily deployed to posthog-session-recordi-rj5w17 March 19, 2021 13:28 Inactive

macobo requested a review from mariusandra March 19, 2021 13:49

mariusandra reviewed Mar 23, 2021

View reviewed changes

Group captured $snapshot events by $session_id

ec9474c

macobo temporarily deployed to posthog-session-recordi-rj5w17 March 23, 2021 07:00 Inactive

Don't chunk already chunked payloads

9972212

macobo temporarily deployed to posthog-session-recordi-rj5w17 March 23, 2021 09:52 Inactive

macobo requested a review from mariusandra March 23, 2021 10:24

mariusandra approved these changes Mar 23, 2021

View reviewed changes

mariusandra merged commit 1adaa8a into master Mar 23, 2021

mariusandra deleted the session-recording-chunking branch March 23, 2021 10:39

macobo added a commit that referenced this pull request Mar 23, 2021

Revert "Chunk session recording events (#3705)"

132b08e

This reverts commit 1adaa8a.

macobo mentioned this pull request Mar 23, 2021

Revert "Chunk session recording events" #3730

Closed

macobo added the highlight ⭐ Release highlight label Mar 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunk session recording events #3705

Chunk session recording events #3705

macobo commented Mar 19, 2021 •

edited

Loading

mariusandra left a comment •

edited

Loading

macobo commented Mar 23, 2021

macobo commented Mar 23, 2021

mariusandra commented Mar 23, 2021

mariusandra left a comment

Chunk session recording events #3705

Chunk session recording events #3705

Conversation

macobo commented Mar 19, 2021 • edited Loading

Checklist

mariusandra left a comment • edited Loading

Choose a reason for hiding this comment

macobo commented Mar 23, 2021

macobo commented Mar 23, 2021

mariusandra commented Mar 23, 2021

mariusandra left a comment

Choose a reason for hiding this comment

macobo commented Mar 19, 2021 •

edited

Loading

mariusandra left a comment •

edited

Loading