Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk session recording events #3705

Merged
merged 4 commits into from
Mar 23, 2021
Merged

Conversation

macobo
Copy link
Contributor

@macobo macobo commented Mar 19, 2021

Closes #3632, Closes #3510 and replaces https://github.com/PostHog/posthog/pull/3566/files

This should make it possible to ingest large full snapshot events which currently get silently dropped due to kafka size limits.

Base64 is used to compress the data for serialization purposes.

pytest.mock is used for clean patching methods

Checklist

  • All querysets/queries filter by Organization, by Team, and by User
  • Django backend tests
  • Jest frontend tests
  • Cypress end-to-end tests

@timgl timgl temporarily deployed to posthog-session-recordi-rj5w17 March 19, 2021 08:03 Inactive
@macobo macobo force-pushed the session-recording-chunking branch from ab3b451 to 2544866 Compare March 19, 2021 08:20
@macobo macobo temporarily deployed to posthog-session-recordi-rj5w17 March 19, 2021 08:21 Inactive
@macobo macobo force-pushed the session-recording-chunking branch from 2544866 to b743f39 Compare March 19, 2021 12:00
Closes #3632 and replaces https://github.com/PostHog/posthog/pull/3566/files

This should make it possible to ingest large full snapshot events

Base64 is used to compress the data for serialization purposes.

pytest.mock is used for clean patching methods
@macobo macobo temporarily deployed to posthog-session-recordi-rj5w17 March 19, 2021 12:00 Inactive
@macobo macobo force-pushed the session-recording-chunking branch from b743f39 to f01a876 Compare March 19, 2021 12:00
@macobo macobo temporarily deployed to posthog-session-recordi-rj5w17 March 19, 2021 12:00 Inactive
@macobo macobo temporarily deployed to posthog-session-recordi-rj5w17 March 19, 2021 13:28 Inactive
@macobo macobo requested a review from mariusandra March 19, 2021 13:49
Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I didn't run it (yet), but got two related questions from just reading the code:

  1. "assumption: all events within a request have same session_id". I'm not sure if I'd be brave enough to make this assumption. Right now, if a session recording event comes in through posthog-js, this assumption will be true. However it's not given that this is how we'll always get events into capture.py. What if there's someone who's using the bigquery plugin to export all their events... and later imports all of them via some bulk import or API? We can't guarantee any future developments will respect this assumption.

  2. What if you re-capture a compressed snapshot? Similar case to the plugin scenario, what if these compressed snapshots are somehow re-ingested, either via a bulk export or some other plugin-based scenario. Will they all get compressed again and will the frontend know how to handle them?

  3. Finally, should we add some postgres indices for the JSONExtractBool(snapshot_data, 'has_full_snapshot')), etc columns?

@macobo
Copy link
Contributor Author

macobo commented Mar 23, 2021

Re 1, Good note. I generally don't think it's worth being super defensive in the code here for hypothetical usecases, but this is easy enough to add.

Tangent: in general our event data is not symmetrical enough to allow for easy re-ingestion. Some fields are added in or removed from a valid request - the ip url parameter, token, project_id and others. IMO this is a small fail in our API design.

This applies even more so to session recordings - the data format here is more volatile and likely to change over time, and will look completely different for e.g. an android session recording library.

Re 2 - Not quite sure what you mean re What if you re-capture a compressed snapshot? Do you mean that someone manages to pollute the database with junk data by duplicating rows somehow? Note that it's not possible to send the payloads generated here to capture again - it would re-chunk them in a different way in all likelihood. Also, deduping the underlying snapshots is mostly handled in rrweb.

Re 3 - This requires some actual measurements to determine whether the index would do anything here. There's already a index "posthog_ses_team_id_0409c4_idx" btree (team_id, "timestamp") which I think will carry most of the weight here. Will add a follow-up task for this.

@macobo macobo temporarily deployed to posthog-session-recordi-rj5w17 March 23, 2021 07:00 Inactive
@macobo
Copy link
Contributor Author

macobo commented Mar 23, 2021

Re 3: I now think the index would do nothing since the query does require looking through all session recording events in timerange as it returns things like start/end/duration. :)

@mariusandra
Copy link
Collaborator

Re tangent: totally agreed. For added side effects (documenting on the fly here), the plugin server also adds $set_once: {'initial_utm_source': 'mailchimp'} if it detects a utm_source parameter for example. This is added during ingestion (captureEE), so the last plugin won't see it, but it'll show up in the database.

Re Re 2: I mean making a plugin that effectively does if event == $snapshot then posthog.capture(the event itself) (ignore the obvious recursive race condition)... and probably not directly, but through some import/export script. I think a block to prevent rechunking is easy enough to add and should be worth it. Being defensive again, but these usecases seem not that unlikely with a growing userbase and customisable open source code...

@macobo macobo temporarily deployed to posthog-session-recordi-rj5w17 March 23, 2021 09:52 Inactive
@macobo macobo requested a review from mariusandra March 23, 2021 10:24
Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mariusandra mariusandra merged commit 1adaa8a into master Mar 23, 2021
@mariusandra mariusandra deleted the session-recording-chunking branch March 23, 2021 10:39
macobo added a commit that referenced this pull request Mar 23, 2021
@macobo macobo added the highlight ⭐ Release highlight label Mar 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
highlight ⭐ Release highlight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Chunking session recording events Some teams are not getting any type 2 session recording events
3 participants