Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up sessions list query #2934

Merged
merged 28 commits into from Jan 14, 2021
Merged

Speed up sessions list query #2934

merged 28 commits into from Jan 14, 2021

Conversation

macobo
Copy link
Contributor

@macobo macobo commented Jan 13, 2021

Relevant issue: #2739

This PR rewrites how self-hosted postgres instances calculate sessions.

This query was slow because:

  1. Combining events to sessions has quite a bit of business logic involved around previous events. The postgres function LAG was causing datasets to get re-sorted multiple times on postgres disk.
  2. Because of the complicated business logic LIMIT could only be done at the very end -> much more data was processed than needed.
  3. Events were loaded together with the session - slowing both calculations as well as 'data download' down for the users.

The solution is sort of silly:

  1. instead of trying to do the business logic in postgres, stream rows to python and do the aggregation there. In practice, it seems like in this case the network io is less than doing all the extra work.
  2. Only load events for LIMIT + offset users

One consequence of this is that pagination logic got quite bit more complex for sessions. We're now passing opaque data blobs back-and-forth from BE to FE and back again for this purpose.

Behavioral changes:

  • For self-hosted users, events are only loaded as they click to open the table.
  • Sessions are now sorted by end time

Checklist

  • All querysets/queries filter by Organization, by Team, and by User
  • Django backend tests
  • Jest frontend tests
  • Cypress end-to-end tests

@macobo macobo requested a review from EDsCODE January 13, 2021 14:42
@timgl timgl temporarily deployed to posthog-simplify-list-q-r6cq9h January 13, 2021 14:44 Inactive
@macobo macobo marked this pull request as draft January 13, 2021 14:48
@macobo
Copy link
Contributor Author

macobo commented Jan 13, 2021

Need to draft this for a bit - there are some bugs in here I'm finding.

@macobo macobo temporarily deployed to posthog-simplify-list-q-r6cq9h January 13, 2021 14:58 Inactive
@macobo macobo temporarily deployed to posthog-simplify-list-q-r6cq9h January 14, 2021 08:41 Inactive
@macobo macobo temporarily deployed to posthog-simplify-list-q-r6cq9h January 14, 2021 08:56 Inactive
@macobo macobo temporarily deployed to posthog-simplify-list-q-r6cq9h January 14, 2021 09:11 Inactive
@macobo macobo temporarily deployed to posthog-simplify-list-q-r6cq9h January 14, 2021 11:21 Inactive
@macobo macobo temporarily deployed to posthog-simplify-list-q-r6cq9h January 14, 2021 11:24 Inactive
@macobo macobo marked this pull request as ready for review January 14, 2021 12:30
Copy link
Member

@EDsCODE EDsCODE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial code comments. will manually QA in a bit

ee/clickhouse/views/events.py Outdated Show resolved Hide resolved
const [page, setPage] = useState(1)
const [pageSize, setPageSize] = useState(50)

async function loadSessionEvents(): Promise<void> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we're trying to keep most data logic in kea logics if possible. even if this data is scoped here only it'll be more modular if there's an accompanying logic instead of localstate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the data was only used locally and the kea logic would have ended up quite a bit more convoluted.

But moving this out will become neccessary soon as we will want to show highlighted events in session recording. Ok if I punt on this until then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup just a heads up not strict

return [session for i, session in enumerate(sessions) if session["distinct_id"] in person_ids]

@action(methods=["GET"], detail=False)
def session_events(self, request: request.Request, *args: Any, **kwargs: Any) -> Response:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does clickhouse also use this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bit of feature drift here.

For clickhouse, the sessions query still returns events and the frontend knows to use those. For postgres, events are loaded separately (for perf reasons) -> separate query is needed.

Not sure what the correct approach here is, I don't think it's worth refactoring clickhouse query here right now. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm yeah I figured it was on these lines. I think if the frontend can handle then it's no problem. This will just be part of the consideration in consolidating our offering into clickhouse only. I think the duality will eventually be a major hinderance in moving quickly

@macobo macobo temporarily deployed to posthog-simplify-list-q-r6cq9h January 14, 2021 20:16 Inactive
Copy link
Member

@EDsCODE EDsCODE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA looks good!

@macobo macobo merged commit 17a31f0 into master Jan 14, 2021
@macobo macobo deleted the simplify-list-query branch January 14, 2021 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants