Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache cohorts from clickhouse / make /decide fast #2316

Merged
merged 6 commits into from
Nov 12, 2020

Conversation

macobo
Copy link
Contributor

@macobo macobo commented Nov 10, 2020

This fixes #2306

Changes

Please describe.
If this affects the front-end, include screenshots.

Checklist

  • All querysets/queries filter by Organization, Team, and User (if this PR affects ANY querysets/queries).
  • Django backend tests (if this PR affects the backend).
  • Cypress end-to-end tests (if this PR affects the frontend).

@timgl timgl temporarily deployed to posthog-fix-decide-endp-d48q7r November 10, 2020 14:34 Inactive
@macobo macobo force-pushed the fix-decide-endpoint-slowness branch from 5b97ca1 to 3caa37d Compare November 10, 2020 15:13
@macobo macobo marked this pull request as ready for review November 10, 2020 15:13
@macobo macobo temporarily deployed to posthog-fix-decide-endp-d48q7r November 10, 2020 15:13 Inactive
@macobo macobo force-pushed the fix-decide-endpoint-slowness branch from 3caa37d to c32b51d Compare November 10, 2020 15:26
@macobo macobo requested a review from EDsCODE November 10, 2020 15:26
@macobo macobo temporarily deployed to posthog-fix-decide-endp-d48q7r November 10, 2020 15:26 Inactive
@macobo macobo requested a review from timgl November 10, 2020 15:26
from ee.clickhouse.models.cohort import get_person_ids_by_cohort_id

uuids = get_person_ids_by_cohort_id(team=self.team, cohort_id=self.pk)
return Person.objects.filter(uuid__in=uuids, team=self.team)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how well this scales with large (1m+) cohorts. Hence my suggestion to do this in Clickhouse rather than Postgres

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific on what you mean by in clickhouse?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have the cohort to person mapping table in clickhouse, rather than piping all of those IDs over the web every 15 minutes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wish I understood what you were trying to get at before - 'in clickhouse' can mean many things.

The 1m comment is fair, for reference, the largest cohort we have currently is currently 670k, but we only have so many cohorts in total.

I'm wondering about the scope creep here - I took this on as this bug is actively detracting from session recording. The proposed fix is doable but seems ~day of work + some sync work.

I propose ticketing it instead, shipping this and moving on it if we see our workers struggling or as we are clearing the backlog?

Copy link
Collaborator

@timgl timgl Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. We know we'll need to do this, scaling is one of our biggest challenges right now and introducing an obvious place where we could run into trouble at this point seems silly. Remember that we execute this query every fifteen minutes (I already added a commit to add this back in).

All we'll need to do is

  • Create a new cohort_people table in Clickhouse with a person_id and cohort_id field.
  • Execute the insert query in cohort.py:73 against that table instead of postgres
  • Do a sync_execute('select id from cohort_people where person_id = (select person_id from person_distinct_id where distinct_id=%(distinct_id)s limit 1) and cohort_id = %(cohort_id)', {'cohort_id': cohort.pk, 'distinct_id': distinct_id}) in the _match_distinct_id function

Seems like a reasonable ask?

Copy link
Collaborator

@timgl timgl Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually thinking about it, there is an argument to be made for doing it in postgresql which is that lookups are much much quicker, lookups in clickhouse might take a second or two vs milliseconds in psql. I'd still be worried about workers falling over when calculating. Maybe this is a question for @fuziontech

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll grab this though I disagree with parts of the argument here.

(I already added a commit to add this back in).

Whoops, thank you!

introducing an obvious place where we could run into trouble at this point seems silly.

I'm generally against premature optimization and would argue this is an instance of it.

I don't think it's obvious at all this will blow up right now - it would currently be loading ~3m rows every 15 minutes is not a lot. We do have metrics that would show this though if it does.

All we'll need to do is

This misses scope-wise:

That does round up to ~day of work in my mind.


In conclusion while this PR might be flawed, I'd argue from an operational standpoint it trades a big issue (autocapture/sessions data not being recorded) for a smaller one (worst case, cohorts are out of date due to workers not being able to handle the query).

Given the severity of the /decide endpoint slowness I'd still encourage getting this merged/shipped. It's also stopping us from seeing if there are other sessions recording issues.

I'll meanwhile start chewing my way through doing this in clickhouse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still be worried about workers falling over when calculating

This is reasonably easy to estimate. I could run the task in production now against a copy of postgres and see how it peforms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Takeaways from call:

  1. Measure this
  2. Separate task per team/cohort

Copy link
Contributor Author

@macobo macobo Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timgl

Did measurements.

CPU times: user 1min 49s, sys: 4.67 s, total: 1min 54s
Wall time: 17min 54s

Of that, currently 906 seconds were spent in prod clickhouse doing cohort queries, 51 seconds were spent doing inserts (largest cohort of 700k items took 24s on its own).

Proposal:

  1. Calculate each cohort async in celery.
  2. Enable this in cloud for every 1h, merge it
  3. See if we can optimize cohort queries in clickhouse (will create a ticket)
  4. Optimize the insertion part - for the largest cohorts, this still took far too long. We can probably make it batch and only delete/insert rows which have changed.
  5. Once both fixes have landed, measure again and decrease this to 15mins again.

@timgl timgl temporarily deployed to posthog-fix-decide-endp-d48q7r November 11, 2020 09:54 Inactive
@macobo macobo temporarily deployed to posthog-fix-decide-endp-d48q7r November 11, 2020 15:10 Inactive
@macobo macobo force-pushed the fix-decide-endpoint-slowness branch from 39754fb to 1e1952e Compare November 11, 2020 15:41
@macobo macobo temporarily deployed to posthog-fix-decide-endp-d48q7r November 11, 2020 15:41 Inactive
@macobo
Copy link
Contributor Author

macobo commented Nov 11, 2020

@timgl updated the PR with:

  1. Calculating cohorts every 1h on cloud
  2. Running every cohort calculation in separate celery task

Didn't get to delete/insert separately (in a way that avoids uploading the ids twice to pg) - will see about it tomorrow. This should still be safe to merge based on the measurements + we need to return to optimize.

@timgl
Copy link
Collaborator

timgl commented Nov 11, 2020

Some typecheck errors, also when you save a cohort it doesn't actually refresh the users below. Other than that I think the approach makes sense to me! Great work :)

@macobo macobo force-pushed the fix-decide-endpoint-slowness branch from 1e1952e to 14e6680 Compare November 12, 2020 07:39
@macobo macobo temporarily deployed to posthog-fix-decide-endp-d48q7r November 12, 2020 07:39 Inactive
@macobo macobo force-pushed the fix-decide-endpoint-slowness branch from 14e6680 to 42916c7 Compare November 12, 2020 09:50
@macobo macobo temporarily deployed to posthog-fix-decide-endp-d48q7r November 12, 2020 09:50 Inactive
@macobo macobo temporarily deployed to posthog-fix-decide-endp-d48q7r November 12, 2020 10:37 Inactive
@macobo macobo merged commit a3f0554 into master Nov 12, 2020
@macobo macobo deleted the fix-decide-endpoint-slowness branch November 12, 2020 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

/decide endpoint can become slow due to feature flags
2 participants