Cache cohorts from clickhouse / make /decide fast #2316

macobo · 2020-11-10T14:33:14Z

This fixes #2306

Changes

Please describe.
If this affects the front-end, include screenshots.

Checklist

All querysets/queries filter by Organization, Team, and User (if this PR affects ANY querysets/queries).
Django backend tests (if this PR affects the backend).
Cypress end-to-end tests (if this PR affects the frontend).

timgl · 2020-11-10T15:51:33Z

posthog/models/cohort.py

+        from ee.clickhouse.models.cohort import get_person_ids_by_cohort_id
+
+        uuids = get_person_ids_by_cohort_id(team=self.team, cohort_id=self.pk)
+        return Person.objects.filter(uuid__in=uuids, team=self.team)


I wonder how well this scales with large (1m+) cohorts. Hence my suggestion to do this in Clickhouse rather than Postgres

Can you be more specific on what you mean by in clickhouse?

Have the cohort to person mapping table in clickhouse, rather than piping all of those IDs over the web every 15 minutes

Wish I understood what you were trying to get at before - 'in clickhouse' can mean many things.

The 1m comment is fair, for reference, the largest cohort we have currently is currently 670k, but we only have so many cohorts in total.

I'm wondering about the scope creep here - I took this on as this bug is actively detracting from session recording. The proposed fix is doable but seems ~day of work + some sync work.

I propose ticketing it instead, shipping this and moving on it if we see our workers struggling or as we are clearing the backlog?

I disagree. We know we'll need to do this, scaling is one of our biggest challenges right now and introducing an obvious place where we could run into trouble at this point seems silly. Remember that we execute this query every fifteen minutes (I already added a commit to add this back in).

All we'll need to do is

Create a new cohort_people table in Clickhouse with a person_id and cohort_id field.

Execute the insert query in cohort.py:73 against that table instead of postgres

Do a sync_execute('select id from cohort_people where person_id = (select person_id from person_distinct_id where distinct_id=%(distinct_id)s limit 1) and cohort_id = %(cohort_id)', {'cohort_id': cohort.pk, 'distinct_id': distinct_id}) in the _match_distinct_id function

Seems like a reasonable ask?

Actually thinking about it, there is an argument to be made for doing it in postgresql which is that lookups are much much quicker, lookups in clickhouse might take a second or two vs milliseconds in psql. I'd still be worried about workers falling over when calculating. Maybe this is a question for @fuziontech

Ok, I'll grab this though I disagree with parts of the argument here.

(I already added a commit to add this back in).

Whoops, thank you!

introducing an obvious place where we could run into trouble at this point seems silly.

I'm generally against premature optimization and would argue this is an instance of it.

I don't think it's obvious at all this will blow up right now - it would currently be loading ~3m rows every 15 minutes is not a lot. We do have metrics that would show this though if it does.

All we'll need to do is

This misses scope-wise:

Needing to do this in a clickhouse-aware way (no deletes, using mergetree to collapse updates)

Writing a clickhouse version of /api/people. This is needed because of https://github.com/PostHog/posthog/blob/fix-decide-endpoint-slowness/posthog/api/person.py#L90-L91 - otherwise we'd be reading cohort ids into memory anyways.

Adding tests

That does round up to ~day of work in my mind.

In conclusion while this PR might be flawed, I'd argue from an operational standpoint it trades a big issue (autocapture/sessions data not being recorded) for a smaller one (worst case, cohorts are out of date due to workers not being able to handle the query).

Given the severity of the /decide endpoint slowness I'd still encourage getting this merged/shipped. It's also stopping us from seeing if there are other sessions recording issues.

I'll meanwhile start chewing my way through doing this in clickhouse.

I'd still be worried about workers falling over when calculating

This is reasonably easy to estimate. I could run the task in production now against a copy of postgres and see how it peforms.

Takeaways from call:

Measure this

Separate task per team/cohort

@timgl

Did measurements.

CPU times: user 1min 49s, sys: 4.67 s, total: 1min 54s Wall time: 17min 54s

Of that, currently 906 seconds were spent in prod clickhouse doing cohort queries, 51 seconds were spent doing inserts (largest cohort of 700k items took 24s on its own).

Proposal:

Calculate each cohort async in celery.

Enable this in cloud for every 1h, merge it

See if we can optimize cohort queries in clickhouse (will create a ticket)

Optimize the insertion part - for the largest cohorts, this still took far too long. We can probably make it batch and only delete/insert rows which have changed.

Once both fixes have landed, measure again and decrease this to 15mins again.

macobo · 2020-11-11T15:42:35Z

@timgl updated the PR with:

Calculating cohorts every 1h on cloud
Running every cohort calculation in separate celery task

Didn't get to delete/insert separately (in a way that avoids uploading the ids twice to pg) - will see about it tomorrow. This should still be safe to merge based on the measurements + we need to return to optimize.

timgl · 2020-11-11T18:31:10Z

Some typecheck errors, also when you save a cohort it doesn't actually refresh the users below. Other than that I think the approach makes sense to me! Great work :)

This fixes #2306

timgl temporarily deployed to posthog-fix-decide-endp-d48q7r November 10, 2020 14:34 Inactive

macobo force-pushed the fix-decide-endpoint-slowness branch from 5b97ca1 to 3caa37d Compare November 10, 2020 15:13

macobo marked this pull request as ready for review November 10, 2020 15:13

macobo temporarily deployed to posthog-fix-decide-endp-d48q7r November 10, 2020 15:13 Inactive

macobo force-pushed the fix-decide-endpoint-slowness branch from 3caa37d to c32b51d Compare November 10, 2020 15:26

macobo requested a review from EDsCODE November 10, 2020 15:26

macobo temporarily deployed to posthog-fix-decide-endp-d48q7r November 10, 2020 15:26 Inactive

macobo requested a review from timgl November 10, 2020 15:26

timgl reviewed Nov 10, 2020

View reviewed changes

timgl temporarily deployed to posthog-fix-decide-endp-d48q7r November 11, 2020 09:54 Inactive

macobo temporarily deployed to posthog-fix-decide-endp-d48q7r November 11, 2020 15:10 Inactive

macobo force-pushed the fix-decide-endpoint-slowness branch from 39754fb to 1e1952e Compare November 11, 2020 15:41

macobo temporarily deployed to posthog-fix-decide-endp-d48q7r November 11, 2020 15:41 Inactive

macobo force-pushed the fix-decide-endpoint-slowness branch from 1e1952e to 14e6680 Compare November 12, 2020 07:39

macobo temporarily deployed to posthog-fix-decide-endp-d48q7r November 12, 2020 07:39 Inactive

macobo and others added 5 commits November 12, 2020 11:28

Cache cohorts in clickhouse

7a5a8be

This fixes #2306

Calculate celery

e48a97c

Schedule cohorts recalculation once per hour

ebc50e5

Run cohort calculations each in separate task

cb60962

Fix max age in cohort recalculation

42916c7

macobo force-pushed the fix-decide-endpoint-slowness branch from 14e6680 to 42916c7 Compare November 12, 2020 09:50

macobo temporarily deployed to posthog-fix-decide-endp-d48q7r November 12, 2020 09:50 Inactive

Simplify

1efd72d

macobo temporarily deployed to posthog-fix-decide-endp-d48q7r November 12, 2020 10:37 Inactive

macobo merged commit a3f0554 into master Nov 12, 2020

macobo deleted the fix-decide-endpoint-slowness branch November 12, 2020 10:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache cohorts from clickhouse / make /decide fast #2316

Cache cohorts from clickhouse / make /decide fast #2316

macobo commented Nov 10, 2020 •

edited

Loading

timgl Nov 10, 2020

macobo Nov 10, 2020

timgl Nov 10, 2020

macobo Nov 10, 2020

timgl Nov 11, 2020 •

edited

Loading

timgl Nov 11, 2020 •

edited

Loading

macobo Nov 11, 2020

macobo Nov 11, 2020

macobo Nov 11, 2020

macobo Nov 11, 2020 •

edited

Loading

macobo commented Nov 11, 2020

timgl commented Nov 11, 2020 •

edited

Loading

Cache cohorts from clickhouse / make /decide fast #2316

Cache cohorts from clickhouse / make /decide fast #2316

Conversation

macobo commented Nov 10, 2020 • edited Loading

Changes

Checklist

timgl Nov 10, 2020

Choose a reason for hiding this comment

macobo Nov 10, 2020

Choose a reason for hiding this comment

timgl Nov 10, 2020

Choose a reason for hiding this comment

macobo Nov 10, 2020

Choose a reason for hiding this comment

timgl Nov 11, 2020 • edited Loading

Choose a reason for hiding this comment

timgl Nov 11, 2020 • edited Loading

Choose a reason for hiding this comment

macobo Nov 11, 2020

Choose a reason for hiding this comment

macobo Nov 11, 2020

Choose a reason for hiding this comment

macobo Nov 11, 2020

Choose a reason for hiding this comment

macobo Nov 11, 2020 • edited Loading

Choose a reason for hiding this comment

macobo commented Nov 11, 2020

timgl commented Nov 11, 2020 • edited Loading

macobo commented Nov 10, 2020 •

edited

Loading

timgl Nov 11, 2020 •

edited

Loading

timgl Nov 11, 2020 •

edited

Loading

macobo Nov 11, 2020 •

edited

Loading

timgl commented Nov 11, 2020 •

edited

Loading