-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache cohorts from clickhouse / make /decide fast #2316
Conversation
5b97ca1
to
3caa37d
Compare
3caa37d
to
c32b51d
Compare
from ee.clickhouse.models.cohort import get_person_ids_by_cohort_id | ||
|
||
uuids = get_person_ids_by_cohort_id(team=self.team, cohort_id=self.pk) | ||
return Person.objects.filter(uuid__in=uuids, team=self.team) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how well this scales with large (1m+) cohorts. Hence my suggestion to do this in Clickhouse rather than Postgres
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be more specific on what you mean by in clickhouse
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have the cohort to person mapping table in clickhouse, rather than piping all of those IDs over the web every 15 minutes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wish I understood what you were trying to get at before - 'in clickhouse' can mean many things.
The 1m comment is fair, for reference, the largest cohort we have currently is currently 670k, but we only have so many cohorts in total.
I'm wondering about the scope creep here - I took this on as this bug is actively detracting from session recording. The proposed fix is doable but seems ~day of work + some sync work.
I propose ticketing it instead, shipping this and moving on it if we see our workers struggling or as we are clearing the backlog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. We know we'll need to do this, scaling is one of our biggest challenges right now and introducing an obvious place where we could run into trouble at this point seems silly. Remember that we execute this query every fifteen minutes (I already added a commit to add this back in).
All we'll need to do is
- Create a new cohort_people table in Clickhouse with a person_id and cohort_id field.
- Execute the insert query in cohort.py:73 against that table instead of postgres
- Do a
sync_execute('select id from cohort_people where person_id = (select person_id from person_distinct_id where distinct_id=%(distinct_id)s limit 1) and cohort_id = %(cohort_id)', {'cohort_id': cohort.pk, 'distinct_id': distinct_id})
in the _match_distinct_id function
Seems like a reasonable ask?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually thinking about it, there is an argument to be made for doing it in postgresql which is that lookups are much much quicker, lookups in clickhouse might take a second or two vs milliseconds in psql. I'd still be worried about workers falling over when calculating. Maybe this is a question for @fuziontech
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll grab this though I disagree with parts of the argument here.
(I already added a commit to add this back in).
Whoops, thank you!
introducing an obvious place where we could run into trouble at this point seems silly.
I'm generally against premature optimization and would argue this is an instance of it.
I don't think it's obvious at all this will blow up right now - it would currently be loading ~3m rows every 15 minutes is not a lot. We do have metrics that would show this though if it does.
All we'll need to do is
This misses scope-wise:
- Needing to do this in a clickhouse-aware way (no deletes, using mergetree to collapse updates)
- Writing a clickhouse version of /api/people. This is needed because of https://github.com/PostHog/posthog/blob/fix-decide-endpoint-slowness/posthog/api/person.py#L90-L91 - otherwise we'd be reading cohort ids into memory anyways.
- Adding tests
That does round up to ~day of work in my mind.
In conclusion while this PR might be flawed, I'd argue from an operational standpoint it trades a big issue (autocapture/sessions data not being recorded) for a smaller one (worst case, cohorts are out of date due to workers not being able to handle the query).
Given the severity of the /decide endpoint slowness I'd still encourage getting this merged/shipped. It's also stopping us from seeing if there are other sessions recording issues.
I'll meanwhile start chewing my way through doing this in clickhouse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still be worried about workers falling over when calculating
This is reasonably easy to estimate. I could run the task in production now against a copy of postgres and see how it peforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Takeaways from call:
- Measure this
- Separate task per team/cohort
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did measurements.
CPU times: user 1min 49s, sys: 4.67 s, total: 1min 54s
Wall time: 17min 54s
Of that, currently 906 seconds were spent in prod clickhouse doing cohort queries, 51 seconds were spent doing inserts (largest cohort of 700k items took 24s on its own).
Proposal:
- Calculate each cohort async in celery.
- Enable this in cloud for every 1h, merge it
- See if we can optimize cohort queries in clickhouse (will create a ticket)
- Optimize the insertion part - for the largest cohorts, this still took far too long. We can probably make it batch and only delete/insert rows which have changed.
- Once both fixes have landed, measure again and decrease this to 15mins again.
39754fb
to
1e1952e
Compare
@timgl updated the PR with:
Didn't get to delete/insert separately (in a way that avoids uploading the ids twice to pg) - will see about it tomorrow. This should still be safe to merge based on the measurements + we need to return to optimize. |
Some typecheck errors, also when you save a cohort it doesn't actually refresh the users below. Other than that I think the approach makes sense to me! Great work :) |
1e1952e
to
14e6680
Compare
14e6680
to
42916c7
Compare
This fixes #2306
Changes
Please describe.
If this affects the front-end, include screenshots.
Checklist