-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #169 break down by cohort #690
Conversation
@EDsCODE @mariusandra I think we need to pre-calculate which people are in cohorts (every 30 mins or so maybe?) to make breakdown by cohorts perform acceptably. I’m torn between storing person_id or distinct_id directly. Advantage of distinct_id is that it will save another query on every loop of events (which could be millions) so should make the /trends page a bit faster. Person_id just feels a bit more logical, especially in person.py for example. Thoughts? Edit: another thought, maybe we should store person_id against the event? That would definitely speed funnels/paths etc up as you don't have to connect distinct_ids with the person first. Then we could precalculate person_ids here too. |
@timgl tricky situation. I'm not sure if storing person_id directly with events (and going through the trouble of updating them with alias events) will speed things up that much. Regarding cohorts, yeah, finding people who belong in a cohort is a massive query right now. Denormalising this is definitely something to consider strongly. 30 minutes sounds rather slow though, is there any way to have this near-real-time, for example in a background job that runs when any (or last) action that makes up a cohort is run? Is it actually also possible that people could be removed from cohorts? Only with the passage of time or the cohort being changed I guess? |
Going off both of your points, I agree we should have calculated table of the cohorts where it's just a table of people indexed by cohort. I'm thinking instead of doing a cron job type system where it calculates at preset intervals, to achieve close to real time, we could have that anytime an action happens, a job is dispatched (maybe using the worker system that we have and only dispatch if some cohort is using it as Marius said) and the job will go determine if the user needs to be added into the cohort related to the action. If yes, then it would be a simple addition of a row into the above-mentioned table. Drawback is that it adds a fair bit of complexity and there will be a lot of unnecessary checking since a job is dispatched every single time an event happens so. If we could figure out how to reduce that it could work (can also batch the processing). Also, if cohort properties are changed, we would trigger a full recalculation. The benefit is that as long as the worker queue doesn't get stuck or backed up, retrieving cohort people should be really simple with the table and almost always up to date |
Thanks both, I've moved this discussion to #696. I suggest we split out the precalculating from this PR as it's already chunky. That means this PR is ready for review :) |
a5dbbe1
to
3401761
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One bug found below.
I was testing cohorts that were created by filter but something thing I found which may have to do with me not fully understanding how the breakdown is supposed to work, but if I create a cohort of users with action $pageview
and I attempt to breakdown by that cohort with entity $pageviews
selected I get nothing. Seems like it should have something to show because all the users within the cohort have done a $pageview
c019600
to
88c0e9f
Compare
@EDsCODE Should be ready for re-QA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very cool! QA'd and changes since last review look good. a few merge conflicts that should be trivial
Re-tested that migration just to be sure, looks good. Thanks for QA in! |
* upstream/master: Closes PostHog#169 break down by cohort (PostHog#690) 703 multiple dashboards (PostHog#740) Use person_id instead of distinct_id for unique count (PostHog#734) new contributors (PostHog#739) Update Trends dotted line UX (PostHog#735)
Changes
Todo:
Checklist