Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise evaluation strategy in circumstances where segmentation is local to each MeasureProvider #2

Open
matthewwardrop opened this issue Mar 9, 2018 · 7 comments
Assignees

Comments

@matthewwardrop
Copy link
Owner

matthewwardrop commented Mar 9, 2018

Currently the evaluation strategy logic always joins measures in by joining MeasureProviders in by the nominal statistical unit type. In cases where segmentation can be done entirely local to all MeasureProviders, this is inefficient.

For example, if you have two measure providers:

And you run:
`m.evaluate('user', measures=['age', 'lead_time'], segment_by=['ds'])

The current SQL generated will be something like:

SELECT users.ds, SUM(age), SUM(lead_time)
FROM (
    SELECT * FROM users
) users
LEFT JOIN (
    SELECT * FROM signups
) ON users.user = signups.user AND users.ds = signups.ds
GROUP BY users.ds

Especially if the cardinality of users is large, this can be needlessly computationally taxing.

Instead, we should add an optimisation when all requested segmentation can be performed locally by all measure providers such that the generated SQL looks something like:

SELECT a.ds, sum_age, sum_lead_time
FROM (
    SELECT ds, SUM(age) AS sum_age FROM users GROUP BY ds
) a
FULL OUTER JOIN (
    SELECT ds, SUM(lead_time) AS sum_lead_time FROM signups GROUP BY ds
) b ON a.ds = b.ds

Which potentially reduces the computational complexity of the join substantially in these simple evaluation scenarios.

Thanks for suggesting/flagging this @danfrankj

@matthewwardrop matthewwardrop self-assigned this Mar 9, 2018
@danfrankj
Copy link
Collaborator

Yup, @matthewwardrop this describes the problem, but in addition is the larger question of whether Mensor wants to support use cases where there are no primary keys or subjects. e.g. you could imagine having only 2 data sources, an impression log and a click log. Neither of these have a primary key but the query below is still a valid query.

m.evaluate(None, measures=['impressions', 'clicks'], segment_by=['ds'])

@matthewwardrop
Copy link
Owner Author

matthewwardrop commented Mar 12, 2018

@danfrankj I personally think the null unit type case is not really in the purview of mensor. It would be straightforward to associate an impression log or click log with an artificial id (i.e. <name of event>+<ts>+<...>) that could be considered primary, and then use them as unitted evaluations. As soon as you remove unit types, there is no way to determine what you can segment by (i.e. you'd be disabling any non-local segmentations), and it becomes less intuitive as to what is being calculated (i.e. I think m.evaluate('impression', measures=['count'], segment_by=['ds'] is clearer and more intuitive than m.evaluate(None, measures='impressions/count', segment_by=['ds'])).

I'm willing to be convinced on this point, but I'd need a super clear example of its utility that is not better and more intuitively served by some other pathway.

@agorajek
Copy link
Contributor

agorajek commented May 14, 2018

Forgive me if I am missing the point of this issue, but I am wondering... if the above example metric evaluation is allowed w/o joining on the pk/f key then why do we need to look at the solution to be still a single query? Why not compute these two measures as separate queries and then merge the results as next step.

In other words, I am trying to say that when a join on the key would be necessary then any query optimization should be taken care of by the underlying provider/platform and if the join on the key is not needed then why not just make them as separate queries.

@matthewwardrop
Copy link
Owner Author

@agorajek Hey Olek! Thanks for chiming in!

The main reason why this might be desirable, instead of running the evaluations separately, is when you want to push higher level calculations down into (e.g.) SQL. If a Metric depends on these two features, if you split it up, you would not be able to push the result into SQL, and would have to compute it locally in Python. That's not always prohibitive (since it would be quite cheap), but if you wanted the results cached in a database table, not being able to squash it down would be annoying.

Perhaps I'm missing something?

@agorajek
Copy link
Contributor

agorajek commented May 14, 2018

@matthewwardrop that makes sense. But then my question would be... why are you performing a join between users and signups at all (as shown in the first query)? Is the answer simply "because that part of the code wasn't optimized yet?"

IMO both queries may be valid depending on what question you are asking of the data. Just to clarify, I don't like the first query either, especially the SELECT * ;) But I am trying to say that the choice between the 1st and the 2nd option may simply be determined by the need of the join on the key between these datasets. And the question of the necessity of such a join should also be inferred from the measures/metrics definitions. And I agree that for the question you asked (m.evaluate('user', measures=['age', 'lead_time'], segment_by=['ds'])) a join on the key should not be performed but a single query may still be the desired choice.

@matthewwardrop
Copy link
Owner Author

matthewwardrop commented May 15, 2018

@agorajek The provided sql was supposed to demonstrate the form of the query, rather than represent the query verbatim. SELECT * would never feature except in (badly) user crafted data extraction queries.

Regarding the necessity of the join, the main objective of the original join strategy was to prevent statistical unit counts being under-estimated during joins due to not involving a primary table. Currently, the evaluation strategy implemented bundles each set of query statements for a given statistical unit type into a single root query (from a primary provider of that statistical unit type), and joins all other provisions into it. Likewise all nested statistical units follow the same pattern, and are then joined into the parent query on their statistical unit type key. This is nice because it is relatively simple recursive algorithm, and is guaranteed to work in all circumstances. For the more interesting use-cases, this will be (close to) optimal, since you will be using a broader selection of measures and dimensions, and joins will be essential. It is only suboptimal in simpler cases, such as that noted above. It was always my plan to improve the optimality of the algorithm, I just believe gunning for completeness and correctness takes priority :). I cannot remember where I read the mantra, but I think it is a good one: Make it work, make it correct, and then make it fast.

@agorajek
Copy link
Contributor

@matthewwardrop thanks for the last explanation. It clarified this issue for me. And btw, I never heard this mantra... but I fully adhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants