New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transform: Minimize the number of non-leaves in join equivalence expressions. #7895
Conversation
Item No. 1 . With this somewhat remotely possibly realistic query:
This plan contains the following redundant Filters
by contrast, the prior commit has only this:
Schema: dbt3-ddl.sql.zip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the unit tests!
I am able to hit this optimization both with my randomized workload that generates TPC-like queries and a purposely-built one that does 3 and 4-table joins with predicates that are similar to the original problematic query.
There are no wrong results or other such. I examined a lot of plans that have changed and the ones that have regressed all appear to follow the pattern described in "Item No1", which looks to me to be a common subexpression that is no longer detected as such. On the positive side, I see movement of some predicates towards the source. There have been no plans that were substantially impacted e.g. extra joins, avoidable cross joins, changes in join order and the like.
Thank you.
90ad673
to
90270aa
Compare
3b0463c
to
828dae6
Compare
@@ -268,34 +268,34 @@ CREATE INDEX bar_idx3 on bar(a + 4); | |||
query T multiline | |||
EXPLAIN PLAN FOR | |||
select foo.b, bar.b, baz.b | |||
FROM foo, bar, baz | |||
FROM bar, foo, baz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this query, if "foo" came before "bar", then we would not be able to use the index bar(a+4)
.
The set of equivalence classes (= #0 #2) (= #4 (#2 + 4))
would get rewritten to (= #0 #2) (= #4 (#0 + 4))
. Then, JoinImplementation will look for an index on foo(a+4)
but not an index on bar(a+4)
.
I have written up in #8002 that JoinImplementation could be changed to identify arrangements on expressions that are equivalent to expressions in equivalences
.
@@ -57,11 +117,29 @@ pub fn canonicalize_equivalences(equivalences: &mut Vec<Vec<MirScalarExpr>>) { | |||
equivalences.sort(); | |||
} | |||
|
|||
fn rank_complexity(expr: &MirScalarExpr) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wants a doc comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit, but why "non-leaves" instead of "nodes" which would simplify the logic a fair bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This question still stands? If we want to prevent cycling, why not just "nodes"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true that "nodes" will also prevent cycling because an expression will only be able to be replaced with an expression with equal or fewer nodes. In my mind, having fewer function calls makes an expression "less complex," but ideally in the future, we could rank complexity by the estimated amount of time it takes to compute the expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is true. The main property you want is that there is some measurement on the thing that must strictly increase when one thing is contained in the other, and strictly decrease when a greater thing is replaced by a strictly smaller thing. We can keep it as whatever, but "nodes" seemed much easier and if I swoop in and change it in the future is that a bug (is a thing I'd love the code to reflect).
/// This function: | ||
/// * ensures the same expression appears in only one equivalence class. | ||
/// * ensures the equivalence classes are sorted and dedupped. | ||
/// * simplifies expressions to involve the least number of non-leaves. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this last bullet point means. If it is important, could you say more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this is ok, but I'm going to push back on the presentation! I can't tell what the new 60 lines are meant to do, and I'd like to have their intended behavior described some other way than in the code itself. I recommend method-level doc comments with more detailed comments in the method body.
/// involving one or more arguments. This is to prevent infinite loops/100% | ||
/// CPU situations where transforms can cause the equivalence class to evolve | ||
/// from `[A f(A) f(C, A)]` to `[A f(A) g(C, A) g(C, f(A))]` and then to | ||
/// `[A f(A) g(C, A) g(C, f(A)) g(C, f(f(A)))]` and so on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I don't know what this means. The example doesn't make it as clear to me as I suspect it makes it clear to someone who has seen this problem before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, more reading here, can I recommend the following text instead:
This ensures that we only replace expressions by "even simpler" expressions. This ensures that repeated substitutions reduce the complexity of expressions and a fixed point is certain to be reached. Without this rule, we might repeatedly replace a simple expression with a complex expression containing that (or another replaceable) simple expression, and repeat indefinitely.
@@ -57,11 +117,29 @@ pub fn canonicalize_equivalences(equivalences: &mut Vec<Vec<MirScalarExpr>>) { | |||
equivalences.sort(); | |||
} | |||
|
|||
fn rank_complexity(expr: &MirScalarExpr) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This question still stands? If we want to prevent cycling, why not just "nodes"?
This allows distinguishing between the complexity of CallUnary{CallUnary{Column}} and CallUnary{Column}. Constants are considered to be less complex that any other type of MirScalarExpr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the nit about literals being sometimes 0, sometimes 1, this seems great, thanks!
b1d8a5b
to
46b103d
Compare
46b103d
to
9ebc436
Compare
Changed the ranking to be by number of non-literal nodes. That said, ranking by "nodes" will still prevent cycling even though bare literals are 0 but a literal encountered in visit would contribute a 1. It turns out that the property
is not required to prevent cycling. As an example, suppose the following is a set of join equivalences, and the complexity ranking is such that After the first round of substitutions, the equivalences look like this: |
Fixes the 100% CPU problem preventing #7471 from being merged.
Description of 100% CPU problem:
(= #1 (#2 + (#2 + #1)) (#3 + (#3 + #3)))
(= #1 (#2 + (#2 + #1)) (#2 + (#2 + (#3 + (#3 + #3)))) (#3 + (#3 + #3)))
(#2 + (#2 + (#3 + (#3 + #3)))) = (#3 + (#3 + #3))
is a single-input predicate, so we push(#2 + (#2 + (#3 + (#3 + #3)))) = (#3 + (#3 + #3))
down, and the join equivalence class becomes to(= #1 (#2 + (#2 + #1)) (#2 + (#2 + (#3 + (#3 + #3))))
(= #1 (#2 + (#2 + #1)) (#2 + (#2 + (#2 + (#2 + (#3 + (#3 + #3)))))))
This PR fixes the 100% CPU problem at step 2 in the above process. We observe that the
(#3 + (#3 + #3))
in(#2 + (#2 + (#3 + (#3 + #3))))
is equivalent to a simpler expression#1
, so we replace(#2 + (#2 + (#3 + (#3 + #3))))
with(#2 + (#2 + #1))
. After dedupping, the equivalence class becomes(= #1 (#2 + (#2 + #1)) (#3 + (#3 + #3)))
EDIT: I realized that measuring complexity by the number of leaves:
-(#0)
vs-(-(#0))
The new measure of complexity is that literals always have the least complexity, and then among all other expressions, they are ranked by the number of non-leaves.