Transform: Minimize the number of non-leaves in join equivalence expressions. #7895

wangandi · 2021-08-18T02:57:09Z

Fixes the 100% CPU problem preventing #7471 from being merged.

Description of 100% CPU problem:

A join equivalence class that starts off looking like this: (= #1 (#2 + (#2 + #1)) (#3 + (#3 + #3)))
After predicatepullup + predicate pushdown, the join equivalence class has become this: (= #1 (#2 + (#2 + #1)) (#2 + (#2 + (#3 + (#3 + #3)))) (#3 + (#3 + #3)))
We automatically detect that (#2 + (#2 + (#3 + (#3 + #3)))) = (#3 + (#3 + #3)) is a single-input predicate, so we push (#2 + (#2 + (#3 + (#3 + #3)))) = (#3 + (#3 + #3)) down, and the join equivalence class becomes to (= #1 (#2 + (#2 + #1)) (#2 + (#2 + (#3 + (#3 + #3))))
Repeat steps 2 and 3. The join equivalence class now looks like this: (= #1 (#2 + (#2 + #1)) (#2 + (#2 + (#2 + (#2 + (#3 + (#3 + #3)))))))
And so on.

This PR fixes the 100% CPU problem at step 2 in the above process. We observe that the (#3 + (#3 + #3)) in (#2 + (#2 + (#3 + (#3 + #3)))) is equivalent to a simpler expression #1, so we replace (#2 + (#2 + (#3 + (#3 + #3)))) with (#2 + (#2 + #1)). After dedupping, the equivalence class becomes (= #1 (#2 + (#2 + #1)) (#3 + (#3 + #3)))

EDIT: I realized that measuring complexity by the number of leaves:

fails to capture the difference in complexity between one or more unary functions nested within each other, like -(#0) vs -(-(#0))
risks overwriting a constant with a column reference.

The new measure of complexity is that literals always have the least complexity, and then among all other expressions, they are ranked by the number of non-leaves.

philip-stoev · 2021-08-18T07:50:25Z

Item No. 1 . With this somewhat remotely possibly realistic query:

EXPLAIN SELECT l_commitDATE AS col58299
  FROM lineitem
  JOIN orders
    ON (l_orderkey = o_orderkey)
  LEFT JOIN customer
    ON (o_custkey = c_custkey)
 WHERE o_orderdate = l_shipDATE - INTERVAL ' 0 DAY '
   AND l_receiptDATE = o_orderdate

This plan contains the following redundant Filters

 %5 =                                                    +
 | Get %2 (l0)                                           +
 | Filter (#12 = #20), (datetots(#12) = (#10 - 00:00:00))+
...
 %8 =                                                    +
 | Get %2 (l0)                                           +
 | Filter (#12 = #20), (datetots(#20) = (#10 - 00:00:00))+

by contrast, the prior commit has only this:

 %5 = Let l2 =                                           +
 | Get %2 (l0)                                           +
 | Filter (#12 = #20), (datetots(#20) = (#10 - 00:00:00))+

Schema: dbt3-ddl.sql.zip

philip-stoev

Thank you for the unit tests!

I am able to hit this optimization both with my randomized workload that generates TPC-like queries and a purposely-built one that does 3 and 4-table joins with predicates that are similar to the original problematic query.

There are no wrong results or other such. I examined a lot of plans that have changed and the ones that have regressed all appear to follow the pattern described in "Item No1", which looks to me to be a common subexpression that is no longer detected as such. On the positive side, I see movement of some predicates towards the source. There have been no plans that were substantially impacted e.g. extra joins, avoidable cross joins, changes in join order and the like.

Thank you.

wangandi · 2021-08-23T20:33:16Z

test/sqllogictest/transform/join_index.slt

@@ -268,34 +268,34 @@ CREATE INDEX bar_idx3 on bar(a + 4);
 query T multiline
 EXPLAIN PLAN FOR
 select foo.b, bar.b, baz.b
-FROM foo, bar, baz
+FROM bar, foo, baz


In this query, if "foo" came before "bar", then we would not be able to use the index bar(a+4).
The set of equivalence classes (= #0 #2) (= #4 (#2 + 4)) would get rewritten to (= #0 #2) (= #4 (#0 + 4)). Then, JoinImplementation will look for an index on foo(a+4) but not an index on bar(a+4).

I have written up in #8002 that JoinImplementation could be changed to identify arrangements on expressions that are equivalent to expressions in equivalences.

frankmcsherry · 2021-08-25T14:14:26Z

src/expr/src/relation/canonicalize.rs

@@ -57,11 +117,29 @@ pub fn canonicalize_equivalences(equivalences: &mut Vec<Vec<MirScalarExpr>>) {
    equivalences.sort();
 }

+fn rank_complexity(expr: &MirScalarExpr) -> usize {


This wants a doc comment.

A nit, but why "non-leaves" instead of "nodes" which would simplify the logic a fair bit?

This question still stands? If we want to prevent cycling, why not just "nodes"?

It is true that "nodes" will also prevent cycling because an expression will only be able to be replaced with an expression with equal or fewer nodes. In my mind, having fewer function calls makes an expression "less complex," but ideally in the future, we could rank complexity by the estimated amount of time it takes to compute the expression.

I believe it is true. The main property you want is that there is some measurement on the thing that must strictly increase when one thing is contained in the other, and strictly decrease when a greater thing is replaced by a strictly smaller thing. We can keep it as whatever, but "nodes" seemed much easier and if I swoop in and change it in the future is that a bug (is a thing I'd love the code to reflect).

frankmcsherry · 2021-08-25T14:17:05Z

src/expr/src/relation/canonicalize.rs

+/// This function:
+/// * ensures the same expression appears in only one equivalence class.
+/// * ensures the equivalence classes are sorted and dedupped.
+/// * simplifies expressions to involve the least number of non-leaves.


I don't understand what this last bullet point means. If it is important, could you say more?

frankmcsherry

I suspect this is ok, but I'm going to push back on the presentation! I can't tell what the new 60 lines are meant to do, and I'd like to have their intended behavior described some other way than in the code itself. I recommend method-level doc comments with more detailed comments in the method body.

frankmcsherry · 2021-09-01T09:10:19Z

src/expr/src/relation/canonicalize.rs

+///   involving one or more arguments. This is to prevent infinite loops/100%
+///   CPU situations where transforms can cause the equivalence class to evolve
+///   from `[A f(A) f(C, A)]` to `[A f(A) g(C, A) g(C, f(A))]` and then to
+///   `[A f(A) g(C, A) g(C, f(A)) g(C, f(f(A)))]` and so on.


I'm sorry, I don't know what this means. The example doesn't make it as clear to me as I suspect it makes it clear to someone who has seen this problem before.

Ok, more reading here, can I recommend the following text instead:

This ensures that we only replace expressions by "even simpler" expressions. This ensures that repeated substitutions reduce the complexity of expressions and a fixed point is certain to be reached. Without this rule, we might repeatedly replace a simple expression with a complex expression containing that (or another replaceable) simple expression, and repeat indefinitely.

frankmcsherry · 2021-09-01T09:11:11Z

src/expr/src/relation/canonicalize.rs

@@ -57,11 +117,29 @@ pub fn canonicalize_equivalences(equivalences: &mut Vec<Vec<MirScalarExpr>>) {
    equivalences.sort();
 }

+fn rank_complexity(expr: &MirScalarExpr) -> usize {


This question still stands? If we want to prevent cycling, why not just "nodes"?

This allows distinguishing between the complexity of CallUnary{CallUnary{Column}} and CallUnary{Column}. Constants are considered to be less complex that any other type of MirScalarExpr.

frankmcsherry

Other than the nit about literals being sometimes 0, sometimes 1, this seems great, thanks!

wangandi · 2021-09-01T22:10:41Z

Nit, but this means that bare literals are 0, but a literal encountered in visit would contribute a 1, right? I don't know that this makes anything wrong, though.

Changed the ranking to be by number of non-literal nodes.

That said, ranking by "nodes" will still prevent cycling even though bare literals are 0 but a literal encountered in visit would contribute a 1.

It turns out that the property

/// If expressions `e1` and `e2` are such that
/// `complexity(e1) < complexity(e2)` then for all SQL functions `f`,
/// `complexity(f(e1)) < complexity(f(e2))`.

is not required to prevent cycling.

As an example, suppose the following is a set of join equivalences, and the complexity ranking is such that complexity(4) < complexity(#0) but complexity(4) > complexity(#0).
[[4 #0] [f(#0) f(4)] ["string" g(f(#0))]]

After the first round of substitutions, the equivalences look like this:
[[4 #0] [f(4) f(4)] ["string" g(f(4))]]

wangandi requested a review from philip-stoev August 18, 2021 02:57

philip-stoev approved these changes Aug 18, 2021

View reviewed changes

wangandi force-pushed the joinequivalencereduce branch from 90ad673 to 90270aa Compare August 20, 2021 23:40

wangandi changed the title ~~Transform: Minimize the number of leaves in join equivalence expressions.~~ Transform: Minimize the number of non-leaves in join equivalence expressions. Aug 20, 2021

wangandi force-pushed the joinequivalencereduce branch 2 times, most recently from 3b0463c to 828dae6 Compare August 23, 2021 19:49

wangandi commented Aug 23, 2021

View reviewed changes

wangandi marked this pull request as ready for review August 23, 2021 20:34

wangandi requested review from frankmcsherry and asenac August 23, 2021 20:35

frankmcsherry reviewed Aug 25, 2021

View reviewed changes

frankmcsherry requested changes Aug 25, 2021

View reviewed changes

asenac approved these changes Aug 27, 2021

View reviewed changes

frankmcsherry reviewed Sep 1, 2021

View reviewed changes

Andi Wang added 5 commits September 1, 2021 15:49

Minimize the number of leaves in join equivalence expressions.

f9aa67d

Test changes.

cd47b5a

Count non-leaves instead of leaves.

056b44f

This allows distinguishing between the complexity of CallUnary{CallUnary{Column}} and CallUnary{Column}. Constants are considered to be less complex that any other type of MirScalarExpr.

Add comments.

6a83a23

Write down the rules for a valid complexity ranking.

040338f

frankmcsherry approved these changes Sep 1, 2021

View reviewed changes

wangandi force-pushed the joinequivalencereduce branch 2 times, most recently from b1d8a5b to 46b103d Compare September 1, 2021 21:57

Do not count literal nodes in the complexity ranking.

9ebc436

wangandi force-pushed the joinequivalencereduce branch from 46b103d to 9ebc436 Compare September 1, 2021 21:59

wangandi merged commit 6b7fdeb into MaterializeInc:main Sep 8, 2021

wangandi deleted the joinequivalencereduce branch September 8, 2021 17:08

materialize-bot mentioned this pull request Sep 14, 2021

release: v0.9.4-rc1 required reviews #8284

Closed

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transform: Minimize the number of non-leaves in join equivalence expressions. #7895

Transform: Minimize the number of non-leaves in join equivalence expressions. #7895

wangandi commented Aug 18, 2021 •

edited

philip-stoev commented Aug 18, 2021

philip-stoev left a comment

wangandi Aug 23, 2021

frankmcsherry Aug 25, 2021

frankmcsherry Aug 25, 2021

frankmcsherry Sep 1, 2021

wangandi Sep 1, 2021

frankmcsherry Sep 1, 2021

frankmcsherry Aug 25, 2021

frankmcsherry left a comment •

edited

frankmcsherry Sep 1, 2021

frankmcsherry Sep 1, 2021

frankmcsherry Sep 1, 2021

frankmcsherry left a comment

wangandi commented Sep 1, 2021

Transform: Minimize the number of non-leaves in join equivalence expressions. #7895

Transform: Minimize the number of non-leaves in join equivalence expressions. #7895

Conversation

wangandi commented Aug 18, 2021 • edited

philip-stoev commented Aug 18, 2021

philip-stoev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frankmcsherry left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frankmcsherry left a comment

Choose a reason for hiding this comment

wangandi commented Sep 1, 2021

wangandi commented Aug 18, 2021 •

edited

frankmcsherry left a comment •

edited