Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform: Minimize the number of non-leaves in join equivalence expressions. #7895

Merged
merged 6 commits into from Sep 8, 2021

Conversation

wangandi
Copy link
Contributor

@wangandi wangandi commented Aug 18, 2021

Fixes the 100% CPU problem preventing #7471 from being merged.

Description of 100% CPU problem:

  1. A join equivalence class that starts off looking like this: (= #1 (#2 + (#2 + #1)) (#3 + (#3 + #3)))
  2. After predicatepullup + predicate pushdown, the join equivalence class has become this: (= #1 (#2 + (#2 + #1)) (#2 + (#2 + (#3 + (#3 + #3)))) (#3 + (#3 + #3)))
  3. We automatically detect that (#2 + (#2 + (#3 + (#3 + #3)))) = (#3 + (#3 + #3)) is a single-input predicate, so we push (#2 + (#2 + (#3 + (#3 + #3)))) = (#3 + (#3 + #3)) down, and the join equivalence class becomes to (= #1 (#2 + (#2 + #1)) (#2 + (#2 + (#3 + (#3 + #3))))
  4. Repeat steps 2 and 3. The join equivalence class now looks like this: (= #1 (#2 + (#2 + #1)) (#2 + (#2 + (#2 + (#2 + (#3 + (#3 + #3)))))))
  5. And so on.

This PR fixes the 100% CPU problem at step 2 in the above process. We observe that the (#3 + (#3 + #3)) in (#2 + (#2 + (#3 + (#3 + #3)))) is equivalent to a simpler expression #1, so we replace (#2 + (#2 + (#3 + (#3 + #3)))) with (#2 + (#2 + #1)). After dedupping, the equivalence class becomes (= #1 (#2 + (#2 + #1)) (#3 + (#3 + #3)))

EDIT: I realized that measuring complexity by the number of leaves:

  • fails to capture the difference in complexity between one or more unary functions nested within each other, like -(#0) vs -(-(#0))
  • risks overwriting a constant with a column reference.

The new measure of complexity is that literals always have the least complexity, and then among all other expressions, they are ranked by the number of non-leaves.

@philip-stoev
Copy link
Contributor

Item No. 1 . With this somewhat remotely possibly realistic query:

EXPLAIN SELECT l_commitDATE AS col58299
  FROM lineitem
  JOIN orders
    ON (l_orderkey = o_orderkey)
  LEFT JOIN customer
    ON (o_custkey = c_custkey)
 WHERE o_orderdate = l_shipDATE - INTERVAL ' 0 DAY '
   AND l_receiptDATE = o_orderdate

This plan contains the following redundant Filters

 %5 =                                                    +
 | Get %2 (l0)                                           +
 | Filter (#12 = #20), (datetots(#12) = (#10 - 00:00:00))+
...
 %8 =                                                    +
 | Get %2 (l0)                                           +
 | Filter (#12 = #20), (datetots(#20) = (#10 - 00:00:00))+

by contrast, the prior commit has only this:

 %5 = Let l2 =                                           +
 | Get %2 (l0)                                           +
 | Filter (#12 = #20), (datetots(#20) = (#10 - 00:00:00))+

Schema: dbt3-ddl.sql.zip

Copy link
Contributor

@philip-stoev philip-stoev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the unit tests!

I am able to hit this optimization both with my randomized workload that generates TPC-like queries and a purposely-built one that does 3 and 4-table joins with predicates that are similar to the original problematic query.

There are no wrong results or other such. I examined a lot of plans that have changed and the ones that have regressed all appear to follow the pattern described in "Item No1", which looks to me to be a common subexpression that is no longer detected as such. On the positive side, I see movement of some predicates towards the source. There have been no plans that were substantially impacted e.g. extra joins, avoidable cross joins, changes in join order and the like.

Thank you.

@wangandi wangandi changed the title Transform: Minimize the number of leaves in join equivalence expressions. Transform: Minimize the number of non-leaves in join equivalence expressions. Aug 20, 2021
@wangandi wangandi force-pushed the joinequivalencereduce branch 2 times, most recently from 3b0463c to 828dae6 Compare August 23, 2021 19:49
@@ -268,34 +268,34 @@ CREATE INDEX bar_idx3 on bar(a + 4);
query T multiline
EXPLAIN PLAN FOR
select foo.b, bar.b, baz.b
FROM foo, bar, baz
FROM bar, foo, baz
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this query, if "foo" came before "bar", then we would not be able to use the index bar(a+4).
The set of equivalence classes (= #0 #2) (= #4 (#2 + 4)) would get rewritten to (= #0 #2) (= #4 (#0 + 4)). Then, JoinImplementation will look for an index on foo(a+4) but not an index on bar(a+4).

I have written up in #8002 that JoinImplementation could be changed to identify arrangements on expressions that are equivalent to expressions in equivalences.

@wangandi wangandi marked this pull request as ready for review August 23, 2021 20:34
@@ -57,11 +117,29 @@ pub fn canonicalize_equivalences(equivalences: &mut Vec<Vec<MirScalarExpr>>) {
equivalences.sort();
}

fn rank_complexity(expr: &MirScalarExpr) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wants a doc comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit, but why "non-leaves" instead of "nodes" which would simplify the logic a fair bit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This question still stands? If we want to prevent cycling, why not just "nodes"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that "nodes" will also prevent cycling because an expression will only be able to be replaced with an expression with equal or fewer nodes. In my mind, having fewer function calls makes an expression "less complex," but ideally in the future, we could rank complexity by the estimated amount of time it takes to compute the expression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is true. The main property you want is that there is some measurement on the thing that must strictly increase when one thing is contained in the other, and strictly decrease when a greater thing is replaced by a strictly smaller thing. We can keep it as whatever, but "nodes" seemed much easier and if I swoop in and change it in the future is that a bug (is a thing I'd love the code to reflect).

/// This function:
/// * ensures the same expression appears in only one equivalence class.
/// * ensures the equivalence classes are sorted and dedupped.
/// * simplifies expressions to involve the least number of non-leaves.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this last bullet point means. If it is important, could you say more?

Copy link
Contributor

@frankmcsherry frankmcsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is ok, but I'm going to push back on the presentation! I can't tell what the new 60 lines are meant to do, and I'd like to have their intended behavior described some other way than in the code itself. I recommend method-level doc comments with more detailed comments in the method body.

Comment on lines 22 to 25
/// involving one or more arguments. This is to prevent infinite loops/100%
/// CPU situations where transforms can cause the equivalence class to evolve
/// from `[A f(A) f(C, A)]` to `[A f(A) g(C, A) g(C, f(A))]` and then to
/// `[A f(A) g(C, A) g(C, f(A)) g(C, f(f(A)))]` and so on.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I don't know what this means. The example doesn't make it as clear to me as I suspect it makes it clear to someone who has seen this problem before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, more reading here, can I recommend the following text instead:

This ensures that we only replace expressions by "even simpler" expressions. This ensures that repeated substitutions reduce the complexity of expressions and a fixed point is certain to be reached. Without this rule, we might repeatedly replace a simple expression with a complex expression containing that (or another replaceable) simple expression, and repeat indefinitely.

@@ -57,11 +117,29 @@ pub fn canonicalize_equivalences(equivalences: &mut Vec<Vec<MirScalarExpr>>) {
equivalences.sort();
}

fn rank_complexity(expr: &MirScalarExpr) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This question still stands? If we want to prevent cycling, why not just "nodes"?

Andi Wang added 5 commits September 1, 2021 15:49
This allows distinguishing between the complexity of
CallUnary{CallUnary{Column}} and CallUnary{Column}.

Constants are considered to be less complex that any other type of
MirScalarExpr.
Copy link
Contributor

@frankmcsherry frankmcsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the nit about literals being sometimes 0, sometimes 1, this seems great, thanks!

@wangandi wangandi force-pushed the joinequivalencereduce branch 2 times, most recently from b1d8a5b to 46b103d Compare September 1, 2021 21:57
@wangandi
Copy link
Contributor Author

wangandi commented Sep 1, 2021

Nit, but this means that bare literals are 0, but a literal encountered in visit would contribute a 1, right? I don't know that this makes anything wrong, though.

Changed the ranking to be by number of non-literal nodes.

That said, ranking by "nodes" will still prevent cycling even though bare literals are 0 but a literal encountered in visit would contribute a 1.

It turns out that the property

/// If expressions `e1` and `e2` are such that
/// `complexity(e1) < complexity(e2)` then for all SQL functions `f`,
/// `complexity(f(e1)) < complexity(f(e2))`.

is not required to prevent cycling.

As an example, suppose the following is a set of join equivalences, and the complexity ranking is such that complexity(4) < complexity(#0) but complexity(4) > complexity(#0).
[[4 #0] [f(#0) f(4)] ["string" g(f(#0))]]

After the first round of substitutions, the equivalences look like this:
[[4 #0] [f(4) f(4)] ["string" g(f(4))]]

@wangandi wangandi merged commit 6b7fdeb into MaterializeInc:main Sep 8, 2021
@wangandi wangandi deleted the joinequivalencereduce branch September 8, 2021 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants