Conversation
|
Workflow [PR], commit [294b356] Summary: ❌
AI ReviewSummaryThis PR adds transitive predicate inference for join-order optimization behind Findings
ClickHouse Rules
Final Verdict
|
| /// Column equivalence classes derived from equi-join edges (e.g., A.x = B.x AND B.x = C.x | ||
| /// implies A.x, B.x, C.x are all equivalent). Used by the join order optimizer to detect | ||
| /// transitive connectivity between relations without synthesizing extra edges. | ||
| EquivalenceClasses<RelColumn, RelColumnHash> column_equivalences; |
There was a problem hiding this comment.
Why can’t we use JoinActionRef instead of RelColumn? Is it because JoinActionRef would be different due to aliases in the DAG?
There was a problem hiding this comment.
Changed it to use JoinActionRef-s
| /// | ||
| /// T must be equality-comparable; Hash must be a hash function for T. | ||
| template <typename T, typename Hash = std::hash<T>> | ||
| struct EquivalenceClasses |
There was a problem hiding this comment.
There’s also another implementation of disjoint sets here, but I don’t think it’s worth unifying them or choosing one over the other. It probably wouldn’t make the code any cleaner, and performance isn’t critical in this cases.
ClickHouse/src/Processors/QueryPlan/Optimizations/filterPushDown.cpp
Lines 229 to 302 in 34c6d70
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| SELECT * FROM t1 ALL INNER JOIN tj ON t1.key1 = tj.key1 AND t1.key2 = tj.key2 AND t1.key3 = tj.key3 AND t1.key1 = tj.key1; -- { serverError INCOMPATIBLE_TYPE_OF_JOIN } | ||
| SELECT '--- duplicate key dedup ---'; | ||
| SELECT * FROM t1 ALL INNER JOIN tj ON t1.key1 = tj.key1 AND t1.key2 = tj.key2 AND t1.key3 = tj.key3 AND t1.key1 = tj.key1 SETTINGS enable_analyzer = 0; -- { serverError INCOMPATIBLE_TYPE_OF_JOIN } | ||
| SELECT * FROM t1 ALL INNER JOIN tj ON t1.key1 = tj.key1 AND t1.key2 = tj.key2 AND t1.key3 = tj.key3 AND t1.key1 = tj.key1 ORDER BY t1.key1 SETTINGS enable_analyzer = 1; -- Whith analyzer duplicate predicates are removed |
There was a problem hiding this comment.
Typo: Whith -> With in this test comment.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| @@ -194,7 +128,7 @@ struct DecorrelationContext | |||
| CorrelatedPlanStepMap correlated_plan_steps; | |||
| /// Equivalence classes stack for subqeiries. Equivalence classes should not be propagated | |||
There was a problem hiding this comment.
Typo: subqeiries -> subqueries.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
464a172 to
357db9b
Compare
| UInt64 max_ndv = std::max(lhs_ndv, rhs_ndv); | ||
| if (max_ndv > 0) | ||
| selectivity = std::min(selectivity, 1.0 / static_cast<double>(max_ndv)); | ||
| break; |
There was a problem hiding this comment.
computeSelectivity currently uses the first matching pair from an equivalence class and then breaks. Because class traversal comes from unordered_map/unordered_set, that pair selection can vary and make selectivity estimates (and therefore join-order choice) unstable. Please evaluate all candidates across the class (or pick a deterministic key and strongest selectivity) before finalizing selectivity.
| @@ -72,6 +72,9 @@ struct QueryPlanOptimizationSettings | |||
| /// Maximum number of tables in query graph to reorder | |||
| UInt64 query_plan_optimize_join_order_limit; | |||
|
|
|||
| /// Infer transitive equi-join predicates (e.g., A.x=B.x AND B.x=C.x implies A.x=C.x) | |||
There was a problem hiding this comment.
This adds a new user-visible optimization knob, but the PR template still leaves documentation unchecked. Please add/update docs for enable_join_transitive_predicates (what it changes, safety constraints, and an example) so users can discover and roll it out safely.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
b9b508d to
d8f8bce
Compare
| auto edge = getApplicableExpressions(left->relations, right->relations); | ||
| if (edge.empty() && best_plan) | ||
| bool connected = !edge.empty() | ||
| || query_graph.areTransitivelyConnected(left->relations, right->relations); |
There was a problem hiding this comment.
So, below we can add a join with an empty applied_edge, assuming the predicate will be added later via cleanupJoinPredicates. I was wondering what happens to the existing ones, and why the checks for non_applied_edges are still valid. It seems we will apply them anyway later, as soon as the required relations are present, and then remove them in cleanupJoinPredicates as well. It’s a bit tricky, but it should be okay.
There was a problem hiding this comment.
Yes, this is done the same way as for DPsize, but there we might have many entries in DP table that will not be included in the final plan, and creating real JoinPredicates for them upfront might be wasteful
79d57bc to
af1d306
Compare
| /// Column equivalence classes derived from equi-join edges (e.g., A.x = B.x AND B.x = C.x | ||
| /// implies A.x, B.x, C.x are all equivalent). Used by the join order optimizer to detect | ||
| /// transitive connectivity between relations without synthesizing extra edges. | ||
| /// Stored as alias-resolved JoinActionRef-s pointing to INPUT nodes. |
There was a problem hiding this comment.
Typo in comment: JoinActionRef-s should be JoinActionRefs (or JoinActionRef in plural context) for readability.
| double computeSelectivity(const std::vector<JoinActionRef *> & edges, const BitSet & left, const BitSet & right); | ||
| size_t getColumnStats(BitSet rels, const String & column_name); | ||
|
|
||
| /// Peridically called from potentially long running optimization to check time limits and send progress |
There was a problem hiding this comment.
Typo in comment: Peridically should be Periodically.
LLVM Coverage Report
Changed lines: 93.35% (295/316) | lost baseline coverage: 48 line(s) · Uncovered code |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
The join order optimizer now infers transitive equi-join predicates from existing join conditions. For example, given
A.x = B.x AND B.x = C.x, the equivalenceA.x = C.xis recognized, allowing the optimizer to consider direct joins between transitively-connected tables. This can improve plan quality for star and snowflake schemas where dimension tables connect through a shared fact table. The feature is controlled by the newenable_join_transitive_predicatessetting (off by default)Documentation entry for user-facing changes