New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make nulls direction configuable for FullSortingMergeJoin #60896
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work on ClickHouse!
Could you please add more details about use-case?
Generally, NULL
s should not affect join, because they are never joined.
In case you would like to have specific NULL
s direction after join, you may just add ORDER BY
with specific modifier, but join itself does not guarantee an order
Thanks for you attention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's then simply add an argument to the MergeJoinAlgorithm and define it as a constant somewhere outside. This will meet your requirements without introducing significant changes. Especially, adding it as a template argument is questionable since it increases the binary size, but nullableCompareAt
already has a runtime argument now.
@@ -356,7 +359,7 @@ void MergeJoinAlgorithm::consume(Input & input, size_t source_num) | |||
cursors[source_num]->setChunk(std::move(input.chunk)); | |||
} | |||
|
|||
template <JoinKind kind> | |||
template <JoinKind kind, NullOrder nullOrder> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we pass direction as a value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/Core/Settings.h
Outdated
@@ -880,6 +880,7 @@ class IColumn; | |||
M(Int64, ignore_cold_parts_seconds, 0, "Only available in ClickHouse Cloud", 0) \ | |||
M(Int64, prefer_warmed_unmerged_parts_seconds, 0, "Only available in ClickHouse Cloud", 0) \ | |||
M(Bool, iceberg_engine_ignore_schema_evolution, false, "Ignore schema evolution in Iceberg table engine and read all data using latest schema saved on table creation. Note that it can lead to incorrect result", 0) \ | |||
M(Bool, nulls_biggest_in_smj, true, "Treat nulls as biggest in sort. Used in sort merge join for compare null keys.", 0) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not expose it to user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Hi vdimir, thanks for your review.
Do you mean no changes in JoinStep,QueryPipelineBuilder,MergeJoinTransform etc and only add a runtime argument for MergeJoinAlgorithm? JoinPtr smj_join = std::make_shared<FullSortingMergeJoin>(table_join, right->getCurrentDataStream().header.cloneEmpty());
QueryPlanStepPtr join_step
= std::make_unique<DB::JoinStep>(left->getCurrentDataStream(), right->getCurrentDataStream(), smj_join, 8192, 1, false); |
This is an automated comment for commit 96e9043 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
Yes I mean just like that auto null_direction = getNullDirection();
JoinPtr smj_join = std::make_shared<FullSortingMergeJoin>(table_join, right->getCurrentDataStream().header.cloneEmpty(), null_direction);
QueryPlanStepPtr join_step
= std::make_unique<DB::JoinStep>(left->getCurrentDataStream(), right->getCurrentDataStream(), smj_join, 8192, 1, false); Also, to be honest I don't fully understand how it can be useful, because handling nulls is |
Particular null order after join does not matter, but null order before Join matters. -- a simple table
CREATE TABLE item (
i_current_price DECIMAL(7,2),
i_category STRING)
USING parquet;
-- data with nulls
insert into item values
(null,null),
(null,null),
(0.63,null),
(0.74,null),
(null,null),
(90.72,'Books'),
(99.89,'Books'),
(99.41,'Books');
-- the following SQL will use SMJ with join type Left Outer
-- expected result is 2, but gluten returns 0 when offloading to ClickHouse
-- notice there are nulls in join key 'i_category'
SELECT count(*) cnt
FROM item i
where
i.i_current_price > 1.0 *
(SELECT avg(j.i_current_price)
FROM item j
WHERE j.i_category = i.i_category
) The reason is sort node's order is asc and nulls first (both in Spark and ClickHouse's SortStep), but FullSortingMergeJoin joins streams treating data in asc order and nulls last. FullSortingMergeJoin will stop when find left/right table start with null, I think. Simply changing null direction in FullSortingMergeJoin can fix it. |
72b09f9
to
8ff21d7
Compare
CI failures seems unrelated. Ready for review. |
Thanks for your effort on the PR! |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add a flag for SMJ to treat null as biggest/smallest. So the behavior can be compitable with other SQL systems, like Apache Spark.
Documentation entry for user-facing changes