Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full sorting merge join, pt 1 #35796

Merged
merged 35 commits into from Jul 7, 2022
Merged

Conversation

vdimir
Copy link
Member

@vdimir vdimir commented Mar 31, 2022

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

  • Added full sorting merge join algorithm

@vdimir vdimir added the do not test disable testing on pull request label Mar 31, 2022
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-feature Pull request with new product feature label Mar 31, 2022
@vdimir vdimir force-pushed the full-sorting-merge-join branch 3 times, most recently from 2710e9e to d6452fa Compare April 6, 2022 14:40
@vdimir vdimir removed the do not test disable testing on pull request label Apr 6, 2022
@vdimir vdimir force-pushed the full-sorting-merge-join branch 3 times, most recently from 2f92fc9 to 70a6f1a Compare May 3, 2022 10:04
@vdimir vdimir force-pushed the full-sorting-merge-join branch 2 times, most recently from 58bc152 to 5fd0d4f Compare May 4, 2022 10:51
@vdimir
Copy link
Member Author

vdimir commented May 16, 2022

Stateless tests flaky check (address, actions) — Tests are not finished, fail: 0, passed: 360

Still need to investigate

UPD: seems that server just hadn't enough time to shut down

@vdimir vdimir marked this pull request as ready for review May 16, 2022 11:21
@vdimir
Copy link
Member Author

vdimir commented May 18, 2022

This PR contains bunch of changed and added tests and lots of them are long, so seems that timeout 1800 bash -c run_tests in Stateless Tests Flaky Check is not enough. Should we fix test scripts to adjust timeout to number of tests, or just merge?

@vdimir vdimir changed the title [WIP] Full sorting merge join Full sorting merge join, pt 1 May 18, 2022
@vdimir vdimir force-pushed the full-sorting-merge-join branch 2 times, most recently from 130d235 to 96d67e0 Compare May 20, 2022 12:00
@yakov-olkhovskiy yakov-olkhovskiy self-assigned this May 27, 2022
Comment on lines +48 to +53
bool type_equals
= table_join->hasUsing() ? left_type->equals(*right_type) : removeNullable(left_type)->equals(*removeNullable(right_type));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, does it mean we can't join like UInt32 and UInt64 columns?
Maybe we need to find the least common type (at least for numeric columns)

Copy link
Member Author

@vdimir vdimir Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add conversion on previous steps, so UInt32 and UInt64 would work. Here is just to check once more that type conversion were added successfully

}

/// Used just to get result header
void joinBlock(Block & block, std::shared_ptr<ExtraBlock> & /* not_processed */) override
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we need to update IJoin interface a little bit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(later)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To have something like IJoin::transformHeader?

@@ -43,7 +64,8 @@ class IJoin

/// StorageJoin/Dictionary is already filled. No need to call addJoinedBlock.
/// Different query plan is used for such joins.
virtual bool isFilled() const { return false; }
virtual bool isFilled() const { return pipelineType() == JoinPipelineType::FilledRight; }
virtual JoinPipelineType pipelineType() const { return JoinPipelineType::FillRightFirst; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like JoinPipelineType enum, but could you add a comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some comment at enum class JoinPipelineType declaration, but I've added short info here as well https://github.com/vdimir/ClickHouse/blob/c262d4d2c7c3551d49d1b8d4f324e4a3a550698f/src/Interpreters/IJoin.h#L20-L39

@@ -105,6 +115,14 @@ class QueryPipelineBuilder
bool keep_left_read_in_order,
Processors * collected_processors = nullptr);

static std::unique_ptr<QueryPipelineBuilder> joinPipelines2(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a better name here :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to QueryPipelineBuilder::joinPipelinesYShaped and QueryPipelineBuilder::joinPipelinesRightLeft

Copy link
Member

@yakov-olkhovskiy yakov-olkhovskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM mostly

@vdimir
Copy link
Member Author

vdimir commented Jul 7, 2022

@KochetovNicolai @yakov-olkhovskiy let's merge?

@alanpaulkwan
Copy link

Hi, I have a question about how this is supposed to work. I posted an issue here: #39542

if (!cursors[1]->cursor.isValid() && !cursors[1]->fullyCompleted())
return Status(1);

if (auto result = handleAllJoinState())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just I am curious, why call handleAllJoinState() method here instead of calling it in allJoin(), and calling handleAnyJoinState is located in anyJoin() method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I don't remember exactly. Perhaps handle*State functions rely on the code below (/// check if blocks are not intersecting at all) in a different ways. Or it's also possible that there's no particular reason, need to dive into the code to understand.

{
assert(state == nullptr);
state = std::make_unique<AllJoinState>(left_cursor.cursor, lpos, right_cursor.cursor, rpos);
state->addRange(0, left_cursor.getCurrent().clone(), lpos, lnum);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we only copy the necessary rows from left_cursor.getCurrent() chunk and right_cursor.getCurrent() chunk?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants