Skip to content

Direct (nested loop) join for merge tree tables#89920

Merged
vdimir merged 8 commits intomasterfrom
vdimir/nested_loop_with_mt
Dec 16, 2025
Merged

Direct (nested loop) join for merge tree tables#89920
vdimir merged 8 commits intomasterfrom
vdimir/nested_loop_with_mt

Conversation

@vdimir
Copy link
Copy Markdown
Member

@vdimir vdimir commented Nov 12, 2025

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

  • Support direct (nested loop) join for MergeTree tables. To use it, specify it as the single option in the setting: join_algorithm = 'direct'

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Nov 12, 2025

Workflow [PR], commit [8381b14]

Summary:

job_name test_name status info comment
BuzzHouse (amd_debug) failure
Logical error: 'Inconsistent AST formatting in SelectQuery: the query: FAIL cidb

@clickhouse-gh clickhouse-gh bot added the pr-feature Pull request with new product feature label Nov 12, 2025
@rschu1ze
Copy link
Copy Markdown
Member

Did direct joins for merge tree tables not work before?

@vdimir
Copy link
Copy Markdown
Member Author

vdimir commented Nov 17, 2025

Did direct joins for merge tree tables not work before?

@rschu1ze

Not quite, it works only if you create LAYOUT(DIRECT()) on top of table and join with that dictionary

@rschu1ze
Copy link
Copy Markdown
Member

Right, I misread the PR description somehow. Thanks.

@vdimir vdimir force-pushed the vdimir/nested_loop_with_mt branch from 1c690ac to 8a48c19 Compare November 26, 2025 10:11
@clickhouse-gh clickhouse-gh bot added the submodule changed At least one submodule changed in this PR. label Nov 28, 2025
@vdimir vdimir force-pushed the vdimir/nested_loop_with_mt branch 3 times, most recently from 8445be6 to afb97da Compare December 2, 2025 08:45
@vdimir vdimir removed the submodule changed At least one submodule changed in this PR. label Dec 2, 2025
@vdimir vdimir force-pushed the vdimir/nested_loop_with_mt branch from d18fa87 to 6d4fc19 Compare December 3, 2025 11:26
@vdimir vdimir changed the title [wip] Implementing direct (aka nested loop) join for merge tree tables Direct (nested loop) join for merge tree tables Dec 3, 2025
@vdimir vdimir marked this pull request as ready for review December 3, 2025 11:58
@Fgrtue Fgrtue self-assigned this Dec 3, 2025
@Fgrtue Fgrtue requested a review from Copilot December 10, 2025 12:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces support for direct (nested loop) join with MergeTree tables, controlled by setting join_algorithm = 'direct'. The implementation enables efficient joins between regular tables and MergeTree tables by performing index-based lookups instead of building hash tables.

Key Changes:

  • Added DirectJoinMergeTreeEntity class that implements IKeyValueEntity interface for MergeTree direct join operations
  • Extended IKeyValueEntity::getByKeys interface to support ALL join semantics via an out_offsets parameter
  • Implemented query plan cloning capabilities for ReadFromMergeTree and storage snapshots to enable parallel lookups
  • Added comprehensive test coverage for various join types (INNER, LEFT, SEMI LEFT, ANTI LEFT) with MergeTree tables

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/Interpreters/DirectJoinMergeTreeEntity.h/cpp New entity implementing direct join logic for MergeTree tables with IN-based filtering
src/Interpreters/IKeyValueEntity.h Extended interface with out_offsets parameter for ALL join semantics
src/Interpreters/DirectJoin.cpp Updated to handle ALL join semantics with offsets from key-value entities
src/Planner/PlannerJoinTree.cpp Added logic to detect and create DirectJoinMergeTreeEntity when conditions are met
src/Processors/QueryPlan/ReadFromMergeTree.h/cpp Added cloning support and query condition cache controls for multiple lookups
src/Storages/StorageSnapshot.h Added virtual clone() method to Data base class
src/Storages/MergeTree/MergeTreeData.h Implemented clone() for MergeTree snapshot data
src/Storages/StorageMemory.h Implemented clone() for Memory snapshot data
src/Storages/Storage*.h/cpp Updated getByKeys signatures to include out_offsets parameter
tests/queries/0_stateless/03712_*.sql/reference Functional test verifying indexed vs full scan behavior
tests/queries/0_stateless/03742_*.sql/reference Long test verifying different join types with large datasets

Comment on lines +236 to +237
auto result_index = ColumnUInt64::create();
auto & result_index_data = result_index->getData();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the reading I had a though that there could be a slightly more meaningful names for these variables. Something that relates to what we actually collect in them, i.e. indices_corresponding_to_right_table_keys or something in that way. Do you think it would worth renaming them for improvement of clarity?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to selector

size_t cumulative_offset = 0;
for (size_t i = 0; i < num_keys; ++i)
{
auto find_result = key_getter.findKey(key_to_rows, i, pool);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same about this variable. I thought something relating to its meaning might improve the readability

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto & found_columns = found_chunk.mutateColumns();

for (const auto & col : found_columns)
col->insertDefault();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that this insert to the end of column is made in order to have a default null value for the rows from the left table, for which we did not find anything?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added a comment

@vdimir vdimir force-pushed the vdimir/nested_loop_with_mt branch from 6d4fc19 to 71a1963 Compare December 15, 2025 10:14
@vdimir vdimir force-pushed the vdimir/nested_loop_with_mt branch from 1ce51ae to 8381b14 Compare December 15, 2025 15:45
@vdimir vdimir enabled auto-merge December 16, 2025 12:04
@vdimir vdimir added this pull request to the merge queue Dec 16, 2025
Merged via the queue into master with commit 9e1dee9 Dec 16, 2025
250 of 256 checks passed
@vdimir vdimir deleted the vdimir/nested_loop_with_mt branch December 16, 2025 13:31
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Dec 16, 2025
The `direct` algorithm performs a lookup in the right table using rows from the left table as keys. It's supported only by special storage such as [Dictionary](/engines/table-engines/special/dictionary) or [EmbeddedRocksDB](../../engines/table-engines/integrations/embedded-rocksdb.md) and only the `LEFT` and `INNER` JOINs.
For MergeTree tables, the algorithm pushes join key filters directly to the storage layer. This can be more efficient when the key can use the table's primary key index for lookups, otherwise it performs full scans of the right table for each left table block.

Supports `INNER` and `LEFT` joins and only single-column equality join keys without other conditions.
Copy link
Copy Markdown
Contributor

@EmeraldShift EmeraldShift Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to extend support to range conditions like BETWEEN? My use case is that I have a table sorted by (a, b, timestamp), and a materialized view that aggregates some_id, a, b, min(timestamp), max(timestamp) GROUP BY ALL. I'd like to query the materialized view to receive a, b, min_t, max_t "ranges", and then directly join these ranges against the original MergeTree table, where each condition is a single consecutive range and should be extremely efficient to find.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EmeraldShift does ASOF JOIN not work because it finds only a single first/last match, but you need all of them? Does your query work with a regular hash join (in some cases we can execute complex conditions, but maybe in an inefficient way)? So does your approach fail because of an error like "not supported" or because the query is too long/consumes too much memory?

Is your query like following?

SELECT * FROM mv JOIN table
ON table.timestamp BETWEEN mv.min_t AND mv.max_t AND table.a = mv.a AND table.b = mv.b

Overall, supporting nested loops for arbitrary conditions would be nice, but we haven't planned it yet. You may create an issue with a more detailed explanation of the use case and why you think nested loops might work better than hash, just to track it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants