Direct (nested loop) join for merge tree tables#89920
Conversation
|
Workflow [PR], commit [8381b14] Summary: ❌
|
|
Did direct joins for merge tree tables not work before? |
Not quite, it works only if you create |
|
Right, I misread the PR description somehow. Thanks. |
1c690ac to
8a48c19
Compare
8445be6 to
afb97da
Compare
d18fa87 to
6d4fc19
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces support for direct (nested loop) join with MergeTree tables, controlled by setting join_algorithm = 'direct'. The implementation enables efficient joins between regular tables and MergeTree tables by performing index-based lookups instead of building hash tables.
Key Changes:
- Added
DirectJoinMergeTreeEntityclass that implementsIKeyValueEntityinterface for MergeTree direct join operations - Extended
IKeyValueEntity::getByKeysinterface to support ALL join semantics via anout_offsetsparameter - Implemented query plan cloning capabilities for
ReadFromMergeTreeand storage snapshots to enable parallel lookups - Added comprehensive test coverage for various join types (INNER, LEFT, SEMI LEFT, ANTI LEFT) with MergeTree tables
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/Interpreters/DirectJoinMergeTreeEntity.h/cpp | New entity implementing direct join logic for MergeTree tables with IN-based filtering |
| src/Interpreters/IKeyValueEntity.h | Extended interface with out_offsets parameter for ALL join semantics |
| src/Interpreters/DirectJoin.cpp | Updated to handle ALL join semantics with offsets from key-value entities |
| src/Planner/PlannerJoinTree.cpp | Added logic to detect and create DirectJoinMergeTreeEntity when conditions are met |
| src/Processors/QueryPlan/ReadFromMergeTree.h/cpp | Added cloning support and query condition cache controls for multiple lookups |
| src/Storages/StorageSnapshot.h | Added virtual clone() method to Data base class |
| src/Storages/MergeTree/MergeTreeData.h | Implemented clone() for MergeTree snapshot data |
| src/Storages/StorageMemory.h | Implemented clone() for Memory snapshot data |
| src/Storages/Storage*.h/cpp | Updated getByKeys signatures to include out_offsets parameter |
| tests/queries/0_stateless/03712_*.sql/reference | Functional test verifying indexed vs full scan behavior |
| tests/queries/0_stateless/03742_*.sql/reference | Long test verifying different join types with large datasets |
tests/queries/0_stateless/03712_nested_loop_join_merge_tree.sql
Outdated
Show resolved
Hide resolved
| auto result_index = ColumnUInt64::create(); | ||
| auto & result_index_data = result_index->getData(); |
There was a problem hiding this comment.
During the reading I had a though that there could be a slightly more meaningful names for these variables. Something that relates to what we actually collect in them, i.e. indices_corresponding_to_right_table_keys or something in that way. Do you think it would worth renaming them for improvement of clarity?
| size_t cumulative_offset = 0; | ||
| for (size_t i = 0; i < num_keys; ++i) | ||
| { | ||
| auto find_result = key_getter.findKey(key_to_rows, i, pool); |
There was a problem hiding this comment.
The same about this variable. I thought something relating to its meaning might improve the readability
There was a problem hiding this comment.
Actually it's common for findKey result in codebase https://github.com/search?q=repo%3AClickHouse%2FClickHouse+%2Ffind_result.*findKey%2F&type=code
| const auto & found_columns = found_chunk.mutateColumns(); | ||
|
|
||
| for (const auto & col : found_columns) | ||
| col->insertDefault(); |
There was a problem hiding this comment.
Do I understand correctly that this insert to the end of column is made in order to have a default null value for the rows from the left table, for which we did not find anything?
6d4fc19 to
71a1963
Compare
1ce51ae to
8381b14
Compare
| The `direct` algorithm performs a lookup in the right table using rows from the left table as keys. It's supported only by special storage such as [Dictionary](/engines/table-engines/special/dictionary) or [EmbeddedRocksDB](../../engines/table-engines/integrations/embedded-rocksdb.md) and only the `LEFT` and `INNER` JOINs. | ||
| For MergeTree tables, the algorithm pushes join key filters directly to the storage layer. This can be more efficient when the key can use the table's primary key index for lookups, otherwise it performs full scans of the right table for each left table block. | ||
|
|
||
| Supports `INNER` and `LEFT` joins and only single-column equality join keys without other conditions. |
There was a problem hiding this comment.
Would it be possible to extend support to range conditions like BETWEEN? My use case is that I have a table sorted by (a, b, timestamp), and a materialized view that aggregates some_id, a, b, min(timestamp), max(timestamp) GROUP BY ALL. I'd like to query the materialized view to receive a, b, min_t, max_t "ranges", and then directly join these ranges against the original MergeTree table, where each condition is a single consecutive range and should be extremely efficient to find.
There was a problem hiding this comment.
@EmeraldShift does ASOF JOIN not work because it finds only a single first/last match, but you need all of them? Does your query work with a regular hash join (in some cases we can execute complex conditions, but maybe in an inefficient way)? So does your approach fail because of an error like "not supported" or because the query is too long/consumes too much memory?
Is your query like following?
SELECT * FROM mv JOIN table
ON table.timestamp BETWEEN mv.min_t AND mv.max_t AND table.a = mv.a AND table.b = mv.bOverall, supporting nested loops for arbitrary conditions would be nice, but we haven't planned it yet. You may create an issue with a more detailed explanation of the use case and why you think nested loops might work better than hash, just to track it.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
direct(nested loop) join for MergeTree tables. To use it, specify it as the single option in the setting:join_algorithm = 'direct'Documentation entry for user-facing changes