Skip to content

Conversation

@vdimir
Copy link
Member

@vdimir vdimir commented Sep 30, 2025

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

  • Add new joined_block_split_single_row setting to reduce memory usage in hash joins with many matches per key. This allows hash join results to be chunked even within matches for a single left table row, which is particularly useful when one row from the left table matches thousands or millions of rows from the right table. Previously, all matches had to be materialized at once in memory. This reduces peak memory usage but may increase CPU usage.

Additionally, this PR fixes #87974 bug that was discovered during implementation and reproduces on master.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@clickhouse-gh
Copy link

clickhouse-gh bot commented Sep 30, 2025

Workflow [PR], commit [cd8f877]

Summary:

job_name test_name status info comment
Integration tests (amd_asan, old analyzer, 2/6) error
test_restore_replica/test.py::test_fix_metadata_version_on_attach_part_after_restore FAIL
test_lost_part/test.py::test_lost_part_other_replica FAIL
test_concurrent_ttl_merges/test.py::test_limited_ttl_merges_in_empty_pool FAIL
OOM in dmesg FAIL

@clickhouse-gh clickhouse-gh bot added the pr-improvement Pull request with some product improvements label Sep 30, 2025
@vdimir vdimir force-pushed the vdimir/hash_join_granular_output branch from 17f6a9b to d5f70dd Compare September 30, 2025 13:43
@vdimir vdimir force-pushed the vdimir/hash_join_granular_output branch from d5f70dd to 28079bd Compare September 30, 2025 14:44
@vdimir vdimir force-pushed the vdimir/hash_join_granular_output branch from f9261db to e4f84ea Compare September 30, 2025 17:42
@vdimir vdimir marked this pull request as ready for review October 2, 2025 08:10
@nickitat nickitat self-assigned this Oct 2, 2025
DECLARE(UInt64, min_joined_block_size_bytes, 512 * 1024, R"(
Minimum block size in bytes for JOIN input and output blocks (if join algorithm supports it). Small blocks will be squashed. 0 means unlimited.
)", 0) \
DECLARE(Bool, joined_block_split_single_row, false, R"(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have perf test results with this setting enabled?

}

void setFilter(IColumn::Filter && filter_) { filter = std::move(filter_); }
void setFilter(const IColumn::Filter * filter_) { filter = std::cref(*filter_); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we limit ourselves to just this one overload?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two branches for which I created owning and non-owning member, but seems we can move in both cases, since one that uses filter directly is supposed to be last call of next

Comment on lines 221 to 236
UInt64 out = 0;
UInt64 prev = 0;

for (size_t i = 0, n = offsets.size(); i < n; ++i)
{
const UInt64 curr = offsets[i];

const UInt64 start = std::max(prev, shift);
const UInt64 stop = std::min(curr, end);

if (start < stop)
out += (stop - start);

out_offsets[i] = out;
prev = curr;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per see, it is equivalent to

out_offsets[i] = std::min(offsets[i], end);
out_offsets[i] = std::max<int64_t>(0, out_offsets[i] - shift);

If so, the version above might be easier to read IMO.

DECLARE(UInt64, min_joined_block_size_bytes, 512 * 1024, R"(
Minimum block size in bytes for JOIN input and output blocks (if join algorithm supports it). Small blocks will be squashed. 0 means unlimited.
)", 0) \
DECLARE(Bool, joined_block_split_single_row, false, R"(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's, pls, add a dedicated perf test when each left row will have a good number of matches with the right side. And run it with and without the new logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the stateless test 03633_join_many_matches_limit.sql.j2, which verifies that the setting correctly limits the result block size. Do you think a performance test is still necessary? I believe the main issue this addresses is preventing memory usage from exploding

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think a performance test is still necessary?

yes, we should understand the performance implications of enabling this setting

@vdimir vdimir enabled auto-merge October 9, 2025 16:44
@vdimir vdimir added this pull request to the merge queue Oct 10, 2025
Merged via the queue into master with commit 2fec69d Oct 10, 2025
239 of 242 checks passed
@vdimir vdimir deleted the vdimir/hash_join_granular_output branch October 10, 2025 08:48
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Oct 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wrong join output with allow_experimental_join_right_table_sorting

4 participants