Skip to content

Optimize hash join probe with software prefetch#102444

Open
wudidapaopao wants to merge 18 commits intoClickHouse:masterfrom
wudidapaopao:opt_join_query2
Open

Optimize hash join probe with software prefetch#102444
wudidapaopao wants to merge 18 commits intoClickHouse:masterfrom
wudidapaopao:opt_join_query2

Conversation

@wudidapaopao
Copy link
Copy Markdown
Contributor

@wudidapaopao wudidapaopao commented Apr 11, 2026

Add software prefetch during hash join probe to hide memory access latency on large hash tables. Reuses the same PrefetchingHelper infrastructure already used by aggregation. Controlled by a new setting enable_software_prefetch_in_join (default: true).

Benchmark (TPC-H SF100, AWS r6i.4xlarge, 8C/16T, 128 GB, Ubuntu 24.04)

5 runs, median, default settings. OFF = enable_software_prefetch_in_join = false.

Query OFF (median) ON (median) Change
Q01 5.086 s 5.121 s +0.7%
Q02 0.959 s 0.954 s −0.5%
Q03 6.992 s 6.991 s 0.0%
Q04 2.009 s 1.656 s −17.6%
Q05 TIMEOUT TIMEOUT
Q06 1.248 s 1.262 s +1.1%
Q07 3.784 s 3.714 s −1.8%
Q08 13.789 s 13.627 s −1.2%
Q09 22.215 s 22.117 s −0.4%
Q10 4.972 s 4.835 s −2.8%
Q11 0.680 s 0.683 s +0.4%
Q12 2.494 s 2.409 s −3.4%
Q13 3.118 s 3.022 s −3.1%
Q14 1.383 s 1.369 s −1.0%
Q15 2.247 s 2.222 s −1.1%
Q16 0.754 s 0.719 s −4.6%
Q17 12.445 s 12.262 s −1.5%
Q18 3.902 s 3.812 s −2.3%
Q19 2.941 s 2.879 s −2.1%
Q20 1.487 s 1.473 s −0.9%
Q21 8.104 s 7.835 s −3.3%
Q22 0.928 s 0.783 s −15.6%

Q04 (−17.6%) and Q22 (−15.6%) show the largest gains — both are join-heavy queries where the right-side hash table is large enough for prefetch to effectively hide cache miss latency. Most other queries show modest 1–5% improvements; no meaningful regressions observed.

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add software prefetch in hash join probe phase to reduce memory access latency for large hash tables, controlled by setting enable_software_prefetch_in_join.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

New setting enable_software_prefetch_in_join (Bool, default true): enables software prefetch during hash join probe to hide memory latency when the hash table exceeds L2 cache size.

Add `enable_software_prefetch_in_join` setting (default true) to enable
software prefetch during hash join probe, following the same pattern as
`enable_software_prefetch_in_aggregation`. When the hash table is large
enough (>4x L2 cache), prefetch future rows' hash table slots to hide
memory access latency. Uses adaptive look-ahead via `PrefetchingHelper`.

Applied to all three probe functions: `joinRightColumns` (single map),
`joinRightColumns` (multiple maps), and
`joinRightColumnsWithAdditionalFilter`.
@wudidapaopao wudidapaopao changed the title Add software prefetch for hash join probe phase Optimize hash join probe with software prefetch Apr 11, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Apr 11, 2026

Workflow [PR], commit [206eee5]

Summary:


AI Review

Summary

This PR adds adaptive software prefetching for hash JOIN build/probe paths and wires the new enable_software_prefetch_in_join setting through query/plan settings and TableJoin/HashJoin. The main implementation looks coherent, but one previously flagged typo is still present in current head.

Findings
  • 💡 Nits
    • [src/Interpreters/HashJoin/HashJoinMethodsImpl.h:1013] Typo is still present as single_flag_know_rows (also used at line 1034). Rename to single_flag_known_rows for consistency with all_flag_known_rows and readability.
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
No large/binary files
Final Verdict

Status: ⚠️ Request changes

Minimum required action:

  • Rename single_flag_know_rows to single_flag_known_rows in src/Interpreters/HashJoin/HashJoinMethodsImpl.h (declaration + use).

@clickhouse-gh clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label Apr 11, 2026
Comment thread src/Interpreters/HashJoin/HashJoinMethodsImpl.h Outdated
@Fgrtue Fgrtue self-assigned this Apr 16, 2026
Comment thread src/Interpreters/HashJoin/HashJoinMethodsImpl.h Outdated
Comment thread src/Interpreters/TableJoin.h Outdated
@wudidapaopao
Copy link
Copy Markdown
Contributor Author

Hi @Fgrtue, this is ready for review when you have time, remaining CI failures are unrelated to this PR. Thanks!

@rschu1ze
Copy link
Copy Markdown
Member

rschu1ze commented May 5, 2026

@nickitat FYI^^

@nickitat
Copy link
Copy Markdown
Member

nickitat commented May 5, 2026

I will have a final look after @Fgrtue. At first glance looks fine

Copy link
Copy Markdown
Contributor

@Fgrtue Fgrtue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the same prefetching technique could be used in insertFromBlockImplTypeCase function on the build side. Did you consider adding it?

Other than that it look good!

Comment on lines +981 to +995
if constexpr (can_prefetch)
{
if (use_prefetch)
{
if (row_idx == PrefetchingHelper::iterationsToMeasure())
prefetch_look_ahead = prefetching.calcPrefetchLookAhead();

if (row_idx + prefetch_look_ahead < selector_size)
{
size_t prefetch_ind = selector[row_idx + prefetch_look_ahead];
auto key_holder = key_getter_vector[0].getKeyHolder(prefetch_ind, *pool);
mapv[0]->prefetch(std::move(key_holder));
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the same logic repeats in all three cases, what do you think if we put it into separate function and just call it from the three places?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, done. Extracted the shared logic into helpers (shouldUseJoinPrefetch, JoinPrefetcher , makeJoinPrefetcher).


auto ind = selector[row_idx];
KnownRowsHolder<true> all_flag_known_rows;
KnownRowsHolder<false> single_flag_know_rows;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: single_flag_know_rows looks like it should be single_flag_known_rows for readability/consistency with all_flag_known_rows.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This typo is still present in the current head: single_flag_know_rows.

Please rename it to single_flag_known_rows for consistency with all_flag_known_rows.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still looks unresolved in the current head: the variable is still spelled single_flag_know_rows at this location (and at the corresponding use site below).

Please rename it to single_flag_known_rows to match all_flag_known_rows and avoid carrying the typo forward.

@wudidapaopao
Copy link
Copy Markdown
Contributor Author

It seems that the same prefetching technique could be used in insertFromBlockImplTypeCase function on the build side. Did you consider adding it?

Other than that it look good!

That makes sense. insertFromBlockImplTypeCase now supports software prefetch as well.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented May 9, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.10% +0.00%
Functions 91.10% 91.70% +0.60%
Branches 76.60% 76.60% +0.00%

Changed lines: 88.12% (141/160) · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants