Skip to content

ARM performance regression: join_runtime_filter query 10 is slower #106607

@egor-click

Description

@egor-click

Company or project name

No response

Describe the situation

Summary

join_runtime_filter query 10 regressed on ARM/aarch64 for client_time.

Builds and environment

Measured on ARM/aarch64, Neoverse-V2 class CPU, 32 logical CPUs, about 123 GiB RAM. Hostname and private workspace paths are intentionally omitted.

Role Revision Public ARM binary Note
baseline ad347db https://clickhouse-builds.s3.us-east-1.amazonaws.com/REFs/master/ad347dbafb074ccf13790b5045b25708a975fb77/build_arm_release/clickhouse nearest available first-parent build was used
affected c7d5efb https://clickhouse-builds.s3.us-east-1.amazonaws.com/REFs/master/c7d5efb0b5abb8b759adb6950743055cd2218eb6/build_arm_release/clickhouse
latest 4853865 https://clickhouse-builds.s3.us-east-1.amazonaws.com/REFs/master/4853865fd42a1eb208f751416ee64927fcbc237f/build_arm_release/clickhouse

Reproduction

Performance test: join_runtime_filter
Query index: 10
Metric: client_time

SQL:

SELECT avg(o_totalprice)
        FROM orders
        JOIN (SELECT * FROM customer JOIN nation ON c_nationkey = n_nationkey WHERE n_name = 'WAKANDA') AS cn
        ON c_custkey = o_custkey
        SETTINGS enable_join_runtime_filters=1

Datasets / inputs:

Dataset/input How to obtain or recreate
tpch10 https://clickhouse-datasets.s3.amazonaws.com/h/10/tpch_sf10.tar
  1. Use an idle ARM/aarch64 host with similar CPU class if possible; the measurements below were taken on Neoverse-V2.
  2. Download the public ARM ClickHouse binaries listed in the build table for the baseline and affected revisions.
  3. Load the datasets/fixtures listed below using the normal ClickHouse performance-test data setup.
  4. Run the SQL below at least 101 times for each revision and compare median client_time.
  5. A valid reproduction should show the affected revision slower by approximately the measured shift, while same-revision reruns stay near zero shift.

A minimal manual loop, after starting each revision as a local server and loading data, is:

for i in $(seq 1 9); do
  clickhouse-client --time --query "$QUERY"
done

Measurements

Comparison Builds Runs Left median Right median Shift Left range Right range
before→after baseline vs affected 96 / 96 0.009844s 0.011089s +12.65% 0.009132s–0.010607s 0.010275s–0.011847s
before→latest baseline vs latest 97 / 97 0.009688s 0.011050s +14.06% 0.009182s–0.010309s 0.010445s–0.011608s
before→before baseline vs baseline 101 / 101 0.009979s 0.009898s -0.81% 0.009134s–0.010670s 0.009216s–0.011057s
after→after affected vs affected 91 / 91 0.011203s 0.011019s -1.65% 0.010364s–0.012043s 0.010048s–0.011487s
latest→latest latest vs latest 92 / 92 0.010774s 0.011068s +2.73% 0.010256s–0.011607s 0.010396s–0.011887s

Stability checks: before→before -0.81%, after→after -1.65%, latest→latest +2.73%. Same-build comparisons are included above so reviewers can distinguish a regression from benchmark noise.

Approximate introduction window

The regression is bounded to the revision/time window below. This is a localization aid, not a final root-cause claim.

Start revision Start time Start subject End revision End time End subject Width Evidence
ad347db 2026-05-08T10:34:54+00:00 Merge pull request #103891 from clickgapai/qa-bot/coverage-pr60419 c7d5efb 2026-05-08T10:43:10+00:00 Merge pull request #104136 from Algunenano/parts-metadata-arena ~0.006 days baseline_bound at ad347db (baseline); signal at c7d5efb (controlled_window_endpoint)

Code areas and mechanism clues

  • Files changed in the bounded window: src/Common/AsynchronousMetrics.cpp, src/Common/CurrentMetrics.cpp, src/Common/Jemalloc.cpp, src/Common/Jemalloc.h, src/Common/JemallocMergeTreeArena.cpp, src/Common/JemallocMergeTreeArena.h, src/Storages/MergeTree/DataPartsExchange.cpp, src/Storages/MergeTree/IMergeTreeDataPart.cpp, src/Storages/MergeTree/MergeTask.cpp, src/Storages/MergeTree/MergeTreeData.cpp, src/Storages/MergeTree/MergeTreeDataPartBuilder.cpp, src/Storages/MergeTree/MutateTask.cpp.

  • Shortstat for that window: 19 files changed, 436 insertions(+), 9 deletions(-).

  • Production files worth checking first: src/Common/AsynchronousMetrics.cpp, src/Common/JemallocMergeTreeArena.h, src/Storages/MergeTree/IMergeTreeDataPart.cpp, src/Storages/MergeTree/MergeTreeData.cpp, src/Storages/MergeTree/registerStorageMergeTree.cpp, src/Common/JemallocMergeTreeArena.cpp, src/Storages/MergeTree/MergeTreeDataPartBuilder.cpp, src/Storages/MergeTree/DataPartsExchange.cpp.

  • Static code/query review: Changed files did not strongly map to the benchmark query. Treat any code context as triage-only.

  • Suspect area from static review: No direct code-path suspect identified by deterministic rules.
    These are investigation leads only; the issue should not assign blame to a change without a validating patch or rollback measurement.

  • Probe-level client time: baseline 0.057404s → affected 0.057851s (+0.78%).

  • Server query duration: baseline 9 ms → affected 9 ms (+0.00%).

Largest captured ProfileEvents deltas:

ProfileEvent Baseline median Affected median Delta
LocalThreadPoolLockWaitMicroseconds 18.0 9 -50.00%
OSCPUWaitMicroseconds 7 5 -28.57%
QueryProfilerSignalOverruns 2 1.5 -25.00%
OSCPUVirtualTimeMicroseconds 18,816 23,388 +24.30%
GlobalThreadPoolLockWaitMicroseconds 17.0 21.0 +23.53%
NetworkSendElapsedMicroseconds 165.0 134.0 -18.79%
QueryProfilerRuns 50.0 56.0 +12.00%
FilteringMarksWithPrimaryKeyMicroseconds 9 8 -11.11%

Largest captured processor elapsed-time deltas:

Processor Baseline µs Affected µs Delta
FillingRightJoinSide 163.0 93.0 -42.94%
SimpleSquashingTransform 12.0 8 -33.33%
LazyOutputFormat 27.0 19.0 -29.63%
ConvertingAggregatedToChunksTransform 15.0 13.0 -13.33%
ExpressionTransform 11.0 12.0 +9.09%
MergeTreeSelect(pool: ReadPoolInOrder, algorithm: InOrder) 538.0 496.0 -7.81%

Fix or validation status

No validated fix is available yet.
Most useful next patch/revert targets: src/Common/AsynchronousMetrics.cpp, src/Common/JemallocMergeTreeArena.h, src/Storages/MergeTree/IMergeTreeDataPart.cpp, src/Storages/MergeTree/MergeTreeData.cpp, src/Storages/MergeTree/registerStorageMergeTree.cpp, src/Common/JemallocMergeTreeArena.cpp, src/Storages/MergeTree/MergeTreeDataPartBuilder.cpp, src/Storages/MergeTree/DataPartsExchange.cpp.
The current evidence narrows the problem and code areas, but no patch/rollback has measured positive validation yet.

Which ClickHouse versions are affected?

latest

How to reproduce

Reproduction

Performance test: join_runtime_filter
Query index: 10
Metric: client_time

SQL:

SELECT avg(o_totalprice)
        FROM orders
        JOIN (SELECT * FROM customer JOIN nation ON c_nationkey = n_nationkey WHERE n_name = 'WAKANDA') AS cn
        ON c_custkey = o_custkey
        SETTINGS enable_join_runtime_filters=1

Datasets / inputs:

Dataset/input How to obtain or recreate
tpch10 https://clickhouse-datasets.s3.amazonaws.com/h/10/tpch_sf10.tar
  1. Use an idle ARM/aarch64 host with similar CPU class if possible; the measurements below were taken on Neoverse-V2.
  2. Download the public ARM ClickHouse binaries listed in the build table for the baseline and affected revisions.
  3. Load the datasets/fixtures listed below using the normal ClickHouse performance-test data setup.
  4. Run the SQL below at least 101 times for each revision and compare median client_time.
  5. A valid reproduction should show the affected revision slower by approximately the measured shift, while same-revision reruns stay near zero shift.

A minimal manual loop, after starting each revision as a local server and loading data, is:

for i in $(seq 1 9); do
  clickhouse-client --time --query "$QUERY"
done

Expected performance

No response

Related issues and pull requests

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions