Company or project name
No response
Describe the situation
Summary
tsv_csv_nullable_parsing query 1 regressed on ARM/aarch64 for client_time.
Builds and environment
Measured on ARM/aarch64, Neoverse-V2 class CPU, 32 logical CPUs, about 123 GiB RAM. Hostname and private workspace paths are intentionally omitted.
Reproduction
Performance test: tsv_csv_nullable_parsing
Query index: 1
Metric: client_time
SQL:
SELECT * FROM table_csv FORMAT Null
Datasets / inputs:
Not captured.
- Use an idle ARM/aarch64 host with similar CPU class if possible; the measurements below were taken on Neoverse-V2.
- Download the public ARM ClickHouse binaries listed in the build table for the baseline and affected revisions.
- Load the datasets/fixtures listed below using the normal ClickHouse performance-test data setup.
- Run the SQL below at least 63 times for each revision and compare median
client_time.
- A valid reproduction should show the affected revision slower by approximately the measured shift, while same-revision reruns stay near zero shift.
A minimal manual loop, after starting each revision as a local server and loading data, is:
for i in $(seq 1 63); do
clickhouse-client --time --query "$QUERY"
done
Measurements
| Comparison |
Builds |
Runs |
Left median |
Right median |
Shift |
Left range |
Right range |
| before→after |
baseline vs affected |
58 / 58 |
0.015846s |
0.018723s |
+18.16% |
0.015366s–0.016506s |
0.017760s–0.019761s |
| before→latest |
baseline vs latest |
58 / 58 |
0.015936s |
0.018464s |
+15.86% |
0.015500s–0.016979s |
0.017837s–0.019492s |
| before→before |
baseline vs baseline |
63 / 63 |
0.015826s |
0.015759s |
-0.42% |
0.015350s–0.018299s |
0.015340s–0.016977s |
| after→after |
affected vs affected |
55 / 55 |
0.018448s |
0.018322s |
-0.68% |
0.017789s–0.019567s |
0.017731s–0.019688s |
| latest→latest |
latest vs latest |
54 / 54 |
0.018471s |
0.018488s |
+0.09% |
0.017871s–0.019795s |
0.017923s–0.019577s |
Stability checks: before→before -0.42%, after→after -0.68%, latest→latest +0.09%. Same-build comparisons are included above so reviewers can distinguish a regression from benchmark noise.
Approximate introduction window
Controlled reproduction compares baseline revision 9821949fd802 with affected revision c1ccce21e7d3. No narrower self-contained localization window was captured.
Code areas and mechanism clues
-
Files changed in the bounded window: base/base/find_symbols.h, src/Common/tests/gtest_find_symbols.cpp.
-
Shortstat for that window: 2 files changed, 439 insertions(+), 14 deletions(-).
-
Production files worth checking first: base/base/find_symbols.h.
-
Static code/query review: The PR changes StringZilla source layout/submodule and AArch64 flags, while the benchmark is a string/regexp replacement path.
-
Suspect area from static review: StringZilla submodule/CMake/ARM flag change alters string primitive implementation.
These are investigation leads only; the issue should not assign blame to a change without a validating patch or rollback measurement.
-
Probe-level client time: baseline 0.066671s → affected 0.066504s (-0.25%).
-
Server query duration: baseline 15.0 ms → affected 17.0 ms (+13.33%).
Largest captured ProfileEvents deltas:
| ProfileEvent |
Baseline median |
Affected median |
Delta |
| GlobalThreadPoolLockWaitMicroseconds |
8 |
— |
-100.00% |
| LocalThreadPoolLockWaitMicroseconds |
1 |
— |
-100.00% |
| OSCPUWaitMicroseconds |
16.0 |
— |
-100.00% |
| NetworkSendElapsedMicroseconds |
121.0 |
154.0 |
+27.27% |
| QueryProfilerRuns |
284.0 |
330.0 |
+16.20% |
| DiskReadElapsedMicroseconds |
355.0 |
396.0 |
+11.55% |
| LocalThreadPoolThreadCreationMicroseconds |
148.0 |
157.0 |
+6.08% |
| OSCPUVirtualTimeMicroseconds |
55,627 |
58,950 |
+5.97% |
Largest captured processor elapsed-time deltas:
| Processor |
Baseline µs |
Affected µs |
Delta |
| LimitsCheckingTransform |
33.0 |
24.0 |
-27.27% |
| File |
40,951 |
48,434 |
+18.27% |
| LazyOutputFormat |
235.0 |
257.0 |
+9.36% |
| ExpressionTransform |
121.0 |
117.0 |
-3.31% |
EXPLAIN/pattern notes:
- find_symbols
- find_first_symbols
- NEON
- neon
- SerializationNullable
- Nullable
- CSV
- TSV
Fix or validation status
A local source-variant / rollback candidate improved the affected build. This is a measured fix lead, not yet a reviewed upstream patch.
- Patch idea: scoped_source_revert touching
base/base/find_symbols.h
- Patch size: 12 hunks, 14 added lines, 296 removed lines.
- Validation result: affected/original vs local variant —; self checks: local-on-affected -0.06%, local-on-latest +1.16%, original-on-affected —.
How to recreate the idea for review:
- Check out the affected revision from the build table.
- Revert or reimplement only the listed production-file changes from the baseline revision; start with the files above, then narrow to the smallest semantic change that keeps the measured improvement.
- Build ClickHouse for ARM release and rerun the benchmark above. Accept only if the affected-vs-patched comparison improves and same-build self checks stay near zero.
Validation shifts:
| Comparison |
Shift |
| local unmodified affected vs local variant |
-7.72% |
| self local unmodified affected |
-0.54% |
| self local variant affected |
-0.06% |
| self local variant latest |
+1.16% |
Patch excerpt used for the local validation
diff --git a/base/base/find_symbols.h b/base/base/find_symbols.h
index 5159201c4a6..c048a5e01b1 100644
--- a/base/base/find_symbols.h
+++ b/base/base/find_symbols.h
@@ -11,9 +11,6 @@
-#if defined(__aarch64__)
- #include <arm_neon.h>
-#endif
@@ -126,14 +123,6 @@ inline std::array<__m128i, 16u> mm_is_in_prepare(const char * symbols, size_t nu
- /// Pad unused slots with a repeat of an actual needle byte so `mm_is_in_execute`
- /// does not spuriously match `\0` bytes in the haystack via the zero-initialised
- /// slots. Callers ensure `num_chars >= 1` before reaching the SIMD body.
- for (size_t i = num_chars; i < 16u; ++i)
- {
- result[i] = result[0];
- }
-
@@ -151,46 +140,6 @@ inline __m128i mm_is_in_execute(__m128i bytes, const std::array<__m128i, 16u> &
-#if defined(__aarch64__)
-/// On AArch64 we use NEON. There is no direct equivalent of pmovmskb, so we
-/// use the well-known shrn-by-4 trick to compress a 16-byte vector of all-0/all-1
-/// bytes into a 64-bit value where each input byte is represented by a 4-bit
-/// nibble. The position of the lowest (or highest) matching byte is then
-/// recovered as `__builtin_ctzll(mask) >> 2` (or `__builtin_clzll(mask) >> 2`).
-template <char s0>
-inline uint8x16_t neon_is_in(uint8x16_t bytes)
-{
- return vceqq_u8(bytes, vdupq_n_u8(static_cast<uint8_t>(s0)));
-}
-
-template <char s0, char s1, char... tail>
-inline uint8x16_t neon_is_in(uint8x16_t bytes)
-{
- uint8x16_t eq0 = vceqq_u8(bytes, vdupq_n_u8(static_cast<uint8_t>(s0)));
- uint8x16_t eq = neon_is_in<s1, tail...>(bytes);
- return vorrq_u8(eq0, eq);
-}
-
-inline uint8x16_t neon_is_in(uint8x16_t bytes, const char * symbols, size_t num_chars)
-{
- uint8x16_t accumulator = vdupq_n_u8(0);
- for (size_t i = 0; i < num_chars; ++i)
- {
- uint8x16_t eq = vceqq_u8(bytes, vdupq_n_u8(static_cast<uint8_t>(symbols[i])));
- accumulator = vorrq_u8(accumulator, eq);
- }
-
- return accumulator;
-}
-
-/// Compresses a 16-byte all-0/all-1 vector into a 64-bit value where each input
-/// byte occupies a 4-bit nibble (all bits set if matched, all clear otherwise).
-inline uint64_t neon_to_bitmask(uint8x16_t eq)
-{
- return vget_lane_u64(vreinterpret_u64_u8(vshrn_n_u16(vreinterpretq_u16_u8(eq), 4)), 0);
-}
-#endif
-
@@ -203,17 +152,6 @@ constexpr uint16_t maybe_negate(uint16_t x)
-#if defined(__aarch64__)
-template <bool positive>
-inline uint8x16_t maybe_negate(uint8x16_t x)
-{
- if constexpr (positive)
- return x;
- else
- return vmvnq_u8(x);
-}
-#endif
-
@@ -221,51 +159,6 @@ enum class ReturnMode : uint8_t
-#if defined(__aarch64__)
-/// NEON body for long haystacks (>= 16 bytes), kept out-of-line so the inline
-/// dispatcher above stays small enough for the compiler to keep auto-vectorising
-/// the scalar fast path and avoid extra branches in callers that handle one
-/// short string per row (e.g. `trim`).
-template <bool positive, ReturnMode return_mode, char... symbols>
-[[gnu::noinline]] const char * find_first_symbols_neon_long(const char * pos, const char * const end)
-{
- /// Many callers (CSV/TSV/JSON format readers, URL parsers) find a
- /// match within the first few bytes of the haystack. In that case the
- /// per-iteration NEON cost (load + 3-4 vceqq + vorrq + shrn + ctz)
- /// exceeds what a handful of byte compares would do, so an unguarded
- /// SIMD body regresses such workloads. An 8-byte scalar pre-check
- /// covers the common short-distance hit and leaves the SIMD body for
- /// the sparse case.
- if (maybe_negate<positive>(is_in<symbols...>(pos[0]))) return pos;
- if (maybe_negate<positive>(is_in<symbols...>(pos[1]))) return pos + 1;
- if (maybe_negate<positive>(is_in<symbols...>(pos[2]))) return pos + 2;
- if (maybe_negate<positive>(is_in<symbols...>(pos[3]))) return pos + 3;
- if (maybe_negate<positive>(is_in<symbols...>(pos[4]))) return pos + 4;
- if (maybe_negate<positive>(is_in<symbols...>(pos[5]))) return pos + 5;
- if (maybe_negate<positive>(is_in<symbols...>(pos[6]))) return pos + 6;
- if (maybe_negate<positive>(is_in<symbols...>(pos[7]))) return pos + 7;
- pos += 8;
-
- for (; pos + 15 < end; pos += 16)
- {
- uint8x16_t bytes = vld1q_u8(reinterpret_cast<const uint8_t *>(pos));
-
- uint8x16_t eq = maybe_negate<positive>(neon_is_in<symbols...>(bytes));
-
- uint64_t bit_mask = neon_to_bitmask(eq);
- if (bit_mask)
- return pos + (__builtin_ctzll(bit_mask) >> 2);
- }
-
- for (; pos < end; ++pos)
- if (maybe_negate<positive>(is_in<symbols...>(*pos)))
- return pos;
-
- return return_mode == ReturnMode::End ? end : nullptr;
-}
-#endif
-
-
@@ -282,12 +175,6 @@ inline const char * find_first_symbols_sse2(const char * const begin, const char
-#elif defined(__aarch64__)
- /// Short haystacks (< 16 bytes) dominate many callers (e.g. `trim` on
- /// `toString(number)`). Tail-call the out-of-line NEON helper so the
- /// inline body collapses to the original scalar loop on master.
- if (end - begin >= 16) [[unlikely]]
- return find_first_symbols_neon_long<positive, return_mode, symbols...>(begin, end);
@@ -297,75 +184,22 @@ inline const char * find_first_symbols_sse2(const char * const begin, const char
-#if defined(__aarch64__)
-/// Runtime-needle NEON body for long haystacks. Always returns either a
-/// matching pointer or `end` (treated by the caller as "no match"). Kept
-/// out-of-line so short-haystack callers stay free of the SIMD prologue
-/// (no stack frame, no extra branches).
-template <bool positive>
-[[gnu::noinline]] const char * find_first_symbols_neon_long_rt(const char * pos, const char * const end, const char * symbols, size_t num_chars)
@@ -377,44 +211,6 @@ inline const char * find_first_symbols_sse2(const char * const begin, const char
@@ -431,14 +227,6 @@ inline const char * find_last_symbols_sse2(const char * const begin, const char
@@ -449,72 +237,22 @@ inline const char * find_last_symbols_sse2(const char * const begin, const char
@@ -630,18 +368,6 @@ inline const char * find_first_symbols_dispatch(const char * begin, const char *
@@ -659,14 +385,6 @@ inline const char * find_last_symbols_dispatch(const char * begin, const char *
... (280 additional diff lines omitted in this issue draft; recreate by reverting the listed files from baseline to affected as described above)
Which ClickHouse versions are affected?
latest
How to reproduce
Reproduction
Performance test: tsv_csv_nullable_parsing
Query index: 1
Metric: client_time
SQL:
SELECT * FROM table_csv FORMAT Null
Datasets / inputs:
Not captured.
- Use an idle ARM/aarch64 host with similar CPU class if possible; the measurements below were taken on Neoverse-V2.
- Download the public ARM ClickHouse binaries listed in the build table for the baseline and affected revisions.
- Load the datasets/fixtures listed below using the normal ClickHouse performance-test data setup.
- Run the SQL below at least 63 times for each revision and compare median
client_time.
- A valid reproduction should show the affected revision slower by approximately the measured shift, while same-revision reruns stay near zero shift.
A minimal manual loop, after starting each revision as a local server and loading data, is:
for i in $(seq 1 63); do
clickhouse-client --time --query "$QUERY"
done
Expected performance
No response
Related issues and pull requests
No response
Additional context
No response
Company or project name
No response
Describe the situation
Summary
tsv_csv_nullable_parsingquery1regressed on ARM/aarch64 forclient_time.reproduced.Builds and environment
Measured on ARM/aarch64, Neoverse-V2 class CPU, 32 logical CPUs, about 123 GiB RAM. Hostname and private workspace paths are intentionally omitted.
Reproduction
Performance test:
tsv_csv_nullable_parsingQuery index:
1Metric:
client_timeSQL:
Datasets / inputs:
Not captured.
client_time.A minimal manual loop, after starting each revision as a local server and loading data, is:
Measurements
Stability checks: before→before -0.42%, after→after -0.68%, latest→latest +0.09%. Same-build comparisons are included above so reviewers can distinguish a regression from benchmark noise.
Approximate introduction window
Controlled reproduction compares baseline revision
9821949fd802with affected revisionc1ccce21e7d3. No narrower self-contained localization window was captured.Code areas and mechanism clues
Files changed in the bounded window:
base/base/find_symbols.h,src/Common/tests/gtest_find_symbols.cpp.Shortstat for that window: 2 files changed, 439 insertions(+), 14 deletions(-).
Production files worth checking first:
base/base/find_symbols.h.Static code/query review: The PR changes StringZilla source layout/submodule and AArch64 flags, while the benchmark is a string/regexp replacement path.
Suspect area from static review: StringZilla submodule/CMake/ARM flag change alters string primitive implementation.
These are investigation leads only; the issue should not assign blame to a change without a validating patch or rollback measurement.
Probe-level client time: baseline 0.066671s → affected 0.066504s (-0.25%).
Server query duration: baseline 15.0 ms → affected 17.0 ms (+13.33%).
Largest captured ProfileEvents deltas:
Largest captured processor elapsed-time deltas:
EXPLAIN/pattern notes:
Fix or validation status
A local source-variant / rollback candidate improved the affected build. This is a measured fix lead, not yet a reviewed upstream patch.
base/base/find_symbols.hHow to recreate the idea for review:
Validation shifts:
Patch excerpt used for the local validation
Which ClickHouse versions are affected?
latest
How to reproduce
Reproduction
Performance test:
tsv_csv_nullable_parsingQuery index:
1Metric:
client_timeSQL:
Datasets / inputs:
Not captured.
client_time.A minimal manual loop, after starting each revision as a local server and loading data, is:
Expected performance
No response
Related issues and pull requests
No response
Additional context
No response