Skip to content

ARM performance regression: tsv_csv_nullable_parsing query 1 is slower #106605

@egor-click

Description

@egor-click

Company or project name

No response

Describe the situation

Summary

tsv_csv_nullable_parsing query 1 regressed on ARM/aarch64 for client_time.

Builds and environment

Measured on ARM/aarch64, Neoverse-V2 class CPU, 32 logical CPUs, about 123 GiB RAM. Hostname and private workspace paths are intentionally omitted.

Role Revision Public ARM binary Note
baseline 9821949 https://clickhouse-builds.s3.us-east-1.amazonaws.com/REFs/master/9821949fd802368428d4977ded5fecc82f132afe/build_arm_release/clickhouse
affected c1ccce2 https://clickhouse-builds.s3.us-east-1.amazonaws.com/REFs/master/c1ccce21e7d349a3a3b13a13daffbd42c99aa7c9/build_arm_release/clickhouse
latest 895cae1 https://clickhouse-builds.s3.us-east-1.amazonaws.com/REFs/master/895cae1f9d5e9535b538a2fe935876b20821cb84/build_arm_release/clickhouse

Reproduction

Performance test: tsv_csv_nullable_parsing
Query index: 1
Metric: client_time

SQL:

SELECT * FROM table_csv FORMAT Null

Datasets / inputs:

Not captured.

  1. Use an idle ARM/aarch64 host with similar CPU class if possible; the measurements below were taken on Neoverse-V2.
  2. Download the public ARM ClickHouse binaries listed in the build table for the baseline and affected revisions.
  3. Load the datasets/fixtures listed below using the normal ClickHouse performance-test data setup.
  4. Run the SQL below at least 63 times for each revision and compare median client_time.
  5. A valid reproduction should show the affected revision slower by approximately the measured shift, while same-revision reruns stay near zero shift.

A minimal manual loop, after starting each revision as a local server and loading data, is:

for i in $(seq 1 63); do
  clickhouse-client --time --query "$QUERY"
done

Measurements

Comparison Builds Runs Left median Right median Shift Left range Right range
before→after baseline vs affected 58 / 58 0.015846s 0.018723s +18.16% 0.015366s–0.016506s 0.017760s–0.019761s
before→latest baseline vs latest 58 / 58 0.015936s 0.018464s +15.86% 0.015500s–0.016979s 0.017837s–0.019492s
before→before baseline vs baseline 63 / 63 0.015826s 0.015759s -0.42% 0.015350s–0.018299s 0.015340s–0.016977s
after→after affected vs affected 55 / 55 0.018448s 0.018322s -0.68% 0.017789s–0.019567s 0.017731s–0.019688s
latest→latest latest vs latest 54 / 54 0.018471s 0.018488s +0.09% 0.017871s–0.019795s 0.017923s–0.019577s

Stability checks: before→before -0.42%, after→after -0.68%, latest→latest +0.09%. Same-build comparisons are included above so reviewers can distinguish a regression from benchmark noise.

Approximate introduction window

Controlled reproduction compares baseline revision 9821949fd802 with affected revision c1ccce21e7d3. No narrower self-contained localization window was captured.

Code areas and mechanism clues

  • Files changed in the bounded window: base/base/find_symbols.h, src/Common/tests/gtest_find_symbols.cpp.

  • Shortstat for that window: 2 files changed, 439 insertions(+), 14 deletions(-).

  • Production files worth checking first: base/base/find_symbols.h.

  • Static code/query review: The PR changes StringZilla source layout/submodule and AArch64 flags, while the benchmark is a string/regexp replacement path.

  • Suspect area from static review: StringZilla submodule/CMake/ARM flag change alters string primitive implementation.
    These are investigation leads only; the issue should not assign blame to a change without a validating patch or rollback measurement.

  • Probe-level client time: baseline 0.066671s → affected 0.066504s (-0.25%).

  • Server query duration: baseline 15.0 ms → affected 17.0 ms (+13.33%).

Largest captured ProfileEvents deltas:

ProfileEvent Baseline median Affected median Delta
GlobalThreadPoolLockWaitMicroseconds 8 -100.00%
LocalThreadPoolLockWaitMicroseconds 1 -100.00%
OSCPUWaitMicroseconds 16.0 -100.00%
NetworkSendElapsedMicroseconds 121.0 154.0 +27.27%
QueryProfilerRuns 284.0 330.0 +16.20%
DiskReadElapsedMicroseconds 355.0 396.0 +11.55%
LocalThreadPoolThreadCreationMicroseconds 148.0 157.0 +6.08%
OSCPUVirtualTimeMicroseconds 55,627 58,950 +5.97%

Largest captured processor elapsed-time deltas:

Processor Baseline µs Affected µs Delta
LimitsCheckingTransform 33.0 24.0 -27.27%
File 40,951 48,434 +18.27%
LazyOutputFormat 235.0 257.0 +9.36%
ExpressionTransform 121.0 117.0 -3.31%

EXPLAIN/pattern notes:

  • find_symbols
  • find_first_symbols
  • NEON
  • neon
  • SerializationNullable
  • Nullable
  • CSV
  • TSV

Fix or validation status

A local source-variant / rollback candidate improved the affected build. This is a measured fix lead, not yet a reviewed upstream patch.

  • Patch idea: scoped_source_revert touching base/base/find_symbols.h
  • Patch size: 12 hunks, 14 added lines, 296 removed lines.
  • Validation result: affected/original vs local variant —; self checks: local-on-affected -0.06%, local-on-latest +1.16%, original-on-affected —.

How to recreate the idea for review:

  1. Check out the affected revision from the build table.
  2. Revert or reimplement only the listed production-file changes from the baseline revision; start with the files above, then narrow to the smallest semantic change that keeps the measured improvement.
  3. Build ClickHouse for ARM release and rerun the benchmark above. Accept only if the affected-vs-patched comparison improves and same-build self checks stay near zero.

Validation shifts:

Comparison Shift
local unmodified affected vs local variant -7.72%
self local unmodified affected -0.54%
self local variant affected -0.06%
self local variant latest +1.16%
Patch excerpt used for the local validation
diff --git a/base/base/find_symbols.h b/base/base/find_symbols.h
index 5159201c4a6..c048a5e01b1 100644
--- a/base/base/find_symbols.h
+++ b/base/base/find_symbols.h
@@ -11,9 +11,6 @@
-#if defined(__aarch64__)
-    #include <arm_neon.h>
-#endif
@@ -126,14 +123,6 @@ inline std::array<__m128i, 16u> mm_is_in_prepare(const char * symbols, size_t nu
-    /// Pad unused slots with a repeat of an actual needle byte so `mm_is_in_execute`
-    /// does not spuriously match `\0` bytes in the haystack via the zero-initialised
-    /// slots. Callers ensure `num_chars >= 1` before reaching the SIMD body.
-    for (size_t i = num_chars; i < 16u; ++i)
-    {
-        result[i] = result[0];
-    }
-
@@ -151,46 +140,6 @@ inline __m128i mm_is_in_execute(__m128i bytes, const std::array<__m128i, 16u> &
-#if defined(__aarch64__)
-/// On AArch64 we use NEON. There is no direct equivalent of pmovmskb, so we
-/// use the well-known shrn-by-4 trick to compress a 16-byte vector of all-0/all-1
-/// bytes into a 64-bit value where each input byte is represented by a 4-bit
-/// nibble. The position of the lowest (or highest) matching byte is then
-/// recovered as `__builtin_ctzll(mask) >> 2` (or `__builtin_clzll(mask) >> 2`).
-template <char s0>
-inline uint8x16_t neon_is_in(uint8x16_t bytes)
-{
-    return vceqq_u8(bytes, vdupq_n_u8(static_cast<uint8_t>(s0)));
-}
-
-template <char s0, char s1, char... tail>
-inline uint8x16_t neon_is_in(uint8x16_t bytes)
-{
-    uint8x16_t eq0 = vceqq_u8(bytes, vdupq_n_u8(static_cast<uint8_t>(s0)));
-    uint8x16_t eq = neon_is_in<s1, tail...>(bytes);
-    return vorrq_u8(eq0, eq);
-}
-
-inline uint8x16_t neon_is_in(uint8x16_t bytes, const char * symbols, size_t num_chars)
-{
-    uint8x16_t accumulator = vdupq_n_u8(0);
-    for (size_t i = 0; i < num_chars; ++i)
-    {
-        uint8x16_t eq = vceqq_u8(bytes, vdupq_n_u8(static_cast<uint8_t>(symbols[i])));
-        accumulator = vorrq_u8(accumulator, eq);
-    }
-
-    return accumulator;
-}
-
-/// Compresses a 16-byte all-0/all-1 vector into a 64-bit value where each input
-/// byte occupies a 4-bit nibble (all bits set if matched, all clear otherwise).
-inline uint64_t neon_to_bitmask(uint8x16_t eq)
-{
-    return vget_lane_u64(vreinterpret_u64_u8(vshrn_n_u16(vreinterpretq_u16_u8(eq), 4)), 0);
-}
-#endif
-
@@ -203,17 +152,6 @@ constexpr uint16_t maybe_negate(uint16_t x)
-#if defined(__aarch64__)
-template <bool positive>
-inline uint8x16_t maybe_negate(uint8x16_t x)
-{
-    if constexpr (positive)
-        return x;
-    else
-        return vmvnq_u8(x);
-}
-#endif
-
@@ -221,51 +159,6 @@ enum class ReturnMode : uint8_t
-#if defined(__aarch64__)
-/// NEON body for long haystacks (>= 16 bytes), kept out-of-line so the inline
-/// dispatcher above stays small enough for the compiler to keep auto-vectorising
-/// the scalar fast path and avoid extra branches in callers that handle one
-/// short string per row (e.g. `trim`).
-template <bool positive, ReturnMode return_mode, char... symbols>
-[[gnu::noinline]] const char * find_first_symbols_neon_long(const char * pos, const char * const end)
-{
-    /// Many callers (CSV/TSV/JSON format readers, URL parsers) find a
-    /// match within the first few bytes of the haystack. In that case the
-    /// per-iteration NEON cost (load + 3-4 vceqq + vorrq + shrn + ctz)
-    /// exceeds what a handful of byte compares would do, so an unguarded
-    /// SIMD body regresses such workloads. An 8-byte scalar pre-check
-    /// covers the common short-distance hit and leaves the SIMD body for
-    /// the sparse case.
-    if (maybe_negate<positive>(is_in<symbols...>(pos[0]))) return pos;
-    if (maybe_negate<positive>(is_in<symbols...>(pos[1]))) return pos + 1;
-    if (maybe_negate<positive>(is_in<symbols...>(pos[2]))) return pos + 2;
-    if (maybe_negate<positive>(is_in<symbols...>(pos[3]))) return pos + 3;
-    if (maybe_negate<positive>(is_in<symbols...>(pos[4]))) return pos + 4;
-    if (maybe_negate<positive>(is_in<symbols...>(pos[5]))) return pos + 5;
-    if (maybe_negate<positive>(is_in<symbols...>(pos[6]))) return pos + 6;
-    if (maybe_negate<positive>(is_in<symbols...>(pos[7]))) return pos + 7;
-    pos += 8;
-
-    for (; pos + 15 < end; pos += 16)
-    {
-        uint8x16_t bytes = vld1q_u8(reinterpret_cast<const uint8_t *>(pos));
-
-        uint8x16_t eq = maybe_negate<positive>(neon_is_in<symbols...>(bytes));
-
-        uint64_t bit_mask = neon_to_bitmask(eq);
-        if (bit_mask)
-            return pos + (__builtin_ctzll(bit_mask) >> 2);
-    }
-
-    for (; pos < end; ++pos)
-        if (maybe_negate<positive>(is_in<symbols...>(*pos)))
-            return pos;
-
-    return return_mode == ReturnMode::End ? end : nullptr;
-}
-#endif
-
-
@@ -282,12 +175,6 @@ inline const char * find_first_symbols_sse2(const char * const begin, const char
-#elif defined(__aarch64__)
-    /// Short haystacks (< 16 bytes) dominate many callers (e.g. `trim` on
-    /// `toString(number)`). Tail-call the out-of-line NEON helper so the
-    /// inline body collapses to the original scalar loop on master.
-    if (end - begin >= 16) [[unlikely]]
-        return find_first_symbols_neon_long<positive, return_mode, symbols...>(begin, end);
@@ -297,75 +184,22 @@ inline const char * find_first_symbols_sse2(const char * const begin, const char
-#if defined(__aarch64__)
-/// Runtime-needle NEON body for long haystacks. Always returns either a
-/// matching pointer or `end` (treated by the caller as "no match"). Kept
-/// out-of-line so short-haystack callers stay free of the SIMD prologue
-/// (no stack frame, no extra branches).
-template <bool positive>
-[[gnu::noinline]] const char * find_first_symbols_neon_long_rt(const char * pos, const char * const end, const char * symbols, size_t num_chars)
@@ -377,44 +211,6 @@ inline const char * find_first_symbols_sse2(const char * const begin, const char
@@ -431,14 +227,6 @@ inline const char * find_last_symbols_sse2(const char * const begin, const char
@@ -449,72 +237,22 @@ inline const char * find_last_symbols_sse2(const char * const begin, const char
@@ -630,18 +368,6 @@ inline const char * find_first_symbols_dispatch(const char * begin, const char *
@@ -659,14 +385,6 @@ inline const char * find_last_symbols_dispatch(const char * begin, const char *
... (280 additional diff lines omitted in this issue draft; recreate by reverting the listed files from baseline to affected as described above)

Which ClickHouse versions are affected?

latest

How to reproduce

Reproduction

Performance test: tsv_csv_nullable_parsing
Query index: 1
Metric: client_time

SQL:

SELECT * FROM table_csv FORMAT Null

Datasets / inputs:

Not captured.

  1. Use an idle ARM/aarch64 host with similar CPU class if possible; the measurements below were taken on Neoverse-V2.
  2. Download the public ARM ClickHouse binaries listed in the build table for the baseline and affected revisions.
  3. Load the datasets/fixtures listed below using the normal ClickHouse performance-test data setup.
  4. Run the SQL below at least 63 times for each revision and compare median client_time.
  5. A valid reproduction should show the affected revision slower by approximately the measured shift, while same-revision reruns stay near zero shift.

A minimal manual loop, after starting each revision as a local server and loading data, is:

for i in $(seq 1 63); do
  clickhouse-client --time --query "$QUERY"
done

Expected performance

No response

Related issues and pull requests

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions