ARM performance regression: `tsv_csv_nullable_parsing` query `1` is slower

### Company or project name

_No response_

### Describe the situation

## Summary

`tsv_csv_nullable_parsing` query `1` regressed on ARM/aarch64 for `client_time`.

- Controlled baseline → affected shift: **+18.16%** (0.015846s → 0.018723s, 58 / 58 runs).
- Baseline → latest shift: **+15.86%**; latest still shows the slowdown.
- Historical public signal: **+30.07%** in ClickHouse performance history.
- Reproduction class: `reproduced`.
- Public history: https://performance.ci.clickhouse.com/history/tsv_csv_nullable_parsing/1?metric=client_time&arch=arm

## Builds and environment

Measured on ARM/aarch64, Neoverse-V2 class CPU, 32 logical CPUs, about 123 GiB RAM. Hostname and private workspace paths are intentionally omitted.

| Role | Revision | Public ARM binary | Note |
| --- | --- | --- | --- |
| baseline | 9821949fd802 | https://clickhouse-builds.s3.us-east-1.amazonaws.com/REFs/master/9821949fd802368428d4977ded5fecc82f132afe/build_arm_release/clickhouse | — |
| affected | c1ccce21e7d3 | https://clickhouse-builds.s3.us-east-1.amazonaws.com/REFs/master/c1ccce21e7d349a3a3b13a13daffbd42c99aa7c9/build_arm_release/clickhouse | — |
| latest | 895cae1f9d5e | https://clickhouse-builds.s3.us-east-1.amazonaws.com/REFs/master/895cae1f9d5e9535b538a2fe935876b20821cb84/build_arm_release/clickhouse | — |

## Reproduction

Performance test: `tsv_csv_nullable_parsing`  
Query index: `1`  
Metric: `client_time`

SQL:

```sql
SELECT * FROM table_csv FORMAT Null
```

Datasets / inputs:

Not captured.

1. Use an idle ARM/aarch64 host with similar CPU class if possible; the measurements below were taken on Neoverse-V2.
2. Download the public ARM ClickHouse binaries listed in the build table for the baseline and affected revisions.
3. Load the datasets/fixtures listed below using the normal ClickHouse performance-test data setup.
4. Run the SQL below at least 63 times for each revision and compare median `client_time`.
5. A valid reproduction should show the affected revision slower by approximately the measured shift, while same-revision reruns stay near zero shift.

A minimal manual loop, after starting each revision as a local server and loading data, is:

```bash
for i in $(seq 1 63); do
  clickhouse-client --time --query "$QUERY"
done
```

## Measurements

| Comparison | Builds | Runs | Left median | Right median | Shift | Left range | Right range |
| --- | --- | --- | --- | --- | --- | --- | --- |
| before→after | baseline vs affected | 58 / 58 | 0.015846s | 0.018723s | +18.16% | 0.015366s–0.016506s | 0.017760s–0.019761s |
| before→latest | baseline vs latest | 58 / 58 | 0.015936s | 0.018464s | +15.86% | 0.015500s–0.016979s | 0.017837s–0.019492s |
| before→before | baseline vs baseline | 63 / 63 | 0.015826s | 0.015759s | -0.42% | 0.015350s–0.018299s | 0.015340s–0.016977s |
| after→after | affected vs affected | 55 / 55 | 0.018448s | 0.018322s | -0.68% | 0.017789s–0.019567s | 0.017731s–0.019688s |
| latest→latest | latest vs latest | 54 / 54 | 0.018471s | 0.018488s | +0.09% | 0.017871s–0.019795s | 0.017923s–0.019577s |

Stability checks: before→before -0.42%, after→after -0.68%, latest→latest +0.09%. Same-build comparisons are included above so reviewers can distinguish a regression from benchmark noise.

## Approximate introduction window

Controlled reproduction compares baseline revision `9821949fd802` with affected revision `c1ccce21e7d3`. No narrower self-contained localization window was captured.

## Code areas and mechanism clues

- Files changed in the bounded window: `base/base/find_symbols.h`, `src/Common/tests/gtest_find_symbols.cpp`.
- Shortstat for that window: 2 files changed, 439 insertions(+), 14 deletions(-).
- Production files worth checking first: `base/base/find_symbols.h`.
- Static code/query review: The PR changes StringZilla source layout/submodule and AArch64 flags, while the benchmark is a string/regexp replacement path.
- Suspect area from static review: StringZilla submodule/CMake/ARM flag change alters string primitive implementation.
These are investigation leads only; the issue should not assign blame to a change without a validating patch or rollback measurement.

- Probe-level client time: baseline 0.066671s → affected 0.066504s (-0.25%).
- Server query duration: baseline 15.0 ms → affected 17.0 ms (+13.33%).

Largest captured ProfileEvents deltas:

| ProfileEvent | Baseline median | Affected median | Delta |
| --- | --- | --- | --- |
| GlobalThreadPoolLockWaitMicroseconds | 8 | — | -100.00% |
| LocalThreadPoolLockWaitMicroseconds | 1 | — | -100.00% |
| OSCPUWaitMicroseconds | 16.0 | — | -100.00% |
| NetworkSendElapsedMicroseconds | 121.0 | 154.0 | +27.27% |
| QueryProfilerRuns | 284.0 | 330.0 | +16.20% |
| DiskReadElapsedMicroseconds | 355.0 | 396.0 | +11.55% |
| LocalThreadPoolThreadCreationMicroseconds | 148.0 | 157.0 | +6.08% |
| OSCPUVirtualTimeMicroseconds | 55,627 | 58,950 | +5.97% |

Largest captured processor elapsed-time deltas:

| Processor | Baseline µs | Affected µs | Delta |
| --- | --- | --- | --- |
| LimitsCheckingTransform | 33.0 | 24.0 | -27.27% |
| File | 40,951 | 48,434 | +18.27% |
| LazyOutputFormat | 235.0 | 257.0 | +9.36% |
| ExpressionTransform | 121.0 | 117.0 | -3.31% |

EXPLAIN/pattern notes:

- find_symbols
- find_first_symbols
- NEON
- neon
- SerializationNullable
- Nullable
- CSV
- TSV

## Fix or validation status

A local source-variant / rollback candidate improved the affected build. This is a measured fix lead, not yet a reviewed upstream patch.

- Patch idea: scoped_source_revert touching `base/base/find_symbols.h`
- Patch size: 12 hunks, 14 added lines, 296 removed lines.
- Validation result: affected/original vs local variant —; self checks: local-on-affected -0.06%, local-on-latest +1.16%, original-on-affected —.

How to recreate the idea for review:

1. Check out the affected revision from the build table.
2. Revert or reimplement only the listed production-file changes from the baseline revision; start with the files above, then narrow to the smallest semantic change that keeps the measured improvement.
3. Build ClickHouse for ARM release and rerun the benchmark above. Accept only if the affected-vs-patched comparison improves and same-build self checks stay near zero.

Validation shifts:

| Comparison | Shift |
| --- | --- |
| local unmodified affected vs local variant | -7.72% |
| self local unmodified affected | -0.54% |
| self local variant affected | -0.06% |
| self local variant latest | +1.16% |

<details><summary>Patch excerpt used for the local validation</summary>

```diff
diff --git a/base/base/find_symbols.h b/base/base/find_symbols.h
index 5159201c4a6..c048a5e01b1 100644
--- a/base/base/find_symbols.h
+++ b/base/base/find_symbols.h
@@ -11,9 +11,6 @@
-#if defined(__aarch64__)
-    #include <arm_neon.h>
-#endif
@@ -126,14 +123,6 @@ inline std::array<__m128i, 16u> mm_is_in_prepare(const char * symbols, size_t nu
-    /// Pad unused slots with a repeat of an actual needle byte so `mm_is_in_execute`
-    /// does not spuriously match `\0` bytes in the haystack via the zero-initialised
-    /// slots. Callers ensure `num_chars >= 1` before reaching the SIMD body.
-    for (size_t i = num_chars; i < 16u; ++i)
-    {
-        result[i] = result[0];
-    }
-
@@ -151,46 +140,6 @@ inline __m128i mm_is_in_execute(__m128i bytes, const std::array<__m128i, 16u> &
-#if defined(__aarch64__)
-/// On AArch64 we use NEON. There is no direct equivalent of pmovmskb, so we
-/// use the well-known shrn-by-4 trick to compress a 16-byte vector of all-0/all-1
-/// bytes into a 64-bit value where each input byte is represented by a 4-bit
-/// nibble. The position of the lowest (or highest) matching byte is then
-/// recovered as `__builtin_ctzll(mask) >> 2` (or `__builtin_clzll(mask) >> 2`).
-template <char s0>
-inline uint8x16_t neon_is_in(uint8x16_t bytes)
-{
-    return vceqq_u8(bytes, vdupq_n_u8(static_cast<uint8_t>(s0)));
-}
-
-template <char s0, char s1, char... tail>
-inline uint8x16_t neon_is_in(uint8x16_t bytes)
-{
-    uint8x16_t eq0 = vceqq_u8(bytes, vdupq_n_u8(static_cast<uint8_t>(s0)));
-    uint8x16_t eq = neon_is_in<s1, tail...>(bytes);
-    return vorrq_u8(eq0, eq);
-}
-
-inline uint8x16_t neon_is_in(uint8x16_t bytes, const char * symbols, size_t num_chars)
-{
-    uint8x16_t accumulator = vdupq_n_u8(0);
-    for (size_t i = 0; i < num_chars; ++i)
-    {
-        uint8x16_t eq = vceqq_u8(bytes, vdupq_n_u8(static_cast<uint8_t>(symbols[i])));
-        accumulator = vorrq_u8(accumulator, eq);
-    }
-
-    return accumulator;
-}
-
-/// Compresses a 16-byte all-0/all-1 vector into a 64-bit value where each input
-/// byte occupies a 4-bit nibble (all bits set if matched, all clear otherwise).
-inline uint64_t neon_to_bitmask(uint8x16_t eq)
-{
-    return vget_lane_u64(vreinterpret_u64_u8(vshrn_n_u16(vreinterpretq_u16_u8(eq), 4)), 0);
-}
-#endif
-
@@ -203,17 +152,6 @@ constexpr uint16_t maybe_negate(uint16_t x)
-#if defined(__aarch64__)
-template <bool positive>
-inline uint8x16_t maybe_negate(uint8x16_t x)
-{
-    if constexpr (positive)
-        return x;
-    else
-        return vmvnq_u8(x);
-}
-#endif
-
@@ -221,51 +159,6 @@ enum class ReturnMode : uint8_t
-#if defined(__aarch64__)
-/// NEON body for long haystacks (>= 16 bytes), kept out-of-line so the inline
-/// dispatcher above stays small enough for the compiler to keep auto-vectorising
-/// the scalar fast path and avoid extra branches in callers that handle one
-/// short string per row (e.g. `trim`).
-template <bool positive, ReturnMode return_mode, char... symbols>
-[[gnu::noinline]] const char * find_first_symbols_neon_long(const char * pos, const char * const end)
-{
-    /// Many callers (CSV/TSV/JSON format readers, URL parsers) find a
-    /// match within the first few bytes of the haystack. In that case the
-    /// per-iteration NEON cost (load + 3-4 vceqq + vorrq + shrn + ctz)
-    /// exceeds what a handful of byte compares would do, so an unguarded
-    /// SIMD body regresses such workloads. An 8-byte scalar pre-check
-    /// covers the common short-distance hit and leaves the SIMD body for
-    /// the sparse case.
-    if (maybe_negate<positive>(is_in<symbols...>(pos[0]))) return pos;
-    if (maybe_negate<positive>(is_in<symbols...>(pos[1]))) return pos + 1;
-    if (maybe_negate<positive>(is_in<symbols...>(pos[2]))) return pos + 2;
-    if (maybe_negate<positive>(is_in<symbols...>(pos[3]))) return pos + 3;
-    if (maybe_negate<positive>(is_in<symbols...>(pos[4]))) return pos + 4;
-    if (maybe_negate<positive>(is_in<symbols...>(pos[5]))) return pos + 5;
-    if (maybe_negate<positive>(is_in<symbols...>(pos[6]))) return pos + 6;
-    if (maybe_negate<positive>(is_in<symbols...>(pos[7]))) return pos + 7;
-    pos += 8;
-
-    for (; pos + 15 < end; pos += 16)
-    {
-        uint8x16_t bytes = vld1q_u8(reinterpret_cast<const uint8_t *>(pos));
-
-        uint8x16_t eq = maybe_negate<positive>(neon_is_in<symbols...>(bytes));
-
-        uint64_t bit_mask = neon_to_bitmask(eq);
-        if (bit_mask)
-            return pos + (__builtin_ctzll(bit_mask) >> 2);
-    }
-
-    for (; pos < end; ++pos)
-        if (maybe_negate<positive>(is_in<symbols...>(*pos)))
-            return pos;
-
-    return return_mode == ReturnMode::End ? end : nullptr;
-}
-#endif
-
-
@@ -282,12 +175,6 @@ inline const char * find_first_symbols_sse2(const char * const begin, const char
-#elif defined(__aarch64__)
-    /// Short haystacks (< 16 bytes) dominate many callers (e.g. `trim` on
-    /// `toString(number)`). Tail-call the out-of-line NEON helper so the
-    /// inline body collapses to the original scalar loop on master.
-    if (end - begin >= 16) [[unlikely]]
-        return find_first_symbols_neon_long<positive, return_mode, symbols...>(begin, end);
@@ -297,75 +184,22 @@ inline const char * find_first_symbols_sse2(const char * const begin, const char
-#if defined(__aarch64__)
-/// Runtime-needle NEON body for long haystacks. Always returns either a
-/// matching pointer or `end` (treated by the caller as "no match"). Kept
-/// out-of-line so short-haystack callers stay free of the SIMD prologue
-/// (no stack frame, no extra branches).
-template <bool positive>
-[[gnu::noinline]] const char * find_first_symbols_neon_long_rt(const char * pos, const char * const end, const char * symbols, size_t num_chars)
@@ -377,44 +211,6 @@ inline const char * find_first_symbols_sse2(const char * const begin, const char
@@ -431,14 +227,6 @@ inline const char * find_last_symbols_sse2(const char * const begin, const char
@@ -449,72 +237,22 @@ inline const char * find_last_symbols_sse2(const char * const begin, const char
@@ -630,18 +368,6 @@ inline const char * find_first_symbols_dispatch(const char * begin, const char *
@@ -659,14 +385,6 @@ inline const char * find_last_symbols_dispatch(const char * begin, const char *
... (280 additional diff lines omitted in this issue draft; recreate by reverting the listed files from baseline to affected as described above)
```

</details>

### Which ClickHouse versions are affected?

latest

### How to reproduce

## Reproduction

Performance test: `tsv_csv_nullable_parsing`  
Query index: `1`  
Metric: `client_time`

SQL:

```sql
SELECT * FROM table_csv FORMAT Null
```

Datasets / inputs:

Not captured.

1. Use an idle ARM/aarch64 host with similar CPU class if possible; the measurements below were taken on Neoverse-V2.
2. Download the public ARM ClickHouse binaries listed in the build table for the baseline and affected revisions.
3. Load the datasets/fixtures listed below using the normal ClickHouse performance-test data setup.
4. Run the SQL below at least 63 times for each revision and compare median `client_time`.
5. A valid reproduction should show the affected revision slower by approximately the measured shift, while same-revision reruns stay near zero shift.

A minimal manual loop, after starting each revision as a local server and loading data, is:

```bash
for i in $(seq 1 63); do
  clickhouse-client --time --query "$QUERY"
done
```

### Expected performance

_No response_

### Related issues and pull requests

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARM performance regression: `tsv_csv_nullable_parsing` query `1` is slower #106605

Company or project name

Describe the situation

Summary

Builds and environment

Reproduction

Measurements

Approximate introduction window

Code areas and mechanism clues

Fix or validation status

Which ClickHouse versions are affected?

How to reproduce

Reproduction

Expected performance

Related issues and pull requests

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Role	Revision	Public ARM binary	Note
baseline	`9821949`	https://clickhouse-builds.s3.us-east-1.amazonaws.com/REFs/master/9821949fd802368428d4977ded5fecc82f132afe/build_arm_release/clickhouse	—
affected	`c1ccce2`	https://clickhouse-builds.s3.us-east-1.amazonaws.com/REFs/master/c1ccce21e7d349a3a3b13a13daffbd42c99aa7c9/build_arm_release/clickhouse	—
latest	`895cae1`	https://clickhouse-builds.s3.us-east-1.amazonaws.com/REFs/master/895cae1f9d5e9535b538a2fe935876b20821cb84/build_arm_release/clickhouse	—

Comparison	Builds	Runs	Left median	Right median	Shift	Left range	Right range
before→after	baseline vs affected	58 / 58	0.015846s	0.018723s	+18.16%	0.015366s–0.016506s	0.017760s–0.019761s
before→latest	baseline vs latest	58 / 58	0.015936s	0.018464s	+15.86%	0.015500s–0.016979s	0.017837s–0.019492s
before→before	baseline vs baseline	63 / 63	0.015826s	0.015759s	-0.42%	0.015350s–0.018299s	0.015340s–0.016977s
after→after	affected vs affected	55 / 55	0.018448s	0.018322s	-0.68%	0.017789s–0.019567s	0.017731s–0.019688s
latest→latest	latest vs latest	54 / 54	0.018471s	0.018488s	+0.09%	0.017871s–0.019795s	0.017923s–0.019577s

ProfileEvent	Baseline median	Affected median	Delta
GlobalThreadPoolLockWaitMicroseconds	8	—	-100.00%
LocalThreadPoolLockWaitMicroseconds	1	—	-100.00%
OSCPUWaitMicroseconds	16.0	—	-100.00%
NetworkSendElapsedMicroseconds	121.0	154.0	+27.27%
QueryProfilerRuns	284.0	330.0	+16.20%
DiskReadElapsedMicroseconds	355.0	396.0	+11.55%
LocalThreadPoolThreadCreationMicroseconds	148.0	157.0	+6.08%
OSCPUVirtualTimeMicroseconds	55,627	58,950	+5.97%

Processor	Baseline µs	Affected µs	Delta
LimitsCheckingTransform	33.0	24.0	-27.27%
File	40,951	48,434	+18.27%
LazyOutputFormat	235.0	257.0	+9.36%
ExpressionTransform	121.0	117.0	-3.31%

Comparison	Shift
local unmodified affected vs local variant	-7.72%
self local unmodified affected	-0.54%
self local variant affected	-0.06%
self local variant latest	+1.16%

ARM performance regression: tsv_csv_nullable_parsing query 1 is slower #106605

Description

Company or project name

Describe the situation

Summary

Builds and environment

Reproduction

Measurements

Approximate introduction window

Code areas and mechanism clues

Fix or validation status

Which ClickHouse versions are affected?

How to reproduce

Reproduction

Expected performance

Related issues and pull requests

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

ARM performance regression: `tsv_csv_nullable_parsing` query `1` is slower #106605