Faster memchr, memchr2 and memchr3 in generic version #151

stepancheg · 2024-06-12T01:42:01Z

Current generic ("all") implementation checks that a chunk (usize) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops.

Context: we use memchr, but many of our strings are short. Currently SIMD-optimized memchr processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we take usize bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.

Suggestion how to check whether this PR makes it faster? In any case, it should not be slower.

BurntSushi · 2024-06-12T17:42:21Z

Suggestion how to check whether this PR makes it faster? In any case, it should not be slower.

You'll want to use rebar. From the root of this repo on master, build the fallback benchmark engine and get a baseline:

$ rebar build -e '^rust/memchr/memchr/fallback$'
$ rebar measure -e '^rust/memchr/memchr/fallback$' | tee before.csv

Then checkout your PR, rebuild the benchmark engine and measure your change:

$ rebar build -e '^rust/memchr/memchr/fallback$'
$ rebar measure -e '^rust/memchr/memchr/fallback$' | tee before.csv

Then you can diff the results:

$ rebar diff before.csv after.csv
benchmark                          engine                       before.csv           after.csv
---------                          ------                       ----------           ---------
memchr/sherlock/common/huge1       rust/memchr/memchr/fallback  1189.8 MB/s (1.96x)  2.3 GB/s (1.00x)
memchr/sherlock/common/small1      rust/memchr/memchr/fallback  4.4 GB/s (1.00x)     4.0 GB/s (1.09x)
memchr/sherlock/common/tiny1       rust/memchr/memchr/fallback  1880.1 MB/s (1.00x)  1462.3 MB/s (1.29x)
memchr/sherlock/never/huge1        rust/memchr/memchr/fallback  24.3 GB/s (1.06x)    25.8 GB/s (1.00x)
memchr/sherlock/never/small1       rust/memchr/memchr/fallback  18.7 GB/s (1.00x)    18.2 GB/s (1.03x)
memchr/sherlock/never/tiny1        rust/memchr/memchr/fallback  3.8 GB/s (1.00x)     3.8 GB/s (1.00x)
memchr/sherlock/never/empty1       rust/memchr/memchr/fallback  12.00ns (1.00x)      12.00ns (1.00x)
memchr/sherlock/rare/huge1         rust/memchr/memchr/fallback  24.5 GB/s (1.00x)    24.2 GB/s (1.01x)
memchr/sherlock/rare/small1        rust/memchr/memchr/fallback  17.7 GB/s (1.00x)    17.7 GB/s (1.00x)
memchr/sherlock/rare/tiny1         rust/memchr/memchr/fallback  3.6 GB/s (1.00x)     3.2 GB/s (1.11x)
memchr/sherlock/uncommon/huge1     rust/memchr/memchr/fallback  6.5 GB/s (1.94x)     12.6 GB/s (1.00x)
memchr/sherlock/uncommon/small1    rust/memchr/memchr/fallback  12.1 GB/s (1.00x)    11.7 GB/s (1.04x)
memchr/sherlock/uncommon/tiny1     rust/memchr/memchr/fallback  2.9 GB/s (1.00x)     2.3 GB/s (1.27x)
memchr/sherlock/verycommon/huge1   rust/memchr/memchr/fallback  695.0 MB/s (1.87x)   1300.7 MB/s (1.00x)
memchr/sherlock/verycommon/small1  rust/memchr/memchr/fallback  2.1 GB/s (1.00x)     1985.1 MB/s (1.09x)

Add a -t 1.1 to only show results with a 1.1x difference or bigger:

$ rebar diff before.csv after.csv -t 1.1
benchmark                         engine                       before.csv           after.csv
---------                         ------                       ----------           ---------
memchr/sherlock/common/huge1      rust/memchr/memchr/fallback  1189.8 MB/s (1.96x)  2.3 GB/s (1.00x)
memchr/sherlock/common/tiny1      rust/memchr/memchr/fallback  1880.1 MB/s (1.00x)  1462.3 MB/s (1.29x)
memchr/sherlock/rare/tiny1        rust/memchr/memchr/fallback  3.6 GB/s (1.00x)     3.2 GB/s (1.11x)
memchr/sherlock/uncommon/huge1    rust/memchr/memchr/fallback  6.5 GB/s (1.94x)     12.6 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1    rust/memchr/memchr/fallback  2.9 GB/s (1.00x)     2.3 GB/s (1.27x)
memchr/sherlock/verycommon/huge1  rust/memchr/memchr/fallback  695.0 MB/s (1.87x)   1300.7 MB/s (1.00x)

So I see a big improvement in memchr/sherlock/common/huge1, memchr/sherlock/uncommon/huge1 and memchr/sherlock/verycommon/huge1. Which I think makes sense, since they are benchmarks that test the case of "lots of matches" (of varying degrees) on a very large haystack.

But it looks like there's a smaller regression on memchr/sherlock/common/tiny1, memchr/sherlock/uncommon/tiny1 and memchr/sherlock/rare/tiny1. In these benchmarks, the haystack is quite small, so the benchmark tends to be dominated by constant factors (like setup costs). That might be worth looking into.

BurntSushi

I think this generally LGTM with one little nit. I do think the regressions warrant some investigation. I don't know when I'll have time to do it, but if you wanted to do an analysis, I'd be happy to read it. I'm inclined to accept the regression given the throughput gains, but it would be nice to spend at least some time trying to eliminate the regression (or reduce it).

src/arch/all/memchr.rs

stepancheg · 2024-06-13T07:45:27Z

Here are my guesses.

I ran benchmark several times on my Apple M1 laptop, binaries built properly for arm64.

benchmark                          engine                       before2.csv           after2.csv
---------                          ------                       -----------           ----------
memchr/sherlock/common/huge1       rust/memchr/memchr/fallback  1032.0 MB/s (1.67x)   1719.7 MB/s (1.00x)
memchr/sherlock/common/small1      rust/memchr/memchr/fallback  3.0 GB/s (1.25x)      3.7 GB/s (1.00x)
memchr/sherlock/uncommon/huge1     rust/memchr/memchr/fallback  5.0 GB/s (1.54x)      7.7 GB/s (1.00x)
memchr/sherlock/uncommon/small1    rust/memchr/memchr/fallback  7.5 GB/s (1.98x)      14.7 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1     rust/memchr/memchr/fallback  1605.0 MB/s (41.00x)  64.3 GB/s (1.00x)
memchr/sherlock/verycommon/huge1   rust/memchr/memchr/fallback  611.3 MB/s (1.62x)    988.0 MB/s (1.00x)
memchr/sherlock/verycommon/small1  rust/memchr/memchr/fallback  1895.9 MB/s (1.00x)   1518.6 MB/s (1.25x)

It shows consistent regression on this benchmark:

memchr/sherlock/verycommon/small1  rust/memchr/memchr/fallback  1895.9 MB/s (1.00x)   1518.6 MB/s (1.25x)

which makes sense: it searches for space, and text is:

Mr. Sherlock Holmes, who was usually very late in the mornings, save
   ^        ^       ^   ^   ^       ^    ^    ^  ^             ^

so it often processes eight bytes at once with shifts and such instead of finding the character on fourth iteration.

About your results:

memchr/sherlock/common/tiny1 — searches for space, same issue
memchr/sherlock/rare/tiny1 — searches for . in a text which has only one ., if it is consistent, I cannot explain
memchr/sherlock/uncommon/tiny1 — searches for l, likely same issue as with space:

Mr. Sherlock Holmes, who was usually very late in the mornings, save
        ^      ^                 ^^       ^

stepancheg · 2024-06-13T10:06:02Z

Updated the comment above. Also updated the PR to do unaligned read in the end of find/rfind instead of iterating byte-by-byte.

Current generic ("all") implementation checks that a chunk (`usize`) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops. Context: we use `memchr`, but many of our strings are short. Currently SIMD-optimized `memchr` processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we take `usize` bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.

BurntSushi · 2024-06-13T16:11:24Z

With the updated PR, on x86-64:

$ rebar diff before.csv after.csv -t 1.1
benchmark                         engine                       before.csv           after.csv
---------                         ------                       ----------           ---------
memchr/sherlock/common/huge1      rust/memchr/memchr/fallback  1187.8 MB/s (1.99x)  2.3 GB/s (1.00x)
memchr/sherlock/common/tiny1      rust/memchr/memchr/fallback  1880.1 MB/s (1.00x)  1495.5 MB/s (1.26x)
memchr/sherlock/never/tiny1       rust/memchr/memchr/fallback  3.8 GB/s (1.21x)     4.6 GB/s (1.00x)
memchr/sherlock/never/empty1      rust/memchr/memchr/fallback  12.00ns (1.00x)      18.00ns (1.50x)
memchr/sherlock/rare/tiny1        rust/memchr/memchr/fallback  3.6 GB/s (1.20x)     4.3 GB/s (1.00x)
memchr/sherlock/uncommon/huge1    rust/memchr/memchr/fallback  6.5 GB/s (2.19x)     14.2 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1    rust/memchr/memchr/fallback  3.1 GB/s (1.00x)     2.4 GB/s (1.29x)
memchr/sherlock/verycommon/huge1  rust/memchr/memchr/fallback  691.6 MB/s (1.88x)   1302.9 MB/s (1.00x)

And on my M2 mac mini:

$ rebar diff before.csv after.csv -t 1.1
benchmark                          engine                       before.csv            after.csv
---------                          ------                       ----------            ---------
memchr/sherlock/common/huge1       rust/memchr/memchr/fallback  1170.6 MB/s (1.61x)   1887.8 MB/s (1.00x)
memchr/sherlock/common/tiny1       rust/memchr/memchr/fallback  1605.0 MB/s (41.00x)  64.3 GB/s (1.00x)
memchr/sherlock/never/huge1        rust/memchr/memchr/fallback  25.3 GB/s (1.00x)     23.0 GB/s (1.10x)
memchr/sherlock/uncommon/huge1     rust/memchr/memchr/fallback  6.0 GB/s (1.35x)      8.1 GB/s (1.00x)
memchr/sherlock/uncommon/small1    rust/memchr/memchr/fallback  7.5 GB/s (1.98x)      14.7 GB/s (1.00x)
memchr/sherlock/verycommon/huge1   rust/memchr/memchr/fallback  692.4 MB/s (1.58x)    1096.0 MB/s (1.00x)
memchr/sherlock/verycommon/small1  rust/memchr/memchr/fallback  1901.6 MB/s (1.00x)   1688.6 MB/s (1.13x)

I think this is on balance a good change. I'm still a little worried about the regressions, but I generally side toward throughput gains.

Also updated the PR to do unaligned read in the end of find/rfind instead of iterating byte-by-byte.

Nice thank you!

BurntSushi · 2024-06-13T16:13:33Z

This PR is on crates.io in memchr 2.7.3.

CryZe · 2024-06-13T21:54:09Z

Looks like this broke all big endian targets. #152

Unbelievably, some of the steps in the main CI configuration were using the hard-coded `cargo` instead of `${{ env.CARGO }}`. The latter will interpolate to `cross` for targets requiring cross compilation. And since all big endian targets are only tested via cross compilation, our CI was not test big endian at all (beyond testing that compilation succeeded). This led to a [performance optimization] [breaking big endian] targets. [performance optimization]: #151 [breaking big endian]: #152

This reverts #151 because it broke big endian targets. CI didn't catch it because of a misconfiguration that resulted in tests being skipped for any target that required cross compilation. This reverts commit 345fab7. Fixes #152

Unbelievably, some of the steps in the main CI configuration were using the hard-coded `cargo` instead of `${{ env.CARGO }}`. The latter will interpolate to `cross` for targets requiring cross compilation. And since all big endian targets are only tested via cross compilation, our CI was not test big endian at all (beyond testing that compilation succeeded). This led to a [performance optimization] [breaking big endian] targets. [performance optimization]: #151 [breaking big endian]: #152

This reverts #151 because it broke big endian targets. CI didn't catch it because of a misconfiguration that resulted in tests being skipped for any target that required cross compilation. This reverts commit 345fab7. Fixes #152

BurntSushi · 2024-06-14T00:17:37Z

This PR was reverted in #153, and the revert was released as memchr 2.7.4 on crates.io.

Resubmit of PR BurntSushi#151 That PR was reverted because it broke big endian implementation and CI did not catch it (see the revert PR BurntSushi#153 for details). Andrew, thank you for new test cases which made it easier to fix the issue. The fix is: ``` --- a/src/arch/all/memchr.rs +++ b/src/arch/all/memchr.rs @@ -1019,7 +1019,7 @@ fn find_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { lowest_zero_byte(x) } else { - highest_zero_byte(x) + Some(USIZE_BYTES - 1 - highest_zero_byte(x)?) } } @@ -1028,7 +1028,7 @@ fn rfind_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { highest_zero_byte(x) } else { - lowest_zero_byte(x) + Some(USIZE_BYTES - 1 - lowest_zero_byte(x)?) } } ``` Original description: Current generic ("all") implementation checks that a chunk (`usize`) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops. Context: we use `memchr`, but many of our strings are short. Currently SIMD-optimized `memchr` processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we take `usize` bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.

Resubmit of PR BurntSushi#151. That PR was reverted because it broke big endian implementation and CI did not catch it (see the revert PR BurntSushi#153 for details). Andrew, thank you for new test cases which made it easier to fix the issue. The fix is: ``` --- a/src/arch/all/memchr.rs +++ b/src/arch/all/memchr.rs @@ -1019,7 +1019,7 @@ fn find_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { lowest_zero_byte(x) } else { - highest_zero_byte(x) + Some(USIZE_BYTES - 1 - highest_zero_byte(x)?) } } @@ -1028,7 +1028,7 @@ fn rfind_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { highest_zero_byte(x) } else { - lowest_zero_byte(x) + Some(USIZE_BYTES - 1 - lowest_zero_byte(x)?) } } ``` Original description: Current generic ("all") implementation checks that a chunk (`usize`) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops. Context: we use `memchr`, but many of our strings are short. Currently SIMD-optimized `memchr` processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we take `usize` bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.

Resubmit of PR BurntSushi#151. That PR was reverted because it broke big endian implementation and CI did not catch it (see the revert PR BurntSushi#153 for details). Andrew, thank you for new test cases which made it easy to fix the issue. The fix is: ``` --- a/src/arch/all/memchr.rs +++ b/src/arch/all/memchr.rs @@ -1019,7 +1019,7 @@ fn find_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { lowest_zero_byte(x) } else { - highest_zero_byte(x) + Some(USIZE_BYTES - 1 - highest_zero_byte(x)?) } } @@ -1028,7 +1028,7 @@ fn rfind_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { highest_zero_byte(x) } else { - lowest_zero_byte(x) + Some(USIZE_BYTES - 1 - lowest_zero_byte(x)?) } } ``` Original description: Current generic ("all") implementation checks that a chunk (`usize`) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops. Context: we use `memchr`, but many of our strings are short. Currently SIMD-optimized `memchr` processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we take `usize` bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.

stepancheg · 2024-06-14T19:11:33Z

Resubmitted as #154.

stepancheg force-pushed the index-of-byte branch 4 times, most recently from 3f70ca9 to 08f82c9 Compare June 12, 2024 01:52

BurntSushi reviewed Jun 12, 2024

View reviewed changes

src/arch/all/memchr.rs Outdated Show resolved Hide resolved

stepancheg force-pushed the index-of-byte branch 3 times, most recently from 9d0ab64 to 885bbc0 Compare June 13, 2024 10:05

stepancheg force-pushed the index-of-byte branch from 885bbc0 to ccb1327 Compare June 13, 2024 10:17

BurntSushi approved these changes Jun 13, 2024

View reviewed changes

BurntSushi merged commit 345fab7 into BurntSushi:master Jun 13, 2024
17 checks passed

BurntSushi mentioned this pull request Jun 14, 2024

revert fallback memchr optimization #153

Merged

stepancheg mentioned this pull request Jun 14, 2024

all: improve perf of memchr fallback (v2) #154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster memchr, memchr2 and memchr3 in generic version #151

Faster memchr, memchr2 and memchr3 in generic version #151

stepancheg commented Jun 12, 2024

BurntSushi commented Jun 12, 2024

BurntSushi left a comment

stepancheg commented Jun 13, 2024 •

edited

Loading

stepancheg commented Jun 13, 2024

BurntSushi commented Jun 13, 2024

BurntSushi commented Jun 13, 2024

CryZe commented Jun 13, 2024

BurntSushi commented Jun 14, 2024

stepancheg commented Jun 14, 2024

Faster memchr, memchr2 and memchr3 in generic version #151

Faster memchr, memchr2 and memchr3 in generic version #151

Conversation

stepancheg commented Jun 12, 2024

BurntSushi commented Jun 12, 2024

BurntSushi left a comment

Choose a reason for hiding this comment

stepancheg commented Jun 13, 2024 • edited Loading

stepancheg commented Jun 13, 2024

BurntSushi commented Jun 13, 2024

BurntSushi commented Jun 13, 2024

CryZe commented Jun 13, 2024

BurntSushi commented Jun 14, 2024

stepancheg commented Jun 14, 2024

stepancheg commented Jun 13, 2024 •

edited

Loading