Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster memchr, memchr2 and memchr3 in generic version #151

Merged
merged 1 commit into from
Jun 13, 2024

Conversation

stepancheg
Copy link
Contributor

Current generic ("all") implementation checks that a chunk (usize) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops.

Context: we use memchr, but many of our strings are short. Currently SIMD-optimized memchr processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we take usize bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.

Suggestion how to check whether this PR makes it faster? In any case, it should not be slower.

@stepancheg stepancheg force-pushed the index-of-byte branch 4 times, most recently from 3f70ca9 to 08f82c9 Compare June 12, 2024 01:52
@BurntSushi
Copy link
Owner

Suggestion how to check whether this PR makes it faster? In any case, it should not be slower.

You'll want to use rebar. From the root of this repo on master, build the fallback benchmark engine and get a baseline:

$ rebar build -e '^rust/memchr/memchr/fallback$'
$ rebar measure -e '^rust/memchr/memchr/fallback$' | tee before.csv

Then checkout your PR, rebuild the benchmark engine and measure your change:

$ rebar build -e '^rust/memchr/memchr/fallback$'
$ rebar measure -e '^rust/memchr/memchr/fallback$' | tee before.csv

Then you can diff the results:

$ rebar diff before.csv after.csv
benchmark                          engine                       before.csv           after.csv
---------                          ------                       ----------           ---------
memchr/sherlock/common/huge1       rust/memchr/memchr/fallback  1189.8 MB/s (1.96x)  2.3 GB/s (1.00x)
memchr/sherlock/common/small1      rust/memchr/memchr/fallback  4.4 GB/s (1.00x)     4.0 GB/s (1.09x)
memchr/sherlock/common/tiny1       rust/memchr/memchr/fallback  1880.1 MB/s (1.00x)  1462.3 MB/s (1.29x)
memchr/sherlock/never/huge1        rust/memchr/memchr/fallback  24.3 GB/s (1.06x)    25.8 GB/s (1.00x)
memchr/sherlock/never/small1       rust/memchr/memchr/fallback  18.7 GB/s (1.00x)    18.2 GB/s (1.03x)
memchr/sherlock/never/tiny1        rust/memchr/memchr/fallback  3.8 GB/s (1.00x)     3.8 GB/s (1.00x)
memchr/sherlock/never/empty1       rust/memchr/memchr/fallback  12.00ns (1.00x)      12.00ns (1.00x)
memchr/sherlock/rare/huge1         rust/memchr/memchr/fallback  24.5 GB/s (1.00x)    24.2 GB/s (1.01x)
memchr/sherlock/rare/small1        rust/memchr/memchr/fallback  17.7 GB/s (1.00x)    17.7 GB/s (1.00x)
memchr/sherlock/rare/tiny1         rust/memchr/memchr/fallback  3.6 GB/s (1.00x)     3.2 GB/s (1.11x)
memchr/sherlock/uncommon/huge1     rust/memchr/memchr/fallback  6.5 GB/s (1.94x)     12.6 GB/s (1.00x)
memchr/sherlock/uncommon/small1    rust/memchr/memchr/fallback  12.1 GB/s (1.00x)    11.7 GB/s (1.04x)
memchr/sherlock/uncommon/tiny1     rust/memchr/memchr/fallback  2.9 GB/s (1.00x)     2.3 GB/s (1.27x)
memchr/sherlock/verycommon/huge1   rust/memchr/memchr/fallback  695.0 MB/s (1.87x)   1300.7 MB/s (1.00x)
memchr/sherlock/verycommon/small1  rust/memchr/memchr/fallback  2.1 GB/s (1.00x)     1985.1 MB/s (1.09x)

Add a -t 1.1 to only show results with a 1.1x difference or bigger:

$ rebar diff before.csv after.csv -t 1.1
benchmark                         engine                       before.csv           after.csv
---------                         ------                       ----------           ---------
memchr/sherlock/common/huge1      rust/memchr/memchr/fallback  1189.8 MB/s (1.96x)  2.3 GB/s (1.00x)
memchr/sherlock/common/tiny1      rust/memchr/memchr/fallback  1880.1 MB/s (1.00x)  1462.3 MB/s (1.29x)
memchr/sherlock/rare/tiny1        rust/memchr/memchr/fallback  3.6 GB/s (1.00x)     3.2 GB/s (1.11x)
memchr/sherlock/uncommon/huge1    rust/memchr/memchr/fallback  6.5 GB/s (1.94x)     12.6 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1    rust/memchr/memchr/fallback  2.9 GB/s (1.00x)     2.3 GB/s (1.27x)
memchr/sherlock/verycommon/huge1  rust/memchr/memchr/fallback  695.0 MB/s (1.87x)   1300.7 MB/s (1.00x)

So I see a big improvement in memchr/sherlock/common/huge1, memchr/sherlock/uncommon/huge1 and memchr/sherlock/verycommon/huge1. Which I think makes sense, since they are benchmarks that test the case of "lots of matches" (of varying degrees) on a very large haystack.

But it looks like there's a smaller regression on memchr/sherlock/common/tiny1, memchr/sherlock/uncommon/tiny1 and memchr/sherlock/rare/tiny1. In these benchmarks, the haystack is quite small, so the benchmark tends to be dominated by constant factors (like setup costs). That might be worth looking into.

Copy link
Owner

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this generally LGTM with one little nit. I do think the regressions warrant some investigation. I don't know when I'll have time to do it, but if you wanted to do an analysis, I'd be happy to read it. I'm inclined to accept the regression given the throughput gains, but it would be nice to spend at least some time trying to eliminate the regression (or reduce it).

src/arch/all/memchr.rs Outdated Show resolved Hide resolved
@stepancheg
Copy link
Contributor Author

stepancheg commented Jun 13, 2024

Here are my guesses.

I ran benchmark several times on my Apple M1 laptop, binaries built properly for arm64.

benchmark                          engine                       before2.csv           after2.csv
---------                          ------                       -----------           ----------
memchr/sherlock/common/huge1       rust/memchr/memchr/fallback  1032.0 MB/s (1.67x)   1719.7 MB/s (1.00x)
memchr/sherlock/common/small1      rust/memchr/memchr/fallback  3.0 GB/s (1.25x)      3.7 GB/s (1.00x)
memchr/sherlock/uncommon/huge1     rust/memchr/memchr/fallback  5.0 GB/s (1.54x)      7.7 GB/s (1.00x)
memchr/sherlock/uncommon/small1    rust/memchr/memchr/fallback  7.5 GB/s (1.98x)      14.7 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1     rust/memchr/memchr/fallback  1605.0 MB/s (41.00x)  64.3 GB/s (1.00x)
memchr/sherlock/verycommon/huge1   rust/memchr/memchr/fallback  611.3 MB/s (1.62x)    988.0 MB/s (1.00x)
memchr/sherlock/verycommon/small1  rust/memchr/memchr/fallback  1895.9 MB/s (1.00x)   1518.6 MB/s (1.25x)

It shows consistent regression on this benchmark:

memchr/sherlock/verycommon/small1  rust/memchr/memchr/fallback  1895.9 MB/s (1.00x)   1518.6 MB/s (1.25x)

which makes sense: it searches for space, and text is:

Mr. Sherlock Holmes, who was usually very late in the mornings, save
   ^        ^       ^   ^   ^       ^    ^    ^  ^             ^

so it often processes eight bytes at once with shifts and such instead of finding the character on fourth iteration.

About your results:

  • memchr/sherlock/common/tiny1 — searches for space, same issue
  • memchr/sherlock/rare/tiny1 — searches for . in a text which has only one ., if it is consistent, I cannot explain
  • memchr/sherlock/uncommon/tiny1 — searches for l, likely same issue as with space:
Mr. Sherlock Holmes, who was usually very late in the mornings, save
        ^      ^                 ^^       ^

@stepancheg stepancheg force-pushed the index-of-byte branch 3 times, most recently from 9d0ab64 to 885bbc0 Compare June 13, 2024 10:05
@stepancheg
Copy link
Contributor Author

Updated the comment above. Also updated the PR to do unaligned read in the end of find/rfind instead of iterating byte-by-byte.

Current generic ("all") implementation checks that a chunk (`usize`)
contains a zero byte, and if it is, iterates over bytes of this
chunk to find the index of zero byte. Instead, we can use more bit
operations to find the index without loops.

Context: we use `memchr`, but many of our strings are short.
Currently SIMD-optimized `memchr` processes bytes one by one when
the string length is shorter than SIMD register. I suspect it can
be made faster if we take `usize` bytes a chunk which does not fit
into SIMD register and process it with such utility, similarly to
how AVX2 implementation falls back to SSE2. So I looked at generic
implementation to reuse it in SIMD-optimized version, but there
were none. So here is it.
@BurntSushi
Copy link
Owner

With the updated PR, on x86-64:

$ rebar diff before.csv after.csv -t 1.1
benchmark                         engine                       before.csv           after.csv
---------                         ------                       ----------           ---------
memchr/sherlock/common/huge1      rust/memchr/memchr/fallback  1187.8 MB/s (1.99x)  2.3 GB/s (1.00x)
memchr/sherlock/common/tiny1      rust/memchr/memchr/fallback  1880.1 MB/s (1.00x)  1495.5 MB/s (1.26x)
memchr/sherlock/never/tiny1       rust/memchr/memchr/fallback  3.8 GB/s (1.21x)     4.6 GB/s (1.00x)
memchr/sherlock/never/empty1      rust/memchr/memchr/fallback  12.00ns (1.00x)      18.00ns (1.50x)
memchr/sherlock/rare/tiny1        rust/memchr/memchr/fallback  3.6 GB/s (1.20x)     4.3 GB/s (1.00x)
memchr/sherlock/uncommon/huge1    rust/memchr/memchr/fallback  6.5 GB/s (2.19x)     14.2 GB/s (1.00x)
memchr/sherlock/uncommon/tiny1    rust/memchr/memchr/fallback  3.1 GB/s (1.00x)     2.4 GB/s (1.29x)
memchr/sherlock/verycommon/huge1  rust/memchr/memchr/fallback  691.6 MB/s (1.88x)   1302.9 MB/s (1.00x)

And on my M2 mac mini:

$ rebar diff before.csv after.csv -t 1.1
benchmark                          engine                       before.csv            after.csv
---------                          ------                       ----------            ---------
memchr/sherlock/common/huge1       rust/memchr/memchr/fallback  1170.6 MB/s (1.61x)   1887.8 MB/s (1.00x)
memchr/sherlock/common/tiny1       rust/memchr/memchr/fallback  1605.0 MB/s (41.00x)  64.3 GB/s (1.00x)
memchr/sherlock/never/huge1        rust/memchr/memchr/fallback  25.3 GB/s (1.00x)     23.0 GB/s (1.10x)
memchr/sherlock/uncommon/huge1     rust/memchr/memchr/fallback  6.0 GB/s (1.35x)      8.1 GB/s (1.00x)
memchr/sherlock/uncommon/small1    rust/memchr/memchr/fallback  7.5 GB/s (1.98x)      14.7 GB/s (1.00x)
memchr/sherlock/verycommon/huge1   rust/memchr/memchr/fallback  692.4 MB/s (1.58x)    1096.0 MB/s (1.00x)
memchr/sherlock/verycommon/small1  rust/memchr/memchr/fallback  1901.6 MB/s (1.00x)   1688.6 MB/s (1.13x)

I think this is on balance a good change. I'm still a little worried about the regressions, but I generally side toward throughput gains.

Also updated the PR to do unaligned read in the end of find/rfind instead of iterating byte-by-byte.

Nice thank you!

@BurntSushi BurntSushi merged commit 345fab7 into BurntSushi:master Jun 13, 2024
17 checks passed
@BurntSushi
Copy link
Owner

This PR is on crates.io in memchr 2.7.3.

@CryZe
Copy link

CryZe commented Jun 13, 2024

Looks like this broke all big endian targets. #152

BurntSushi added a commit that referenced this pull request Jun 14, 2024
Unbelievably, some of the steps in the main CI configuration were using
the hard-coded `cargo` instead of `${{ env.CARGO }}`. The latter will
interpolate to `cross` for targets requiring cross compilation. And
since all big endian targets are only tested via cross compilation, our
CI was not test big endian at all (beyond testing that compilation
succeeded).

This led to a [performance optimization] [breaking big endian] targets.

[performance optimization]: #151
[breaking big endian]: #152
BurntSushi added a commit that referenced this pull request Jun 14, 2024
This reverts #151 because it broke big endian targets. CI didn't catch
it because of a misconfiguration that resulted in tests being skipped
for any target that required cross compilation.

This reverts commit 345fab7.

Fixes #152
BurntSushi added a commit that referenced this pull request Jun 14, 2024
Unbelievably, some of the steps in the main CI configuration were using
the hard-coded `cargo` instead of `${{ env.CARGO }}`. The latter will
interpolate to `cross` for targets requiring cross compilation. And
since all big endian targets are only tested via cross compilation, our
CI was not test big endian at all (beyond testing that compilation
succeeded).

This led to a [performance optimization] [breaking big endian] targets.

[performance optimization]: #151
[breaking big endian]: #152
BurntSushi added a commit that referenced this pull request Jun 14, 2024
This reverts #151 because it broke big endian targets. CI didn't catch
it because of a misconfiguration that resulted in tests being skipped
for any target that required cross compilation.

This reverts commit 345fab7.

Fixes #152
@BurntSushi
Copy link
Owner

This PR was reverted in #153, and the revert was released as memchr 2.7.4 on crates.io.

stepancheg added a commit to stepancheg/memchr that referenced this pull request Jun 14, 2024
Resubmit of PR BurntSushi#151

That PR was reverted because it broke big endian implementation and
CI did not catch it (see the revert PR BurntSushi#153 for details).

Andrew, thank you for new test cases which made it easier to fix the
issue.

The fix is:

```
--- a/src/arch/all/memchr.rs
+++ b/src/arch/all/memchr.rs
@@ -1019,7 +1019,7 @@ fn find_zero_in_chunk(x: usize) -> Option<usize> {
     if cfg!(target_endian = "little") {
         lowest_zero_byte(x)
     } else {
-        highest_zero_byte(x)
+        Some(USIZE_BYTES - 1 - highest_zero_byte(x)?)
     }
 }

@@ -1028,7 +1028,7 @@ fn rfind_zero_in_chunk(x: usize) -> Option<usize> {
     if cfg!(target_endian = "little") {
         highest_zero_byte(x)
     } else {
-        lowest_zero_byte(x)
+        Some(USIZE_BYTES - 1 - lowest_zero_byte(x)?)
     }
 }
```

Original description:

Current generic ("all") implementation checks that a chunk (`usize`)
contains a zero byte, and if it is, iterates over bytes of this
chunk to find the index of zero byte. Instead, we can use more bit
operations to find the index without loops.

Context: we use `memchr`, but many of our strings are short.
Currently SIMD-optimized `memchr` processes bytes one by one when
the string length is shorter than SIMD register. I suspect it can
be made faster if we take `usize` bytes a chunk which does not fit
into SIMD register and process it with such utility, similarly to
how AVX2 implementation falls back to SSE2. So I looked at generic
implementation to reuse it in SIMD-optimized version, but there
were none. So here is it.
stepancheg added a commit to stepancheg/memchr that referenced this pull request Jun 14, 2024
Resubmit of PR BurntSushi#151.

That PR was reverted because it broke big endian implementation and
CI did not catch it (see the revert PR BurntSushi#153 for details).

Andrew, thank you for new test cases which made it easier to fix the
issue.

The fix is:

```
--- a/src/arch/all/memchr.rs
+++ b/src/arch/all/memchr.rs
@@ -1019,7 +1019,7 @@ fn find_zero_in_chunk(x: usize) -> Option<usize> {
     if cfg!(target_endian = "little") {
         lowest_zero_byte(x)
     } else {
-        highest_zero_byte(x)
+        Some(USIZE_BYTES - 1 - highest_zero_byte(x)?)
     }
 }

@@ -1028,7 +1028,7 @@ fn rfind_zero_in_chunk(x: usize) -> Option<usize> {
     if cfg!(target_endian = "little") {
         highest_zero_byte(x)
     } else {
-        lowest_zero_byte(x)
+        Some(USIZE_BYTES - 1 - lowest_zero_byte(x)?)
     }
 }
```

Original description:

Current generic ("all") implementation checks that a chunk (`usize`)
contains a zero byte, and if it is, iterates over bytes of this
chunk to find the index of zero byte. Instead, we can use more bit
operations to find the index without loops.

Context: we use `memchr`, but many of our strings are short.
Currently SIMD-optimized `memchr` processes bytes one by one when
the string length is shorter than SIMD register. I suspect it can
be made faster if we take `usize` bytes a chunk which does not fit
into SIMD register and process it with such utility, similarly to
how AVX2 implementation falls back to SSE2. So I looked at generic
implementation to reuse it in SIMD-optimized version, but there
were none. So here is it.
stepancheg added a commit to stepancheg/memchr that referenced this pull request Jun 14, 2024
Resubmit of PR BurntSushi#151.

That PR was reverted because it broke big endian implementation and
CI did not catch it (see the revert PR BurntSushi#153 for details).

Andrew, thank you for new test cases which made it easy to fix the
issue.

The fix is:

```
--- a/src/arch/all/memchr.rs
+++ b/src/arch/all/memchr.rs
@@ -1019,7 +1019,7 @@ fn find_zero_in_chunk(x: usize) -> Option<usize> {
     if cfg!(target_endian = "little") {
         lowest_zero_byte(x)
     } else {
-        highest_zero_byte(x)
+        Some(USIZE_BYTES - 1 - highest_zero_byte(x)?)
     }
 }

@@ -1028,7 +1028,7 @@ fn rfind_zero_in_chunk(x: usize) -> Option<usize> {
     if cfg!(target_endian = "little") {
         highest_zero_byte(x)
     } else {
-        lowest_zero_byte(x)
+        Some(USIZE_BYTES - 1 - lowest_zero_byte(x)?)
     }
 }
```

Original description:

Current generic ("all") implementation checks that a chunk (`usize`)
contains a zero byte, and if it is, iterates over bytes of this
chunk to find the index of zero byte. Instead, we can use more bit
operations to find the index without loops.

Context: we use `memchr`, but many of our strings are short.
Currently SIMD-optimized `memchr` processes bytes one by one when
the string length is shorter than SIMD register. I suspect it can
be made faster if we take `usize` bytes a chunk which does not fit
into SIMD register and process it with such utility, similarly to
how AVX2 implementation falls back to SSE2. So I looked at generic
implementation to reuse it in SIMD-optimized version, but there
were none. So here is it.
@stepancheg
Copy link
Contributor Author

Resubmitted as #154.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants