Port the memchr implementation to "genericsimd" #103

alexcrichton · 2021-12-23T00:56:38Z

This commit ports the various memchr algorithms implemented for SSE2 at
src/memchr/x86/sse2.rs to the Vector trait. This added a few minor
methods to the Vector trait and the rewrite is hopefully quite
straightforward.

Afterwards the src/memchr/x86/avx.rs module is deleted because the
theory is that it's simply the AVX2-using version of the new generic
code (basically a new type parameter). Further changes were made to all
the memchr-related functions to also enable WebAssembly acceleration
with the v128 type. Ideally there's only one place necessary for new
architectures to specify their implementation as well.

This method of abstraction is distinct from the memmem module,
however. I've tried to draw inspiration from the previous design of the
memchr module but I believe that the memmem module is newer and so
further refactorings may be necessary to align the two in style.

This commit ports the various memchr algorithms implemented for SSE2 at `src/memchr/x86/sse2.rs` to the `Vector` trait. This added a few minor methods to the `Vector` trait and the rewrite is hopefully quite straightforward. Afterwards the `src/memchr/x86/avx.rs` module is deleted because the theory is that it's simply the AVX2-using version of the new generic code (basically a new type parameter). Further changes were made to all the `memchr`-related functions to also enable WebAssembly acceleration with the `v128` type. Ideally there's only one place necessary for new architectures to specify their implementation as well. This method of abstraction is distinct from the `memmem` module, however. I've tried to draw inspiration from the previous design of the `memchr` module but I believe that the `memmem` module is newer and so further refactorings may be necessary to align the two in style.

alexcrichton · 2021-12-23T00:57:52Z

@BurntSushi ok this is my old branch dusted off. I wanted to open this up to start CI testing and figure out what's broken. I haven't had a chance to benchmark this thoroughly yet, especially with the deletion of the avx2 module assuming that the __m256i-generic implementation is the same. I'll want to do that before actually landing this.

Otherwise though I figured it might be good to post something that can be poked aroun.

Otherwise nothing gets inlined!

alexcrichton · 2021-12-25T06:12:26Z

Ok the PR as-is has these numbers. To generate these numbers I ran:

$ cargo bench -- 'mem.*chr.*krate' --save-baseline $name

in the bench directory. For generating the reports I ran:

$ critcmp -t 5 generic master

to hopefully cut down on the noise

AVX2 implementation

These are generated with the PR as-is since the machine I'm running on run-time-detected AVX2:

raw data

``` group generic master ----- ------- ------ memchr1/krate/empty/never 1.74 0.9±0.00ns 0 B/sec 1.00 0.5±0.00ns 0 B/sec memchr1/krate/huge/never 1.15 15.1±0.12µs 36.7 GB/sec 1.00 13.1±0.11µs 42.3 GB/sec memchr1/krate/huge/uncommon 1.00 110.4±0.12µs 5.0 GB/sec 1.06 117.0±0.55µs 4.7 GB/sec memchr1/krate/small/never 1.35 17.4±0.20ns 35.6 GB/sec 1.00 12.9±0.19ns 48.0 GB/sec memchr1/krate/small/rare 1.18 24.8±0.07ns 24.9 GB/sec 1.00 21.1±0.22ns 29.3 GB/sec memchr1/krate/small/uncommon 1.06 64.4±0.30ns 9.6 GB/sec 1.00 61.1±0.56ns 10.1 GB/sec memchr1/krate/tiny/never 1.27 9.3±0.08ns 6.9 GB/sec 1.00 7.3±0.03ns 8.8 GB/sec memchr1/krate/tiny/rare 1.15 13.5±0.05ns 4.8 GB/sec 1.00 11.7±0.06ns 5.5 GB/sec memchr1/krate/tiny/uncommon 1.42 47.0±0.13ns 1398.6 MB/sec 1.00 33.2±0.11ns 1981.4 MB/sec memchr2/krate/empty/never 1.41 1.0±0.01ns 0 B/sec 1.00 0.7±0.00ns 0 B/sec memchr2/krate/huge/rare 1.00 24.8±0.12µs 22.3 GB/sec 1.08 26.7±0.23µs 20.7 GB/sec memchr2/krate/huge/uncommon 1.00 225.4±1.35µs 2.5 GB/sec 1.05 236.9±2.13µs 2.3 GB/sec memchr2/krate/small/never 1.00 24.7±0.23ns 25.1 GB/sec 1.20 29.6±1.31ns 20.9 GB/sec memchr2/krate/small/rare 1.00 41.2±0.22ns 15.0 GB/sec 1.29 53.0±0.55ns 11.7 GB/sec memchr2/krate/small/uncommon 1.00 150.8±0.43ns 4.1 GB/sec 1.22 183.4±1.04ns 3.4 GB/sec memchr2/krate/tiny/never 1.00 8.8±0.04ns 7.3 GB/sec 1.35 11.9±0.09ns 5.4 GB/sec memchr3/krate/empty/never 1.52 1.0±0.01ns 0 B/sec 1.00 0.6±0.01ns 0 B/sec memchr3/krate/huge/never 1.22 30.8±0.12µs 18.0 GB/sec 1.00 25.3±0.20µs 21.9 GB/sec memchr3/krate/huge/rare 1.18 37.4±0.14µs 14.8 GB/sec 1.00 31.7±0.23µs 17.5 GB/sec memchr3/krate/huge/uncommon 1.00 268.1±0.35µs 2.1 GB/sec 1.07 285.7±0.52µs 1986.0 MB/sec memchr3/krate/small/never 1.39 41.1±0.36ns 15.1 GB/sec 1.00 29.6±0.08ns 20.9 GB/sec memchr3/krate/small/rare 1.11 60.5±0.13ns 10.2 GB/sec 1.00 54.6±0.24ns 11.3 GB/sec memchr3/krate/tiny/never 1.00 9.8±0.35ns 6.5 GB/sec 1.16 11.4±0.04ns 5.7 GB/sec memrchr1/krate/empty/never 1.71 1.8±0.05ns 0 B/sec 1.00 1.1±0.00ns 0 B/sec memrchr1/krate/huge/never 1.14 14.4±0.57µs 38.4 GB/sec 1.00 12.7±0.05µs 43.7 GB/sec memrchr3/krate/small/common 1.20 925.5±4.70ns 684.2 MB/sec 1.00 772.8±1.44ns 819.4 MB/sec memrchr3/krate/small/rare 1.10 57.8±0.19ns 10.7 GB/sec 1.00 52.4±0.31ns 11.8 GB/sec memrchr3/krate/small/uncommon 1.23 230.7±0.98ns 2.7 GB/sec 1.00 186.9±1.31ns 3.3 GB/sec memrchr3/krate/tiny/never 1.05 11.4±0.10ns 5.6 GB/sec 1.00 10.8±0.04ns 5.9 GB/sec memrchr3/krate/tiny/rare 1.17 37.1±0.20ns 1773.9 MB/sec 1.00 31.6±0.19ns 2.0 GB/sec memrchr3/krate/tiny/uncommon 1.23 145.6±0.59ns 451.9 MB/sec 1.00 118.2±0.32ns 556.7 MB/sec ```

SSE2 implementation

For these numbers I edited the run-time detection to skip the avx2 feature detection, meaning that the SSE2 implementation was forced to be used

raw data

``` group generic-no-avx master-no-avx ----- -------------- ------------- memchr1/krate/huge/never 1.10 30.6±0.08µs 18.1 GB/sec 1.00 27.8±0.42µs 19.9 GB/sec memchr1/krate/huge/rare 1.08 32.8±0.07µs 16.9 GB/sec 1.00 30.2±0.09µs 18.3 GB/sec memchr1/krate/small/never 1.07 41.1±1.29ns 15.0 GB/sec 1.00 38.6±1.18ns 16.0 GB/sec memchr1/krate/small/rare 1.07 47.6±0.17ns 13.0 GB/sec 1.00 44.6±0.58ns 13.9 GB/sec memchr3/krate/small/rare 1.12 120.6±0.94ns 5.1 GB/sec 1.00 108.2±0.39ns 5.7 GB/sec memchr3/krate/small/uncommon 1.09 256.9±1.20ns 2.4 GB/sec 1.00 235.9±2.89ns 2.6 GB/sec memchr3/krate/tiny/never 1.07 15.6±0.04ns 4.1 GB/sec 1.00 14.5±0.04ns 4.4 GB/sec memrchr1/krate/empty/never 1.05 1.9±0.03ns 0 B/sec 1.00 1.8±0.01ns 0 B/sec memrchr1/krate/tiny/never 1.00 9.4±0.04ns 6.8 GB/sec 1.12 10.5±0.06ns 6.1 GB/sec memrchr1/krate/tiny/rare 1.00 13.4±0.04ns 4.8 GB/sec 1.08 14.5±0.07ns 4.4 GB/sec memrchr2/krate/huge/common 1.00 566.1±1.09µs 1002.3 MB/sec 1.21 684.4±2.13µs 829.0 MB/sec memrchr2/krate/huge/uncommon 1.00 272.9±0.32µs 2.0 GB/sec 1.06 288.4±0.58µs 1967.3 MB/sec memrchr2/krate/small/common 1.00 462.8±2.07ns 1368.4 MB/sec 1.36 629.7±2.91ns 1005.6 MB/sec memrchr2/krate/small/uncommon 1.00 167.7±0.95ns 3.7 GB/sec 1.08 181.4±0.69ns 3.4 GB/sec memrchr2/krate/tiny/never 1.09 13.0±0.22ns 4.9 GB/sec 1.00 11.9±0.04ns 5.4 GB/sec memrchr2/krate/tiny/rare 1.17 36.7±0.32ns 1793.7 MB/sec 1.00 31.2±0.17ns 2.1 GB/sec memrchr2/krate/tiny/uncommon 1.00 85.5±0.39ns 769.2 MB/sec 1.11 95.3±0.35ns 690.5 MB/sec memrchr3/krate/huge/never 1.00 52.5±0.17µs 10.6 GB/sec 1.08 56.6±0.21µs 9.8 GB/sec memrchr3/krate/huge/rare 1.00 62.9±1.26µs 8.8 GB/sec 1.06 66.5±0.92µs 8.3 GB/sec ```

I'm not super certain as to how much variation these benchmarks have, but that's at least the current state! My goal was basically parity with the master branch as-is, but I may need to dig into some of the cases here because the speedups/slowdowns are bigger than I'd expect.

BurntSushi · 2023-07-12T00:41:58Z

Thanks so much for working on this! I've finally gotten around to taking a look at this PR. I'm also trying to knock out #76 at the same time. Unfortunately, the neon ISA doesn't have a movemask routine, which means that if I wind up going with a generic version, it's going to look a bit different.

I'm also having a little trouble with the runtime feature detection macros. I'm also having trouble with all the conditional compilation in memmem as well (where I also plan to add aarch64 support). Long story short, I think I'm just going to try and think more holistically about how to manage all the complexity here and try to find a way to push down and encapsulate the condition compilation in a more sensible way than is being done today.

Thanks again for working on this and getting the wasm stuff working in memmem. My plan is to get x86, wasm and aarch64 supported in memchr as well.

alexcrichton added 6 commits December 22, 2021 17:00

Refactor cfgs slightly to squelch warnings

a2be85e

Remove fixed TODO comment

fb185c7

Touch up some stray comments/code

688904a

Delegate with ifunc again

58136f4

Fix wasm compile

afaf99b

Make sure there's a shim that actually enables the feature

c169db1

Otherwise nothing gets inlined!

BurntSushi mentioned this pull request Jul 1, 2022

Add aarch64 neon simd support #76

Closed

redzic mentioned this pull request Jul 8, 2022

Add initial aarch64 neon support #114

Closed

BurntSushi closed this Jul 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port the memchr implementation to "genericsimd" #103

Port the memchr implementation to "genericsimd" #103

alexcrichton commented Dec 23, 2021

alexcrichton commented Dec 23, 2021

alexcrichton commented Dec 25, 2021

BurntSushi commented Jul 12, 2023

Port the memchr implementation to "genericsimd" #103

Port the memchr implementation to "genericsimd" #103

Conversation

alexcrichton commented Dec 23, 2021

alexcrichton commented Dec 23, 2021

alexcrichton commented Dec 25, 2021

AVX2 implementation

SSE2 implementation

BurntSushi commented Jul 12, 2023