Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port the memchr implementation to "genericsimd" #103

Closed
wants to merge 7 commits into from

Conversation

alexcrichton
Copy link
Contributor

This commit ports the various memchr algorithms implemented for SSE2 at
src/memchr/x86/sse2.rs to the Vector trait. This added a few minor
methods to the Vector trait and the rewrite is hopefully quite
straightforward.

Afterwards the src/memchr/x86/avx.rs module is deleted because the
theory is that it's simply the AVX2-using version of the new generic
code (basically a new type parameter). Further changes were made to all
the memchr-related functions to also enable WebAssembly acceleration
with the v128 type. Ideally there's only one place necessary for new
architectures to specify their implementation as well.

This method of abstraction is distinct from the memmem module,
however. I've tried to draw inspiration from the previous design of the
memchr module but I believe that the memmem module is newer and so
further refactorings may be necessary to align the two in style.

This commit ports the various memchr algorithms implemented for SSE2 at
`src/memchr/x86/sse2.rs` to the `Vector` trait. This added a few minor
methods to the `Vector` trait and the rewrite is hopefully quite
straightforward.

Afterwards the `src/memchr/x86/avx.rs` module is deleted because the
theory is that it's simply the AVX2-using version of the new generic
code (basically a new type parameter). Further changes were made to all
the `memchr`-related functions to also enable WebAssembly acceleration
with the `v128` type. Ideally there's only one place necessary for new
architectures to specify their implementation as well.

This method of abstraction is distinct from the `memmem` module,
however. I've tried to draw inspiration from the previous design of the
`memchr` module but I believe that the `memmem` module is newer and so
further refactorings may be necessary to align the two in style.
@alexcrichton
Copy link
Contributor Author

@BurntSushi ok this is my old branch dusted off. I wanted to open this up to start CI testing and figure out what's broken. I haven't had a chance to benchmark this thoroughly yet, especially with the deletion of the avx2 module assuming that the __m256i-generic implementation is the same. I'll want to do that before actually landing this.

Otherwise though I figured it might be good to post something that can be poked aroun.

@alexcrichton
Copy link
Contributor Author

Ok the PR as-is has these numbers. To generate these numbers I ran:

$ cargo bench -- 'mem.*chr.*krate' --save-baseline $name

in the bench directory. For generating the reports I ran:

$ critcmp -t 5 generic master

to hopefully cut down on the noise

AVX2 implementation

These are generated with the PR as-is since the machine I'm running on run-time-detected AVX2:

Screen Shot 2021-12-25 at 12 09 17 AM

raw data ``` group generic master ----- ------- ------ memchr1/krate/empty/never 1.74 0.9±0.00ns 0 B/sec 1.00 0.5±0.00ns 0 B/sec memchr1/krate/huge/never 1.15 15.1±0.12µs 36.7 GB/sec 1.00 13.1±0.11µs 42.3 GB/sec memchr1/krate/huge/uncommon 1.00 110.4±0.12µs 5.0 GB/sec 1.06 117.0±0.55µs 4.7 GB/sec memchr1/krate/small/never 1.35 17.4±0.20ns 35.6 GB/sec 1.00 12.9±0.19ns 48.0 GB/sec memchr1/krate/small/rare 1.18 24.8±0.07ns 24.9 GB/sec 1.00 21.1±0.22ns 29.3 GB/sec memchr1/krate/small/uncommon 1.06 64.4±0.30ns 9.6 GB/sec 1.00 61.1±0.56ns 10.1 GB/sec memchr1/krate/tiny/never 1.27 9.3±0.08ns 6.9 GB/sec 1.00 7.3±0.03ns 8.8 GB/sec memchr1/krate/tiny/rare 1.15 13.5±0.05ns 4.8 GB/sec 1.00 11.7±0.06ns 5.5 GB/sec memchr1/krate/tiny/uncommon 1.42 47.0±0.13ns 1398.6 MB/sec 1.00 33.2±0.11ns 1981.4 MB/sec memchr2/krate/empty/never 1.41 1.0±0.01ns 0 B/sec 1.00 0.7±0.00ns 0 B/sec memchr2/krate/huge/rare 1.00 24.8±0.12µs 22.3 GB/sec 1.08 26.7±0.23µs 20.7 GB/sec memchr2/krate/huge/uncommon 1.00 225.4±1.35µs 2.5 GB/sec 1.05 236.9±2.13µs 2.3 GB/sec memchr2/krate/small/never 1.00 24.7±0.23ns 25.1 GB/sec 1.20 29.6±1.31ns 20.9 GB/sec memchr2/krate/small/rare 1.00 41.2±0.22ns 15.0 GB/sec 1.29 53.0±0.55ns 11.7 GB/sec memchr2/krate/small/uncommon 1.00 150.8±0.43ns 4.1 GB/sec 1.22 183.4±1.04ns 3.4 GB/sec memchr2/krate/tiny/never 1.00 8.8±0.04ns 7.3 GB/sec 1.35 11.9±0.09ns 5.4 GB/sec memchr3/krate/empty/never 1.52 1.0±0.01ns 0 B/sec 1.00 0.6±0.01ns 0 B/sec memchr3/krate/huge/never 1.22 30.8±0.12µs 18.0 GB/sec 1.00 25.3±0.20µs 21.9 GB/sec memchr3/krate/huge/rare 1.18 37.4±0.14µs 14.8 GB/sec 1.00 31.7±0.23µs 17.5 GB/sec memchr3/krate/huge/uncommon 1.00 268.1±0.35µs 2.1 GB/sec 1.07 285.7±0.52µs 1986.0 MB/sec memchr3/krate/small/never 1.39 41.1±0.36ns 15.1 GB/sec 1.00 29.6±0.08ns 20.9 GB/sec memchr3/krate/small/rare 1.11 60.5±0.13ns 10.2 GB/sec 1.00 54.6±0.24ns 11.3 GB/sec memchr3/krate/tiny/never 1.00 9.8±0.35ns 6.5 GB/sec 1.16 11.4±0.04ns 5.7 GB/sec memrchr1/krate/empty/never 1.71 1.8±0.05ns 0 B/sec 1.00 1.1±0.00ns 0 B/sec memrchr1/krate/huge/never 1.14 14.4±0.57µs 38.4 GB/sec 1.00 12.7±0.05µs 43.7 GB/sec memrchr3/krate/small/common 1.20 925.5±4.70ns 684.2 MB/sec 1.00 772.8±1.44ns 819.4 MB/sec memrchr3/krate/small/rare 1.10 57.8±0.19ns 10.7 GB/sec 1.00 52.4±0.31ns 11.8 GB/sec memrchr3/krate/small/uncommon 1.23 230.7±0.98ns 2.7 GB/sec 1.00 186.9±1.31ns 3.3 GB/sec memrchr3/krate/tiny/never 1.05 11.4±0.10ns 5.6 GB/sec 1.00 10.8±0.04ns 5.9 GB/sec memrchr3/krate/tiny/rare 1.17 37.1±0.20ns 1773.9 MB/sec 1.00 31.6±0.19ns 2.0 GB/sec memrchr3/krate/tiny/uncommon 1.23 145.6±0.59ns 451.9 MB/sec 1.00 118.2±0.32ns 556.7 MB/sec ```

SSE2 implementation

For these numbers I edited the run-time detection to skip the avx2 feature detection, meaning that the SSE2 implementation was forced to be used

Screen Shot 2021-12-25 at 12 11 05 AM

raw data ``` group generic-no-avx master-no-avx ----- -------------- ------------- memchr1/krate/huge/never 1.10 30.6±0.08µs 18.1 GB/sec 1.00 27.8±0.42µs 19.9 GB/sec memchr1/krate/huge/rare 1.08 32.8±0.07µs 16.9 GB/sec 1.00 30.2±0.09µs 18.3 GB/sec memchr1/krate/small/never 1.07 41.1±1.29ns 15.0 GB/sec 1.00 38.6±1.18ns 16.0 GB/sec memchr1/krate/small/rare 1.07 47.6±0.17ns 13.0 GB/sec 1.00 44.6±0.58ns 13.9 GB/sec memchr3/krate/small/rare 1.12 120.6±0.94ns 5.1 GB/sec 1.00 108.2±0.39ns 5.7 GB/sec memchr3/krate/small/uncommon 1.09 256.9±1.20ns 2.4 GB/sec 1.00 235.9±2.89ns 2.6 GB/sec memchr3/krate/tiny/never 1.07 15.6±0.04ns 4.1 GB/sec 1.00 14.5±0.04ns 4.4 GB/sec memrchr1/krate/empty/never 1.05 1.9±0.03ns 0 B/sec 1.00 1.8±0.01ns 0 B/sec memrchr1/krate/tiny/never 1.00 9.4±0.04ns 6.8 GB/sec 1.12 10.5±0.06ns 6.1 GB/sec memrchr1/krate/tiny/rare 1.00 13.4±0.04ns 4.8 GB/sec 1.08 14.5±0.07ns 4.4 GB/sec memrchr2/krate/huge/common 1.00 566.1±1.09µs 1002.3 MB/sec 1.21 684.4±2.13µs 829.0 MB/sec memrchr2/krate/huge/uncommon 1.00 272.9±0.32µs 2.0 GB/sec 1.06 288.4±0.58µs 1967.3 MB/sec memrchr2/krate/small/common 1.00 462.8±2.07ns 1368.4 MB/sec 1.36 629.7±2.91ns 1005.6 MB/sec memrchr2/krate/small/uncommon 1.00 167.7±0.95ns 3.7 GB/sec 1.08 181.4±0.69ns 3.4 GB/sec memrchr2/krate/tiny/never 1.09 13.0±0.22ns 4.9 GB/sec 1.00 11.9±0.04ns 5.4 GB/sec memrchr2/krate/tiny/rare 1.17 36.7±0.32ns 1793.7 MB/sec 1.00 31.2±0.17ns 2.1 GB/sec memrchr2/krate/tiny/uncommon 1.00 85.5±0.39ns 769.2 MB/sec 1.11 95.3±0.35ns 690.5 MB/sec memrchr3/krate/huge/never 1.00 52.5±0.17µs 10.6 GB/sec 1.08 56.6±0.21µs 9.8 GB/sec memrchr3/krate/huge/rare 1.00 62.9±1.26µs 8.8 GB/sec 1.06 66.5±0.92µs 8.3 GB/sec ```

I'm not super certain as to how much variation these benchmarks have, but that's at least the current state! My goal was basically parity with the master branch as-is, but I may need to dig into some of the cases here because the speedups/slowdowns are bigger than I'd expect.

@BurntSushi
Copy link
Owner

Thanks so much for working on this! I've finally gotten around to taking a look at this PR. I'm also trying to knock out #76 at the same time. Unfortunately, the neon ISA doesn't have a movemask routine, which means that if I wind up going with a generic version, it's going to look a bit different.

I'm also having a little trouble with the runtime feature detection macros. I'm also having trouble with all the conditional compilation in memmem as well (where I also plan to add aarch64 support). Long story short, I think I'm just going to try and think more holistically about how to manage all the complexity here and try to find a way to push down and encapsulate the condition compilation in a more sensible way than is being done today.

Thanks again for working on this and getting the wasm stuff working in memmem. My plan is to get x86, wasm and aarch64 supported in memchr as well.

@BurntSushi BurntSushi closed this Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants