rewrite the entire crate #35

BurntSushi · 2018-09-17T00:32:17Z

This fills out the complete API of the memchr crate, namely, including
memrchr2 and memrchr3, along with double ended iterators for memchr2
and memchr3. We also add memr?chr[123]_iter convenience free functions
for constructing iterators, instead of needing to import the iterator
type itself.

In doing so, we provide vectorized implementations for everything. Both
memchr and memrchr should now be competitive with glibc on x86_64, and
therefore, we now drop libc on x86_64. This also has the benefit of
providing consistent performance across all platforms, regardless of
which libc is used. Namely, the performance of the various
implementations of memchr across different implementations of libc can
vary greatly. In particular, static builds of Rust programs using MUSL can
now benefit from this work. Previously, such programs would use MUSL's
(comparatively) unoptimized routine for memchr.

We also rewrite the fallback routines to make them faster as well.
For the most part, throughput remains roughly the same, but we reduce
overhead quite a bit by just accepting our fate and using raw pointers
everywhere.

Finally, we dramatically improve the coverage of tests and benchmarks.
In the course of working on this, I played the role of a mutation tester,
and generally speaking, small tweaks to make any of the algorithms
incorrect should now be caught by the test suite. This is important because
there are a ton of parameters that impact how the algorithm behaves (such
as CPU support, alignment, length, and of course, the haystack itself).

Closes #19, Closes #34

cc'ing some folks that I know have worked in this area: @Manishearth @bluss @llogiq @Veedrac --- any review you might be able to provide would be awesome! I also imagine we might want to pull some of this into std.

This let's us conveniently test CI as we move along.

This patch is from the standard library.

This represents an initial go at adding comprehensive criterion benchmarks. We hit every public API function, and vary benchmarks based on both corpus size and match frequency. The point of this exercise is to establish a baseline on the status quo, and make sure we don't introduce any major regressions after a refactoring toward vendor intrinsics.

This reorganizes the internal structure of the crate to be less of a mess, including rewriting the tests. In particular, we commit to no_std and obsolete the need for the `use_std` feature, since we simply elect to always use `core`. We also add an explicit compilation error for non-32/64 bit systems. We technically supported it previously, but it's not clear if it ever worked since it was never tested. We can add it back if necessary.

We'll use these configs to enable use of vendor intrinsics on stable Rust compilers to support it. To check the version, we use the lightweight version_check crate.

This commit updates the tests to increase their coverage dramatically. They were influenced by my experimenting with the existing tests by introducing subtle bugs into the memchr implementations, and witnessing with horror at the test suite continuing to pass. We increase by coverage by doing two things. Firstly, we pad the size of the corpus of each test by adding bytes that we never use as a needle. We go all the way up to ~500 bytes to make sure we cover all forms of case analysis. We pad both sides of the corpus and update our expected positions accordingly. Secondly, for each `find` test, we vary the alignment of the slice we pass to memchr, all the way up to 66 to account for AVX2 optimizations (which will read at least 32 bytes at a time, but loop unrolling might make this 64 bytes at a time). This coverage passed my previous smell test. Every bug I tried to introduce is now caught by the test suite (unless the bug introduces UB, but that's par for the course).

BurntSushi · 2018-09-17T00:33:25Z

I'll try to see about posting numbers, but the switch to criterion has caused me to lose convenient condensed output. (But, it was invaluable to actually doing this rewrite otherwise.)

One interesting thing I discovered in my work on this is that glibc's AVX implementation of memchr, as far as I can tell from looking at the disassembly, does not actually have a "byte at a time" matching loop to handle haystacks less than the vector size. In particular, it appears to potentially read past the end of the provided buffer, and will "fix up" anything afterwards by making sure it doesn't return a match position past the end of the buffer. From what I can tell, this is actually intentional, but whether it's UB or not isn't clear. It might be a case where it's legal to do in Assembly, but perhaps not in C. Either way, there's some interesting reading on the topic: https://stackoverflow.com/questions/37800739/is-it-safe-to-read-past-the-end-of-a-buffer-within-the-same-page-on-x86-and-x64

llogiq · 2018-09-17T06:36:15Z

I'm impressed, as usual with your code. Kudos!

Manishearth · 2018-09-17T09:23:46Z

src/fallback.rs

+use core::usize;
+
+#[cfg(target_pointer_width = "32")]
+const USIZE_BYTES: usize = 4;


For inclusion in libcore, we need to handle pointer_width = 16 as well

Yeah, I actually made it a compile error to build memchr on 16 bit platforms because I don't have an easy way to test it: https://github.com/BurntSushi/rust-memchr/pull/35/files#diff-b4aea3e418ccdb71239b96952d9cddb6R31

Manishearth · 2018-09-17T09:26:01Z

Yeah, libcore uses memchr for slice::contains and str::find. Having these SIMD versions in tree would be pretty great! I don't know how easy it is to keep these in sync though.

nagisa · 2018-09-17T10:41:01Z

Inclusion to libcore/libstd will also need to use at least align_offset, but you also could use align_to which was made specifically for splitting buffers into parts for such an optimisation.

BurntSushi · 2018-09-17T11:11:32Z

@nagisa Yeah, I saw those, but neither are stable yet. I also don't think I understand why core requires their use. In any case, I have no idea what the process is for including this code in core anyway. Does the code need to be copied/vendored? Then I think those changes can just be made as part of that process.

gnzlbg · 2018-09-17T11:46:57Z

This should probably use is_x86_feature_detected when std is available and use 256-bit wide vector instructions when that's the case. Doing so would mean that it does not make much sense to include this into libcore, since libcore can't do that.

nagisa · 2018-09-17T11:59:54Z

Making the changes as part of integration into core/std is fine. Use of those functions is necessary for miri.

BurntSushi · 2018-09-17T12:03:25Z

@gnzlbg It does indeed do that. And yes, there are probably varying levels of what could be pushed into core:

The fallback implementations are a (small) improvement over the existing ones.
SSE2 implementations can be used on x86_64 without is_x86_feature_detected!.
I imagine on x86 in core you could still query CPU info no? You might not be able to use is_x86_feature_detected! though I guess. Not sure.

But yeah, we can certainly try and get the code to be as similar as possible in both places, but it sounds like there will need to be at least some changes. In any case, all the different implementations of memchr are pretty self contained and amenable to just copying even if the higher level dispatching needs to be customized for core.

gnzlbg · 2018-09-17T12:03:51Z

Looking more closely about how to do the integration, we can probably get away with just exposing different memchr functions in libcore and libstd. It probably makes sense to just have here a use_std cargo feature, and make sure its disabled when compiling this code as part of libcore, but enabled when compiling it as part of libstd. The stdsimd crate is set up this way in std: there are two crates, coresimd and stdsimd, both reusing the same code that ends up being compiled twice once for libcore and once for libstd.

gnzlbg · 2018-09-17T12:07:11Z

The problem is that stuff like the slice methods get re-exported in libstd from libcore, so they won't use the potentially more efficient implementation of memchr available in libstd (they will use the ones from core).

I imagine on x86 in core you could still query CPU info no?

Not currently, and this is by design, since run-time features is a platform thing (e.g. can the OS save AVX registers? depends on the OS, and whether you are in user space or kernel space) but libcore is platform agnostic.

This fills out the complete API of the memchr crate, namely, including memrchr2 and memrchr3, along with doubled ended iterators for memchr2 and memchr3. We also add `memr?chr[123]_iter` convenience free functions for constructing iterators, instead of needing to import the iterator type itself. In doing so, we provide vectorized implementations for everything. Both memchr and memrchr should now be competitive with glibc on x86_64, and therefore, we now drop libc on x86_64. This also has the benefit of providing consistent performance across all platforms, regardless of which libc is used. Namely, the performance of the various implementations of memchr across different implementations of libc can vary greatly. We also rewrite the fallback routines to make them faster as well. For the most part, throughput remains roughly the same, but we reduce overhead quite a bit by just accepting our fate and using raw pointers everywhere. Closes #19, Closes #34

alkis · 2018-09-17T23:06:04Z

Drive by comment: use of avx2 instructions clocks down the core (on older CPUs it might even clock down all the cores). So while benchmarks might seem favorable for avx2 it is unlikely that the performance gains of memchr are going to justify slowing down other parts of the system - unless the program is avx2 heavy to begin with.

BurntSushi · 2018-09-17T23:17:56Z

Some benchmarks from aho-corasick (the regex benchmarks aren't quite good enough yet to differentiate the impact here, because of other optimizations kicking in):

$ cargo benchcmp ./tmp/master.log tmp/master-memchr.log --threshold 5
 name                                        master.log ns/iter   master-memchr.log ns/iter  diff ns/iter   diff %  speedup
 dense::ac_two_bytes                         1,240 (8064 MB/s)    144 (69444 MB/s)                 -1,096  -88.39%   x 8.61
 dense::ac_two_diff_prefix                   1,240 (8064 MB/s)    144 (69444 MB/s)                 -1,096  -88.39%   x 8.61
 dense_boxed::ac_ten_one_prefix_byte_random  15,346 (651 MB/s)    14,122 (708 MB/s)                -1,224   -7.98%   x 1.09
 dense_boxed::ac_two_bytes                   1,243 (8045 MB/s)    147 (68027 MB/s)                 -1,096  -88.17%   x 8.46
 dense_boxed::ac_two_diff_prefix             1,243 (8045 MB/s)    146 (68493 MB/s)                 -1,097  -88.25%   x 8.51
 dense_boxed::ac_two_one_prefix_byte_random  14,562 (686 MB/s)    13,294 (752 MB/s)                -1,268   -8.71%   x 1.10
 full::ac_two_bytes                          1,242 (8051 MB/s)    145 (68965 MB/s)                 -1,097  -88.33%   x 8.57
 full::ac_two_diff_prefix                    1,242 (8051 MB/s)    145 (68965 MB/s)                 -1,097  -88.33%   x 8.57
 full_overlap::ac_two_bytes                  1,243 (8045 MB/s)    147 (68027 MB/s)                 -1,096  -88.17%   x 8.46
 full_overlap::ac_two_diff_prefix            1,243 (8045 MB/s)    147 (68027 MB/s)                 -1,096  -88.17%   x 8.46
 sherlock::name_alt2                         136,524 (4357 MB/s)  55,499 (10719 MB/s)             -81,025  -59.35%   x 2.46
 sherlock::name_alt4                         134,124 (4435 MB/s)  53,523 (11115 MB/s)             -80,601  -60.09%   x 2.51
 sherlock::name_alt5                         199,892 (2976 MB/s)  93,936 (6333 MB/s)             -105,956  -53.01%   x 2.13
 sparse::ac_two_bytes                        1,240 (8064 MB/s)    143 (69930 MB/s)                 -1,097  -88.47%   x 8.67
 sparse::ac_two_diff_prefix                  1,240 (8064 MB/s)    143 (69930 MB/s)                 -1,097  -88.47%   x 8.67

In particular, the big speedups here are from vectorizing memchr2 and memchr3, which were previously unvectorized.

BurntSushi · 2018-09-17T23:23:18Z

@alkis

Drive by comment: use of avx2 instructions clocks down the core (on older CPUs it might even clock down all the cores). So while benchmarks might seem favorable for avx2 it is unlikely that the performance gains of memchr are going to justify slowing down other parts of the system - unless the program is avx2 heavy to begin with.

Yes, I'm aware of this downside to AVX2. However, that both glibc and Go's standard library use AVX2 for this exact same routine is a strong vote of confidence in favor of doing it. With that said, I would be happy to revise this understanding given a benchmark. From there, we can figure out how to deal with it.

gnzlbg · 2018-09-18T07:06:06Z

Just using AVX2 does not mean that a core (or all cores) will be downclocked. Whether AVX2 downclocks an old core depends on how exactly AVX2 is used (https://lemire.me/blog/2018/04/19/by-how-much-does-avx-512-slow-down-your-cpu-a-first-experiment/).

alkis · 2018-09-18T07:24:30Z

Just using AVX2 does not mean that a core (or all cores) will be downclocked. Whether AVX2 downclocks an old core depends on how exactly AVX2 is used (https://lemire.me/blog/2018/04/19/by-how-much-does-avx-512-slow-down-your-cpu-a-first-experiment/).

Haswell XEONs clock down all the cores on the same chip. More recent chips (broadwell, skylake, kabylake) are clocking down progressively less cores. Similarly with when and with which instructions the clock down is happening: more recent CPUs are less prone to down clocking.

As for the Lemire's analysis above, it is incomplete. For one the tests are only on a single model but more importantly only 1 core out of 4 is exercised. The latter makes it a lot less likely for the core to clock down. CPUs clock down not because the manufacturers are trying to cheat you - it is because the core needs to stay below some max power threshold. If only 1/4th of the total chip is used it is hard to make the core downclock at all.

gnzlbg · 2018-09-18T12:07:43Z

As for the Lemire's analysis above, it is incomplete. For one the tests are only on a single model but more importantly only 1 core out of 4 is exercised. The latter makes it a lot less likely for the core to clock down. CPUs clock down not because the manufacturers are trying to cheat you - it is because the core needs to stay below some max power threshold. If only 1/4th of the total chip is used it is hard to make the core downclock at all.

The last example in lemire's blog post:

#include <x86intrin.h>
#include <stdlib.h>
int main(int argc, char **argv) {
  if(argc>1) _mm256_zeroupper();
  float a = 3;
  float b = rand();
  if(argc>2) _mm256_zeroupper();
  for(int k = 0; k < 5000000; k++) {
    b = a + b * b;
    a = b - a * a;
  }
  return (b == 0)? 1 : 0;
}

shows that using two AVX1 instructions once in a single threaded program can have a catastrophic impact on performance. The second multi-threaded example that uses FMA heavily on all cores shows that using a lot of AVX instructions in the hot-spots of your code can have measurable differences in performance, while the first single-threaded example that uses AVX heavily in a hotspot shows no difference at all.

Haswell XEONs clock down all the cores on the same chip.

That depends on the instructions used, when they are used, the CPU involved, etc. The intel white papers explicitly state that down clocking when using 256-bit and 512-bit wide registers only happens for "certain workloads". Using a single AVX1 instruction a lot does not necessarily down clock anything, but using another instruction sequence only once in your program execution might downclock one core, or all cores, for a long time.

If using memchr once or a lot downclocks your CPU in an undesired way it would be nice for you to provide an example that reproduce the issue. It might be possible to tune the implementation to avoid any down clocking.

BurntSushi added 6 commits September 10, 2018 20:01

TRASH: ci: temporarily add this branch to CI

6848825

This let's us conveniently test CI as we move along.

memchr: tweak repeat_byte

bd774f1

This patch is from the standard library.

version: set cfg's on Rust 1.27 or newer

77cf837

We'll use these configs to enable use of vendor intrinsics on stable Rust compilers to support it. To check the version, we use the lightweight version_check crate.

BurntSushi force-pushed the ag/vendor branch 4 times, most recently from 2e07cc4 to 37d38c5 Compare September 17, 2018 00:58

Manishearth reviewed Sep 17, 2018

View reviewed changes

BurntSushi force-pushed the ag/vendor branch 5 times, most recently from 50086f0 to 11694b1 Compare September 17, 2018 18:08

BurntSushi force-pushed the ag/vendor branch from 11694b1 to 4bc5e45 Compare September 17, 2018 18:10

BurntSushi merged commit 3120980 into master Sep 17, 2018

BurntSushi deleted the ag/vendor branch September 17, 2018 20:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rewrite the entire crate #35

rewrite the entire crate #35

BurntSushi commented Sep 17, 2018

BurntSushi commented Sep 17, 2018 •

edited

Loading

llogiq commented Sep 17, 2018

Manishearth Sep 17, 2018

BurntSushi Sep 17, 2018

Manishearth commented Sep 17, 2018

nagisa commented Sep 17, 2018 •

edited

Loading

BurntSushi commented Sep 17, 2018

gnzlbg commented Sep 17, 2018 •

edited

Loading

nagisa commented Sep 17, 2018

BurntSushi commented Sep 17, 2018

gnzlbg commented Sep 17, 2018 •

edited

Loading

gnzlbg commented Sep 17, 2018 •

edited

Loading

alkis commented Sep 17, 2018

BurntSushi commented Sep 17, 2018 •

edited

Loading

BurntSushi commented Sep 17, 2018

gnzlbg commented Sep 18, 2018 •

edited

Loading

alkis commented Sep 18, 2018

gnzlbg commented Sep 18, 2018 •

edited

Loading

rewrite the entire crate #35

rewrite the entire crate #35

Conversation

BurntSushi commented Sep 17, 2018

BurntSushi commented Sep 17, 2018 • edited Loading

llogiq commented Sep 17, 2018

Manishearth Sep 17, 2018

Choose a reason for hiding this comment

BurntSushi Sep 17, 2018

Choose a reason for hiding this comment

Manishearth commented Sep 17, 2018

nagisa commented Sep 17, 2018 • edited Loading

BurntSushi commented Sep 17, 2018

gnzlbg commented Sep 17, 2018 • edited Loading

nagisa commented Sep 17, 2018

BurntSushi commented Sep 17, 2018

gnzlbg commented Sep 17, 2018 • edited Loading

gnzlbg commented Sep 17, 2018 • edited Loading

alkis commented Sep 17, 2018

BurntSushi commented Sep 17, 2018 • edited Loading

BurntSushi commented Sep 17, 2018

gnzlbg commented Sep 18, 2018 • edited Loading

alkis commented Sep 18, 2018

gnzlbg commented Sep 18, 2018 • edited Loading

BurntSushi commented Sep 17, 2018 •

edited

Loading

nagisa commented Sep 17, 2018 •

edited

Loading

gnzlbg commented Sep 17, 2018 •

edited

Loading

gnzlbg commented Sep 17, 2018 •

edited

Loading

gnzlbg commented Sep 17, 2018 •

edited

Loading

BurntSushi commented Sep 17, 2018 •

edited

Loading

gnzlbg commented Sep 18, 2018 •

edited

Loading

gnzlbg commented Sep 18, 2018 •

edited

Loading