Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrite the entire crate #35

Merged
merged 7 commits into from
Sep 17, 2018
Merged

rewrite the entire crate #35

merged 7 commits into from
Sep 17, 2018

Conversation

BurntSushi
Copy link
Owner

This fills out the complete API of the memchr crate, namely, including
memrchr2 and memrchr3, along with double ended iterators for memchr2
and memchr3. We also add memr?chr[123]_iter convenience free functions
for constructing iterators, instead of needing to import the iterator
type itself.

In doing so, we provide vectorized implementations for everything. Both
memchr and memrchr should now be competitive with glibc on x86_64, and
therefore, we now drop libc on x86_64. This also has the benefit of
providing consistent performance across all platforms, regardless of
which libc is used. Namely, the performance of the various
implementations of memchr across different implementations of libc can
vary greatly. In particular, static builds of Rust programs using MUSL can
now benefit from this work. Previously, such programs would use MUSL's
(comparatively) unoptimized routine for memchr.

We also rewrite the fallback routines to make them faster as well.
For the most part, throughput remains roughly the same, but we reduce
overhead quite a bit by just accepting our fate and using raw pointers
everywhere.

Finally, we dramatically improve the coverage of tests and benchmarks.
In the course of working on this, I played the role of a mutation tester,
and generally speaking, small tweaks to make any of the algorithms
incorrect should now be caught by the test suite. This is important because
there are a ton of parameters that impact how the algorithm behaves (such
as CPU support, alignment, length, and of course, the haystack itself).

Closes #19, Closes #34

cc'ing some folks that I know have worked in this area: @Manishearth @bluss @llogiq @Veedrac --- any review you might be able to provide would be awesome! I also imagine we might want to pull some of this into std.

This let's us conveniently test CI as we move along.
This patch is from the standard library.
This represents an initial go at adding comprehensive criterion
benchmarks. We hit every public API function, and vary benchmarks based
on both corpus size and match frequency.

The point of this exercise is to establish a baseline on the status quo,
and make sure we don't introduce any major regressions after a
refactoring toward vendor intrinsics.
This reorganizes the internal structure of the crate to be less of a
mess, including rewriting the tests. In particular, we commit to no_std
and obsolete the need for the `use_std` feature, since we simply elect
to always use `core`.

We also add an explicit compilation error for non-32/64 bit systems. We
technically supported it previously, but it's not clear if it ever
worked since it was never tested. We can add it back if necessary.
We'll use these configs to enable use of vendor intrinsics on stable
Rust compilers to support it. To check the version, we use the
lightweight version_check crate.
This commit updates the tests to increase their coverage dramatically.
They were influenced by my experimenting with the existing tests by
introducing subtle bugs into the memchr implementations, and witnessing
with horror at the test suite continuing to pass.

We increase by coverage by doing two things. Firstly, we pad the size of
the corpus of each test by adding bytes that we never use as a needle.
We go all the way up to ~500 bytes to make sure we cover all forms of
case analysis. We pad both sides of the corpus and update our expected
positions accordingly.

Secondly, for each `find` test, we vary the alignment of the slice we
pass to memchr, all the way up to 66 to account for AVX2 optimizations
(which will read at least 32 bytes at a time, but loop unrolling might
make this 64 bytes at a time).

This coverage passed my previous smell test. Every bug I tried to
introduce is now caught by the test suite (unless the bug introduces UB,
but that's par for the course).
@BurntSushi
Copy link
Owner Author

BurntSushi commented Sep 17, 2018

I'll try to see about posting numbers, but the switch to criterion has caused me to lose convenient condensed output. (But, it was invaluable to actually doing this rewrite otherwise.)

One interesting thing I discovered in my work on this is that glibc's AVX implementation of memchr, as far as I can tell from looking at the disassembly, does not actually have a "byte at a time" matching loop to handle haystacks less than the vector size. In particular, it appears to potentially read past the end of the provided buffer, and will "fix up" anything afterwards by making sure it doesn't return a match position past the end of the buffer. From what I can tell, this is actually intentional, but whether it's UB or not isn't clear. It might be a case where it's legal to do in Assembly, but perhaps not in C. Either way, there's some interesting reading on the topic: https://stackoverflow.com/questions/37800739/is-it-safe-to-read-past-the-end-of-a-buffer-within-the-same-page-on-x86-and-x64

@BurntSushi BurntSushi force-pushed the ag/vendor branch 4 times, most recently from 2e07cc4 to 37d38c5 Compare September 17, 2018 00:58
@llogiq
Copy link

llogiq commented Sep 17, 2018

I'm impressed, as usual with your code. Kudos!

use core::usize;

#[cfg(target_pointer_width = "32")]
const USIZE_BYTES: usize = 4;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For inclusion in libcore, we need to handle pointer_width = 16 as well

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I actually made it a compile error to build memchr on 16 bit platforms because I don't have an easy way to test it: https://github.com/BurntSushi/rust-memchr/pull/35/files#diff-b4aea3e418ccdb71239b96952d9cddb6R31

@Manishearth
Copy link

Yeah, libcore uses memchr for slice::contains and str::find. Having these SIMD versions in tree would be pretty great! I don't know how easy it is to keep these in sync though.

@nagisa
Copy link

nagisa commented Sep 17, 2018

Inclusion to libcore/libstd will also need to use at least align_offset, but you also could use align_to which was made specifically for splitting buffers into parts for such an optimisation.

@BurntSushi
Copy link
Owner Author

@nagisa Yeah, I saw those, but neither are stable yet. I also don't think I understand why core requires their use. In any case, I have no idea what the process is for including this code in core anyway. Does the code need to be copied/vendored? Then I think those changes can just be made as part of that process.

@gnzlbg
Copy link

gnzlbg commented Sep 17, 2018

This should probably use is_x86_feature_detected when std is available and use 256-bit wide vector instructions when that's the case. Doing so would mean that it does not make much sense to include this into libcore, since libcore can't do that.

@nagisa
Copy link

nagisa commented Sep 17, 2018

Making the changes as part of integration into core/std is fine. Use of those functions is necessary for miri.

@BurntSushi
Copy link
Owner Author

@gnzlbg It does indeed do that. And yes, there are probably varying levels of what could be pushed into core:

  • The fallback implementations are a (small) improvement over the existing ones.
  • SSE2 implementations can be used on x86_64 without is_x86_feature_detected!.
  • I imagine on x86 in core you could still query CPU info no? You might not be able to use is_x86_feature_detected! though I guess. Not sure.

But yeah, we can certainly try and get the code to be as similar as possible in both places, but it sounds like there will need to be at least some changes. In any case, all the different implementations of memchr are pretty self contained and amenable to just copying even if the higher level dispatching needs to be customized for core.

@gnzlbg
Copy link

gnzlbg commented Sep 17, 2018

Looking more closely about how to do the integration, we can probably get away with just exposing different memchr functions in libcore and libstd. It probably makes sense to just have here a use_std cargo feature, and make sure its disabled when compiling this code as part of libcore, but enabled when compiling it as part of libstd. The stdsimd crate is set up this way in std: there are two crates, coresimd and stdsimd, both reusing the same code that ends up being compiled twice once for libcore and once for libstd.

@gnzlbg
Copy link

gnzlbg commented Sep 17, 2018

The problem is that stuff like the slice methods get re-exported in libstd from libcore, so they won't use the potentially more efficient implementation of memchr available in libstd (they will use the ones from core).

I imagine on x86 in core you could still query CPU info no?

Not currently, and this is by design, since run-time features is a platform thing (e.g. can the OS save AVX registers? depends on the OS, and whether you are in user space or kernel space) but libcore is platform agnostic.

@BurntSushi BurntSushi force-pushed the ag/vendor branch 5 times, most recently from 50086f0 to 11694b1 Compare September 17, 2018 18:08
This fills out the complete API of the memchr crate, namely, including
memrchr2 and memrchr3, along with doubled ended iterators for memchr2
and memchr3. We also add `memr?chr[123]_iter` convenience free functions
for constructing iterators, instead of needing to import the iterator
type itself.

In doing so, we provide vectorized implementations for everything. Both
memchr and memrchr should now be competitive with glibc on x86_64, and
therefore, we now drop libc on x86_64. This also has the benefit of
providing consistent performance across all platforms, regardless of
which libc is used. Namely, the performance of the various
implementations of memchr across different implementations of libc can
vary greatly.

We also rewrite the fallback routines to make them faster as well.
For the most part, throughput remains roughly the same, but we reduce
overhead quite a bit by just accepting our fate and using raw pointers
everywhere.

Closes #19, Closes #34
@alkis
Copy link

alkis commented Sep 17, 2018

Drive by comment: use of avx2 instructions clocks down the core (on older CPUs it might even clock down all the cores). So while benchmarks might seem favorable for avx2 it is unlikely that the performance gains of memchr are going to justify slowing down other parts of the system - unless the program is avx2 heavy to begin with.

@BurntSushi
Copy link
Owner Author

BurntSushi commented Sep 17, 2018

Some benchmarks from aho-corasick (the regex benchmarks aren't quite good enough yet to differentiate the impact here, because of other optimizations kicking in):

$ cargo benchcmp ./tmp/master.log tmp/master-memchr.log --threshold 5
 name                                        master.log ns/iter   master-memchr.log ns/iter  diff ns/iter   diff %  speedup
 dense::ac_two_bytes                         1,240 (8064 MB/s)    144 (69444 MB/s)                 -1,096  -88.39%   x 8.61
 dense::ac_two_diff_prefix                   1,240 (8064 MB/s)    144 (69444 MB/s)                 -1,096  -88.39%   x 8.61
 dense_boxed::ac_ten_one_prefix_byte_random  15,346 (651 MB/s)    14,122 (708 MB/s)                -1,224   -7.98%   x 1.09
 dense_boxed::ac_two_bytes                   1,243 (8045 MB/s)    147 (68027 MB/s)                 -1,096  -88.17%   x 8.46
 dense_boxed::ac_two_diff_prefix             1,243 (8045 MB/s)    146 (68493 MB/s)                 -1,097  -88.25%   x 8.51
 dense_boxed::ac_two_one_prefix_byte_random  14,562 (686 MB/s)    13,294 (752 MB/s)                -1,268   -8.71%   x 1.10
 full::ac_two_bytes                          1,242 (8051 MB/s)    145 (68965 MB/s)                 -1,097  -88.33%   x 8.57
 full::ac_two_diff_prefix                    1,242 (8051 MB/s)    145 (68965 MB/s)                 -1,097  -88.33%   x 8.57
 full_overlap::ac_two_bytes                  1,243 (8045 MB/s)    147 (68027 MB/s)                 -1,096  -88.17%   x 8.46
 full_overlap::ac_two_diff_prefix            1,243 (8045 MB/s)    147 (68027 MB/s)                 -1,096  -88.17%   x 8.46
 sherlock::name_alt2                         136,524 (4357 MB/s)  55,499 (10719 MB/s)             -81,025  -59.35%   x 2.46
 sherlock::name_alt4                         134,124 (4435 MB/s)  53,523 (11115 MB/s)             -80,601  -60.09%   x 2.51
 sherlock::name_alt5                         199,892 (2976 MB/s)  93,936 (6333 MB/s)             -105,956  -53.01%   x 2.13
 sparse::ac_two_bytes                        1,240 (8064 MB/s)    143 (69930 MB/s)                 -1,097  -88.47%   x 8.67
 sparse::ac_two_diff_prefix                  1,240 (8064 MB/s)    143 (69930 MB/s)                 -1,097  -88.47%   x 8.67

In particular, the big speedups here are from vectorizing memchr2 and memchr3, which were previously unvectorized.

@BurntSushi
Copy link
Owner Author

@alkis

Drive by comment: use of avx2 instructions clocks down the core (on older CPUs it might even clock down all the cores). So while benchmarks might seem favorable for avx2 it is unlikely that the performance gains of memchr are going to justify slowing down other parts of the system - unless the program is avx2 heavy to begin with.

Yes, I'm aware of this downside to AVX2. However, that both glibc and Go's standard library use AVX2 for this exact same routine is a strong vote of confidence in favor of doing it. With that said, I would be happy to revise this understanding given a benchmark. From there, we can figure out how to deal with it.

@gnzlbg
Copy link

gnzlbg commented Sep 18, 2018

Just using AVX2 does not mean that a core (or all cores) will be downclocked. Whether AVX2 downclocks an old core depends on how exactly AVX2 is used (https://lemire.me/blog/2018/04/19/by-how-much-does-avx-512-slow-down-your-cpu-a-first-experiment/).

@alkis
Copy link

alkis commented Sep 18, 2018

Just using AVX2 does not mean that a core (or all cores) will be downclocked. Whether AVX2 downclocks an old core depends on how exactly AVX2 is used (https://lemire.me/blog/2018/04/19/by-how-much-does-avx-512-slow-down-your-cpu-a-first-experiment/).

Haswell XEONs clock down all the cores on the same chip. More recent chips (broadwell, skylake, kabylake) are clocking down progressively less cores. Similarly with when and with which instructions the clock down is happening: more recent CPUs are less prone to down clocking.

As for the Lemire's analysis above, it is incomplete. For one the tests are only on a single model but more importantly only 1 core out of 4 is exercised. The latter makes it a lot less likely for the core to clock down. CPUs clock down not because the manufacturers are trying to cheat you - it is because the core needs to stay below some max power threshold. If only 1/4th of the total chip is used it is hard to make the core downclock at all.

@gnzlbg
Copy link

gnzlbg commented Sep 18, 2018

As for the Lemire's analysis above, it is incomplete. For one the tests are only on a single model but more importantly only 1 core out of 4 is exercised. The latter makes it a lot less likely for the core to clock down. CPUs clock down not because the manufacturers are trying to cheat you - it is because the core needs to stay below some max power threshold. If only 1/4th of the total chip is used it is hard to make the core downclock at all.

The last example in lemire's blog post:

#include <x86intrin.h>
#include <stdlib.h>
int main(int argc, char **argv) {
  if(argc>1) _mm256_zeroupper();
  float a = 3;
  float b = rand();
  if(argc>2) _mm256_zeroupper();
  for(int k = 0; k < 5000000; k++) {
    b = a + b * b;
    a = b - a * a;
  }
  return (b == 0)? 1 : 0;
}

shows that using two AVX1 instructions once in a single threaded program can have a catastrophic impact on performance. The second multi-threaded example that uses FMA heavily on all cores shows that using a lot of AVX instructions in the hot-spots of your code can have measurable differences in performance, while the first single-threaded example that uses AVX heavily in a hotspot shows no difference at all.

Haswell XEONs clock down all the cores on the same chip.

That depends on the instructions used, when they are used, the CPU involved, etc. The intel white papers explicitly state that down clocking when using 256-bit and 512-bit wide registers only happens for "certain workloads". Using a single AVX1 instruction a lot does not necessarily down clock anything, but using another instruction sequence only once in your program execution might downclock one core, or all cores, for a long time.

If using memchr once or a lot downclocks your CPU in an undesired way it would be nice for you to provide an example that reproduce the issue. It might be possible to tune the implementation to avoid any down clocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants