-
-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rewrite the entire crate #35
Conversation
This let's us conveniently test CI as we move along.
This patch is from the standard library.
This represents an initial go at adding comprehensive criterion benchmarks. We hit every public API function, and vary benchmarks based on both corpus size and match frequency. The point of this exercise is to establish a baseline on the status quo, and make sure we don't introduce any major regressions after a refactoring toward vendor intrinsics.
This reorganizes the internal structure of the crate to be less of a mess, including rewriting the tests. In particular, we commit to no_std and obsolete the need for the `use_std` feature, since we simply elect to always use `core`. We also add an explicit compilation error for non-32/64 bit systems. We technically supported it previously, but it's not clear if it ever worked since it was never tested. We can add it back if necessary.
We'll use these configs to enable use of vendor intrinsics on stable Rust compilers to support it. To check the version, we use the lightweight version_check crate.
This commit updates the tests to increase their coverage dramatically. They were influenced by my experimenting with the existing tests by introducing subtle bugs into the memchr implementations, and witnessing with horror at the test suite continuing to pass. We increase by coverage by doing two things. Firstly, we pad the size of the corpus of each test by adding bytes that we never use as a needle. We go all the way up to ~500 bytes to make sure we cover all forms of case analysis. We pad both sides of the corpus and update our expected positions accordingly. Secondly, for each `find` test, we vary the alignment of the slice we pass to memchr, all the way up to 66 to account for AVX2 optimizations (which will read at least 32 bytes at a time, but loop unrolling might make this 64 bytes at a time). This coverage passed my previous smell test. Every bug I tried to introduce is now caught by the test suite (unless the bug introduces UB, but that's par for the course).
I'll try to see about posting numbers, but the switch to criterion has caused me to lose convenient condensed output. (But, it was invaluable to actually doing this rewrite otherwise.) One interesting thing I discovered in my work on this is that glibc's AVX implementation of memchr, as far as I can tell from looking at the disassembly, does not actually have a "byte at a time" matching loop to handle haystacks less than the vector size. In particular, it appears to potentially read past the end of the provided buffer, and will "fix up" anything afterwards by making sure it doesn't return a match position past the end of the buffer. From what I can tell, this is actually intentional, but whether it's UB or not isn't clear. It might be a case where it's legal to do in Assembly, but perhaps not in C. Either way, there's some interesting reading on the topic: https://stackoverflow.com/questions/37800739/is-it-safe-to-read-past-the-end-of-a-buffer-within-the-same-page-on-x86-and-x64 |
2e07cc4
to
37d38c5
Compare
I'm impressed, as usual with your code. Kudos! |
use core::usize; | ||
|
||
#[cfg(target_pointer_width = "32")] | ||
const USIZE_BYTES: usize = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For inclusion in libcore, we need to handle pointer_width = 16 as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I actually made it a compile error to build memchr on 16 bit platforms because I don't have an easy way to test it: https://github.com/BurntSushi/rust-memchr/pull/35/files#diff-b4aea3e418ccdb71239b96952d9cddb6R31
Yeah, libcore uses memchr for |
Inclusion to libcore/libstd will also need to use at least |
@nagisa Yeah, I saw those, but neither are stable yet. I also don't think I understand why |
This should probably use |
Making the changes as part of integration into core/std is fine. Use of those functions is necessary for miri. |
@gnzlbg It does indeed do that. And yes, there are probably varying levels of what could be pushed into core:
But yeah, we can certainly try and get the code to be as similar as possible in both places, but it sounds like there will need to be at least some changes. In any case, all the different implementations of memchr are pretty self contained and amenable to just copying even if the higher level dispatching needs to be customized for core. |
Looking more closely about how to do the integration, we can probably get away with just exposing different |
The problem is that stuff like the slice methods get re-exported in
Not currently, and this is by design, since run-time features is a platform thing (e.g. can the OS save AVX registers? depends on the OS, and whether you are in user space or kernel space) but |
50086f0
to
11694b1
Compare
This fills out the complete API of the memchr crate, namely, including memrchr2 and memrchr3, along with doubled ended iterators for memchr2 and memchr3. We also add `memr?chr[123]_iter` convenience free functions for constructing iterators, instead of needing to import the iterator type itself. In doing so, we provide vectorized implementations for everything. Both memchr and memrchr should now be competitive with glibc on x86_64, and therefore, we now drop libc on x86_64. This also has the benefit of providing consistent performance across all platforms, regardless of which libc is used. Namely, the performance of the various implementations of memchr across different implementations of libc can vary greatly. We also rewrite the fallback routines to make them faster as well. For the most part, throughput remains roughly the same, but we reduce overhead quite a bit by just accepting our fate and using raw pointers everywhere. Closes #19, Closes #34
11694b1
to
4bc5e45
Compare
Drive by comment: use of avx2 instructions clocks down the core (on older CPUs it might even clock down all the cores). So while benchmarks might seem favorable for avx2 it is unlikely that the performance gains of memchr are going to justify slowing down other parts of the system - unless the program is avx2 heavy to begin with. |
Some benchmarks from aho-corasick (the regex benchmarks aren't quite good enough yet to differentiate the impact here, because of other optimizations kicking in):
In particular, the big speedups here are from vectorizing |
Yes, I'm aware of this downside to AVX2. However, that both glibc and Go's standard library use AVX2 for this exact same routine is a strong vote of confidence in favor of doing it. With that said, I would be happy to revise this understanding given a benchmark. From there, we can figure out how to deal with it. |
Just using AVX2 does not mean that a core (or all cores) will be downclocked. Whether AVX2 downclocks an old core depends on how exactly AVX2 is used (https://lemire.me/blog/2018/04/19/by-how-much-does-avx-512-slow-down-your-cpu-a-first-experiment/). |
Haswell XEONs clock down all the cores on the same chip. More recent chips (broadwell, skylake, kabylake) are clocking down progressively less cores. Similarly with when and with which instructions the clock down is happening: more recent CPUs are less prone to down clocking. As for the Lemire's analysis above, it is incomplete. For one the tests are only on a single model but more importantly only 1 core out of 4 is exercised. The latter makes it a lot less likely for the core to clock down. CPUs clock down not because the manufacturers are trying to cheat you - it is because the core needs to stay below some max power threshold. If only 1/4th of the total chip is used it is hard to make the core downclock at all. |
The last example in lemire's blog post: #include <x86intrin.h>
#include <stdlib.h>
int main(int argc, char **argv) {
if(argc>1) _mm256_zeroupper();
float a = 3;
float b = rand();
if(argc>2) _mm256_zeroupper();
for(int k = 0; k < 5000000; k++) {
b = a + b * b;
a = b - a * a;
}
return (b == 0)? 1 : 0;
} shows that using two AVX1 instructions once in a single threaded program can have a catastrophic impact on performance. The second multi-threaded example that uses FMA heavily on all cores shows that using a lot of AVX instructions in the hot-spots of your code can have measurable differences in performance, while the first single-threaded example that uses AVX heavily in a hotspot shows no difference at all.
That depends on the instructions used, when they are used, the CPU involved, etc. The intel white papers explicitly state that down clocking when using 256-bit and 512-bit wide registers only happens for "certain workloads". Using a single AVX1 instruction a lot does not necessarily down clock anything, but using another instruction sequence only once in your program execution might downclock one core, or all cores, for a long time. If using |
This fills out the complete API of the memchr crate, namely, including
memrchr2 and memrchr3, along with double ended iterators for memchr2
and memchr3. We also add
memr?chr[123]_iter
convenience free functionsfor constructing iterators, instead of needing to import the iterator
type itself.
In doing so, we provide vectorized implementations for everything. Both
memchr and memrchr should now be competitive with glibc on x86_64, and
therefore, we now drop libc on x86_64. This also has the benefit of
providing consistent performance across all platforms, regardless of
which libc is used. Namely, the performance of the various
implementations of memchr across different implementations of libc can
vary greatly. In particular, static builds of Rust programs using MUSL can
now benefit from this work. Previously, such programs would use MUSL's
(comparatively) unoptimized routine for memchr.
We also rewrite the fallback routines to make them faster as well.
For the most part, throughput remains roughly the same, but we reduce
overhead quite a bit by just accepting our fate and using raw pointers
everywhere.
Finally, we dramatically improve the coverage of tests and benchmarks.
In the course of working on this, I played the role of a mutation tester,
and generally speaking, small tweaks to make any of the algorithms
incorrect should now be caught by the test suite. This is important because
there are a ton of parameters that impact how the algorithm behaves (such
as CPU support, alignment, length, and of course, the haystack itself).
Closes #19, Closes #34
cc'ing some folks that I know have worked in this area: @Manishearth @bluss @llogiq @Veedrac --- any review you might be able to provide would be awesome! I also imagine we might want to pull some of this into
std
.