Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider enabling AVX2 or SSE2 features at runtime #143

Closed
ethack opened this issue Jul 9, 2021 · 17 comments
Closed

Consider enabling AVX2 or SSE2 features at runtime #143

ethack opened this issue Jul 9, 2021 · 17 comments
Labels
enhancement New feature or request

Comments

@ethack
Copy link

ethack commented Jul 9, 2021

This might be a long-shot, but figured I'd suggest it.

How I got here:
I have a docker image where I compile ugrep and then I end up running it on variety of different systems. The other day ugrep crashed on a particular system and gave me a core dump. After some troubleshooting I concluded it was because ugrep was compiled with AVX2 support enabled, but the CPU on the system I was running on didn't support it.

I know I could use the --disable-avx flag while building but I'm assuming that comes with a performance hit when the CPU does actually support it. Or I could build ugrep each time on my target system, but that doesn't really work for my use case.

I often switch between ugrep and ripgrep for different reasons. When ugrep crashed on that no-AVX2 system I switched to ripgrep and it worked fine. While investigating more later I noticed that ripgrep does something interesting: it enables certain features dynamically when the CPU supports AVX2 or SSSE3.

I'm wondering if it would be possible for ugrep to do the same? I realize it's likely not a simple thing to implement, so feel free to close this if the cost to implement is too high.

I should also point out that my docker image (and thus ugrep's compilation) happens on a Github runner in the cloud where I don't really have control over the CPU instruction sets. So if I happen to get unlucky and compile ugrep on a legacy system without AVX2 support then it would be disabled even if I ran the ugrep binary on systems that did support AVX2. Runtime support would help in this scenario as well.

Relevant links:

@genivia-inc
Copy link
Member

There is a runtime check to disable AVX and SSE, but unfortunately the way the C++ code is typically built on *nix systems this will not prevent AVX/SSE optimizations done by the compiler. That is the problem, at least with *nix. Windows is OK, because I've built ugrep.exe without AVX/SSE optimizations, but with specific code that uses AVX and SSE intrinsics to speed up searching, as intended. Unfortunately, it is a known problem that this approach will not work with GNU or clang C++ on *nix because the AVX/SSE intrinsics require AVX/SSE compiler options enabled. Those options are then globally applied, essentially to all code. We're not alone facing this issue and this was discussed before as an issue here #103, with as resolution to disable AVX to build ugrep with ./build.sh --disable-avx. SSE is widely supported and should be fast enough anyway.

I will talk to an old friend of mine at Google who used to work at Intel on the Intel C++ compiler's auto-vectorization algorithms and see what he recommends.

@ethack
Copy link
Author

ethack commented Jul 9, 2021

Thank you for the explanation! That is nifty that you already handle it at runtime as well. But a bummer that the build tooling is the wrench in the gears here.

@genivia-inc
Copy link
Member

There are two possibilities, neither is ideal:

  1. isolate the custom SIMD-specialized parts with AVX/SSE intrinsics and move them to new C++ source files to compile separately.
  2. rewrite the SIMD-specialized parts with inlined assembly.

IMHO option 1 is preferable. I suppose the code duplication can limit the call-return overhead by hoisting the runtime CPU check up and out of loops (not a bad idea anyway).

The structure of the code right now uses both compile-time and runtime CPU checking guards to execute SIMD-optimized parts with intrinsics:

  while (true)
  {
    if (lcs_ < len)
    {
      ...
#if defined(HAVE_AVX512BW) && (!defined(_MSC_VER) || defined(_WIN64))
      if (have_HW_AVX512BW())
      ...
#elif defined(HAVE_AVX2)
      if (have_HW_AVX2())
      ...
 #elif defined(HAVE_SSE2)
      if (have_HW_SSE2())
      ...
#elif defined(HAVE_NEON)
      ...
#endif
      ... // non-SIMD code fallback with memchr/memcmp etc to find a match
    } else {
      ... // non-SIMD code fallback when Boyer-Moore is expected to be faster
    }
  }  

The goal is to prevent the compiler from using SIMD instructions to optimize the unguarded fallback parts. The idea is to create four or five specialized versions of the entire source code file, because I don't want to create new functions to call to execute a guarded branch, which would be easier, but adds call-return overhead to this performance-critical part. Having multiple copies of essentially the same source code is not helping with maintainability.

It is surprisingly dumb of Clang and GCC to require -m just to use intrinsics, because it also activates CPU-specific instructions and optimizations everywhere else.

@genivia-inc
Copy link
Member

genivia-inc commented Jul 16, 2021

Checking with Compiler Explorer's assembly code output, clang already enables SSE2 instructions in plain x86 builds with -O2 (without -msse2 etc). Therefore, the suggested build step:

./build.sh --disable-avx

should produce a clean x86 executable that is binary portable to all (modern) x86 systems. The ugrep --version displays +sse2 when SSE2 is enabled and supported or -sse2 when it is not supported.

I had some time today to work on refactoring the code to effectively enable AVX/SSE2 runtime optimizations only when the CPU supports them. All tests pass with different AVX/SSE configs. But I don't have machines to test if the binary is indeed movable from an AVX machine to a non-AVX-capable machine (all my machines are AVX-capable). There is a cascade of small changes to the source code and the build scrips. But none should negatively impact the performance.

@genivia-inc
Copy link
Member

It will help a lot if you could give this beta dev version a try, since I don't have non-AVX machines to test this against: https://www.genivia.com/files/ugrep-v3.3.5-beta.zip
The zip contains updated (but with ugly FIXME reminders) source code and the ./build.sh script.

@genivia-inc genivia-inc added the enhancement New feature or request label Jul 16, 2021
@ethack
Copy link
Author

ethack commented Jul 16, 2021

More than happy to! You may have to coach me through any specific tests or information you need for debugging.

Summary: ug --version reports +avx2 on my modern CPU and my test search worked. On the legacy CPU it reports +sse2 and outputs Illegal instruction (core dumped) on my test search.

Here are my steps.

  1. Created a ugrep-beta.Dockerfile based on Ubuntu 21.04 to build ugrep 3.3.5-beta.
  2. Modern: Built docker image (which built ugrep).
  3. Modern: Inside container ran ug --version and ug -i avx2 on the source code directory. This produced matches.
ug --version
ugrep 3.3.5 x86_64-pc-linux-gnu +avx2 +pcre2_jit +zlib

ug -i avx2 | wc -l  
57
  1. Transferred the docker image from the modern system to the legacy one.
  2. Legacy: Inside container ran ug --version and ug -i avx2. This crashed.
ug --version
ugrep 3.3.5 x86_64-pc-linux-gnu +sse2 +pcre2_jit +zlib

ug -i avx2
Illegal instruction (core dumped)
  1. Legacy: Built docker image (which built ugrep).
  2. Legacy: Inside container ran ug --version and ug -i avx2. This produced matches. (The different number of matches seems related to binary files and log output from building on the different systems.)
ug --version
ugrep 3.3.5 x86_64-pc-linux-gnu +sse2 +pcre2_jit +zlib

ug -i avx2 | wc -l
45

CPU flags from /proc/cpuinfo.

  • Modern: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust sgx bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp sgx_lc md_clear flush_l1d
  • Legacy: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm pcid sse4_1 sse4_2 movbe popcnt xsave rdrand lahf_lm abm cpuid_fault invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust erms invpcid xsaveopt dtherm arat pln pts

Let me know what would be most useful for you: build logs, core dump, or running with different flags, etc.

@ethack
Copy link
Author

ethack commented Jul 16, 2021

Also, please let me know if I did the right build steps in my Dockerfile. I just did ./build.sh but I'm wondering if I should have done ./build.sh --disable-avx. After re-reading your earlier comment, I'm thinking that the --disable-avx is only for the compiler in the beta and doesn't effect the runtime. So even with --disable-avx in the build step it will still enable avx features at runtime?

You also mentioned performance. If there are any cases you are worried about let me know and I can run benchmarks.

@ethack
Copy link
Author

ethack commented Jul 16, 2021

I built it with ./build.sh --disable-avx as well. Good news is that it works same on both systems without crashing. But I'm not sure how to tell if the AVX2 features are being enabled at runtime on my modern CPU. This is the ug --version output on my modern CPU.

ug --version
ugrep 3.3.5 x86_64-pc-linux-gnu +sse2 +pcre2_jit +zlib

It no longer shows +avx2 like it did before, which might not mean anything for the runtime. ripgrep also had an issue where it wasn't clear which features were actually enabled and supported by the current CPU. They solved that by outputting 2 lines of features, one for compile time and one for runtime like this:

rg --version                       
ripgrep 13.0.0
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

@genivia-inc
Copy link
Member

I think that the illegal instruction is probably caused by vpbroadcastb in the SSE2 code part, the reason is that in the meantime I had ported this isolated SIMD part of the code with a bit of effort to a small stand-alone test program to compile with Compiler Explorer to verify the assembly: https://godbolt.org/z/Med3sWKav

Lo and behold: there is an AVX2 instruction vpbroadcastb in the SSE2 version of the code! Sure, it is compiled with -mavx2 but the code with SSE2 intrinsics is so simple I did not expect vpbroadcastb or any other AVX2 instruction to occur in the SSE2-specific code. Besides the SSE2 intrinsics, there are a few other calculations necessary. Those are optimized with AVX2 apparently? Alas, Compiler Explorer won't even tell me where the vpbroadcastb originates from in the source code.

I was hoping that code duplication would not be necessary, so I just isolated the SIMD critical code in one new file (simd.cpp) instead of splitting this up in many files. The only way around this will be to create a simd.cpp code version for SSE2 compiled with -msse2, a code version for AVX2 compiled with -mavx2 and a code version for AVX512BW. Then select one of these three at runtime to execute.

What a pain...

@genivia-inc
Copy link
Member

To explain the +sse2: I may have incorrectly stated that -avx2 would be shown. Only when SIMD optimizations are not possible at runtime at all, then (-avx2) or (-sse2) will be shown. So +sse2 simply means that SSE2 is selected, either because of the runtime flag (--disable-avx) or the runtime check when compiled with AVX2.

@genivia-inc
Copy link
Member

Second attempt. This time I've split up simd.cpp into simd_avx2.cpp and simd_avx512bw.cpp. Both are separately compiled with -mavx2 and -mavx512bw, respectively. Tested and verified the build on AVX-capable machines and ARM neon machines: https://www.genivia.com/files/ugrep-v3.3.5-beta2.zip

This approach to split up the code should make it robust to x86 transfers of the x86 binary from an x86 AVX-capable machine to an x86 non-AVX2-capable machine that supports at least SSE2 (if the build steps detect SSE2 or greater e.g. AVX2, then SSE2 will always be enabled in the generated x86 with -msse2). The runtime CPU detection picks the AVX or SSE version of the code to execute.

@genivia-inc
Copy link
Member

Perhaps you could give this update a spin? I don't have a non-AVX machine to test: https://www.genivia.com/files/ugrep-v3.3.5-beta2.zip

@ethack
Copy link
Author

ethack commented Jul 20, 2021

I went through the same steps as before, building ugrep in an image on my modern system (./build.sh) and transferring it over to the legacy one. It now searches fine on both systems. In addition, I get this output:

# modern
$ md5sum /usr/local/bin/ugrep
2e2c465ae10ce3bd3a2ab27e7506a6e3  /usr/local/bin/ugrep
$ /usr/local/bin/ugrep --version
ugrep 3.3.5 x86_64-pc-linux-gnu +avx2 +pcre2_jit +zlib

# legacy
$ md5sum /usr/local/bin/ugrep
2e2c465ae10ce3bd3a2ab27e7506a6e3  /usr/local/bin/ugrep
$ /usr/local/bin/ugrep --version
ugrep 3.3.5 x86_64-pc-linux-gnu +sse2 +pcre2_jit +zlib

Are there any other commands or tests you'd like me to run?

@genivia-inc
Copy link
Member

Thanks! This would be expected to work, since the code is now split up into separate compilation units when built with make on *nix systems. Also the output looks fine: I opted to output the runtime support for AVX/SSE as +avx512bw, +avx2 and +sse2. If none applies at runtime, but the binary was compiled with SSE2 then (no sse2!) will be shown in v3.3.5, this is meant to indicate an incompatibility problem with potential abort with invalid instruction may happen when running the x86 binary on the current machine.

If an x86 ugrep was built on a non-SSE2/AVX2 legacy *nix machine, then ugrep -V will not show anything at all with respect to SSE/AVX and will run just fine since no SIMD optimizations were applied.

@genivia-inc
Copy link
Member

The tricky part was to force -mavx512bw and -mavx2 to compile simd_avx512bw.cpp and simd_avx2.cpp code with AVX512BW and AVX2 intrinsics, using the autoconf and automake tools. The Makefile.am uses trick to extend the compiler flags for these two files:

noinst_LIBRARIES      = libreflex.a
libreflex_a_CPPFLAGS  = -I$(top_srcdir)/include $(SIMD_FLAGS)
libreflex_a_SOURCES   = ...
libreflex_a-simd_avx2.$(OBJEXT) : CXXFLAGS += $(SIMD_AVX2_FLAGS)
libreflex_a-simd_avx512bw.$(OBJEXT) : CXXFLAGS += $(SIMD_AVX512BW_FLAGS)

The SIMD_FLAGS are just -msse2 -DHAVE_XXX with HAVE_XXX either HAVE_SSE2, HAVE_AVX2 or HAVE_AVX512BW, when SSE2 or greater is detected on the host machine. The SIMD_AVX512BW_FLAGS and SIMD_AVX2_FLAGS flags are -mavx512bw and -mavx2 or just empty if this is not an Intel x86 with SSE/AVX. Both files have the proper #if defined(HAVE_AVX512BW) and #if defined(HAVE_AVX2) conditions to compile the optimized versions of the vectorized functions with intrinsics.

Easy does it, once you have a plan of attack!

@genivia-inc
Copy link
Member

The v3.3.5 update is available.

leahneukirchen added a commit to void-linux/void-packages that referenced this issue Jul 21, 2021
AVX2 is now checked on runtime: Genivia/ugrep#143
@genivia-inc
Copy link
Member

@ethack the ugrep v3.6.0 release casts a wider net for AVX optimizations applied to the code. The code base uses multi-versioning like before, but now with the entire matcher engine being optimized rather than a small portion of it. This was done to restore the faster speed to what it was before 3.3.5 was released. The latest version should work just as well as the previous versions (after 3.3.5) with respect to binary portability to non-AVX systems with runtime AVX checking and running the proper optimized version of the code. But one cannot be cautious enough. So if you have any trouble, then please report your observations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants