Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dispatch regression with recent commit #763

Closed
tycho opened this issue Nov 27, 2022 · 28 comments · Fixed by #865
Closed

dispatch regression with recent commit #763

tycho opened this issue Nov 27, 2022 · 28 comments · Fixed by #865

Comments

@tycho
Copy link

tycho commented Nov 27, 2022

Commit 91788f1 seems to have caused a regression with dispatch, causing it to unconditionally use AVX512 on my system -- while my toolchain supports AVX512, my host CPU does not.

My system hits an illegal instruction through here:

xxHash/xxhash.h

Line 5028 in 91788f1

XXH3_accumulate_512(acc, p, secret + secretSize - XXH_STRIPE_LEN - XXH_SECRET_LASTACC_START);

Call stack:

>	game.exe!XXH3_accumulate_512_avx512(void * acc, const void * input, const void * secret) Line 4195	C
 	game.exe!XXH3_hashLong_internal_loop(unsigned __int64 * acc, const unsigned char * input, unsigned __int64 len, const unsigned char * secret, unsigned __int64 secretSize, void(*)(unsigned __int64 *, const unsigned char *, const unsigned char *, unsigned __int64) f_acc, void(*)(void *, const void *) f_scramble) Line 5093	C
 	game.exe!XXH3_hashLong_64b_internal(const void * input, unsigned __int64 len, const void * secret, unsigned __int64 secretSize, void(*)(unsigned __int64 *, const unsigned char *, const unsigned char *, unsigned __int64) f_acc, void(*)(void *, const void *) f_scramble) Line 5144	C
 	game.exe!XXHL64_default_avx2(const void * input, unsigned __int64 len) Line 555	C
 	game.exe!XXH3_hashLong_64b_defaultSecret_selection(const void * input, unsigned __int64 len, unsigned __int64 seed64, const unsigned char * secret, unsigned __int64 secretLen) Line 688	C
 	game.exe!XXH3_64bits_internal(const void * input, unsigned __int64 len, unsigned __int64 seed64, const void * secret, unsigned __int64 secretLen, unsigned __int64(*)(const void *, unsigned __int64, unsigned __int64, const unsigned char *, unsigned __int64) f_hashLong) Line 5244	C
 	game.exe!XXH3_64bits_dispatch(const void * input, unsigned __int64 len) Line 693	C

EDIT: For clarity, note that dispatch correctly selected AVX2 (XXHL64_default_avx2) but XXH3_hashLong_internal_loop directly uses the XXH3_accumulate_512 macro which resolved to XXH3_accumulate_512_avx512 (which was the highest toolchain-supported ISA at compile-time).

I'm not sure what the correct line would look like.

EDIT: Removed bogus patch. Better one in comments.

@t-mat
Copy link
Contributor

t-mat commented Nov 28, 2022

Hi @tycho, thanks for the report.
I think it's our side mistake. So I'd like to mention other devs to understand/resolve this issue.

If you can, could you provide some details of your project? Especially your compiler and compile-time options for xxhash will greatly help us.


  • @hzhuang1, could you review above patch? It seems good for me, but I'm not sure about corner cases.

  • @Cyan4973, TBH I'm just a bit surprised that we don't have test for this issue.

So far, I think we can introduce qemu-system-x86 to our qemu test on GitHub actions.

sudo apt install qemu-system-x86
qemu-x86_64-static --version
RUN_ENV="qemu-x86_64-static -cpu Haswell" LDFLAGS="-static" make clean check

I'm also not sure we have proper test for this issue. Since xxhsum is special case for xxh_x86dispatch, we may need dedicated simple test to prevent same issue again in future.
Please let me know if you have idea.

@hzhuang1
Copy link
Contributor

diff --git a/xxhash.h b/xxhash.h
index a17a8eb..3428620 100644
--- a/xxhash.h
+++ b/xxhash.h
@@ -5088,7 +5088,7 @@ XXH3_hashLong_internal_loop(xxh_u64* XXH_RESTRICT acc,
/* last stripe /
{ const xxh_u8
const p = input + len - XXH_STRIPE_LEN;
#define XXH_SECRET_LASTACC_START 7 /* not aligned on 8, last secret is different from acc & scrambler */

  •        XXH3_accumulate_512(acc, p, secret + secretSize - XXH_STRIPE_LEN - XXH_SECRET_LASTACC_START);
    
  •        f_acc(acc, p, secret + secretSize - XXH_STRIPE_LEN - XXH_SECRET_LASTACC_START, 1);
    
    } }
    }

f_acc means the full accumulation loop, and XXH3_accumulate_512 means the internal 512-byte loop. At here, it should be 512-byte internal loop since it's the last stripe. It was talked in the pull request #692.

I need to understand why f_acc() won't trigger the same issue on your machine.

@hzhuang1
Copy link
Contributor

@tycho Could you help me to do a quick test? How about only revert the change on xxh_x86dispatch.c in commit 91788f1? I think that I changed it with wrong settings.

@tycho
Copy link
Author

tycho commented Nov 28, 2022

Hi @tycho, thanks for the report. I think it's our side mistake. So I'd like to mention other devs to understand/resolve this issue.

If you can, could you provide some details of your project? Especially your compiler and compile-time options for xxhash will greatly help us.

I compile with LLVM clang-cl using Visual Studio 2019, with these flags (trimmed out include paths and other unnecessary flags):

C:\LLVM\main-release\bin\clang-cl.exe /c /I<several include paths> /Z7 /W4 /WX- /diagnostics:column /MP /O2 /Ob2 /Oi /Ot /D WIN32 /D _WIN32 /D LIBARCHIVE_STATIC /D PCRE2_STATIC /D TARGET_DEBUG /D __SSE__=1 /D __SSE_MATH__=1 /D __SSE2__=1 /D __SSE2_MATH__=1 /D __SSE3__=1 /D __SSSE3__=1 /D __SSE4_1__=1 /D __SSE4_2__=1 /D NDEBUG /D __STATIC__ /D NTDDI_VERSION=0x06010000 /D _WIN32_WINNT=0x0601 /D _CRT_SECURE_NO_WARNINGS /D _CRT_NONSTDC_NO_WARNINGS /D WIN32 /D NOMINMAX /D XXH_DISPATCH_AVX512=0 /D _HAS_ITERATOR_DEBUGGING=0 /D _SCL_SECURE=0 /D _UNICODE /D UNICODE /GF /MD /GS- /fp:fast /fp:except- /GR- /std:c17 /Fo"..." /Gd /TC --target=x86_64-pc-windows-msvc /showFilenames -Xclang -fopenmp-simd -Wno-microsoft-enum-forward-reference -Wno-deprecated-declarations -Wno-reorder-ctor -Wno-deprecated-builtins -Werror=implicit-function-declaration -Wno-overloaded-virtual -Wno-microsoft-include -Wno-single-bit-bitfield-constant-conversion -Xclang -fdenormal-fp-math=positive-zero -Xclang -fdenormal-fp-math-f32=positive-zero -march=x86-64-v2 -tune:znver3 -flto=thin -fwhole-program-vtables -Xclang -O2 -Xclang -ffast-math -fmerge-all-constants /Zc:__cplusplus /volatile:iso /Zc:preprocessor ..\..\contrib\xxHash\xxh_x86dispatch.c

@tycho Could you help me to do a quick test? How about only revert the change on xxh_x86dispatch.c in commit 91788f1? I think that I changed it with wrong settings.

Which part, the whole file? Because that just results in compile errors (looks like missing intrinsic header includes).

f_acc means the full accumulation loop, and XXH3_accumulate_512 means the internal 512-byte loop. At here, it should be 512-byte internal loop since it's the last stripe. It was talked in the pull request #692.

Ah.

I need to understand why f_acc() won't trigger the same issue on your machine.

Because f_acc is selected by dispatch, but the XXH3_accumulate_512 macro is just the highest toolchain-supported ISA at compile time? If it had another argument with a dispatch-selected XXH3_accumulate_512 it'd work fine. You can see in the call stack in my OP that dispatch selected AVX2 (XXHL64_default_avx2) but it called the AVX512 version of XXH3_accumulate_512.

@t-mat
Copy link
Contributor

t-mat commented Nov 28, 2022

I've tried to reproduce this issue with the following code. But failed. @tycho , could you provide some advice for reproducing your issue ? Since we have a plan to add new test for this issue, minimal repro code helps us a lot.

// test-issue-763.c
//
// ## test commands
//
//  git branch
//  # dev
//
//  git log -1
//  # 30d6a3e, Thu Nov 24 01:17:08 2022 -0800, Merge pull request #756 from hzhuang1/sve_02
//
//  gcc -DXXH_DISPATCH_AVX512=0 test-issue-763.c xxh_x86dispatch.c
//
//  ./a.out
//  # eb4b7c3707879151
//
//  RUN_ENV="qemu-x86_64-static -cpu Haswell" LDFLAGS="-static" ./a.out
//  # eb4b7c3707879151

#include <stdio.h>
#define XXH_STATIC_LINKING_ONLY
#define XXH_IMPLEMENTATION
#define XXH_INLINE_ALL
#include "xxhash.h"
#include "xxh_x86dispatch.h"

int main(int argc, char* argv[]) {
    static char src[4096];
    for(size_t i = 0; i < sizeof(src); ++i) {
        src[i] = (char) i;
    }

    XXH64_hash_t const hash_xxh64 = XXH3_64bits(src, sizeof(src));

    {
        XXH64_canonical_t canonical_xxh64;
        XXH64_canonicalFromHash(&canonical_xxh64, hash_xxh64);
        for(size_t i = 0; i < sizeof(canonical_xxh64.digest); ++i) {
            printf("%02x", canonical_xxh64.digest[i]);
        }
        printf("\n");
    }

    return 0;
}

@tycho
Copy link
Author

tycho commented Nov 28, 2022

Oh! I see why you're having trouble reproducing it.

I had forgotten that I have some patches on top of upstream xxHash, which make it possible to use dispatching on clang-cl with Visual Studio (note that clang-cl defines both _MSC_VER and __clang__):

diff --git a/xxh_x86dispatch.c b/xxh_x86dispatch.c
index b509035..094cc99 100644
--- a/xxh_x86dispatch.c
+++ b/xxh_x86dispatch.c
@@ -91,6 +91,18 @@ extern "C" {
 #  define XXH_HAS_INCLUDE(header) 0
 #endif

+/* clang-cl's intrinsic headers depend on the predefined macros coming from
+ * -march or -mfeature flags, which are not defined if we are targeting something
+ * without those features. Since we are deliberately doing dispatch for intrinsics
+ * we may not be able to execute on our primary target, we must define these here
+ * in order for the right intrinsics to get defined.
+ */
+#if defined(_MSC_VER) && defined(__clang__)
+#  define __AVX__
+#  define __AVX2__
+#  define __AVX512F__
+#endif
+
 /*!
  * @def XXH_DISPATCH_SCALAR
  * @brief Enables/dispatching the scalar code path.
@@ -175,7 +187,7 @@ extern "C" {
  * @def XXH_TARGET_AVX512
  * @brief Like @ref XXH_TARGET_SSE2, but for AVX512.
  */
-#if defined(__GNUC__)
+#if defined(__GNUC__) || defined(__clang__)
 #  include <emmintrin.h> /* SSE2 */
 #  if XXH_DISPATCH_AVX2 || XXH_DISPATCH_AVX512
 #    include <immintrin.h> /* AVX2, AVX512F */

The above is added to serves a few purposes:

  • Allow xxHash dispatch to compile on clang-cl at all
  • Include all the right intrinsics for each of the dispatch targets
  • Declare the targeting macros for each of the dispatch targets (i.e. __attribute__((target(*))), which are supported by clang-cl, unlike MSVC's cl)

Admittedly, defining __AVX__, __AVX2__, and __AVX512F__ is almost always a terrible idea, but there's no other way to get the headers for clang-cl to define the intrinsics for all the targets used by dispatch (as my comment in the patch above mentions, they are defined only when the appropriate __FEATURE__ macros are present).

The problem with defining those feature macros is that it made the automatic selection for XXH_VECTOR to be wrong (and thus use AVX512 for XXH3_accumulate_512). Note also that this was not a problem before, because previously all the dispatch paths used the correct ISA-specific functions. If I had also done #define XXH_VECTOR XXH_SSE2 as well (which I did in my project, but not within the project compiling xxh_x86dispatch.c), it would have compiled and run, though it would have still used the incorrect SSE2 instructions for the last stripe in XXH3_hashLong_internal_loop for all of the dispatch functions.

After adding my above patch to upstream xxHash, I can repro the issue (where test.c is the test code in your post):

> C:\llvm\main-release\bin\clang-cl.exe -march=x86-64-v2 /o xxhtest.exe xxh_x86dispatch.c test.c
> xxhtest.exe
* crash*

I think the following is probably the right approach, now that I understand the distinction between accumulate and accumulate_512:

diff --git a/xxh_x86dispatch.c b/xxh_x86dispatch.c
index b509035..b5f9e9e 100644
--- a/xxh_x86dispatch.c
+++ b/xxh_x86dispatch.c
@@ -459,7 +459,8 @@ XXHL64_default_##suffix(const void* XXH_RESTRICT input, size_t len)           \
 {                                                                             \
     return XXH3_hashLong_64b_internal(                                        \
                input, len, XXH3_kSecret, sizeof(XXH3_kSecret),                \
-               XXH3_accumulate_##suffix, XXH3_scrambleAcc_##suffix            \
+               XXH3_accumulate_##suffix, XXH3_accumulate_512_##suffix,        \
+               XXH3_scrambleAcc_##suffix                                      \
     );                                                                        \
 }                                                                             \
                                                                               \
@@ -471,7 +472,8 @@ XXHL64_seed_##suffix(const void* XXH_RESTRICT input, size_t len,              \
 {                                                                             \
     return XXH3_hashLong_64b_withSeed_internal(                               \
                     input, len, seed, XXH3_accumulate_##suffix,               \
-                    XXH3_scrambleAcc_##suffix, XXH3_initCustomSecret_##suffix \
+                    XXH3_accumulate_512_##suffix, XXH3_scrambleAcc_##suffix,  \
+                    XXH3_initCustomSecret_##suffix                            \
     );                                                                        \
 }                                                                             \
                                                                               \
@@ -483,7 +485,8 @@ XXHL64_secret_##suffix(const void* XXH_RESTRICT input, size_t len,            \
 {                                                                             \
     return XXH3_hashLong_64b_internal(                                        \
                     input, len, secret, secretLen,                            \
-                    XXH3_accumulate_##suffix, XXH3_scrambleAcc_##suffix       \
+                    XXH3_accumulate_##suffix, XXH3_accumulate_512_##suffix,   \
+                    XXH3_scrambleAcc_##suffix                                 \
     );                                                                        \
 }                                                                             \
                                                                               \
@@ -503,7 +506,8 @@ XXHL128_default_##suffix(const void* XXH_RESTRICT input, size_t len)          \
 {                                                                             \
     return XXH3_hashLong_128b_internal(                                       \
                     input, len, XXH3_kSecret, sizeof(XXH3_kSecret),           \
-                    XXH3_accumulate_##suffix, XXH3_scrambleAcc_##suffix       \
+                    XXH3_accumulate_##suffix, XXH3_accumulate_512_##suffix,   \
+                    XXH3_scrambleAcc_##suffix                                 \
     );                                                                        \
 }                                                                             \
                                                                               \
@@ -515,7 +519,8 @@ XXHL128_secret_##suffix(const void* XXH_RESTRICT input, size_t len,           \
 {                                                                             \
     return XXH3_hashLong_128b_internal(                                       \
                     input, len, (const xxh_u8*)secret, secretLen,             \
-                    XXH3_accumulate_##suffix, XXH3_scrambleAcc_##suffix);     \
+                    XXH3_accumulate_##suffix, XXH3_accumulate_512_##suffix,   \
+                    XXH3_scrambleAcc_##suffix);                               \
 }                                                                             \
                                                                               \
 /* ===   XXH128 Seeded variants   === */                                      \
@@ -525,7 +530,8 @@ XXHL128_seed_##suffix(const void* XXH_RESTRICT input, size_t len,             \
                       XXH64_hash_t seed)                                      \
 {                                                                             \
     return XXH3_hashLong_128b_withSeed_internal(input, len, seed,             \
-                    XXH3_accumulate_##suffix, XXH3_scrambleAcc_##suffix,      \
+                    XXH3_accumulate_##suffix, XXH3_accumulate_512_##suffix,   \
+                    XXH3_scrambleAcc_##suffix,                                \
                     XXH3_initCustomSecret_##suffix);                          \
 }
 
diff --git a/xxhash.h b/xxhash.h
index 1767b7e..a622f52 100644
--- a/xxhash.h
+++ b/xxhash.h
@@ -5001,6 +5001,7 @@ XXH3_initCustomSecret_scalar(void* XXH_RESTRICT customSecret, xxh_u64 seed64)
 
 
 typedef void (*XXH3_f_accumulate)(xxh_u64* XXH_RESTRICT, const xxh_u8* XXH_RESTRICT, const xxh_u8* XXH_RESTRICT, size_t);
+typedef void (*XXH3_f_accumulate_512)(void *XXH_RESTRICT acc, const void *XXH_RESTRICT input, const void *XXH_RESTRICT secret);
 typedef void (*XXH3_f_scrambleAcc)(void* XXH_RESTRICT, const void*);
 typedef void (*XXH3_f_initCustomSecret)(void* XXH_RESTRICT, xxh_u64);
 
@@ -5065,6 +5066,7 @@ XXH3_hashLong_internal_loop(xxh_u64* XXH_RESTRICT acc,
                       const xxh_u8* XXH_RESTRICT input, size_t len,
                       const xxh_u8* XXH_RESTRICT secret, size_t secretSize,
                             XXH3_f_accumulate f_acc,
+                            XXH3_f_accumulate_512 f_acc512,
                             XXH3_f_scrambleAcc f_scramble)
 {
     size_t const nbStripesPerBlock = (secretSize - XXH_STRIPE_LEN) / XXH_SECRET_CONSUME_RATE;
@@ -5089,7 +5091,7 @@ XXH3_hashLong_internal_loop(xxh_u64* XXH_RESTRICT acc,
         /* last stripe */
         {   const xxh_u8* const p = input + len - XXH_STRIPE_LEN;
 #define XXH_SECRET_LASTACC_START 7  /* not aligned on 8, last secret is different from acc & scrambler */
-            XXH3_accumulate_512(acc, p, secret + secretSize - XXH_STRIPE_LEN - XXH_SECRET_LASTACC_START);
+            f_acc512(acc, p, secret + secretSize - XXH_STRIPE_LEN - XXH_SECRET_LASTACC_START);
     }   }
 }
 
@@ -5135,11 +5137,12 @@ XXH_FORCE_INLINE XXH64_hash_t
 XXH3_hashLong_64b_internal(const void* XXH_RESTRICT input, size_t len,
                            const void* XXH_RESTRICT secret, size_t secretSize,
                            XXH3_f_accumulate f_acc,
+                           XXH3_f_accumulate_512 f_acc512,
                            XXH3_f_scrambleAcc f_scramble)
 {
     XXH_ALIGN(XXH_ACC_ALIGN) xxh_u64 acc[XXH_ACC_NB] = XXH3_INIT_ACC;
 
-    XXH3_hashLong_internal_loop(acc, (const xxh_u8*)input, len, (const xxh_u8*)secret, secretSize, f_acc, f_scramble);
+    XXH3_hashLong_internal_loop(acc, (const xxh_u8*)input, len, (const xxh_u8*)secret, secretSize, f_acc, f_acc512, f_scramble);
 
     /* converge into final hash */
     XXH_STATIC_ASSERT(sizeof(acc) == 64);
@@ -5159,7 +5162,7 @@ XXH3_hashLong_64b_withSecret(const void* XXH_RESTRICT input, size_t len,
                              XXH64_hash_t seed64, const xxh_u8* XXH_RESTRICT secret, size_t secretLen)
 {
     (void)seed64;
-    return XXH3_hashLong_64b_internal(input, len, secret, secretLen, XXH3_accumulate, XXH3_scrambleAcc);
+    return XXH3_hashLong_64b_internal(input, len, secret, secretLen, XXH3_accumulate, XXH3_accumulate_512, XXH3_scrambleAcc);
 }
 
 /*
@@ -5173,7 +5176,7 @@ XXH3_hashLong_64b_default(const void* XXH_RESTRICT input, size_t len,
                           XXH64_hash_t seed64, const xxh_u8* XXH_RESTRICT secret, size_t secretLen)
 {
     (void)seed64; (void)secret; (void)secretLen;
-    return XXH3_hashLong_64b_internal(input, len, XXH3_kSecret, sizeof(XXH3_kSecret), XXH3_accumulate, XXH3_scrambleAcc);
+    return XXH3_hashLong_64b_internal(input, len, XXH3_kSecret, sizeof(XXH3_kSecret), XXH3_accumulate, XXH3_accumulate_512, XXH3_scrambleAcc);
 }
 
 /*
@@ -5191,6 +5194,7 @@ XXH_FORCE_INLINE XXH64_hash_t
 XXH3_hashLong_64b_withSeed_internal(const void* input, size_t len,
                                     XXH64_hash_t seed,
                                     XXH3_f_accumulate f_acc,
+                                    XXH3_f_accumulate_512 f_acc512,
                                     XXH3_f_scrambleAcc f_scramble,
                                     XXH3_f_initCustomSecret f_initSec)
 {
@@ -5198,12 +5202,12 @@ XXH3_hashLong_64b_withSeed_internal(const void* input, size_t len,
     if (seed == 0)
         return XXH3_hashLong_64b_internal(input, len,
                                           XXH3_kSecret, sizeof(XXH3_kSecret),
-                                          f_acc, f_scramble);
+                                          f_acc, f_acc512, f_scramble);
 #endif
     {   XXH_ALIGN(XXH_SEC_ALIGN) xxh_u8 secret[XXH_SECRET_DEFAULT_SIZE];
         f_initSec(secret, seed);
         return XXH3_hashLong_64b_internal(input, len, secret, sizeof(secret),
-                                          f_acc, f_scramble);
+                                          f_acc, f_acc512, f_scramble);
     }
 }
 
@@ -5216,7 +5220,7 @@ XXH3_hashLong_64b_withSeed(const void* XXH_RESTRICT input, size_t len,
 {
     (void)secret; (void)secretLen;
     return XXH3_hashLong_64b_withSeed_internal(input, len, seed,
-                XXH3_accumulate, XXH3_scrambleAcc, XXH3_initCustomSecret);
+                XXH3_accumulate, XXH3_accumulate_512, XXH3_scrambleAcc, XXH3_initCustomSecret);
 }
 
 
@@ -5927,11 +5931,12 @@ XXH_FORCE_INLINE XXH128_hash_t
 XXH3_hashLong_128b_internal(const void* XXH_RESTRICT input, size_t len,
                             const xxh_u8* XXH_RESTRICT secret, size_t secretSize,
                             XXH3_f_accumulate f_acc,
+                            XXH3_f_accumulate_512 f_acc512,
                             XXH3_f_scrambleAcc f_scramble)
 {
     XXH_ALIGN(XXH_ACC_ALIGN) xxh_u64 acc[XXH_ACC_NB] = XXH3_INIT_ACC;
 
-    XXH3_hashLong_internal_loop(acc, (const xxh_u8*)input, len, secret, secretSize, f_acc, f_scramble);
+    XXH3_hashLong_internal_loop(acc, (const xxh_u8*)input, len, secret, secretSize, f_acc, f_acc512, f_scramble);
 
     /* converge into final hash */
     XXH_STATIC_ASSERT(sizeof(acc) == 64);
@@ -5958,7 +5963,7 @@ XXH3_hashLong_128b_default(const void* XXH_RESTRICT input, size_t len,
 {
     (void)seed64; (void)secret; (void)secretLen;
     return XXH3_hashLong_128b_internal(input, len, XXH3_kSecret, sizeof(XXH3_kSecret),
-                                       XXH3_accumulate, XXH3_scrambleAcc);
+                                       XXH3_accumulate, XXH3_accumulate_512, XXH3_scrambleAcc);
 }
 
 /*
@@ -5972,24 +5977,25 @@ XXH3_hashLong_128b_withSecret(const void* XXH_RESTRICT input, size_t len,
 {
     (void)seed64;
     return XXH3_hashLong_128b_internal(input, len, (const xxh_u8*)secret, secretLen,
-                                       XXH3_accumulate, XXH3_scrambleAcc);
+                                       XXH3_accumulate, XXH3_accumulate_512, XXH3_scrambleAcc);
 }
 
 XXH_FORCE_INLINE XXH128_hash_t
 XXH3_hashLong_128b_withSeed_internal(const void* XXH_RESTRICT input, size_t len,
                                 XXH64_hash_t seed64,
                                 XXH3_f_accumulate f_acc,
+                                XXH3_f_accumulate_512 f_acc512,
                                 XXH3_f_scrambleAcc f_scramble,
                                 XXH3_f_initCustomSecret f_initSec)
 {
     if (seed64 == 0)
         return XXH3_hashLong_128b_internal(input, len,
                                            XXH3_kSecret, sizeof(XXH3_kSecret),
-                                           f_acc, f_scramble);
+                                           f_acc, f_acc512, f_scramble);
     {   XXH_ALIGN(XXH_SEC_ALIGN) xxh_u8 secret[XXH_SECRET_DEFAULT_SIZE];
         f_initSec(secret, seed64);
         return XXH3_hashLong_128b_internal(input, len, (const xxh_u8*)secret, sizeof(secret),
-                                           f_acc, f_scramble);
+                                           f_acc, f_acc512, f_scramble);
     }
 }
 
@@ -6002,7 +6008,7 @@ XXH3_hashLong_128b_withSeed(const void* input, size_t len,
 {
     (void)secret; (void)secretLen;
     return XXH3_hashLong_128b_withSeed_internal(input, len, seed64,
-                XXH3_accumulate, XXH3_scrambleAcc, XXH3_initCustomSecret);
+                XXH3_accumulate, XXH3_accumulate_512, XXH3_scrambleAcc, XXH3_initCustomSecret);
 }
 
 typedef XXH128_hash_t (*XXH3_hashLong128_f)(const void* XXH_RESTRICT, size_t,

@t-mat
Copy link
Contributor

t-mat commented Nov 28, 2022

@tycho ,

Before discussing modification, could you please confirm basic problem firstly?
Because this issue claims issue about 91788f1 and dev branch.

My question is:

Does your issue happen with mainline, un-modified version of dev branch of xxHash?
I mean, can you observe any issue in the following configuration?

  • Using your clang-cl.exe.
  • With un-modified, plain, as-is version of dev branch of xxHash.
  • With you own code.

@tycho
Copy link
Author

tycho commented Nov 28, 2022

@tycho ,

Before discussing modification, could you please confirm basic problem firstly? Because this issue claims issue about 91788f1 and dev branch.

The basic problem is less severe but still present: not using the right (suboptimal) ISA for the final XXH3_accumulate_512. Look at the preprocessed version of xxh_x86dispatch.c and you'll note it calls an ISA-specific version of XXH3_accumulate_512, and the resulting code for each dispatch variant will use that same ISA for that accumulate.

My question is:

Does your issue happen with mainline, un-modified version of dev branch of xxHash? I mean, can you observe any issue in the following configuration?

* Using your `clang-cl.exe`.

* With un-modified, plain, as-is version of `dev` branch of xxHash.

* With you own code.

With clang-cl and no modifications it cannot compile with dispatch. That's why my modification is there, as I mentioned.

@tycho
Copy link
Author

tycho commented Nov 28, 2022

With clang-cl and no modifications it cannot compile with dispatch. That's why my modification is there, as I mentioned.

Illustrated:

>C:\llvm\main-release\bin\clang-cl.exe /o xxhtest.exe xxh_x86dispatch.c test.c
In file included from xxh_x86dispatch.c:208:
./xxhash.h(4190,14): error: expected expression
    __m512i* const xacc = (__m512i *) acc;
             ^
./xxhash.h(4190,5): error: use of undeclared identifier '__m512i'
    __m512i* const xacc = (__m512i *) acc;
    ^
./xxhash.h(4192,48): error: use of undeclared identifier '__m512i'
    XXH_STATIC_ASSERT(XXH_STRIPE_LEN == sizeof(__m512i));
                                               ^
./xxhash.h(4196,9): error: use of undeclared identifier '__m512i'
        __m512i const data_vec    = _mm512_loadu_si512   (input);
        ^
./xxhash.h(4198,9): error: use of undeclared identifier '__m512i'
        __m512i const key_vec     = _mm512_loadu_si512   (secret);
        ^
./xxhash.h(4200,9): error: use of undeclared identifier '__m512i'
        __m512i const data_key    = _mm512_xor_si512     (data_vec, key_vec);
        ^
./xxhash.h(4202,9): error: use of undeclared identifier '__m512i'
        __m512i const data_key_lo = _mm512_shuffle_epi32 (data_key, (_MM_PERM_ENUM)_MM_SHUFFLE(0, 3, 0, 1));
        ^
./xxhash.h(4204,9): error: use of undeclared identifier '__m512i'
        __m512i const product     = _mm512_mul_epu32     (data_key, data_key_lo);
        ^
./xxhash.h(4206,9): error: use of undeclared identifier '__m512i'
        __m512i const data_swap = _mm512_shuffle_epi32(data_vec, (_MM_PERM_ENUM)_MM_SHUFFLE(1, 0, 3, 2));
        ^
./xxhash.h(4207,9): error: use of undeclared identifier '__m512i'
        __m512i const sum       = _mm512_add_epi64(*xacc, data_swap);
        ^
./xxhash.h(4209,17): error: call to undeclared function '_mm512_add_epi64'; ISO C99 and later do not support implicit function declarations
      [-Wimplicit-function-declaration]
        *xacc = _mm512_add_epi64(product, sum);
                ^
./xxhash.h(4209,17): note: did you mean '_mm_add_epi64'?
C:\llvm\main-release\lib\clang\16\include\emmintrin.h(2096,46): note: '_mm_add_epi64' declared here
static __inline__ __m128i __DEFAULT_FN_ATTRS _mm_add_epi64(__m128i __a,
                                             ^
In file included from xxh_x86dispatch.c:208:
./xxhash.h(4209,10): error: use of undeclared identifier 'xacc'
        *xacc = _mm512_add_epi64(product, sum);
         ^
./xxhash.h(4209,34): error: use of undeclared identifier 'product'
        *xacc = _mm512_add_epi64(product, sum);
                                 ^
./xxhash.h(4209,43): error: use of undeclared identifier 'sum'
        *xacc = _mm512_add_epi64(product, sum);
                                          ^
./xxhash.h(4239,48): error: use of undeclared identifier '__m512i'
    XXH_STATIC_ASSERT(XXH_STRIPE_LEN == sizeof(__m512i));
                                               ^
./xxhash.h(4240,18): error: expected expression
    {   __m512i* const xacc = (__m512i*) acc;
                 ^
./xxhash.h(4240,9): error: use of undeclared identifier '__m512i'
    {   __m512i* const xacc = (__m512i*) acc;
        ^
./xxhash.h(4241,15): error: unknown type name '__m512i'
        const __m512i prime32 = _mm512_set1_epi32((int)XXH_PRIME32_1);
              ^
./xxhash.h(4241,33): error: call to undeclared function '_mm512_set1_epi32'; ISO C99 and later do not support implicit function declarations
      [-Wimplicit-function-declaration]
        const __m512i prime32 = _mm512_set1_epi32((int)XXH_PRIME32_1);
                                ^
./xxhash.h(4241,33): note: did you mean '_mm_set1_epi32'?
C:\llvm\main-release\lib\clang\16\include\emmintrin.h(3618,46): note: '_mm_set1_epi32' declared here
static __inline__ __m128i __DEFAULT_FN_ATTRS _mm_set1_epi32(int __i) {
                                             ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.

And with AVX512 dispatch off, it has similar errors with AVX2:

>C:\llvm\main-release\bin\clang-cl.exe /D XXH_DISPATCH_AVX512=0 /o xxhtest.exe xxh_x86dispatch.c test.c
In file included from xxh_x86dispatch.c:208:
./xxhash.h(4295,9): error: use of undeclared identifier '__m256i'
    {   __m256i* const xacc    =       (__m256i *) acc;
        ^
./xxhash.h(4295,18): error: expected expression
    {   __m256i* const xacc    =       (__m256i *) acc;
                 ^
./xxhash.h(4298,23): error: unknown type name '__m256i'
        const         __m256i* const xinput  = (const __m256i *) input;
                      ^
./xxhash.h(4298,55): error: unknown type name '__m256i'
        const         __m256i* const xinput  = (const __m256i *) input;
                                                      ^
./xxhash.h(4301,23): error: unknown type name '__m256i'
        const         __m256i* const xsecret = (const __m256i *) secret;
                      ^
./xxhash.h(4301,55): error: unknown type name '__m256i'
        const         __m256i* const xsecret = (const __m256i *) secret;
                                                      ^
./xxhash.h(4304,45): error: use of undeclared identifier '__m256i'
        for (i=0; i < XXH_STRIPE_LEN/sizeof(__m256i); i++) {
                                            ^
./xxhash.h(4306,13): error: use of undeclared identifier '__m256i'
            __m256i const data_vec    = _mm256_loadu_si256    (xinput+i);
            ^
./xxhash.h(4308,13): error: use of undeclared identifier '__m256i'
            __m256i const key_vec     = _mm256_loadu_si256   (xsecret+i);
            ^
./xxhash.h(4310,13): error: use of undeclared identifier '__m256i'
            __m256i const data_key    = _mm256_xor_si256     (data_vec, key_vec);
            ^
./xxhash.h(4312,13): error: use of undeclared identifier '__m256i'
            __m256i const data_key_lo = _mm256_shuffle_epi32 (data_key, _MM_SHUFFLE(0, 3, 0, 1));
            ^
./xxhash.h(4314,13): error: use of undeclared identifier '__m256i'
            __m256i const product     = _mm256_mul_epu32     (data_key, data_key_lo);
            ^
./xxhash.h(4316,13): error: use of undeclared identifier '__m256i'
            __m256i const data_swap = _mm256_shuffle_epi32(data_vec, _MM_SHUFFLE(1, 0, 3, 2));
            ^
./xxhash.h(4317,13): error: use of undeclared identifier '__m256i'
            __m256i const sum       = _mm256_add_epi64(xacc[i], data_swap);
            ^
./xxhash.h(4319,13): error: use of undeclared identifier 'xacc'; did you mean 'acc'?
            xacc[i] = _mm256_add_epi64(product, sum);
            ^~~~
            acc
./xxhash.h(4290,46): note: 'acc' declared here
XXH3_accumulate_512_avx2( void* XXH_RESTRICT acc,
                                             ^
./xxhash.h(4319,23): error: call to undeclared function '_mm256_add_epi64'; ISO C99 and later do not support implicit function declarations
      [-Wimplicit-function-declaration]
            xacc[i] = _mm256_add_epi64(product, sum);
                      ^
./xxhash.h(4319,23): note: did you mean '_mm_add_epi64'?
C:\llvm\main-release\lib\clang\16\include\emmintrin.h(2096,46): note: '_mm_add_epi64' declared here
static __inline__ __m128i __DEFAULT_FN_ATTRS _mm_add_epi64(__m128i __a,
                                             ^
In file included from xxh_x86dispatch.c:208:
./xxhash.h(4319,40): error: use of undeclared identifier 'product'
            xacc[i] = _mm256_add_epi64(product, sum);
                                       ^
./xxhash.h(4319,49): error: use of undeclared identifier 'sum'
            xacc[i] = _mm256_add_epi64(product, sum);
                                                ^
./xxhash.h(4328,9): error: use of undeclared identifier '__m256i'
    {   __m256i* const xacc = (__m256i*) acc;
        ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.

@hzhuang1
Copy link
Contributor

diff --git a/xxh_x86dispatch.c b/xxh_x86dispatch.c
index b509035..b5f9e9e 100644
--- a/xxh_x86dispatch.c
+++ b/xxh_x86dispatch.c
@@ -459,7 +459,8 @@ XXHL64_default_##suffix(const void* XXH_RESTRICT input, size_t len)           \
 {                                                                             \
     return XXH3_hashLong_64b_internal(                                        \
                input, len, XXH3_kSecret, sizeof(XXH3_kSecret),                \
-               XXH3_accumulate_##suffix, XXH3_scrambleAcc_##suffix            \
+               XXH3_accumulate_##suffix, XXH3_accumulate_512_##suffix,        \
+               XXH3_scrambleAcc_##suffix                                      \
     );                                                                        \
 }                                                                             \
                                                                               \
@@ -471,7 +472,8 @@ XXHL64_seed_##suffix(const void* XXH_RESTRICT input, size_t len,              \
 {                                                                             \
     return XXH3_hashLong_64b_withSeed_internal(                               \
                     input, len, seed, XXH3_accumulate_##suffix,               \
-                    XXH3_scrambleAcc_##suffix, XXH3_initCustomSecret_##suffix \
+                    XXH3_accumulate_512_##suffix, XXH3_scrambleAcc_##suffix,  \
+                    XXH3_initCustomSecret_##suffix                            \
     );                                                                        \
 }                                                                             \
                                                                               \
@@ -483,7 +485,8 @@ XXHL64_secret_##suffix(const void* XXH_RESTRICT input, size_t len,            \
 {                                                                             \
     return XXH3_hashLong_64b_internal(                                        \
                     input, len, secret, secretLen,                            \
-                    XXH3_accumulate_##suffix, XXH3_scrambleAcc_##suffix       \
+                    XXH3_accumulate_##suffix, XXH3_accumulate_512_##suffix,   \
+                    XXH3_scrambleAcc_##suffix                                 \
     );                                                                        \
 }                                                                             \
                                                                               \
@@ -503,7 +506,8 @@ XXHL128_default_##suffix(const void* XXH_RESTRICT input, size_t len)          \
 {                                                                             \
     return XXH3_hashLong_128b_internal(                                       \
                     input, len, XXH3_kSecret, sizeof(XXH3_kSecret),           \
-                    XXH3_accumulate_##suffix, XXH3_scrambleAcc_##suffix       \
+                    XXH3_accumulate_##suffix, XXH3_accumulate_512_##suffix,   \
+                    XXH3_scrambleAcc_##suffix                                 \
     );                                                                        \
 }                                                                             \
                                                                               \
@@ -515,7 +519,8 @@ XXHL128_secret_##suffix(const void* XXH_RESTRICT input, size_t len,           \
 {                                                                             \
     return XXH3_hashLong_128b_internal(                                       \
                     input, len, (const xxh_u8*)secret, secretLen,             \
-                    XXH3_accumulate_##suffix, XXH3_scrambleAcc_##suffix);     \
+                    XXH3_accumulate_##suffix, XXH3_accumulate_512_##suffix,   \
+                    XXH3_scrambleAcc_##suffix);                               \
 }                                                                             \
                                                                               \
 /* ===   XXH128 Seeded variants   === */                                      \
@@ -525,7 +530,8 @@ XXHL128_seed_##suffix(const void* XXH_RESTRICT input, size_t len,             \
                       XXH64_hash_t seed)                                      \
 {                                                                             \
     return XXH3_hashLong_128b_withSeed_internal(input, len, seed,             \
-                    XXH3_accumulate_##suffix, XXH3_scrambleAcc_##suffix,      \
+                    XXH3_accumulate_##suffix, XXH3_accumulate_512_##suffix,   \
+                    XXH3_scrambleAcc_##suffix,                                \
                     XXH3_initCustomSecret_##suffix);                          \
 }
 
diff --git a/xxhash.h b/xxhash.h
index 1767b7e..a622f52 100644
--- a/xxhash.h
+++ b/xxhash.h
@@ -5001,6 +5001,7 @@ XXH3_initCustomSecret_scalar(void* XXH_RESTRICT customSecret, xxh_u64 seed64)
 
 
 typedef void (*XXH3_f_accumulate)(xxh_u64* XXH_RESTRICT, const xxh_u8* XXH_RESTRICT, const xxh_u8* XXH_RESTRICT, size_t);
+typedef void (*XXH3_f_accumulate_512)(void *XXH_RESTRICT acc, const void *XXH_RESTRICT input, const void *XXH_RESTRICT secret);
 typedef void (*XXH3_f_scrambleAcc)(void* XXH_RESTRICT, const void*);
 typedef void (*XXH3_f_initCustomSecret)(void* XXH_RESTRICT, xxh_u64);
 
@@ -5065,6 +5066,7 @@ XXH3_hashLong_internal_loop(xxh_u64* XXH_RESTRICT acc,
                       const xxh_u8* XXH_RESTRICT input, size_t len,
                       const xxh_u8* XXH_RESTRICT secret, size_t secretSize,
                             XXH3_f_accumulate f_acc,
+                            XXH3_f_accumulate_512 f_acc512,
                             XXH3_f_scrambleAcc f_scramble)
 {
     size_t const nbStripesPerBlock = (secretSize - XXH_STRIPE_LEN) / XXH_SECRET_CONSUME_RATE;

I'm fine on your fix. But I really hope that you could help CI to identify this issue. Since you could reproduce this issue on SSE2, maybe it could simplify the procedure.

By the way, I'm working on dispatch on aarch64. I really appreciate that you find the issue at early stage.

@tycho
Copy link
Author

tycho commented Nov 29, 2022

I'm fine on your fix. But I really hope that you could help CI to identify this issue. Since you could reproduce this issue on SSE2, maybe it could simplify the procedure.

I understand what you're hoping for, but I don't see a good way for CI to identify this problem with the code as-is. You could probably do something like this, which would be an architecture-independent way to detect misuse of XXH_VECTOR-dependent XXH3_* functions (basically just abort when any of them are called):

diff --git a/xxh_x86dispatch.c b/xxh_x86dispatch.c
index b509035..d09c51f 100644
--- a/xxh_x86dispatch.c
+++ b/xxh_x86dispatch.c
@@ -203,6 +203,7 @@ extern "C" {
 #endif
 #include <assert.h>
 
+#define XXH_VECTOR XXH_ABORT
 #define XXH_INLINE_ALL
 #define XXH_X86DISPATCH
 #include "xxhash.h"
diff --git a/xxhash.h b/xxhash.h
index 1767b7e..62771dd 100644
--- a/xxhash.h
+++ b/xxhash.h
@@ -3134,6 +3134,7 @@ enum XXH_VECTOR_TYPE /* fake enum */ {
     XXH_NEON   = 4,  /*!< NEON for most ARMv7-A and all AArch64 */
     XXH_VSX    = 5,  /*!< VSX and ZVector for POWER8/z13 (64-bit) */
     XXH_SVE    = 6,  /*!< SVE for some ARMv8-A and ARMv9-A */
+    XXH_ABORT  = 7,
 };
 /*!
  * @ingroup tuning
@@ -3156,6 +3157,7 @@ enum XXH_VECTOR_TYPE /* fake enum */ {
 #  define XXH_NEON   4
 #  define XXH_VSX    5
 #  define XXH_SVE    6
+#  define XXH_ABORT  7
 #endif
 
 #ifndef XXH_VECTOR    /* can be defined on command line */
@@ -3200,7 +3202,7 @@ enum XXH_VECTOR_TYPE /* fake enum */ {
  * for compatibility with aligned vector loads, which are usually faster.
  */
 #ifndef XXH_ACC_ALIGN
-#  if defined(XXH_X86DISPATCH)
+#  if defined(XXH_X86DISPATCH) || XXH_VECTOR == XXH_ABORT
 #     define XXH_ACC_ALIGN 64  /* for compatibility with avx512 */
 #  elif XXH_VECTOR == XXH_SCALAR  /* scalar */
 #     define XXH_ACC_ALIGN 8
@@ -5000,6 +5002,36 @@ XXH3_initCustomSecret_scalar(void* XXH_RESTRICT customSecret, xxh_u64 seed64)
 }
 
 
+XXH_FORCE_INLINE void
+XXH3_accumulate_512_abort(void* XXH_RESTRICT acc,
+                          const void* XXH_RESTRICT input,
+                          const void* XXH_RESTRICT secret)
+{
+	abort();
+}
+
+XXH_FORCE_INLINE void
+XXH3_accumulate_abort(xxh_u64* XXH_RESTRICT acc,
+                      const xxh_u8* XXH_RESTRICT input,
+                      const xxh_u8* XXH_RESTRICT secret,
+                      size_t nbStripes)
+{
+	abort();
+}
+
+XXH_FORCE_INLINE void
+XXH3_scrambleAcc_abort(void* XXH_RESTRICT acc, const void* XXH_RESTRICT secret)
+{
+	abort();
+}
+
+XXH_FORCE_INLINE void
+XXH3_initCustomSecret_abort(void* XXH_RESTRICT customSecret, xxh_u64 seed64)
+{
+	abort();
+}
+
+
 typedef void (*XXH3_f_accumulate)(xxh_u64* XXH_RESTRICT, const xxh_u8* XXH_RESTRICT, const xxh_u8* XXH_RESTRICT, size_t);
 typedef void (*XXH3_f_scrambleAcc)(void* XXH_RESTRICT, const void*);
 typedef void (*XXH3_f_initCustomSecret)(void* XXH_RESTRICT, xxh_u64);
@@ -5046,6 +5078,13 @@ typedef void (*XXH3_f_initCustomSecret)(void* XXH_RESTRICT, xxh_u64);
 #define XXH3_scrambleAcc    XXH3_scrambleAcc_scalar
 #define XXH3_initCustomSecret XXH3_initCustomSecret_scalar
 
+#elif (XXH_VECTOR == XXH_ABORT)
+
+#define XXH3_accumulate_512 XXH3_accumulate_512_abort
+#define XXH3_accumulate     XXH3_accumulate_abort
+#define XXH3_scrambleAcc    XXH3_scrambleAcc_abort
+#define XXH3_initCustomSecret XXH3_initCustomSecret_abort
+
 #else /* scalar */
 
 #define XXH3_accumulate_512 XXH3_accumulate_512_scalar

The above does cause the test case to abort when it hits that XXH3_accumulate_512.

@hzhuang1
Copy link
Contributor

@t-mat Let me summarize this issue.
Dispatch is built on AVX, and run on non-AVX machine. Error occurs.
@tycho submitted the first patch on XXH3_hashLong_internal_loop(). I mentioned it would cause some issues on s390x. And it does (https://github.com/hzhuang1/xxHash/actions/runs/3579213227).
After checked code again, there's no reason to fail. Although code is not efficient since it needn't prefetch instructions at here. At least, the same code works on other architectures.

I tried to build test s390x test environments.

  1. Run in multi-arch docker for s390x on MacOS. It failed with segment fault issue in Qemu.
  2. Run in multi-arch docker for s390x on Ubuntu. It succeed.
    The docker version is 20.10.12. I think it may be an issue in Qemu, not in code.

I'll try to setup s390x Qemu environment with version 6.x or 7.x to verify it.

@hzhuang1
Copy link
Contributor

I tried to use ubuntu s390x cloud image 22.04 on Qemu 6.2. It succeed.

So it's clear that it's the issue on Qemu. Could we check the Qemu version for CI? I doubt it may be a little older.

@t-mat
Copy link
Contributor

t-mat commented Nov 30, 2022

@hzhuang1 thanks for explanation and investigation.

The following section of ci.yml shows version of gcc cross compiler and qemu.
https://github.com/hzhuang1/xxHash/blob/bb16c785c650b89b6ea5f0f4d043e6f149d98861/.github/workflows/ci.yml#L373

Please check "Environment info(2)" in your QEMU test log.

https://github.com/hzhuang1/xxHash/actions/runs/3579213227/jobs/6020181259#step:6:31

/usr/bin/qemu-s390x-static

qemu-s390x version 4.2.1 (Debian 1:4.2-3ubuntu6.23)
Copyright (c) 2003-2019 Fabrice Bellard and the QEMU Project developers

@tycho
Copy link
Author

tycho commented Nov 30, 2022

I'm confused. What is the goal of all this testing of s390x using the wild-guess patch I threw out in the initial post? Don't we already know it was a Bad Idea(tm) from the start? What does this additional testing tell us?

@hzhuang1
Copy link
Contributor

qemu-s390x version 4.2.1 (Debian 1:4.2-3ubuntu6.23)
Copyright (c) 2003-2019 Fabrice Bellard and the QEMU Project developers

4.2.1 is too old. Let's upgrade it to 6.2 at least. By the way, Qemu v6.2 is used on ubuntu 22.04.

@hzhuang1
Copy link
Contributor

I'm confused. What is the goal of all this testing of s390x using the wild-guess patch I threw out in the initial post? Don't we already know it was a Bad Idea(tm) from the start? What does this additional testing tell us?

When the issue on Qemu is resolved, the two patches could both fix the issue. Let maintainer decide which one is better. Even maintainer select the second patch, we still need to fix the Qemu issue. It may be triggered again at any time.

@t-mat
Copy link
Contributor

t-mat commented Nov 30, 2022

@hzhuang1 as for S390X VSX and ARM64 SVE, I found my embarrassed mistake. I started #766.

4.2.1 is too old. Let's upgrade it to 6.2 at least. By the way, Qemu v6.2 is used on ubuntu 22.04.

As of today, ubuntu-latest is ubuntu-20.04. So replacing ubuntu-latest with ubunutu-22.04 may fix it.

-         { name: 'S390X',           xcc_pkg: gcc-s390x-linux-gnu,          xcc: s390x-linux-gnu-gcc,          xemu_pkg: qemu-system-s390x, xemu: qemu-s390x-static,   os: ubuntu-latest, },
+         { name: 'S390X',           xcc_pkg: gcc-s390x-linux-gnu,          xcc: s390x-linux-gnu-gcc,          xemu_pkg: qemu-system-s390x, xemu: qemu-s390x-static,   os: ubuntu-22.04, },

t-mat added a commit to t-mat/xxHash that referenced this issue Nov 30, 2022
To use recent version of QEMU, this patch replaces ubuntu VM images for QEMU test matrix.

See
Cyan4973#763
@Cyan4973
Copy link
Owner

Cyan4973 commented Dec 6, 2022

My understanding of this (relatively complex) situation is that
it all starts with a need to compile xxhash + xxh_x86_dispatch
on platform Windows + Visual studio + clang-cl .

This is arguably an uncommon platform for us,
but it's nonetheless part of our CI,
as can seen here : https://github.com/Cyan4973/xxHash/blob/dev/.github/workflows/ci.yml#L456

The problem is, this test only compiles the "regular" version of xxhsum,
not the DISPATCH=1 one. So it's successful.

But I presume a DISPATCH=1 test could be added in this category,
and if my understanding is correct,
the new test would then fail, as compilation would fail.

And I believe that's a good place to start this topic.

If we can fix x86dispatch for Visual + clang-cl,
then there is no need for users to manually modify the source file themselves,
and then we probably have less surprising side effects to understand.

Worst case, we may conclude that 91788f1 is problematic and must be reverted,
and now we have a good test to check its successor.

@easyaspi314
Copy link
Contributor

As for the clang-cl issue, this could be done:

#ifdef __clang__
#  pragma push_macro("__AVX__")
#  pragma push_macro("__AVX2__")
#  pragma push_macro("__AVX512F__")
#  define __AVX__ 1
#  define __AVX2__ 1
#  define __AVX512F__ 1
#  include <immintrin.h>
#  pragma pop_macro("__AVX512F__")
#  pragma pop_macro("__AVX2__")
#  pragma pop_macro("__AVX__")
#endif

This trick also works with arm_neon.h and arm_sve.h btw, Clang fully supports and exposes the intrinsics in targeted functions but errors if the feature macros are not defined.

easyaspi314 added a commit to easyaspi314/xxHash that referenced this issue Mar 16, 2023
While it would be nice to do these seperately, the dispatching needed to be
rewritten anyways to support this change.

This introduces a new hashLong model which
 1. Greatly reduces code size by reducing the inlined hashLong copies
   - Only two copies of hashLong are emitted, one for known secret size and one for
     unknown
     - `XXH_SIZE_OPT` only emits one
   - Significant code size improvement on 32-bit scalar since especially with no unaligned
     access, it gets bloated.
   - On GCC 12 x64, this allows dispatching to be included with a net size *decrease*
     compared to before (34 kB vs 35 kB, far less than old dispatch which was 48 kB)
 2. Natively dispatches using a function table
   - Adding support to other targets is trivial
   - Modern compilers inline the table when dispatching is disabled
   - xxh_x86dispatch.c is no longer required (and now `#pragma message`s)
   - Can be done fully inline
 3. Has a very minimal overhead
 4. Has no mutable global variables (merely one function-static pointer)

The dispatching logic has also been improved:
 - For purposes of stupid backwards compatibility, a FXSAVE check is included to catch
   ancient OSes that don't support SSE.
 - AVX2 is always dispatched. Aside from old MSVC versions, the compilers that
   don't support AVX2 don't support dispatching in this method.
   - This also allows old compiler hacks to be removed.
 - Dispatching AVX512 is now supported on macOS 12.2+
   - macOS doesn't show AVX512 support in XGETBV until it traps an instruction and
     "promotes" the thread.
   - Versions < 12.2 didn't properly save the mask registers in signal handlers and are
     not safe to use.
 - Excess logging and commenting is removed, and macros are cleaned up

And some bugs have been fixed:
 - GCC complaining about `-Wmaybe-uninitialized` on its own intrinsics
 - Clang 14 breaking on `-masm=intel` again
 - RBX not being preserved on x86_64 (it is reserved on the medium and large code model)
 - clang-cl did not include the correct intrinsics (fixes Cyan4973#763)
 - Clang < 3.7 did not support using intrinsic headers without the correct macros
 - GCC 4.9 dispatched AVX512 but not AVX2
 - As mentioned before, AVX512 was not dispatched on macOS

xxhsum, makefile, and tests haven't been updated yet, for now testing can be done with
`make CPPFLAGS="-DXXH_DISPATCH"`
@Cyan4973
Copy link
Owner

I believe the specific issue mentioned in this thread has been fixed,
but it was also supposed to be completed with a CI test,
and this one I don't see.

Specifically, I see we have a compilation and runtime test for Windows + Visual + clang-cl, but, unless I'm missing something, this test doesn't seem to employ DISPATCH=1, which was the core issue.

@t-mat
Copy link
Contributor

t-mat commented Jul 15, 2023

edit : Update actual patch and cmake options

The following patch for cmake_unofficial/CMakeList.txt enables dispatch mode for clang-cl.

# Only support DISPATCH option on x86_64.
- if("${PLATFORM}" STREQUAL "x86_64")
+ if(("${PLATFORM}" STREQUAL "x86_64") OR ("${PLATFORM}" STREQUAL "AMD64"))
+    set(XXHSUM_DISPATCH ON)

...

if(XXHASH_BUILD_XXHSUM)
  set(XXHSUM_DIR "${XXHASH_DIR}/cli")
  # xxhsum
- add_executable(xxhsum "${XXHSUM_DIR}/xxhsum.c"
+ set(XXHSUM_SOURCES)
+ if (XXHSUM_DISPATCH)
+   list(APPEND XXHSUM_SOURCES "${XXHASH_DIR}/xxh_x86dispatch.c")
+ endif()
+ list(APPEND XXHSUM_SOURCES "${XXHSUM_DIR}/xxhsum.c"
                             "${XXHSUM_DIR}/xsum_os_specific.c"
                             "${XXHSUM_DIR}/xsum_output.c"
                             "${XXHSUM_DIR}/xsum_sanity_check.c"
                             "${XXHSUM_DIR}/xsum_bench.c"
      )
+ add_executable(xxhsum ${XXHSUM_SOURCES})
  add_executable(${PROJECT_NAME}::xxhsum ALIAS xxhsum)

note: as for xxh_x86dispatch.c, see also

xxHash/Makefile

Lines 104 to 107 in 4fd75d7

ifeq ($(DISPATCH),1)
xxhsum: CPPFLAGS += -DXXHSUM_DISPATCH=1
xxhsum: xxh_x86dispatch.o
endif

With the following build procedure

git clone https://github.com/Cyan4973/xxHash.git
cd xxHash
git branch -v

# * dev 4fd75d7 Merge pull request #860 from Cyan4973/old_names_warning

cd cmake_unofficial

# Patch the CMakeList.txt

mkdir build-clang-cl
cd build-clang-cl
cmake .. -DXXHASH_C_FLAGS="/arch:AVX512 -DXXH_X86DISPATCH_ALLOW_AVX=1" -DCMAKE_BUILD_TYPE=Release -DDISPATCH=ON -A x64 -DCMAKE_GENERATOR_TOOLSET=ClangCL
#
# > ...
# > -- Architecture: AMD64
# > -- Enable xxHash dispatch mode
#
cmake --build . --config Release

@t-mat
Copy link
Contributor

t-mat commented Jul 15, 2023

The actual binary crashes 😿

.\Release\xxhsum.exe -bi1

DEBUG: xxHash dispatch: SSE2 support detected.
DEBUG: xxHash dispatch: AVX2 support detected.
DEBUG: xxHash dispatch: SSE2 support detected.
DEBUG: xxHash dispatch: AVX2 support detected.
xxhsum.exe 0.8.2 by Yann Collet
compiled as 64-bit x86_64 autoVec little endian with Clang 15.0.1
(-- crash --)

↑ My environment doesn't support AVX512. But for some reason, it selects AVX512 and crashes. Investigating.

image
  • update(1) : Since I explicitly set /arch:AVX512, XXH_VECTOR becomes XXH_AVX512 and it crashes properly.
  • update(2) : /arch:AVX512 with forcing code path to SSE2 by -DXXH_VECTOR=XXH_SSE2 doesn't work as already warned.
    Even _mm_mul_epu32() contains vpmullq (AVX512 instruction) in this setting.
Disassembly of _mm_mul_epu32
            __m128i const prod_hi     = _mm_mul_epu32     (data_key_hi, prime32);
00007FFAB5408493 C5 F9 6F 8C 24 90 00 00 00 vmovdqa     xmm1,xmmword ptr [rsp+90h]  
00007FFAB540849C C5 F9 6F 44 24 20    vmovdqa     xmm0,xmmword ptr [rsp+20h]  
00007FFAB54084A2 C5 F9 7F 8C 24 20 01 00 00 vmovdqa     xmmword ptr [rsp+120h],xmm1  
00007FFAB54084AB C5 F9 7F 84 24 10 01 00 00 vmovdqa     xmmword ptr [rsp+110h],xmm0  
00007FFAB54084B4 C5 F9 6F 84 24 10 01 00 00 vmovdqa     xmm0,xmmword ptr [rsp+110h]  
00007FFAB54084BD C5 F9 6F 8C 24 20 01 00 00 vmovdqa     xmm1,xmmword ptr [rsp+120h]  
00007FFAB54084C6 C5 E9 EF D2          vpxor       xmm2,xmm2,xmm2  
00007FFAB54084CA C4 E3 79 02 C2 0A    vpblendd    xmm0,xmm0,xmm2,0Ah  
00007FFAB54084D0 C4 E3 71 02 CA 0A    vpblendd    xmm1,xmm1,xmm2,0Ah  
00007FFAB54084D6 62 F2 FD 08 40 C1    vpmullq     xmm0,xmm0,xmm1  
00007FFAB54084DC C5 F9 7F 04 24       vmovdqa     xmmword ptr [rsp],xmm0

@Cyan4973
Copy link
Owner

Why setting /arch:AVX512 at build time ? What's the intention ?

@t-mat
Copy link
Contributor

t-mat commented Jul 15, 2023

Without explicit /arch setting, clang-cl reports the following error. Since default is /arch:SSE2, same error occurs for AVX and AVX2.

/arch:AVX2 :

D:\xxHash\cmake_unofficial\../xxhash.h(4635,37):
  error : always_inline function '_mm512_srli_epi64' requires target feature 'avx512f',
  but would be inlined into function 'XXH3_scrambleAcc_avx512' that is compiled
  without support for 'avx512f' [D:\xxHash\cmake_unofficial\build-clang-cl\xxhash.vcxproj]

/arch:AVX :

D:\xxHash\cmake_unofficial\../xxhash.h(4721,41):
  error : always_inline function '_mm256_srli_epi64' requires target feature 'avx2',
  but would be inlined into function 'XXH3_scrambleAcc_avx2' that is compiled
  without support for 'avx2' [D:\xxHash\cmake_unofficial\build-clang-cl\xxhash.vcxproj]

/arch:SSE2 :

D:\xxHash\cmake_unofficial\../xxhash.h(4715,33):
  error : always_inline function '_mm256_set1_epi32' requires target feature 'avx',
  but would be inlined into function 'XXH3_scrambleAcc_avx2' that is compiled
  without support for 'avx' [D:\xxHash\cmake_unofficial\build-clang-cl\xxhash.vcxproj]

@Cyan4973
Copy link
Owner

Cyan4973 commented Jul 15, 2023

xxh_x86dispatch is designed around the concept of selectively applying __attribute__((__target__("TARGET"))) capability. Which means, the capability must exist.

Additionally, compilation of a specific vector extension may (and generally does) require some #include to declare the corresponding intrinsics. Otherwise, we have missing symbols at compilation or link time.

These #include are properly defined in xxh_x86dispatch.c, but looking at the code, they seem to rely on detection of __GNUC__, and I'm not sure if visual studio + clang_cl defines __GNUC__. I suspect it does not.

To be more complete, when __GNUC__ is not detected, the next test tries to detect _MSC_VER, and I would expect this one to be defined when employing Visual Studio, even with the clang_cl backend. (This is just speculation, alas I have almost zero experience with such a setup). But I don't understand what happens under visual. The code features several lines which are supposed to take care of this situation, but it looks foreign to me. I can't tell if xxh_x86dispatch.c can actually dispatch when compiled under visual. Once again, lack of direct experience.

@t-mat
Copy link
Contributor

t-mat commented Jul 15, 2023

I think I've managed to build and run it properly (hopefully).

How can we detect clang-cl?

  • clang-cl doesn't define __GNUC__
  • clang-cl defines both __clang__ and _MSC_VER

How can we compile AVX512 intrinsics with clang-cl properly?

It seems clang-cl needs to include some extra headers to compile AVX512.
We can compile the following code with gcc, genuine clang and clang-cl without /arch etc.

test.c
// test.c
// building: clang test.c
// building: gcc test.c
// building: clang-cl test.c

#include <emmintrin.h> // SSE2
#include <immintrin.h> // AVX2, AVX512F

#if !defined(XXH_COMPILER_CLANGCL) && defined(__clang__) && defined(_MSC_VER)
#  define XXH_COMPILER_CLANGCL 1  // clang-cl.exe
#endif

#if XXH_COMPILER_CLANGCL
#  include <smmintrin.h>     // _MM_FROUND_*
#  include <avxintrin.h>
#  include <avx2intrin.h>
#  include <avx512fintrin.h>
#endif


#define XXH_TARGET_AVX512 __attribute__((__target__("avx512f")))

XXH_TARGET_AVX512 void test(void* p0, const void* p1, const void* p2) {
    __m512i  v0 = _mm512_loadu_si512(p1);
    __m512i  v1 = _mm512_loadu_si512(p2);
    __m512i* d  = (__m512i *) p0;
    *d = _mm512_add_epi64(v0, v1);
}

int main(void) {
    static char p0[1024] = { 0 };
    static char p1[1024] = { 0 };
    static char p2[1024] = { 0 };

    test(p0, p1, p2);

    return 0;
}

Patch for clang-cl

xxh_x86dispatch.h
  #  define XXH_TARGET_AVX2 __attribute__((__target__("avx2")))
  #  define XXH_TARGET_AVX512 __attribute__((__target__("avx512f")))
+ #elif defined(__clang__) && defined(_MSC_VER) /* clang-cl.exe */
+ #  include <emmintrin.h> /* SSE2 */
+ #  if XXH_DISPATCH_AVX2 || XXH_DISPATCH_AVX512
+ #    include <immintrin.h> /* AVX2, AVX512F */
+ #    include <smmintrin.h>
+ #    include <avxintrin.h>
+ #    include <avx2intrin.h>
+ #    include <avx512fintrin.h>
+ #  endif
+ #  define XXH_TARGET_SSE2 __attribute__((__target__("sse2")))
+ #  define XXH_TARGET_AVX2 __attribute__((__target__("avx2")))
+ #  define XXH_TARGET_AVX512 __attribute__((__target__("avx512f")))
  #elif defined(_MSC_VER)
  #  include <intrin.h>
CMakeLists.txt
  # Only support DISPATCH option on x86_64.
- if("${PLATFORM}" STREQUAL "x86_64")
+ if(("${PLATFORM}" STREQUAL "x86_64") OR ("${PLATFORM}" STREQUAL "AMD64"))
+  set(XXHSUM_DISPATCH ON)
    message(STATUS "Enable xxHash dispatch mode")
    add_library(xxhash "${XXHASH_DIR}/xxh_x86dispatch.c"
                       "${XXHASH_DIR}/xxhash.c"
if(XXHASH_BUILD_XXHSUM)
  set(XXHSUM_DIR "${XXHASH_DIR}/cli")
  # xxhsum
- add_executable(xxhsum "${XXHSUM_DIR}/xxhsum.c"
+ set(XXHSUM_SOURCES)
+ if (XXHSUM_DISPATCH)
+   list(APPEND XXHSUM_SOURCES "${XXHASH_DIR}/xxh_x86dispatch.c")
+ endif()
+ list(APPEND XXHSUM_SOURCES "${XXHSUM_DIR}/xxhsum.c"
                             "${XXHSUM_DIR}/xsum_os_specific.c"
                             "${XXHSUM_DIR}/xsum_output.c"
                             "${XXHSUM_DIR}/xsum_sanity_check.c"
                             "${XXHSUM_DIR}/xsum_bench.c"
      )
+ add_executable(xxhsum ${XXHSUM_SOURCES})
  add_executable(${PROJECT_NAME}::xxhsum ALIAS xxhsum)

Build & test

cd cmake_unofficial
mkdir build-clang-cl
cd build-clang-cl

cmake .. -DCMAKE_BUILD_TYPE=Release -DDISPATCH=ON -A x64 -DCMAKE_GENERATOR_TOOLSET=ClangCL && cmake --build . --config Release
cmake --build . --config Release
.\Release\xxhsum.exe -bi1

Benchmark

Dispatch

cmake .. -DCMAKE_BUILD_TYPE=Release -DDISPATCH=ON -A x64 -DCMAKE_GENERATOR_TOOLSET=ClangCL
cmake --build . --config Release
.\Release\xxhsum.exe -bi1
#
# xxhsum.exe 0.8.2 by Yann Collet
# compiled as 64-bit x86_64 autoVec little endian with Clang 15.0.1
# Sample of 100 KB...
#  1#XXH32                         :     102400 ->    78567 it/s ( 7672.5 MB/s)
#  3#XXH64                         :     102400 ->   157360 it/s (15367.2 MB/s)
#  5#XXH3_64b                      :     102400 ->   443833 it/s (43343.0 MB/s)
# 11#XXH128                        :     102400 ->   439046 it/s (42875.6 MB/s)

Dispatch, XXH_DISPATCH_AVX2=0

cmake .. -DXXHASH_C_FLAGS="-DXXH_DISPATCH_AVX2=0" -DCMAKE_BUILD_TYPE=Release -DDISPATCH=ON -A x64 -DCMAKE_GENERATOR_TOOLSET=ClangCL
cmake --build . --config Release
.\Release\xxhsum.exe -bi1
#
# xxhsum.exe 0.8.2 by Yann Collet
# compiled as 64-bit x86_64 autoVec little endian with Clang 15.0.1
# Sample of 100 KB...
#  1#XXH32                         :     102400 ->    77854 it/s ( 7602.9 MB/s)
#  3#XXH64                         :     102400 ->   157988 it/s (15428.5 MB/s)
#  5#XXH3_64b                      :     102400 ->   253696 it/s (24775.0 MB/s)
# 11#XXH128                        :     102400 ->   253085 it/s (24715.3 MB/s)

It seems working (except AVX512, since I don't have it). I'm writing PR. Done ( #865 ).

@Cyan4973
Copy link
Owner

This looks excellent @t-mat !

Cyan4973 added a commit that referenced this issue Jul 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants