-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dispatch regression with recent commit #763
Comments
Hi @tycho, thanks for the report. If you can, could you provide some details of your project? Especially your compiler and compile-time options for xxhash will greatly help us.
So far, I think we can introduce
I'm also not sure we have proper test for this issue. Since |
f_acc means the full accumulation loop, and XXH3_accumulate_512 means the internal 512-byte loop. At here, it should be 512-byte internal loop since it's the last stripe. It was talked in the pull request #692. I need to understand why f_acc() won't trigger the same issue on your machine. |
I compile with LLVM clang-cl using Visual Studio 2019, with these flags (trimmed out include paths and other unnecessary flags):
Which part, the whole file? Because that just results in compile errors (looks like missing intrinsic header includes).
Ah.
Because |
I've tried to reproduce this issue with the following code. But failed. @tycho , could you provide some advice for reproducing your issue ? Since we have a plan to add new test for this issue, minimal repro code helps us a lot. // test-issue-763.c
//
// ## test commands
//
// git branch
// # dev
//
// git log -1
// # 30d6a3e, Thu Nov 24 01:17:08 2022 -0800, Merge pull request #756 from hzhuang1/sve_02
//
// gcc -DXXH_DISPATCH_AVX512=0 test-issue-763.c xxh_x86dispatch.c
//
// ./a.out
// # eb4b7c3707879151
//
// RUN_ENV="qemu-x86_64-static -cpu Haswell" LDFLAGS="-static" ./a.out
// # eb4b7c3707879151
#include <stdio.h>
#define XXH_STATIC_LINKING_ONLY
#define XXH_IMPLEMENTATION
#define XXH_INLINE_ALL
#include "xxhash.h"
#include "xxh_x86dispatch.h"
int main(int argc, char* argv[]) {
static char src[4096];
for(size_t i = 0; i < sizeof(src); ++i) {
src[i] = (char) i;
}
XXH64_hash_t const hash_xxh64 = XXH3_64bits(src, sizeof(src));
{
XXH64_canonical_t canonical_xxh64;
XXH64_canonicalFromHash(&canonical_xxh64, hash_xxh64);
for(size_t i = 0; i < sizeof(canonical_xxh64.digest); ++i) {
printf("%02x", canonical_xxh64.digest[i]);
}
printf("\n");
}
return 0;
} |
Oh! I see why you're having trouble reproducing it. I had forgotten that I have some patches on top of upstream xxHash, which make it possible to use dispatching on diff --git a/xxh_x86dispatch.c b/xxh_x86dispatch.c
index b509035..094cc99 100644
--- a/xxh_x86dispatch.c
+++ b/xxh_x86dispatch.c
@@ -91,6 +91,18 @@ extern "C" {
# define XXH_HAS_INCLUDE(header) 0
#endif
+/* clang-cl's intrinsic headers depend on the predefined macros coming from
+ * -march or -mfeature flags, which are not defined if we are targeting something
+ * without those features. Since we are deliberately doing dispatch for intrinsics
+ * we may not be able to execute on our primary target, we must define these here
+ * in order for the right intrinsics to get defined.
+ */
+#if defined(_MSC_VER) && defined(__clang__)
+# define __AVX__
+# define __AVX2__
+# define __AVX512F__
+#endif
+
/*!
* @def XXH_DISPATCH_SCALAR
* @brief Enables/dispatching the scalar code path.
@@ -175,7 +187,7 @@ extern "C" {
* @def XXH_TARGET_AVX512
* @brief Like @ref XXH_TARGET_SSE2, but for AVX512.
*/
-#if defined(__GNUC__)
+#if defined(__GNUC__) || defined(__clang__)
# include <emmintrin.h> /* SSE2 */
# if XXH_DISPATCH_AVX2 || XXH_DISPATCH_AVX512
# include <immintrin.h> /* AVX2, AVX512F */ The above is added to serves a few purposes:
Admittedly, defining The problem with defining those feature macros is that it made the automatic selection for After adding my above patch to upstream xxHash, I can repro the issue (where
I think the following is probably the right approach, now that I understand the distinction between diff --git a/xxh_x86dispatch.c b/xxh_x86dispatch.c
index b509035..b5f9e9e 100644
--- a/xxh_x86dispatch.c
+++ b/xxh_x86dispatch.c
@@ -459,7 +459,8 @@ XXHL64_default_##suffix(const void* XXH_RESTRICT input, size_t len) \
{ \
return XXH3_hashLong_64b_internal( \
input, len, XXH3_kSecret, sizeof(XXH3_kSecret), \
- XXH3_accumulate_##suffix, XXH3_scrambleAcc_##suffix \
+ XXH3_accumulate_##suffix, XXH3_accumulate_512_##suffix, \
+ XXH3_scrambleAcc_##suffix \
); \
} \
\
@@ -471,7 +472,8 @@ XXHL64_seed_##suffix(const void* XXH_RESTRICT input, size_t len, \
{ \
return XXH3_hashLong_64b_withSeed_internal( \
input, len, seed, XXH3_accumulate_##suffix, \
- XXH3_scrambleAcc_##suffix, XXH3_initCustomSecret_##suffix \
+ XXH3_accumulate_512_##suffix, XXH3_scrambleAcc_##suffix, \
+ XXH3_initCustomSecret_##suffix \
); \
} \
\
@@ -483,7 +485,8 @@ XXHL64_secret_##suffix(const void* XXH_RESTRICT input, size_t len, \
{ \
return XXH3_hashLong_64b_internal( \
input, len, secret, secretLen, \
- XXH3_accumulate_##suffix, XXH3_scrambleAcc_##suffix \
+ XXH3_accumulate_##suffix, XXH3_accumulate_512_##suffix, \
+ XXH3_scrambleAcc_##suffix \
); \
} \
\
@@ -503,7 +506,8 @@ XXHL128_default_##suffix(const void* XXH_RESTRICT input, size_t len) \
{ \
return XXH3_hashLong_128b_internal( \
input, len, XXH3_kSecret, sizeof(XXH3_kSecret), \
- XXH3_accumulate_##suffix, XXH3_scrambleAcc_##suffix \
+ XXH3_accumulate_##suffix, XXH3_accumulate_512_##suffix, \
+ XXH3_scrambleAcc_##suffix \
); \
} \
\
@@ -515,7 +519,8 @@ XXHL128_secret_##suffix(const void* XXH_RESTRICT input, size_t len, \
{ \
return XXH3_hashLong_128b_internal( \
input, len, (const xxh_u8*)secret, secretLen, \
- XXH3_accumulate_##suffix, XXH3_scrambleAcc_##suffix); \
+ XXH3_accumulate_##suffix, XXH3_accumulate_512_##suffix, \
+ XXH3_scrambleAcc_##suffix); \
} \
\
/* === XXH128 Seeded variants === */ \
@@ -525,7 +530,8 @@ XXHL128_seed_##suffix(const void* XXH_RESTRICT input, size_t len, \
XXH64_hash_t seed) \
{ \
return XXH3_hashLong_128b_withSeed_internal(input, len, seed, \
- XXH3_accumulate_##suffix, XXH3_scrambleAcc_##suffix, \
+ XXH3_accumulate_##suffix, XXH3_accumulate_512_##suffix, \
+ XXH3_scrambleAcc_##suffix, \
XXH3_initCustomSecret_##suffix); \
}
diff --git a/xxhash.h b/xxhash.h
index 1767b7e..a622f52 100644
--- a/xxhash.h
+++ b/xxhash.h
@@ -5001,6 +5001,7 @@ XXH3_initCustomSecret_scalar(void* XXH_RESTRICT customSecret, xxh_u64 seed64)
typedef void (*XXH3_f_accumulate)(xxh_u64* XXH_RESTRICT, const xxh_u8* XXH_RESTRICT, const xxh_u8* XXH_RESTRICT, size_t);
+typedef void (*XXH3_f_accumulate_512)(void *XXH_RESTRICT acc, const void *XXH_RESTRICT input, const void *XXH_RESTRICT secret);
typedef void (*XXH3_f_scrambleAcc)(void* XXH_RESTRICT, const void*);
typedef void (*XXH3_f_initCustomSecret)(void* XXH_RESTRICT, xxh_u64);
@@ -5065,6 +5066,7 @@ XXH3_hashLong_internal_loop(xxh_u64* XXH_RESTRICT acc,
const xxh_u8* XXH_RESTRICT input, size_t len,
const xxh_u8* XXH_RESTRICT secret, size_t secretSize,
XXH3_f_accumulate f_acc,
+ XXH3_f_accumulate_512 f_acc512,
XXH3_f_scrambleAcc f_scramble)
{
size_t const nbStripesPerBlock = (secretSize - XXH_STRIPE_LEN) / XXH_SECRET_CONSUME_RATE;
@@ -5089,7 +5091,7 @@ XXH3_hashLong_internal_loop(xxh_u64* XXH_RESTRICT acc,
/* last stripe */
{ const xxh_u8* const p = input + len - XXH_STRIPE_LEN;
#define XXH_SECRET_LASTACC_START 7 /* not aligned on 8, last secret is different from acc & scrambler */
- XXH3_accumulate_512(acc, p, secret + secretSize - XXH_STRIPE_LEN - XXH_SECRET_LASTACC_START);
+ f_acc512(acc, p, secret + secretSize - XXH_STRIPE_LEN - XXH_SECRET_LASTACC_START);
} }
}
@@ -5135,11 +5137,12 @@ XXH_FORCE_INLINE XXH64_hash_t
XXH3_hashLong_64b_internal(const void* XXH_RESTRICT input, size_t len,
const void* XXH_RESTRICT secret, size_t secretSize,
XXH3_f_accumulate f_acc,
+ XXH3_f_accumulate_512 f_acc512,
XXH3_f_scrambleAcc f_scramble)
{
XXH_ALIGN(XXH_ACC_ALIGN) xxh_u64 acc[XXH_ACC_NB] = XXH3_INIT_ACC;
- XXH3_hashLong_internal_loop(acc, (const xxh_u8*)input, len, (const xxh_u8*)secret, secretSize, f_acc, f_scramble);
+ XXH3_hashLong_internal_loop(acc, (const xxh_u8*)input, len, (const xxh_u8*)secret, secretSize, f_acc, f_acc512, f_scramble);
/* converge into final hash */
XXH_STATIC_ASSERT(sizeof(acc) == 64);
@@ -5159,7 +5162,7 @@ XXH3_hashLong_64b_withSecret(const void* XXH_RESTRICT input, size_t len,
XXH64_hash_t seed64, const xxh_u8* XXH_RESTRICT secret, size_t secretLen)
{
(void)seed64;
- return XXH3_hashLong_64b_internal(input, len, secret, secretLen, XXH3_accumulate, XXH3_scrambleAcc);
+ return XXH3_hashLong_64b_internal(input, len, secret, secretLen, XXH3_accumulate, XXH3_accumulate_512, XXH3_scrambleAcc);
}
/*
@@ -5173,7 +5176,7 @@ XXH3_hashLong_64b_default(const void* XXH_RESTRICT input, size_t len,
XXH64_hash_t seed64, const xxh_u8* XXH_RESTRICT secret, size_t secretLen)
{
(void)seed64; (void)secret; (void)secretLen;
- return XXH3_hashLong_64b_internal(input, len, XXH3_kSecret, sizeof(XXH3_kSecret), XXH3_accumulate, XXH3_scrambleAcc);
+ return XXH3_hashLong_64b_internal(input, len, XXH3_kSecret, sizeof(XXH3_kSecret), XXH3_accumulate, XXH3_accumulate_512, XXH3_scrambleAcc);
}
/*
@@ -5191,6 +5194,7 @@ XXH_FORCE_INLINE XXH64_hash_t
XXH3_hashLong_64b_withSeed_internal(const void* input, size_t len,
XXH64_hash_t seed,
XXH3_f_accumulate f_acc,
+ XXH3_f_accumulate_512 f_acc512,
XXH3_f_scrambleAcc f_scramble,
XXH3_f_initCustomSecret f_initSec)
{
@@ -5198,12 +5202,12 @@ XXH3_hashLong_64b_withSeed_internal(const void* input, size_t len,
if (seed == 0)
return XXH3_hashLong_64b_internal(input, len,
XXH3_kSecret, sizeof(XXH3_kSecret),
- f_acc, f_scramble);
+ f_acc, f_acc512, f_scramble);
#endif
{ XXH_ALIGN(XXH_SEC_ALIGN) xxh_u8 secret[XXH_SECRET_DEFAULT_SIZE];
f_initSec(secret, seed);
return XXH3_hashLong_64b_internal(input, len, secret, sizeof(secret),
- f_acc, f_scramble);
+ f_acc, f_acc512, f_scramble);
}
}
@@ -5216,7 +5220,7 @@ XXH3_hashLong_64b_withSeed(const void* XXH_RESTRICT input, size_t len,
{
(void)secret; (void)secretLen;
return XXH3_hashLong_64b_withSeed_internal(input, len, seed,
- XXH3_accumulate, XXH3_scrambleAcc, XXH3_initCustomSecret);
+ XXH3_accumulate, XXH3_accumulate_512, XXH3_scrambleAcc, XXH3_initCustomSecret);
}
@@ -5927,11 +5931,12 @@ XXH_FORCE_INLINE XXH128_hash_t
XXH3_hashLong_128b_internal(const void* XXH_RESTRICT input, size_t len,
const xxh_u8* XXH_RESTRICT secret, size_t secretSize,
XXH3_f_accumulate f_acc,
+ XXH3_f_accumulate_512 f_acc512,
XXH3_f_scrambleAcc f_scramble)
{
XXH_ALIGN(XXH_ACC_ALIGN) xxh_u64 acc[XXH_ACC_NB] = XXH3_INIT_ACC;
- XXH3_hashLong_internal_loop(acc, (const xxh_u8*)input, len, secret, secretSize, f_acc, f_scramble);
+ XXH3_hashLong_internal_loop(acc, (const xxh_u8*)input, len, secret, secretSize, f_acc, f_acc512, f_scramble);
/* converge into final hash */
XXH_STATIC_ASSERT(sizeof(acc) == 64);
@@ -5958,7 +5963,7 @@ XXH3_hashLong_128b_default(const void* XXH_RESTRICT input, size_t len,
{
(void)seed64; (void)secret; (void)secretLen;
return XXH3_hashLong_128b_internal(input, len, XXH3_kSecret, sizeof(XXH3_kSecret),
- XXH3_accumulate, XXH3_scrambleAcc);
+ XXH3_accumulate, XXH3_accumulate_512, XXH3_scrambleAcc);
}
/*
@@ -5972,24 +5977,25 @@ XXH3_hashLong_128b_withSecret(const void* XXH_RESTRICT input, size_t len,
{
(void)seed64;
return XXH3_hashLong_128b_internal(input, len, (const xxh_u8*)secret, secretLen,
- XXH3_accumulate, XXH3_scrambleAcc);
+ XXH3_accumulate, XXH3_accumulate_512, XXH3_scrambleAcc);
}
XXH_FORCE_INLINE XXH128_hash_t
XXH3_hashLong_128b_withSeed_internal(const void* XXH_RESTRICT input, size_t len,
XXH64_hash_t seed64,
XXH3_f_accumulate f_acc,
+ XXH3_f_accumulate_512 f_acc512,
XXH3_f_scrambleAcc f_scramble,
XXH3_f_initCustomSecret f_initSec)
{
if (seed64 == 0)
return XXH3_hashLong_128b_internal(input, len,
XXH3_kSecret, sizeof(XXH3_kSecret),
- f_acc, f_scramble);
+ f_acc, f_acc512, f_scramble);
{ XXH_ALIGN(XXH_SEC_ALIGN) xxh_u8 secret[XXH_SECRET_DEFAULT_SIZE];
f_initSec(secret, seed64);
return XXH3_hashLong_128b_internal(input, len, (const xxh_u8*)secret, sizeof(secret),
- f_acc, f_scramble);
+ f_acc, f_acc512, f_scramble);
}
}
@@ -6002,7 +6008,7 @@ XXH3_hashLong_128b_withSeed(const void* input, size_t len,
{
(void)secret; (void)secretLen;
return XXH3_hashLong_128b_withSeed_internal(input, len, seed64,
- XXH3_accumulate, XXH3_scrambleAcc, XXH3_initCustomSecret);
+ XXH3_accumulate, XXH3_accumulate_512, XXH3_scrambleAcc, XXH3_initCustomSecret);
}
typedef XXH128_hash_t (*XXH3_hashLong128_f)(const void* XXH_RESTRICT, size_t,
|
@tycho , Before discussing modification, could you please confirm basic problem firstly? My question is: Does your issue happen with mainline, un-modified version of
|
The basic problem is less severe but still present: not using the right (suboptimal) ISA for the final
With |
Illustrated:
And with AVX512 dispatch off, it has similar errors with AVX2:
|
I'm fine on your fix. But I really hope that you could help CI to identify this issue. Since you could reproduce this issue on SSE2, maybe it could simplify the procedure. By the way, I'm working on dispatch on aarch64. I really appreciate that you find the issue at early stage. |
I understand what you're hoping for, but I don't see a good way for CI to identify this problem with the code as-is. You could probably do something like this, which would be an architecture-independent way to detect misuse of diff --git a/xxh_x86dispatch.c b/xxh_x86dispatch.c
index b509035..d09c51f 100644
--- a/xxh_x86dispatch.c
+++ b/xxh_x86dispatch.c
@@ -203,6 +203,7 @@ extern "C" {
#endif
#include <assert.h>
+#define XXH_VECTOR XXH_ABORT
#define XXH_INLINE_ALL
#define XXH_X86DISPATCH
#include "xxhash.h"
diff --git a/xxhash.h b/xxhash.h
index 1767b7e..62771dd 100644
--- a/xxhash.h
+++ b/xxhash.h
@@ -3134,6 +3134,7 @@ enum XXH_VECTOR_TYPE /* fake enum */ {
XXH_NEON = 4, /*!< NEON for most ARMv7-A and all AArch64 */
XXH_VSX = 5, /*!< VSX and ZVector for POWER8/z13 (64-bit) */
XXH_SVE = 6, /*!< SVE for some ARMv8-A and ARMv9-A */
+ XXH_ABORT = 7,
};
/*!
* @ingroup tuning
@@ -3156,6 +3157,7 @@ enum XXH_VECTOR_TYPE /* fake enum */ {
# define XXH_NEON 4
# define XXH_VSX 5
# define XXH_SVE 6
+# define XXH_ABORT 7
#endif
#ifndef XXH_VECTOR /* can be defined on command line */
@@ -3200,7 +3202,7 @@ enum XXH_VECTOR_TYPE /* fake enum */ {
* for compatibility with aligned vector loads, which are usually faster.
*/
#ifndef XXH_ACC_ALIGN
-# if defined(XXH_X86DISPATCH)
+# if defined(XXH_X86DISPATCH) || XXH_VECTOR == XXH_ABORT
# define XXH_ACC_ALIGN 64 /* for compatibility with avx512 */
# elif XXH_VECTOR == XXH_SCALAR /* scalar */
# define XXH_ACC_ALIGN 8
@@ -5000,6 +5002,36 @@ XXH3_initCustomSecret_scalar(void* XXH_RESTRICT customSecret, xxh_u64 seed64)
}
+XXH_FORCE_INLINE void
+XXH3_accumulate_512_abort(void* XXH_RESTRICT acc,
+ const void* XXH_RESTRICT input,
+ const void* XXH_RESTRICT secret)
+{
+ abort();
+}
+
+XXH_FORCE_INLINE void
+XXH3_accumulate_abort(xxh_u64* XXH_RESTRICT acc,
+ const xxh_u8* XXH_RESTRICT input,
+ const xxh_u8* XXH_RESTRICT secret,
+ size_t nbStripes)
+{
+ abort();
+}
+
+XXH_FORCE_INLINE void
+XXH3_scrambleAcc_abort(void* XXH_RESTRICT acc, const void* XXH_RESTRICT secret)
+{
+ abort();
+}
+
+XXH_FORCE_INLINE void
+XXH3_initCustomSecret_abort(void* XXH_RESTRICT customSecret, xxh_u64 seed64)
+{
+ abort();
+}
+
+
typedef void (*XXH3_f_accumulate)(xxh_u64* XXH_RESTRICT, const xxh_u8* XXH_RESTRICT, const xxh_u8* XXH_RESTRICT, size_t);
typedef void (*XXH3_f_scrambleAcc)(void* XXH_RESTRICT, const void*);
typedef void (*XXH3_f_initCustomSecret)(void* XXH_RESTRICT, xxh_u64);
@@ -5046,6 +5078,13 @@ typedef void (*XXH3_f_initCustomSecret)(void* XXH_RESTRICT, xxh_u64);
#define XXH3_scrambleAcc XXH3_scrambleAcc_scalar
#define XXH3_initCustomSecret XXH3_initCustomSecret_scalar
+#elif (XXH_VECTOR == XXH_ABORT)
+
+#define XXH3_accumulate_512 XXH3_accumulate_512_abort
+#define XXH3_accumulate XXH3_accumulate_abort
+#define XXH3_scrambleAcc XXH3_scrambleAcc_abort
+#define XXH3_initCustomSecret XXH3_initCustomSecret_abort
+
#else /* scalar */
#define XXH3_accumulate_512 XXH3_accumulate_512_scalar The above does cause the test case to abort when it hits that |
@t-mat Let me summarize this issue. I tried to build test s390x test environments.
I'll try to setup s390x Qemu environment with version 6.x or 7.x to verify it. |
I tried to use ubuntu s390x cloud image 22.04 on Qemu 6.2. It succeed. So it's clear that it's the issue on Qemu. Could we check the Qemu version for CI? I doubt it may be a little older. |
@hzhuang1 thanks for explanation and investigation. The following section of Please check "Environment info(2)" in your QEMU test log. https://github.com/hzhuang1/xxHash/actions/runs/3579213227/jobs/6020181259#step:6:31
|
I'm confused. What is the goal of all this testing of s390x using the wild-guess patch I threw out in the initial post? Don't we already know it was a Bad Idea(tm) from the start? What does this additional testing tell us? |
4.2.1 is too old. Let's upgrade it to 6.2 at least. By the way, Qemu v6.2 is used on ubuntu 22.04. |
When the issue on Qemu is resolved, the two patches could both fix the issue. Let maintainer decide which one is better. Even maintainer select the second patch, we still need to fix the Qemu issue. It may be triggered again at any time. |
@hzhuang1 as for S390X VSX and ARM64 SVE, I found my embarrassed mistake. I started #766.
As of today,
|
To use recent version of QEMU, this patch replaces ubuntu VM images for QEMU test matrix. See Cyan4973#763
My understanding of this (relatively complex) situation is that This is arguably an uncommon platform for us, The problem is, this test only compiles the "regular" version of But I presume a And I believe that's a good place to start this topic. If we can fix Worst case, we may conclude that 91788f1 is problematic and must be reverted, |
As for the clang-cl issue, this could be done: #ifdef __clang__
# pragma push_macro("__AVX__")
# pragma push_macro("__AVX2__")
# pragma push_macro("__AVX512F__")
# define __AVX__ 1
# define __AVX2__ 1
# define __AVX512F__ 1
# include <immintrin.h>
# pragma pop_macro("__AVX512F__")
# pragma pop_macro("__AVX2__")
# pragma pop_macro("__AVX__")
#endif This trick also works with |
While it would be nice to do these seperately, the dispatching needed to be rewritten anyways to support this change. This introduces a new hashLong model which 1. Greatly reduces code size by reducing the inlined hashLong copies - Only two copies of hashLong are emitted, one for known secret size and one for unknown - `XXH_SIZE_OPT` only emits one - Significant code size improvement on 32-bit scalar since especially with no unaligned access, it gets bloated. - On GCC 12 x64, this allows dispatching to be included with a net size *decrease* compared to before (34 kB vs 35 kB, far less than old dispatch which was 48 kB) 2. Natively dispatches using a function table - Adding support to other targets is trivial - Modern compilers inline the table when dispatching is disabled - xxh_x86dispatch.c is no longer required (and now `#pragma message`s) - Can be done fully inline 3. Has a very minimal overhead 4. Has no mutable global variables (merely one function-static pointer) The dispatching logic has also been improved: - For purposes of stupid backwards compatibility, a FXSAVE check is included to catch ancient OSes that don't support SSE. - AVX2 is always dispatched. Aside from old MSVC versions, the compilers that don't support AVX2 don't support dispatching in this method. - This also allows old compiler hacks to be removed. - Dispatching AVX512 is now supported on macOS 12.2+ - macOS doesn't show AVX512 support in XGETBV until it traps an instruction and "promotes" the thread. - Versions < 12.2 didn't properly save the mask registers in signal handlers and are not safe to use. - Excess logging and commenting is removed, and macros are cleaned up And some bugs have been fixed: - GCC complaining about `-Wmaybe-uninitialized` on its own intrinsics - Clang 14 breaking on `-masm=intel` again - RBX not being preserved on x86_64 (it is reserved on the medium and large code model) - clang-cl did not include the correct intrinsics (fixes Cyan4973#763) - Clang < 3.7 did not support using intrinsic headers without the correct macros - GCC 4.9 dispatched AVX512 but not AVX2 - As mentioned before, AVX512 was not dispatched on macOS xxhsum, makefile, and tests haven't been updated yet, for now testing can be done with `make CPPFLAGS="-DXXH_DISPATCH"`
I believe the specific issue mentioned in this thread has been fixed, Specifically, I see we have a compilation and runtime test for |
edit : Update actual patch and The following patch for # Only support DISPATCH option on x86_64.
- if("${PLATFORM}" STREQUAL "x86_64")
+ if(("${PLATFORM}" STREQUAL "x86_64") OR ("${PLATFORM}" STREQUAL "AMD64"))
+ set(XXHSUM_DISPATCH ON)
...
if(XXHASH_BUILD_XXHSUM)
set(XXHSUM_DIR "${XXHASH_DIR}/cli")
# xxhsum
- add_executable(xxhsum "${XXHSUM_DIR}/xxhsum.c"
+ set(XXHSUM_SOURCES)
+ if (XXHSUM_DISPATCH)
+ list(APPEND XXHSUM_SOURCES "${XXHASH_DIR}/xxh_x86dispatch.c")
+ endif()
+ list(APPEND XXHSUM_SOURCES "${XXHSUM_DIR}/xxhsum.c"
"${XXHSUM_DIR}/xsum_os_specific.c"
"${XXHSUM_DIR}/xsum_output.c"
"${XXHSUM_DIR}/xsum_sanity_check.c"
"${XXHSUM_DIR}/xsum_bench.c"
)
+ add_executable(xxhsum ${XXHSUM_SOURCES})
add_executable(${PROJECT_NAME}::xxhsum ALIAS xxhsum)
note: as for Lines 104 to 107 in 4fd75d7
With the following build procedure git clone https://github.com/Cyan4973/xxHash.git
cd xxHash
git branch -v
# * dev 4fd75d7 Merge pull request #860 from Cyan4973/old_names_warning
cd cmake_unofficial
# Patch the CMakeList.txt
mkdir build-clang-cl
cd build-clang-cl
cmake .. -DXXHASH_C_FLAGS="/arch:AVX512 -DXXH_X86DISPATCH_ALLOW_AVX=1" -DCMAKE_BUILD_TYPE=Release -DDISPATCH=ON -A x64 -DCMAKE_GENERATOR_TOOLSET=ClangCL
#
# > ...
# > -- Architecture: AMD64
# > -- Enable xxHash dispatch mode
#
cmake --build . --config Release |
The actual binary crashes 😿
↑ My environment doesn't support AVX512. But for some reason, it selects AVX512 and crashes. Investigating.
Disassembly of _mm_mul_epu32
|
Why setting |
Without explicit
|
Additionally, compilation of a specific vector extension may (and generally does) require some These To be more complete, when |
I think I've managed to build and run it properly (hopefully). How can we detect
|
This looks excellent @t-mat ! |
test that #763 is fixed and remains fixed
Commit 91788f1 seems to have caused a regression with dispatch, causing it to unconditionally use AVX512 on my system -- while my toolchain supports AVX512, my host CPU does not.
My system hits an illegal instruction through here:
xxHash/xxhash.h
Line 5028 in 91788f1
Call stack:
EDIT: For clarity, note that dispatch correctly selected AVX2 (
XXHL64_default_avx2
) butXXH3_hashLong_internal_loop
directly uses theXXH3_accumulate_512
macro which resolved toXXH3_accumulate_512_avx512
(which was the highest toolchain-supported ISA at compile-time).I'm not sure what the correct line would look like.
EDIT: Removed bogus patch. Better one in comments.
The text was updated successfully, but these errors were encountered: