Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add avx512 optimization #446

Closed
wants to merge 4 commits into from
Closed

Conversation

huihan365
Copy link
Contributor

Nowadays icelake servers are widely used. this patch add avx512 optimization for better performance than avx2. croaring_avx512() will be true if platform support avx512 during runtime dispatch.

Signed-off-by: lifan lin(lifan.lin@intel.com)
hui han(hui.han@intel.com)

here is the benchmark result(platform icelake 6330),
Not all operations have improvement, but some operations have obvious performance boost.
avx2
taskset -c 0 benchmarks/add_benchmark
[cycles/element]
intvlen=1 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 62.9
roaring_bitmap_add_many(): 85.5
roaring_bitmap_add_bulk(): 65.4
roaring_bitmap_add_range(): 79.4
roaring_bitmap_remove(): 993.9
roaring_bitmap_remove_range(): 81.6
intvlen=4 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 24.7
roaring_bitmap_add_many(): 23.7
roaring_bitmap_add_bulk(): 23.1
roaring_bitmap_add_range(): 16.6
roaring_bitmap_remove(): 548.9
roaring_bitmap_remove_range(): 26.8
intvlen=16 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 15.0
roaring_bitmap_add_many(): 10.8
roaring_bitmap_add_bulk(): 12.6
roaring_bitmap_add_range(): 8.1
roaring_bitmap_remove(): 191.8
roaring_bitmap_remove_range(): 15.4
intvlen=64 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 13.8
roaring_bitmap_add_many(): 8.7
roaring_bitmap_add_bulk(): 11.4
roaring_bitmap_add_range(): 2.0
roaring_bitmap_remove(): 70.2
roaring_bitmap_remove_range(): 3.4

intvlen=1 density=0.200000 order=ASC
roaring_bitmap_add(): 7.5
roaring_bitmap_add_many(): 13.9
roaring_bitmap_add_bulk(): 6.4
roaring_bitmap_add_range(): 27.8
roaring_bitmap_remove(): 606.3
roaring_bitmap_remove_range(): 41.5
intvlen=4 density=0.200000 order=ASC
roaring_bitmap_add(): 7.1
roaring_bitmap_add_many(): 5.3
roaring_bitmap_add_bulk(): 5.8
roaring_bitmap_add_range(): 6.9
roaring_bitmap_remove(): 389.0
roaring_bitmap_remove_range(): 15.1
intvlen=16 density=0.200000 order=ASC
roaring_bitmap_add(): 6.9
roaring_bitmap_add_many(): 3.6
roaring_bitmap_add_bulk(): 5.7
roaring_bitmap_add_range(): 2.7
roaring_bitmap_remove(): 164.3
roaring_bitmap_remove_range(): 8.2
intvlen=64 density=0.200000 order=ASC
roaring_bitmap_add(): 6.9
roaring_bitmap_add_many(): 3.3
roaring_bitmap_add_bulk(): 5.8
roaring_bitmap_add_range(): 0.8
roaring_bitmap_remove(): 66.1
roaring_bitmap_remove_range(): 1.7

intvlen=1 density=0.200000 order=DESC
roaring_bitmap_add(): 28.5
roaring_bitmap_add_many(): 35.1
roaring_bitmap_add_bulk(): 24.5
roaring_bitmap_add_range(): 34.3
roaring_bitmap_remove(): 884.0
roaring_bitmap_remove_range(): 48.6
intvlen=4 density=0.200000 order=DESC
roaring_bitmap_add(): 17.1
roaring_bitmap_add_many(): 14.1
roaring_bitmap_add_bulk(): 14.0
roaring_bitmap_add_range(): 8.6
roaring_bitmap_remove(): 610.2
roaring_bitmap_remove_range(): 18.1
intvlen=16 density=0.200000 order=DESC
roaring_bitmap_add(): 15.3
roaring_bitmap_add_many(): 10.1
roaring_bitmap_add_bulk(): 12.2
roaring_bitmap_add_range(): 4.3
roaring_bitmap_remove(): 182.8
roaring_bitmap_remove_range(): 10.4
intvlen=64 density=0.200000 order=DESC
roaring_bitmap_add(): 16.1
roaring_bitmap_add_many(): 11.2
roaring_bitmap_add_bulk(): 13.8
roaring_bitmap_add_range(): 0.9
roaring_bitmap_remove(): 67.2
roaring_bitmap_remove_range(): 2.0

avx512
taskset -c 0 benchmarks/add_benchmark
[cycles/element]
intvlen=1 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 64.7
roaring_bitmap_add_many(): 86.4
roaring_bitmap_add_bulk(): 66.5
roaring_bitmap_add_range(): 82.5
roaring_bitmap_remove(): 852.2
roaring_bitmap_remove_range(): 78.1
intvlen=4 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 25.7
roaring_bitmap_add_many(): 24.6
roaring_bitmap_add_bulk(): 23.0
roaring_bitmap_add_range(): 17.4
roaring_bitmap_remove(): 391.7
roaring_bitmap_remove_range(): 24.5
intvlen=16 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 15.3
roaring_bitmap_add_many(): 11.7
roaring_bitmap_add_bulk(): 13.1
roaring_bitmap_add_range(): 8.1
roaring_bitmap_remove(): 126.0
roaring_bitmap_remove_range(): 13.1
intvlen=64 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 14.5
roaring_bitmap_add_many(): 9.5
roaring_bitmap_add_bulk(): 11.6
roaring_bitmap_add_range(): 2.0
roaring_bitmap_remove(): 57.1
roaring_bitmap_remove_range(): 2.9

intvlen=1 density=0.200000 order=ASC
roaring_bitmap_add(): 8.4
roaring_bitmap_add_many(): 15.0
roaring_bitmap_add_bulk(): 6.9
roaring_bitmap_add_range(): 31.6
roaring_bitmap_remove(): 374.2
roaring_bitmap_remove_range(): 36.9
intvlen=4 density=0.200000 order=ASC
roaring_bitmap_add(): 7.7
roaring_bitmap_add_many(): 6.1
roaring_bitmap_add_bulk(): 6.2
roaring_bitmap_add_range(): 7.9
roaring_bitmap_remove(): 226.5
roaring_bitmap_remove_range(): 12.5
intvlen=16 density=0.200000 order=ASC
roaring_bitmap_add(): 7.4
roaring_bitmap_add_many(): 4.4
roaring_bitmap_add_bulk(): 6.0
roaring_bitmap_add_range(): 2.9
roaring_bitmap_remove(): 103.1
roaring_bitmap_remove_range(): 6.2
intvlen=64 density=0.200000 order=ASC
roaring_bitmap_add(): 7.6
roaring_bitmap_add_many(): 4.0
roaring_bitmap_add_bulk(): 6.3
roaring_bitmap_add_range(): 0.9
roaring_bitmap_remove(): 51.6
roaring_bitmap_remove_range(): 1.4

intvlen=1 density=0.200000 order=DESC
roaring_bitmap_add(): 28.8
roaring_bitmap_add_many(): 35.7
roaring_bitmap_add_bulk(): 24.7
roaring_bitmap_add_range(): 38.2
roaring_bitmap_remove(): 752.7
roaring_bitmap_remove_range(): 46.5
intvlen=4 density=0.200000 order=DESC
roaring_bitmap_add(): 17.5
roaring_bitmap_add_many(): 15.0
roaring_bitmap_add_bulk(): 14.4
roaring_bitmap_add_range(): 9.4
roaring_bitmap_remove(): 466.9
roaring_bitmap_remove_range(): 16.6
intvlen=16 density=0.200000 order=DESC
roaring_bitmap_add(): 15.2
roaring_bitmap_add_many(): 10.9
roaring_bitmap_add_bulk(): 12.2
roaring_bitmap_add_range(): 4.4
roaring_bitmap_remove(): 124.5
roaring_bitmap_remove_range(): 8.9
intvlen=64 density=0.200000 order=DESC
roaring_bitmap_add(): 17.1
roaring_bitmap_add_many(): 12.0
roaring_bitmap_add_bulk(): 14.3
roaring_bitmap_add_range(): 0.9
roaring_bitmap_remove(): 57.1
roaring_bitmap_remove_range(): 1.8

avx2
taskset -c 0 benchmarks/bitset_container_benchmark
bitset container benchmarks
set_test(B): 1.50 cycles per operation
get_test(B): 0.96 cycles per operation
bitset_container_cardinality(B): 201.00 cycles per operation
bitset_container_compute_cardinality(B): 0.40 cycles per operation
unset_test(B): 1.51 cycles per operation

number of values in container = 4096
bitset_container_to_uint32_array(out, Bt, 1234): 0.89 cycles per operation
bitset_container_get bitset_cache_prefetch: 205.69 cycles per operation
bitset_container_get bitset_cache_flush: 327.49 cycles per operation

number of values in container = 8192
bitset_container_to_uint32_array(out, Bt, 1234): 1.16 cycles per operation
bitset_container_get bitset_cache_prefetch: 208.41 cycles per operation
bitset_container_get bitset_cache_flush: 312.30 cycles per operation

number of values in container = 16384
bitset_container_to_uint32_array(out, Bt, 1234): 0.60 cycles per operation
bitset_container_get bitset_cache_prefetch: 210.64 cycles per operation
bitset_container_get bitset_cache_flush: 308.05 cycles per operation

number of values in container = 32768
bitset_container_to_uint32_array(out, Bt, 1234): 0.30 cycles per operation
bitset_container_get bitset_cache_prefetch: 206.42 cycles per operation
bitset_container_get bitset_cache_flush: 309.61 cycles per operation

number of values in container = 65536
bitset_container_to_uint32_array(out, Bt, 1234): 0.17 cycles per operation
bitset_container_get bitset_cache_prefetch: 212.69 cycles per operation
bitset_container_get bitset_cache_flush: 317.35 cycles per operation

Logical operations (time units per single operation):
bitset_container_and_nocard(B1, B2, BO): 507.00 cycles per operation
bitset_container_and(B1, B2, BO): 534.00 cycles per operation
bitset_container_and_justcard(B1, B2): 353.00 cycles per operation
bitset_container_compute_cardinality(BO): 417.00 cycles per operation
bitset_container_or_nocard(B1, B2, BO): 628.00 cycles per operation
bitset_container_or(B1, B2, BO): 528.00 cycles per operation
bitset_container_or_justcard(B1, B2): 351.00 cycles per operation
bitset_container_compute_cardinality(BO): 419.00 cycles per operation

get_cardinality_through_conversion_to_array(B1): 4.28 cycles per operation

avx512
taskset -c 0 benchmarks/bitset_container_benchmark
bitset container benchmarks
set_test(B): 2.40 cycles per operation
get_test(B): 1.14 cycles per operation
bitset_container_cardinality(B): 203.00 cycles per operation
bitset_container_compute_cardinality(B): 0.31 cycles per operation
unset_test(B): 2.41 cycles per operation

number of values in container = 4096
bitset_container_to_uint32_array(out, Bt, 1234): 0.96 cycles per operation
bitset_container_get bitset_cache_prefetch: 209.76 cycles per operation
bitset_container_get bitset_cache_flush: 207.77 cycles per operation

number of values in container = 8192
bitset_container_to_uint32_array(out, Bt, 1234): 0.73 cycles per operation
bitset_container_get bitset_cache_prefetch: 209.72 cycles per operation
bitset_container_get bitset_cache_flush: 208.29 cycles per operation

number of values in container = 16384
bitset_container_to_uint32_array(out, Bt, 1234): 0.36 cycles per operation
bitset_container_get bitset_cache_prefetch: 210.37 cycles per operation
bitset_container_get bitset_cache_flush: 207.80 cycles per operation

number of values in container = 32768
bitset_container_to_uint32_array(out, Bt, 1234): 0.21 cycles per operation
bitset_container_get bitset_cache_prefetch: 208.17 cycles per operation
bitset_container_get bitset_cache_flush: 208.69 cycles per operation

number of values in container = 65536
bitset_container_to_uint32_array(out, Bt, 1234): 0.13 cycles per operation
bitset_container_get bitset_cache_prefetch: 207.84 cycles per operation
bitset_container_get bitset_cache_flush: 208.03 cycles per operation

Logical operations (time units per single operation):
bitset_container_and_nocard(B1, B2, BO): 392.00 cycles per operation
bitset_container_and(B1, B2, BO): 424.00 cycles per operation
bitset_container_and_justcard(B1, B2): 333.00 cycles per operation
bitset_container_compute_cardinality(BO): 322.00 cycles per operation
bitset_container_or_nocard(B1, B2, BO): 392.00 cycles per operation
bitset_container_or(B1, B2, BO): 424.00 cycles per operation
bitset_container_or_justcard(B1, B2): 337.00 cycles per operation
bitset_container_compute_cardinality(BO): 316.00 cycles per operation

get_cardinality_through_conversion_to_array(B1): 4.30 cycles per operation

@chenyang8094
Copy link

exciting work👍

@@ -64,7 +64,7 @@ bitset_container_t *bitset_container_clone(const bitset_container_t *src);
void bitset_container_set_range(bitset_container_t *bitset, uint32_t begin,
uint32_t end);

#if defined(CROARING_ASMBITMANIPOPTIMIZATION) && defined(__AVX2__)
#if defined(CROARING_ASMBITMANIPOPTIMIZATION) && !defined(ROARING_DISABLE_AVX)
Copy link
Member

@lemire lemire Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ASM_SHIFT_RIGHT routine below expects BMI2, I believe.

So not only will it only run under x64, but it also requires BMI2. It is satisfied if __AVX2__ is defined, but the !defined(ROARING_DISABLE_AVX) is insufficient. At the very least, we need to check that CROARING_IS_X64 is defined, and then, before using it, we need to do a runtime check (e.g., croaring_avx2()). However, it is dubious whether you can do runtime dispatching at the granularity level of such a tiny function and see benefits. Now, it coudb be fruitfully used within a larger function that repeatedly does bitset_container_set, then you can amortize the runtime dispatching. Otherwise, I would simply not bother.

Copy link
Contributor Author

@huihan365 huihan365 Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, runtime dispatching maybe get little benefits. So this change will revert

@@ -71,6 +71,7 @@ enum croaring_instruction_set {
CROARING_BMI1 = 0x20,
CROARING_BMI2 = 0x40,
CROARING_ALTIVEC = 0x80,
CROARING_AVX512 =0x100,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please insert a space before 0x100 to make the code look consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will change

@@ -328,6 +328,7 @@ static inline int hamming(uint64_t x) {
#endif

#define CROARING_TARGET_AVX2 CROARING_TARGET_REGION("avx2,bmi,pclmul,lzcnt")
#define CROARING_TARGET_AVX512 CROARING_TARGET_REGION("bmi2,avx512f,avx512bw,avx512dq")
Copy link
Member

@lemire lemire Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like you only check for the AVX-512 bit and then assume that avx512bw and avx512dq follow, but that's not true on Xeon Phi accelerators (and I own one of those, they do exist).

Furthermore, you start the comment with a reference to icelake.

So I recommend going all out and checking for everything up to VBMI2...

https://github.com/simdjson/simdjson/blob/005054916d2d5b89c1a2f7916704444b054e820c/include/simdjson/internal/isadetection.h#L70-L78

Then you can target the whole range...

"avx512f,avx512dq,avx512cd,avx512bw,avx512vbmi,avx512vbmi2,avx512vl,avx2,bmi,pclmul,lzcnt,avx512vpopcntdq"

This opens up crazily fast bitmap decoding...

https://lemire.me/blog/2022/05/10/faster-bitset-decoding-using-intel-avx-512/

You don't need to implement that in this PR, but targeting Zen 4 and Ice Lake right away, allows us to add this easily in the future.

Furthermore, Ice Lake and Zen 4 do not suffer from frequency throttling, like the previous AVX-512-capable chips. This makes it much more interesting to use AVX-512 just on recent chips capable of VBMI2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will add checking

@@ -569,6 +572,282 @@ CROARING_TARGET_AVX2
AVXPOPCNTFNC(andnot, _mm256_andnot_si256)
CROARING_UNTARGET_REGION

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we go straight to requiring VPOPCNTDQ which both Ice Lake and Zen 4 support. Then we can simply do _mm512_popcnt_epi64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. how about implement in the next release? so this release can apply servers not only icelake but also casalake...

@@ -131,6 +132,7 @@ static inline uint32_t dynamic_croaring_detect_supported_architectures() {
static uint32_t cpuid_avx2_bit = 1 << 5; ///< @private Bit 5 of EBX for EAX=0x7
static uint32_t cpuid_bmi1_bit = 1 << 3; ///< @private bit 3 of EBX for EAX=0x7
static uint32_t cpuid_bmi2_bit = 1 << 8; ///< @private bit 8 of EBX for EAX=0x7
static uint32_t cpuid_avx512f_bit = 1 << 16; ///< @private bit 16 of EBX for EAX=0x7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend checking for VBMI2 and VPOPCNTDQ as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about next release?

src/bitset_util.c Outdated Show resolved Hide resolved
@lemire
Copy link
Member

lemire commented Mar 13, 2023

I agree that this is very exciting work.

Let us make this happen.

@wanweiqiangintel might be interested as well.

@lemire
Copy link
Member

lemire commented Mar 13, 2023

@huihan365 When compiling under Visual Studio to a 32-bit target, some code seems to expect croaring_avx512 to be present.

@lemire
Copy link
Member

lemire commented Mar 13, 2023

Under the macos build, we are getting illegal instructions which might be related to bitset code I commented: https://github.com/RoaringBitmap/CRoaring/actions/runs/4403958055/jobs/7722282318

const __m512i *ptr1 = (const __m512i*)container1->words;
const __m512i *ptr2 = (const __m512i*)container2->words;
for (size_t i = 0; i < BITSET_CONTAINER_SIZE_IN_WORDS*sizeof(uint64_t)/64; i++) {
__m512i r1 = _mm512_loadu_si512(ptr1+i);

Check failure

Code scanning / CodeQL

Suspicious pointer scaling High

This pointer might have type
unsigned long
(size 8), but this pointer arithmetic is done with type const __m512i * (size 64).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can someone see what the github-code-scanning bot found? It does not look like a bug to me. Maybe I am misreading the code? It looks correct at a glance.

const __m512i *ptr2 = (const __m512i*)container2->words;
for (size_t i = 0; i < BITSET_CONTAINER_SIZE_IN_WORDS*sizeof(uint64_t)/64; i++) {
__m512i r1 = _mm512_loadu_si512(ptr1+i);
__m512i r2 = _mm512_loadu_si512(ptr2+i);

Check failure

Code scanning / CodeQL

Suspicious pointer scaling High

This pointer might have type
unsigned long
(size 8), but this pointer arithmetic is done with type const __m512i * (size 64).
src/bitset_util.c Fixed Show resolved Hide resolved
@lemire
Copy link
Member

lemire commented Mar 13, 2023

Note: I am committed to make this happen. Please review my comments.

@CharlesChen888
Copy link
Contributor

It is exciting to see this PR. Looks like you have implemented AVX512 calculation between bitset containers. Have you done or considered applying AVX512 to array containers?

@lemire
Copy link
Member

lemire commented Mar 14, 2023

@CharlesChen888 Note that we can use multiple rounds of PRs to get all the performance out of AVX-512. :-)

@lemire
Copy link
Member

lemire commented Mar 14, 2023

@huihan365

I agree that we do not need to get everything right or perfect at once. You have done a lot of work and I don't want to hold on to the PR to long.

However, I'd like to discuss this point...

I agree. how about implement in the next release? so this release can apply servers not only icelake but also casalake...

Probably you mean cascade lake... that is, essentially skylake.

Please read this part of the CRoaring documentation...

Our AVX2 code does not use floating-point numbers or multiplications, so it is not subject to turbo frequency throttling on many-core Intel processors.

Notice "so it is not subject to turbo frequency throttling on many-core Intel processors."

As you are no doubt aware, prior to Rocket Lake and Ice Lake, using AVX-512 on Intel processors could lead to severe frequency throttling. I have tried to document it the best I could, based on my experience... https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/

Even Intel documentation, at the time, urged engineers to benchmark and assess the benefits. It is fairly complicated...

On server-class processor, the downclocking is determined on a per-core basis based on the license and the total number of active cores, on the same CPU socket, irrespective of the license of the other cores. That is, to determine the frequency of core under downclocking, you need only to know its license (determined by the type of instructions it runs) and count the number of cores where code is running. Thus you cannot downclock other cores on the same socket, other than the sibling logical core when hyperthreading is used, merely by running heavy and sustained AVX-512 instructions on one core. If you can isolate your heavy numerical work on a few cores (or just one), then the downclocking is limited to these cores. On linux, you can control which cores are running your processes using tools such as taskset or numactl.

Thankfully, recent Intel servers do not have this issue.

You can check my answer to @wanweiqiangintel on a related PR in another project...

simdjson/simdjson#1960

Even Intel would prefer that users do not get bad surprises.

So here is what I propose, if you want to support pre-Ice Lake processor: AVX-512 should be disabled by default on these processors. Maybe a compilation flag (macro) could change this default behaviour.

Only on recent Intel processors (where VBMI2 is available) should AVX-512 be enabled by default.

Now, this makes the engineering more complicated because instead of having to support fallback, AVX2 and AVX-512, are going to have to support fallback, AVX2, AVX-512 and AVX-512-VBMI2.

I submit to you that it is probably not worth it. It is going to be more effort for relatively little gain in the sense that Intel will quickly move the AVX-512 support to something like Ice Lake or better (that is, VBMI2 + no throttling) across all its server chips.

I am aware that Intel still sells pre-Ice Lake server chips... but I submit to you that it is selling these chips primarily to customers who have little interest for AVX-512.

So my strong suggestion is that we only enable the AVX-512 code when VBMI2 is available. That's where AVX-512 really shines. We won't have to warn users about complicated licensing issues (frequency throttling) and we can do all sorts of nifty tricks.

My recommandation is not that you update your code (e.g., use VPOPCNTDQ and VBMI2) in this PR. It is perfectly acceptable to proceed in steps. However, please consider my point about requiring that the processor has VBMI2. Think about how much simpler it will make future work, and about how it does not require changing out documentation to warn people about frequency throttling.

@lemire
Copy link
Member

lemire commented Mar 14, 2023

@huihan365 In effect, short of the small functional changes, I am only suggesting that you check for VBMI2 to enable the AVX-512 kernel.

So I am only suggesting you change a few lines. Your PR is basically good as-is.

@huihan365 huihan365 force-pushed the avx512_perf branch 2 times, most recently from 2d93bc9 to cb0b4c2 Compare March 15, 2023 11:30
@lemire
Copy link
Member

lemire commented Mar 15, 2023

Running tests.

#elif defined(__AVX2__)
static inline bool croaring_avx2() {
return true;
}
static inline bool croaring_avx512() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this is correct. If AVX2 is defined, it could still be true that AVX-512 is supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but if AVX-512 is supported. the upper AVX-512 branch will be compiled.

include/roaring/isadetection.h Fixed Show resolved Hide resolved
@lemire
Copy link
Member

lemire commented Mar 20, 2023

@huihan365 I will fix the remaining issues.

@lemire lemire mentioned this pull request Mar 20, 2023
@lemire
Copy link
Member

lemire commented Mar 20, 2023

@huihan365 Please see #451 where I have applied various small fixes, mostly to help portability.

@lemire
Copy link
Member

lemire commented Mar 21, 2023

Closing in favour of #451

@lemire lemire closed this Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants