add avx512 optimization #446

huihan365 · 2023-03-12T11:17:44Z

Nowadays icelake servers are widely used. this patch add avx512 optimization for better performance than avx2. croaring_avx512() will be true if platform support avx512 during runtime dispatch.

Signed-off-by: lifan lin(lifan.lin@intel.com)
hui han(hui.han@intel.com)

here is the benchmark result(platform icelake 6330),
Not all operations have improvement, but some operations have obvious performance boost.
avx2
taskset -c 0 benchmarks/add_benchmark
[cycles/element]
intvlen=1 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 62.9
roaring_bitmap_add_many(): 85.5
roaring_bitmap_add_bulk(): 65.4
roaring_bitmap_add_range(): 79.4
roaring_bitmap_remove(): 993.9
roaring_bitmap_remove_range(): 81.6
intvlen=4 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 24.7
roaring_bitmap_add_many(): 23.7
roaring_bitmap_add_bulk(): 23.1
roaring_bitmap_add_range(): 16.6
roaring_bitmap_remove(): 548.9
roaring_bitmap_remove_range(): 26.8
intvlen=16 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 15.0
roaring_bitmap_add_many(): 10.8
roaring_bitmap_add_bulk(): 12.6
roaring_bitmap_add_range(): 8.1
roaring_bitmap_remove(): 191.8
roaring_bitmap_remove_range(): 15.4
intvlen=64 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 13.8
roaring_bitmap_add_many(): 8.7
roaring_bitmap_add_bulk(): 11.4
roaring_bitmap_add_range(): 2.0
roaring_bitmap_remove(): 70.2
roaring_bitmap_remove_range(): 3.4

intvlen=1 density=0.200000 order=ASC
roaring_bitmap_add(): 7.5
roaring_bitmap_add_many(): 13.9
roaring_bitmap_add_bulk(): 6.4
roaring_bitmap_add_range(): 27.8
roaring_bitmap_remove(): 606.3
roaring_bitmap_remove_range(): 41.5
intvlen=4 density=0.200000 order=ASC
roaring_bitmap_add(): 7.1
roaring_bitmap_add_many(): 5.3
roaring_bitmap_add_bulk(): 5.8
roaring_bitmap_add_range(): 6.9
roaring_bitmap_remove(): 389.0
roaring_bitmap_remove_range(): 15.1
intvlen=16 density=0.200000 order=ASC
roaring_bitmap_add(): 6.9
roaring_bitmap_add_many(): 3.6
roaring_bitmap_add_bulk(): 5.7
roaring_bitmap_add_range(): 2.7
roaring_bitmap_remove(): 164.3
roaring_bitmap_remove_range(): 8.2
intvlen=64 density=0.200000 order=ASC
roaring_bitmap_add(): 6.9
roaring_bitmap_add_many(): 3.3
roaring_bitmap_add_bulk(): 5.8
roaring_bitmap_add_range(): 0.8
roaring_bitmap_remove(): 66.1
roaring_bitmap_remove_range(): 1.7

intvlen=1 density=0.200000 order=DESC
roaring_bitmap_add(): 28.5
roaring_bitmap_add_many(): 35.1
roaring_bitmap_add_bulk(): 24.5
roaring_bitmap_add_range(): 34.3
roaring_bitmap_remove(): 884.0
roaring_bitmap_remove_range(): 48.6
intvlen=4 density=0.200000 order=DESC
roaring_bitmap_add(): 17.1
roaring_bitmap_add_many(): 14.1
roaring_bitmap_add_bulk(): 14.0
roaring_bitmap_add_range(): 8.6
roaring_bitmap_remove(): 610.2
roaring_bitmap_remove_range(): 18.1
intvlen=16 density=0.200000 order=DESC
roaring_bitmap_add(): 15.3
roaring_bitmap_add_many(): 10.1
roaring_bitmap_add_bulk(): 12.2
roaring_bitmap_add_range(): 4.3
roaring_bitmap_remove(): 182.8
roaring_bitmap_remove_range(): 10.4
intvlen=64 density=0.200000 order=DESC
roaring_bitmap_add(): 16.1
roaring_bitmap_add_many(): 11.2
roaring_bitmap_add_bulk(): 13.8
roaring_bitmap_add_range(): 0.9
roaring_bitmap_remove(): 67.2
roaring_bitmap_remove_range(): 2.0

avx512
taskset -c 0 benchmarks/add_benchmark
[cycles/element]
intvlen=1 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 64.7
roaring_bitmap_add_many(): 86.4
roaring_bitmap_add_bulk(): 66.5
roaring_bitmap_add_range(): 82.5
roaring_bitmap_remove(): 852.2
roaring_bitmap_remove_range(): 78.1
intvlen=4 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 25.7
roaring_bitmap_add_many(): 24.6
roaring_bitmap_add_bulk(): 23.0
roaring_bitmap_add_range(): 17.4
roaring_bitmap_remove(): 391.7
roaring_bitmap_remove_range(): 24.5
intvlen=16 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 15.3
roaring_bitmap_add_many(): 11.7
roaring_bitmap_add_bulk(): 13.1
roaring_bitmap_add_range(): 8.1
roaring_bitmap_remove(): 126.0
roaring_bitmap_remove_range(): 13.1
intvlen=64 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 14.5
roaring_bitmap_add_many(): 9.5
roaring_bitmap_add_bulk(): 11.6
roaring_bitmap_add_range(): 2.0
roaring_bitmap_remove(): 57.1
roaring_bitmap_remove_range(): 2.9

intvlen=1 density=0.200000 order=ASC
roaring_bitmap_add(): 8.4
roaring_bitmap_add_many(): 15.0
roaring_bitmap_add_bulk(): 6.9
roaring_bitmap_add_range(): 31.6
roaring_bitmap_remove(): 374.2
roaring_bitmap_remove_range(): 36.9
intvlen=4 density=0.200000 order=ASC
roaring_bitmap_add(): 7.7
roaring_bitmap_add_many(): 6.1
roaring_bitmap_add_bulk(): 6.2
roaring_bitmap_add_range(): 7.9
roaring_bitmap_remove(): 226.5
roaring_bitmap_remove_range(): 12.5
intvlen=16 density=0.200000 order=ASC
roaring_bitmap_add(): 7.4
roaring_bitmap_add_many(): 4.4
roaring_bitmap_add_bulk(): 6.0
roaring_bitmap_add_range(): 2.9
roaring_bitmap_remove(): 103.1
roaring_bitmap_remove_range(): 6.2
intvlen=64 density=0.200000 order=ASC
roaring_bitmap_add(): 7.6
roaring_bitmap_add_many(): 4.0
roaring_bitmap_add_bulk(): 6.3
roaring_bitmap_add_range(): 0.9
roaring_bitmap_remove(): 51.6
roaring_bitmap_remove_range(): 1.4

intvlen=1 density=0.200000 order=DESC
roaring_bitmap_add(): 28.8
roaring_bitmap_add_many(): 35.7
roaring_bitmap_add_bulk(): 24.7
roaring_bitmap_add_range(): 38.2
roaring_bitmap_remove(): 752.7
roaring_bitmap_remove_range(): 46.5
intvlen=4 density=0.200000 order=DESC
roaring_bitmap_add(): 17.5
roaring_bitmap_add_many(): 15.0
roaring_bitmap_add_bulk(): 14.4
roaring_bitmap_add_range(): 9.4
roaring_bitmap_remove(): 466.9
roaring_bitmap_remove_range(): 16.6
intvlen=16 density=0.200000 order=DESC
roaring_bitmap_add(): 15.2
roaring_bitmap_add_many(): 10.9
roaring_bitmap_add_bulk(): 12.2
roaring_bitmap_add_range(): 4.4
roaring_bitmap_remove(): 124.5
roaring_bitmap_remove_range(): 8.9
intvlen=64 density=0.200000 order=DESC
roaring_bitmap_add(): 17.1
roaring_bitmap_add_many(): 12.0
roaring_bitmap_add_bulk(): 14.3
roaring_bitmap_add_range(): 0.9
roaring_bitmap_remove(): 57.1
roaring_bitmap_remove_range(): 1.8

avx2
taskset -c 0 benchmarks/bitset_container_benchmark
bitset container benchmarks
set_test(B): 1.50 cycles per operation
get_test(B): 0.96 cycles per operation
bitset_container_cardinality(B): 201.00 cycles per operation
bitset_container_compute_cardinality(B): 0.40 cycles per operation
unset_test(B): 1.51 cycles per operation

number of values in container = 4096
bitset_container_to_uint32_array(out, Bt, 1234): 0.89 cycles per operation
bitset_container_get bitset_cache_prefetch: 205.69 cycles per operation
bitset_container_get bitset_cache_flush: 327.49 cycles per operation

number of values in container = 8192
bitset_container_to_uint32_array(out, Bt, 1234): 1.16 cycles per operation
bitset_container_get bitset_cache_prefetch: 208.41 cycles per operation
bitset_container_get bitset_cache_flush: 312.30 cycles per operation

number of values in container = 16384
bitset_container_to_uint32_array(out, Bt, 1234): 0.60 cycles per operation
bitset_container_get bitset_cache_prefetch: 210.64 cycles per operation
bitset_container_get bitset_cache_flush: 308.05 cycles per operation

number of values in container = 32768
bitset_container_to_uint32_array(out, Bt, 1234): 0.30 cycles per operation
bitset_container_get bitset_cache_prefetch: 206.42 cycles per operation
bitset_container_get bitset_cache_flush: 309.61 cycles per operation

number of values in container = 65536
bitset_container_to_uint32_array(out, Bt, 1234): 0.17 cycles per operation
bitset_container_get bitset_cache_prefetch: 212.69 cycles per operation
bitset_container_get bitset_cache_flush: 317.35 cycles per operation

Logical operations (time units per single operation):
bitset_container_and_nocard(B1, B2, BO): 507.00 cycles per operation
bitset_container_and(B1, B2, BO): 534.00 cycles per operation
bitset_container_and_justcard(B1, B2): 353.00 cycles per operation
bitset_container_compute_cardinality(BO): 417.00 cycles per operation
bitset_container_or_nocard(B1, B2, BO): 628.00 cycles per operation
bitset_container_or(B1, B2, BO): 528.00 cycles per operation
bitset_container_or_justcard(B1, B2): 351.00 cycles per operation
bitset_container_compute_cardinality(BO): 419.00 cycles per operation

get_cardinality_through_conversion_to_array(B1): 4.28 cycles per operation

avx512
taskset -c 0 benchmarks/bitset_container_benchmark
bitset container benchmarks
set_test(B): 2.40 cycles per operation
get_test(B): 1.14 cycles per operation
bitset_container_cardinality(B): 203.00 cycles per operation
bitset_container_compute_cardinality(B): 0.31 cycles per operation
unset_test(B): 2.41 cycles per operation

number of values in container = 4096
bitset_container_to_uint32_array(out, Bt, 1234): 0.96 cycles per operation
bitset_container_get bitset_cache_prefetch: 209.76 cycles per operation
bitset_container_get bitset_cache_flush: 207.77 cycles per operation

number of values in container = 8192
bitset_container_to_uint32_array(out, Bt, 1234): 0.73 cycles per operation
bitset_container_get bitset_cache_prefetch: 209.72 cycles per operation
bitset_container_get bitset_cache_flush: 208.29 cycles per operation

number of values in container = 16384
bitset_container_to_uint32_array(out, Bt, 1234): 0.36 cycles per operation
bitset_container_get bitset_cache_prefetch: 210.37 cycles per operation
bitset_container_get bitset_cache_flush: 207.80 cycles per operation

number of values in container = 32768
bitset_container_to_uint32_array(out, Bt, 1234): 0.21 cycles per operation
bitset_container_get bitset_cache_prefetch: 208.17 cycles per operation
bitset_container_get bitset_cache_flush: 208.69 cycles per operation

number of values in container = 65536
bitset_container_to_uint32_array(out, Bt, 1234): 0.13 cycles per operation
bitset_container_get bitset_cache_prefetch: 207.84 cycles per operation
bitset_container_get bitset_cache_flush: 208.03 cycles per operation

Logical operations (time units per single operation):
bitset_container_and_nocard(B1, B2, BO): 392.00 cycles per operation
bitset_container_and(B1, B2, BO): 424.00 cycles per operation
bitset_container_and_justcard(B1, B2): 333.00 cycles per operation
bitset_container_compute_cardinality(BO): 322.00 cycles per operation
bitset_container_or_nocard(B1, B2, BO): 392.00 cycles per operation
bitset_container_or(B1, B2, BO): 424.00 cycles per operation
bitset_container_or_justcard(B1, B2): 337.00 cycles per operation
bitset_container_compute_cardinality(BO): 316.00 cycles per operation

get_cardinality_through_conversion_to_array(B1): 4.30 cycles per operation

chenyang8094 · 2023-03-13T07:28:06Z

exciting work👍

lemire · 2023-03-13T17:13:03Z

include/roaring/containers/bitset.h

@@ -64,7 +64,7 @@ bitset_container_t *bitset_container_clone(const bitset_container_t *src);
 void bitset_container_set_range(bitset_container_t *bitset, uint32_t begin,
                                uint32_t end);

-#if defined(CROARING_ASMBITMANIPOPTIMIZATION) && defined(__AVX2__)
+#if defined(CROARING_ASMBITMANIPOPTIMIZATION) && !defined(ROARING_DISABLE_AVX)


The ASM_SHIFT_RIGHT routine below expects BMI2, I believe.

So not only will it only run under x64, but it also requires BMI2. It is satisfied if __AVX2__ is defined, but the !defined(ROARING_DISABLE_AVX) is insufficient. At the very least, we need to check that CROARING_IS_X64 is defined, and then, before using it, we need to do a runtime check (e.g., croaring_avx2()). However, it is dubious whether you can do runtime dispatching at the granularity level of such a tiny function and see benefits. Now, it coudb be fruitfully used within a larger function that repeatedly does bitset_container_set, then you can amortize the runtime dispatching. Otherwise, I would simply not bother.

I agree, runtime dispatching maybe get little benefits. So this change will revert

lemire · 2023-03-13T17:13:48Z

include/roaring/isadetection.h

@@ -71,6 +71,7 @@ enum croaring_instruction_set {
  CROARING_BMI1 = 0x20,
  CROARING_BMI2 = 0x40,
  CROARING_ALTIVEC = 0x80,
+  CROARING_AVX512 =0x100,


Please insert a space before 0x100 to make the code look consistent.

i will change

lemire · 2023-03-13T17:22:15Z

include/roaring/portability.h

@@ -328,6 +328,7 @@ static inline int hamming(uint64_t x) {
 #endif

 #define CROARING_TARGET_AVX2 CROARING_TARGET_REGION("avx2,bmi,pclmul,lzcnt")
+#define CROARING_TARGET_AVX512 CROARING_TARGET_REGION("bmi2,avx512f,avx512bw,avx512dq")


It looks to me like you only check for the AVX-512 bit and then assume that avx512bw and avx512dq follow, but that's not true on Xeon Phi accelerators (and I own one of those, they do exist).

Furthermore, you start the comment with a reference to icelake.

So I recommend going all out and checking for everything up to VBMI2...

https://github.com/simdjson/simdjson/blob/005054916d2d5b89c1a2f7916704444b054e820c/include/simdjson/internal/isadetection.h#L70-L78

Then you can target the whole range...

"avx512f,avx512dq,avx512cd,avx512bw,avx512vbmi,avx512vbmi2,avx512vl,avx2,bmi,pclmul,lzcnt,avx512vpopcntdq"

This opens up crazily fast bitmap decoding...

https://lemire.me/blog/2022/05/10/faster-bitset-decoding-using-intel-avx-512/

You don't need to implement that in this PR, but targeting Zen 4 and Ice Lake right away, allows us to add this easily in the future.

Furthermore, Ice Lake and Zen 4 do not suffer from frequency throttling, like the previous AVX-512-capable chips. This makes it much more interesting to use AVX-512 just on recent chips capable of VBMI2.

i will add checking

lemire · 2023-03-13T17:27:35Z

include/roaring/bitset_util.h

@@ -569,6 +572,282 @@ CROARING_TARGET_AVX2
 AVXPOPCNTFNC(andnot, _mm256_andnot_si256)
 CROARING_UNTARGET_REGION



Why don't we go straight to requiring VPOPCNTDQ which both Ice Lake and Zen 4 support. Then we can simply do _mm512_popcnt_epi64.

I agree. how about implement in the next release? so this release can apply servers not only icelake but also casalake...

lemire · 2023-03-13T17:42:24Z

include/roaring/isadetection.h

@@ -131,6 +132,7 @@ static inline uint32_t dynamic_croaring_detect_supported_architectures() {
  static uint32_t cpuid_avx2_bit = 1 << 5;      ///< @private Bit 5 of EBX for EAX=0x7
  static uint32_t cpuid_bmi1_bit = 1 << 3;      ///< @private bit 3 of EBX for EAX=0x7
  static uint32_t cpuid_bmi2_bit = 1 << 8;      ///< @private bit 8 of EBX for EAX=0x7
+  static uint32_t cpuid_avx512f_bit = 1 << 16;  ///< @private bit 16 of EBX for EAX=0x7


I recommend checking for VBMI2 and VPOPCNTDQ as well.

how about next release?

src/bitset_util.c

lemire · 2023-03-13T17:46:52Z

I agree that this is very exciting work.

Let us make this happen.

@wanweiqiangintel might be interested as well.

lemire · 2023-03-13T17:48:53Z

@huihan365 When compiling under Visual Studio to a 32-bit target, some code seems to expect croaring_avx512 to be present.

lemire · 2023-03-13T17:49:40Z

Under the macos build, we are getting illegal instructions which might be related to bitset code I commented: https://github.com/RoaringBitmap/CRoaring/actions/runs/4403958055/jobs/7722282318

lemire · 2023-03-13T20:00:50Z

src/containers/bitset.c

+  const __m512i *ptr1 = (const __m512i*)container1->words;
+  const __m512i *ptr2 = (const __m512i*)container2->words;
+  for (size_t i = 0; i < BITSET_CONTAINER_SIZE_IN_WORDS*sizeof(uint64_t)/64; i++) {
+      __m512i r1 = _mm512_loadu_si512(ptr1+i);


Can someone see what the github-code-scanning bot found? It does not look like a bug to me. Maybe I am misreading the code? It looks correct at a glance.

src/containers/bitset.c

+  const __m512i *ptr2 = (const __m512i*)container2->words;
+  for (size_t i = 0; i < BITSET_CONTAINER_SIZE_IN_WORDS*sizeof(uint64_t)/64; i++) {
+      __m512i r1 = _mm512_loadu_si512(ptr1+i);
+      __m512i r2 = _mm512_loadu_si512(ptr2+i);


src/bitset_util.c

lemire · 2023-03-13T18:06:31Z

Note: I am committed to make this happen. Please review my comments.

CharlesChen888 · 2023-03-14T01:47:34Z

It is exciting to see this PR. Looks like you have implemented AVX512 calculation between bitset containers. Have you done or considered applying AVX512 to array containers?

lemire · 2023-03-14T01:50:10Z

@CharlesChen888 Note that we can use multiple rounds of PRs to get all the performance out of AVX-512. :-)

lemire · 2023-03-14T16:23:00Z

@huihan365

I agree that we do not need to get everything right or perfect at once. You have done a lot of work and I don't want to hold on to the PR to long.

However, I'd like to discuss this point...

I agree. how about implement in the next release? so this release can apply servers not only icelake but also casalake...

Probably you mean cascade lake... that is, essentially skylake.

Please read this part of the CRoaring documentation...

CRoaring/README.md

Line 592 in 84fe3c8

    
           Our AVX2 code does not use floating-point numbers or multiplications, so it is not subject to turbo frequency throttling on many-core Intel processors.

Notice "so it is not subject to turbo frequency throttling on many-core Intel processors."

As you are no doubt aware, prior to Rocket Lake and Ice Lake, using AVX-512 on Intel processors could lead to severe frequency throttling. I have tried to document it the best I could, based on my experience... https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/

Even Intel documentation, at the time, urged engineers to benchmark and assess the benefits. It is fairly complicated...

On server-class processor, the downclocking is determined on a per-core basis based on the license and the total number of active cores, on the same CPU socket, irrespective of the license of the other cores. That is, to determine the frequency of core under downclocking, you need only to know its license (determined by the type of instructions it runs) and count the number of cores where code is running. Thus you cannot downclock other cores on the same socket, other than the sibling logical core when hyperthreading is used, merely by running heavy and sustained AVX-512 instructions on one core. If you can isolate your heavy numerical work on a few cores (or just one), then the downclocking is limited to these cores. On linux, you can control which cores are running your processes using tools such as taskset or numactl.

Thankfully, recent Intel servers do not have this issue.

You can check my answer to @wanweiqiangintel on a related PR in another project...

simdjson/simdjson#1960

Even Intel would prefer that users do not get bad surprises.

So here is what I propose, if you want to support pre-Ice Lake processor: AVX-512 should be disabled by default on these processors. Maybe a compilation flag (macro) could change this default behaviour.

Only on recent Intel processors (where VBMI2 is available) should AVX-512 be enabled by default.

Now, this makes the engineering more complicated because instead of having to support fallback, AVX2 and AVX-512, are going to have to support fallback, AVX2, AVX-512 and AVX-512-VBMI2.

I submit to you that it is probably not worth it. It is going to be more effort for relatively little gain in the sense that Intel will quickly move the AVX-512 support to something like Ice Lake or better (that is, VBMI2 + no throttling) across all its server chips.

I am aware that Intel still sells pre-Ice Lake server chips... but I submit to you that it is selling these chips primarily to customers who have little interest for AVX-512.

So my strong suggestion is that we only enable the AVX-512 code when VBMI2 is available. That's where AVX-512 really shines. We won't have to warn users about complicated licensing issues (frequency throttling) and we can do all sorts of nifty tricks.

My recommandation is not that you update your code (e.g., use VPOPCNTDQ and VBMI2) in this PR. It is perfectly acceptable to proceed in steps. However, please consider my point about requiring that the processor has VBMI2. Think about how much simpler it will make future work, and about how it does not require changing out documentation to warn people about frequency throttling.

lemire · 2023-03-14T16:30:37Z

@huihan365 In effect, short of the small functional changes, I am only suggesting that you check for VBMI2 to enable the AVX-512 kernel.

So I am only suggesting you change a few lines. Your PR is basically good as-is.

lemire · 2023-03-15T14:13:02Z

Running tests.

lemire · 2023-03-15T14:14:27Z

include/roaring/isadetection.h

 #elif defined(__AVX2__)
 static inline bool croaring_avx2() {
  return true;
 }
+static inline bool croaring_avx512() {


I don't think that this is correct. If AVX2 is defined, it could still be true that AVX-512 is supported.

I agree, but if AVX-512 is supported. the upper AVX-512 branch will be compiled.

include/roaring/isadetection.h

include/roaring/portability.h

src/containers/bitset.c

include/roaring/isadetection.h

src/containers/bitset.c

lemire · 2023-03-20T20:02:34Z

@huihan365 I will fix the remaining issues.

lemire · 2023-03-20T21:29:40Z

@huihan365 Please see #451 where I have applied various small fixes, mostly to help portability.

lemire · 2023-03-21T02:56:13Z

Closing in favour of #451

huihan365 force-pushed the avx512_perf branch from c2fc19a to 99b40ea Compare March 13, 2023 01:59

add avx512 optimization

99b40ea

lemire reviewed Mar 13, 2023

View reviewed changes

src/bitset_util.c Outdated Show resolved Hide resolved

github-advanced-security bot found potential problems Mar 13, 2023

View reviewed changes

huihan365 force-pushed the avx512_perf branch from e1b6313 to 0aafa11 Compare March 15, 2023 10:59

support bmi2 when use avx

cbdb8bd

huihan365 force-pushed the avx512_perf branch 2 times, most recently from 2d93bc9 to cb0b4c2 Compare March 15, 2023 11:30

support bmi2 when use avx

604ff9f

lemire reviewed Mar 15, 2023

View reviewed changes

include/roaring/isadetection.h Show resolved Hide resolved

lemire reviewed Mar 15, 2023

View reviewed changes

include/roaring/portability.h Outdated Show resolved Hide resolved

lemire reviewed Mar 15, 2023

View reviewed changes

src/containers/bitset.c Show resolved Hide resolved

github-advanced-security bot found potential problems Mar 15, 2023

View reviewed changes

include/roaring/isadetection.h Fixed Show resolved Hide resolved

lemire reviewed Mar 15, 2023

View reviewed changes

src/containers/bitset.c Show resolved Hide resolved

huihan365 force-pushed the avx512_perf branch from cb0b4c2 to e67743f Compare March 16, 2023 03:09

update

e67743f

lemire mentioned this pull request Mar 20, 2023

add avx512 support #451

Merged

lemire closed this Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add avx512 optimization #446

add avx512 optimization #446

huihan365 commented Mar 12, 2023

chenyang8094 commented Mar 13, 2023

lemire Mar 13, 2023 •

edited

huihan365 Mar 14, 2023 •

edited

lemire Mar 13, 2023

huihan365 Mar 14, 2023

lemire Mar 13, 2023 •

edited

huihan365 Mar 14, 2023

lemire Mar 13, 2023

huihan365 Mar 14, 2023

lemire Mar 13, 2023

huihan365 Mar 14, 2023

lemire commented Mar 13, 2023

lemire commented Mar 13, 2023

lemire commented Mar 13, 2023

lemire Mar 13, 2023

lemire commented Mar 13, 2023

CharlesChen888 commented Mar 14, 2023

lemire commented Mar 14, 2023

lemire commented Mar 14, 2023

lemire commented Mar 14, 2023

lemire commented Mar 15, 2023

lemire Mar 15, 2023

huihan365 Mar 16, 2023

lemire commented Mar 20, 2023

lemire commented Mar 20, 2023

lemire commented Mar 21, 2023

		@@ -569,6 +572,282 @@ CROARING_TARGET_AVX2
		AVXPOPCNTFNC(andnot, _mm256_andnot_si256)
		CROARING_UNTARGET_REGION

add avx512 optimization #446

add avx512 optimization #446

Conversation

huihan365 commented Mar 12, 2023

chenyang8094 commented Mar 13, 2023

lemire Mar 13, 2023 • edited

Choose a reason for hiding this comment

huihan365 Mar 14, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lemire Mar 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lemire commented Mar 13, 2023

lemire commented Mar 13, 2023

lemire commented Mar 13, 2023

Choose a reason for hiding this comment

lemire commented Mar 13, 2023

CharlesChen888 commented Mar 14, 2023

lemire commented Mar 14, 2023

lemire commented Mar 14, 2023

lemire commented Mar 14, 2023

lemire commented Mar 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lemire commented Mar 20, 2023

lemire commented Mar 20, 2023

lemire commented Mar 21, 2023

lemire Mar 13, 2023 •

edited

huihan365 Mar 14, 2023 •

edited

lemire Mar 13, 2023 •

edited