New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add avx512 optimization #446
Conversation
exciting work👍 |
include/roaring/containers/bitset.h
Outdated
@@ -64,7 +64,7 @@ bitset_container_t *bitset_container_clone(const bitset_container_t *src); | |||
void bitset_container_set_range(bitset_container_t *bitset, uint32_t begin, | |||
uint32_t end); | |||
|
|||
#if defined(CROARING_ASMBITMANIPOPTIMIZATION) && defined(__AVX2__) | |||
#if defined(CROARING_ASMBITMANIPOPTIMIZATION) && !defined(ROARING_DISABLE_AVX) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ASM_SHIFT_RIGHT routine below expects BMI2, I believe.
So not only will it only run under x64, but it also requires BMI2. It is satisfied if __AVX2__
is defined, but the !defined(ROARING_DISABLE_AVX)
is insufficient. At the very least, we need to check that CROARING_IS_X64
is defined, and then, before using it, we need to do a runtime check (e.g., croaring_avx2()). However, it is dubious whether you can do runtime dispatching at the granularity level of such a tiny function and see benefits. Now, it coudb be fruitfully used within a larger function that repeatedly does bitset_container_set, then you can amortize the runtime dispatching. Otherwise, I would simply not bother.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, runtime dispatching maybe get little benefits. So this change will revert
include/roaring/isadetection.h
Outdated
@@ -71,6 +71,7 @@ enum croaring_instruction_set { | |||
CROARING_BMI1 = 0x20, | |||
CROARING_BMI2 = 0x40, | |||
CROARING_ALTIVEC = 0x80, | |||
CROARING_AVX512 =0x100, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please insert a space before 0x100 to make the code look consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will change
include/roaring/portability.h
Outdated
@@ -328,6 +328,7 @@ static inline int hamming(uint64_t x) { | |||
#endif | |||
|
|||
#define CROARING_TARGET_AVX2 CROARING_TARGET_REGION("avx2,bmi,pclmul,lzcnt") | |||
#define CROARING_TARGET_AVX512 CROARING_TARGET_REGION("bmi2,avx512f,avx512bw,avx512dq") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me like you only check for the AVX-512 bit and then assume that avx512bw and avx512dq follow, but that's not true on Xeon Phi accelerators (and I own one of those, they do exist).
Furthermore, you start the comment with a reference to icelake.
So I recommend going all out and checking for everything up to VBMI2...
Then you can target the whole range...
"avx512f,avx512dq,avx512cd,avx512bw,avx512vbmi,avx512vbmi2,avx512vl,avx2,bmi,pclmul,lzcnt,avx512vpopcntdq"
This opens up crazily fast bitmap decoding...
https://lemire.me/blog/2022/05/10/faster-bitset-decoding-using-intel-avx-512/
You don't need to implement that in this PR, but targeting Zen 4 and Ice Lake right away, allows us to add this easily in the future.
Furthermore, Ice Lake and Zen 4 do not suffer from frequency throttling, like the previous AVX-512-capable chips. This makes it much more interesting to use AVX-512 just on recent chips capable of VBMI2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will add checking
@@ -569,6 +572,282 @@ CROARING_TARGET_AVX2 | |||
AVXPOPCNTFNC(andnot, _mm256_andnot_si256) | |||
CROARING_UNTARGET_REGION | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we go straight to requiring VPOPCNTDQ which both Ice Lake and Zen 4 support. Then we can simply do _mm512_popcnt_epi64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. how about implement in the next release? so this release can apply servers not only icelake but also casalake...
@@ -131,6 +132,7 @@ static inline uint32_t dynamic_croaring_detect_supported_architectures() { | |||
static uint32_t cpuid_avx2_bit = 1 << 5; ///< @private Bit 5 of EBX for EAX=0x7 | |||
static uint32_t cpuid_bmi1_bit = 1 << 3; ///< @private bit 3 of EBX for EAX=0x7 | |||
static uint32_t cpuid_bmi2_bit = 1 << 8; ///< @private bit 8 of EBX for EAX=0x7 | |||
static uint32_t cpuid_avx512f_bit = 1 << 16; ///< @private bit 16 of EBX for EAX=0x7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend checking for VBMI2 and VPOPCNTDQ as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about next release?
I agree that this is very exciting work. Let us make this happen. @wanweiqiangintel might be interested as well. |
@huihan365 When compiling under Visual Studio to a 32-bit target, some code seems to expect croaring_avx512 to be present. |
Under the macos build, we are getting illegal instructions which might be related to bitset code I commented: https://github.com/RoaringBitmap/CRoaring/actions/runs/4403958055/jobs/7722282318 |
const __m512i *ptr1 = (const __m512i*)container1->words; | ||
const __m512i *ptr2 = (const __m512i*)container2->words; | ||
for (size_t i = 0; i < BITSET_CONTAINER_SIZE_IN_WORDS*sizeof(uint64_t)/64; i++) { | ||
__m512i r1 = _mm512_loadu_si512(ptr1+i); |
Check failure
Code scanning / CodeQL
Suspicious pointer scaling High
unsigned long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can someone see what the github-code-scanning bot found? It does not look like a bug to me. Maybe I am misreading the code? It looks correct at a glance.
const __m512i *ptr2 = (const __m512i*)container2->words; | ||
for (size_t i = 0; i < BITSET_CONTAINER_SIZE_IN_WORDS*sizeof(uint64_t)/64; i++) { | ||
__m512i r1 = _mm512_loadu_si512(ptr1+i); | ||
__m512i r2 = _mm512_loadu_si512(ptr2+i); |
Check failure
Code scanning / CodeQL
Suspicious pointer scaling High
unsigned long
Note: I am committed to make this happen. Please review my comments. |
It is exciting to see this PR. Looks like you have implemented AVX512 calculation between bitset containers. Have you done or considered applying AVX512 to array containers? |
@CharlesChen888 Note that we can use multiple rounds of PRs to get all the performance out of AVX-512. :-) |
I agree that we do not need to get everything right or perfect at once. You have done a lot of work and I don't want to hold on to the PR to long. However, I'd like to discuss this point...
Probably you mean cascade lake... that is, essentially skylake. Please read this part of the CRoaring documentation... Line 592 in 84fe3c8
Notice "so it is not subject to turbo frequency throttling on many-core Intel processors." As you are no doubt aware, prior to Rocket Lake and Ice Lake, using AVX-512 on Intel processors could lead to severe frequency throttling. I have tried to document it the best I could, based on my experience... https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/ Even Intel documentation, at the time, urged engineers to benchmark and assess the benefits. It is fairly complicated...
Thankfully, recent Intel servers do not have this issue. You can check my answer to @wanweiqiangintel on a related PR in another project... Even Intel would prefer that users do not get bad surprises. So here is what I propose, if you want to support pre-Ice Lake processor: AVX-512 should be disabled by default on these processors. Maybe a compilation flag (macro) could change this default behaviour. Only on recent Intel processors (where VBMI2 is available) should AVX-512 be enabled by default. Now, this makes the engineering more complicated because instead of having to support fallback, AVX2 and AVX-512, are going to have to support fallback, AVX2, AVX-512 and AVX-512-VBMI2. I submit to you that it is probably not worth it. It is going to be more effort for relatively little gain in the sense that Intel will quickly move the AVX-512 support to something like Ice Lake or better (that is, VBMI2 + no throttling) across all its server chips. I am aware that Intel still sells pre-Ice Lake server chips... but I submit to you that it is selling these chips primarily to customers who have little interest for AVX-512. So my strong suggestion is that we only enable the AVX-512 code when VBMI2 is available. That's where AVX-512 really shines. We won't have to warn users about complicated licensing issues (frequency throttling) and we can do all sorts of nifty tricks. My recommandation is not that you update your code (e.g., use VPOPCNTDQ and VBMI2) in this PR. It is perfectly acceptable to proceed in steps. However, please consider my point about requiring that the processor has VBMI2. Think about how much simpler it will make future work, and about how it does not require changing out documentation to warn people about frequency throttling. |
@huihan365 In effect, short of the small functional changes, I am only suggesting that you check for VBMI2 to enable the AVX-512 kernel. So I am only suggesting you change a few lines. Your PR is basically good as-is. |
2d93bc9
to
cb0b4c2
Compare
Running tests. |
#elif defined(__AVX2__) | ||
static inline bool croaring_avx2() { | ||
return true; | ||
} | ||
static inline bool croaring_avx512() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this is correct. If AVX2 is defined, it could still be true that AVX-512 is supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but if AVX-512 is supported. the upper AVX-512 branch will be compiled.
@huihan365 I will fix the remaining issues. |
@huihan365 Please see #451 where I have applied various small fixes, mostly to help portability. |
Closing in favour of #451 |
Nowadays icelake servers are widely used. this patch add avx512 optimization for better performance than avx2. croaring_avx512() will be true if platform support avx512 during runtime dispatch.
Signed-off-by: lifan lin(lifan.lin@intel.com)
hui han(hui.han@intel.com)
here is the benchmark result(platform icelake 6330),
Not all operations have improvement, but some operations have obvious performance boost.
avx2
taskset -c 0 benchmarks/add_benchmark
[cycles/element]
intvlen=1 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 62.9
roaring_bitmap_add_many(): 85.5
roaring_bitmap_add_bulk(): 65.4
roaring_bitmap_add_range(): 79.4
roaring_bitmap_remove(): 993.9
roaring_bitmap_remove_range(): 81.6
intvlen=4 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 24.7
roaring_bitmap_add_many(): 23.7
roaring_bitmap_add_bulk(): 23.1
roaring_bitmap_add_range(): 16.6
roaring_bitmap_remove(): 548.9
roaring_bitmap_remove_range(): 26.8
intvlen=16 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 15.0
roaring_bitmap_add_many(): 10.8
roaring_bitmap_add_bulk(): 12.6
roaring_bitmap_add_range(): 8.1
roaring_bitmap_remove(): 191.8
roaring_bitmap_remove_range(): 15.4
intvlen=64 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 13.8
roaring_bitmap_add_many(): 8.7
roaring_bitmap_add_bulk(): 11.4
roaring_bitmap_add_range(): 2.0
roaring_bitmap_remove(): 70.2
roaring_bitmap_remove_range(): 3.4
intvlen=1 density=0.200000 order=ASC
roaring_bitmap_add(): 7.5
roaring_bitmap_add_many(): 13.9
roaring_bitmap_add_bulk(): 6.4
roaring_bitmap_add_range(): 27.8
roaring_bitmap_remove(): 606.3
roaring_bitmap_remove_range(): 41.5
intvlen=4 density=0.200000 order=ASC
roaring_bitmap_add(): 7.1
roaring_bitmap_add_many(): 5.3
roaring_bitmap_add_bulk(): 5.8
roaring_bitmap_add_range(): 6.9
roaring_bitmap_remove(): 389.0
roaring_bitmap_remove_range(): 15.1
intvlen=16 density=0.200000 order=ASC
roaring_bitmap_add(): 6.9
roaring_bitmap_add_many(): 3.6
roaring_bitmap_add_bulk(): 5.7
roaring_bitmap_add_range(): 2.7
roaring_bitmap_remove(): 164.3
roaring_bitmap_remove_range(): 8.2
intvlen=64 density=0.200000 order=ASC
roaring_bitmap_add(): 6.9
roaring_bitmap_add_many(): 3.3
roaring_bitmap_add_bulk(): 5.8
roaring_bitmap_add_range(): 0.8
roaring_bitmap_remove(): 66.1
roaring_bitmap_remove_range(): 1.7
intvlen=1 density=0.200000 order=DESC
roaring_bitmap_add(): 28.5
roaring_bitmap_add_many(): 35.1
roaring_bitmap_add_bulk(): 24.5
roaring_bitmap_add_range(): 34.3
roaring_bitmap_remove(): 884.0
roaring_bitmap_remove_range(): 48.6
intvlen=4 density=0.200000 order=DESC
roaring_bitmap_add(): 17.1
roaring_bitmap_add_many(): 14.1
roaring_bitmap_add_bulk(): 14.0
roaring_bitmap_add_range(): 8.6
roaring_bitmap_remove(): 610.2
roaring_bitmap_remove_range(): 18.1
intvlen=16 density=0.200000 order=DESC
roaring_bitmap_add(): 15.3
roaring_bitmap_add_many(): 10.1
roaring_bitmap_add_bulk(): 12.2
roaring_bitmap_add_range(): 4.3
roaring_bitmap_remove(): 182.8
roaring_bitmap_remove_range(): 10.4
intvlen=64 density=0.200000 order=DESC
roaring_bitmap_add(): 16.1
roaring_bitmap_add_many(): 11.2
roaring_bitmap_add_bulk(): 13.8
roaring_bitmap_add_range(): 0.9
roaring_bitmap_remove(): 67.2
roaring_bitmap_remove_range(): 2.0
avx512
taskset -c 0 benchmarks/add_benchmark
[cycles/element]
intvlen=1 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 64.7
roaring_bitmap_add_many(): 86.4
roaring_bitmap_add_bulk(): 66.5
roaring_bitmap_add_range(): 82.5
roaring_bitmap_remove(): 852.2
roaring_bitmap_remove_range(): 78.1
intvlen=4 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 25.7
roaring_bitmap_add_many(): 24.6
roaring_bitmap_add_bulk(): 23.0
roaring_bitmap_add_range(): 17.4
roaring_bitmap_remove(): 391.7
roaring_bitmap_remove_range(): 24.5
intvlen=16 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 15.3
roaring_bitmap_add_many(): 11.7
roaring_bitmap_add_bulk(): 13.1
roaring_bitmap_add_range(): 8.1
roaring_bitmap_remove(): 126.0
roaring_bitmap_remove_range(): 13.1
intvlen=64 density=0.200000 order=SHUFFLE
roaring_bitmap_add(): 14.5
roaring_bitmap_add_many(): 9.5
roaring_bitmap_add_bulk(): 11.6
roaring_bitmap_add_range(): 2.0
roaring_bitmap_remove(): 57.1
roaring_bitmap_remove_range(): 2.9
intvlen=1 density=0.200000 order=ASC
roaring_bitmap_add(): 8.4
roaring_bitmap_add_many(): 15.0
roaring_bitmap_add_bulk(): 6.9
roaring_bitmap_add_range(): 31.6
roaring_bitmap_remove(): 374.2
roaring_bitmap_remove_range(): 36.9
intvlen=4 density=0.200000 order=ASC
roaring_bitmap_add(): 7.7
roaring_bitmap_add_many(): 6.1
roaring_bitmap_add_bulk(): 6.2
roaring_bitmap_add_range(): 7.9
roaring_bitmap_remove(): 226.5
roaring_bitmap_remove_range(): 12.5
intvlen=16 density=0.200000 order=ASC
roaring_bitmap_add(): 7.4
roaring_bitmap_add_many(): 4.4
roaring_bitmap_add_bulk(): 6.0
roaring_bitmap_add_range(): 2.9
roaring_bitmap_remove(): 103.1
roaring_bitmap_remove_range(): 6.2
intvlen=64 density=0.200000 order=ASC
roaring_bitmap_add(): 7.6
roaring_bitmap_add_many(): 4.0
roaring_bitmap_add_bulk(): 6.3
roaring_bitmap_add_range(): 0.9
roaring_bitmap_remove(): 51.6
roaring_bitmap_remove_range(): 1.4
intvlen=1 density=0.200000 order=DESC
roaring_bitmap_add(): 28.8
roaring_bitmap_add_many(): 35.7
roaring_bitmap_add_bulk(): 24.7
roaring_bitmap_add_range(): 38.2
roaring_bitmap_remove(): 752.7
roaring_bitmap_remove_range(): 46.5
intvlen=4 density=0.200000 order=DESC
roaring_bitmap_add(): 17.5
roaring_bitmap_add_many(): 15.0
roaring_bitmap_add_bulk(): 14.4
roaring_bitmap_add_range(): 9.4
roaring_bitmap_remove(): 466.9
roaring_bitmap_remove_range(): 16.6
intvlen=16 density=0.200000 order=DESC
roaring_bitmap_add(): 15.2
roaring_bitmap_add_many(): 10.9
roaring_bitmap_add_bulk(): 12.2
roaring_bitmap_add_range(): 4.4
roaring_bitmap_remove(): 124.5
roaring_bitmap_remove_range(): 8.9
intvlen=64 density=0.200000 order=DESC
roaring_bitmap_add(): 17.1
roaring_bitmap_add_many(): 12.0
roaring_bitmap_add_bulk(): 14.3
roaring_bitmap_add_range(): 0.9
roaring_bitmap_remove(): 57.1
roaring_bitmap_remove_range(): 1.8
avx2
taskset -c 0 benchmarks/bitset_container_benchmark
bitset container benchmarks
set_test(B): 1.50 cycles per operation
get_test(B): 0.96 cycles per operation
bitset_container_cardinality(B): 201.00 cycles per operation
bitset_container_compute_cardinality(B): 0.40 cycles per operation
unset_test(B): 1.51 cycles per operation
number of values in container = 4096
bitset_container_to_uint32_array(out, Bt, 1234): 0.89 cycles per operation
bitset_container_get bitset_cache_prefetch: 205.69 cycles per operation
bitset_container_get bitset_cache_flush: 327.49 cycles per operation
number of values in container = 8192
bitset_container_to_uint32_array(out, Bt, 1234): 1.16 cycles per operation
bitset_container_get bitset_cache_prefetch: 208.41 cycles per operation
bitset_container_get bitset_cache_flush: 312.30 cycles per operation
number of values in container = 16384
bitset_container_to_uint32_array(out, Bt, 1234): 0.60 cycles per operation
bitset_container_get bitset_cache_prefetch: 210.64 cycles per operation
bitset_container_get bitset_cache_flush: 308.05 cycles per operation
number of values in container = 32768
bitset_container_to_uint32_array(out, Bt, 1234): 0.30 cycles per operation
bitset_container_get bitset_cache_prefetch: 206.42 cycles per operation
bitset_container_get bitset_cache_flush: 309.61 cycles per operation
number of values in container = 65536
bitset_container_to_uint32_array(out, Bt, 1234): 0.17 cycles per operation
bitset_container_get bitset_cache_prefetch: 212.69 cycles per operation
bitset_container_get bitset_cache_flush: 317.35 cycles per operation
Logical operations (time units per single operation):
bitset_container_and_nocard(B1, B2, BO): 507.00 cycles per operation
bitset_container_and(B1, B2, BO): 534.00 cycles per operation
bitset_container_and_justcard(B1, B2): 353.00 cycles per operation
bitset_container_compute_cardinality(BO): 417.00 cycles per operation
bitset_container_or_nocard(B1, B2, BO): 628.00 cycles per operation
bitset_container_or(B1, B2, BO): 528.00 cycles per operation
bitset_container_or_justcard(B1, B2): 351.00 cycles per operation
bitset_container_compute_cardinality(BO): 419.00 cycles per operation
get_cardinality_through_conversion_to_array(B1): 4.28 cycles per operation
avx512
taskset -c 0 benchmarks/bitset_container_benchmark
bitset container benchmarks
set_test(B): 2.40 cycles per operation
get_test(B): 1.14 cycles per operation
bitset_container_cardinality(B): 203.00 cycles per operation
bitset_container_compute_cardinality(B): 0.31 cycles per operation
unset_test(B): 2.41 cycles per operation
number of values in container = 4096
bitset_container_to_uint32_array(out, Bt, 1234): 0.96 cycles per operation
bitset_container_get bitset_cache_prefetch: 209.76 cycles per operation
bitset_container_get bitset_cache_flush: 207.77 cycles per operation
number of values in container = 8192
bitset_container_to_uint32_array(out, Bt, 1234): 0.73 cycles per operation
bitset_container_get bitset_cache_prefetch: 209.72 cycles per operation
bitset_container_get bitset_cache_flush: 208.29 cycles per operation
number of values in container = 16384
bitset_container_to_uint32_array(out, Bt, 1234): 0.36 cycles per operation
bitset_container_get bitset_cache_prefetch: 210.37 cycles per operation
bitset_container_get bitset_cache_flush: 207.80 cycles per operation
number of values in container = 32768
bitset_container_to_uint32_array(out, Bt, 1234): 0.21 cycles per operation
bitset_container_get bitset_cache_prefetch: 208.17 cycles per operation
bitset_container_get bitset_cache_flush: 208.69 cycles per operation
number of values in container = 65536
bitset_container_to_uint32_array(out, Bt, 1234): 0.13 cycles per operation
bitset_container_get bitset_cache_prefetch: 207.84 cycles per operation
bitset_container_get bitset_cache_flush: 208.03 cycles per operation
Logical operations (time units per single operation):
bitset_container_and_nocard(B1, B2, BO): 392.00 cycles per operation
bitset_container_and(B1, B2, BO): 424.00 cycles per operation
bitset_container_and_justcard(B1, B2): 333.00 cycles per operation
bitset_container_compute_cardinality(BO): 322.00 cycles per operation
bitset_container_or_nocard(B1, B2, BO): 392.00 cycles per operation
bitset_container_or(B1, B2, BO): 424.00 cycles per operation
bitset_container_or_justcard(B1, B2): 337.00 cycles per operation
bitset_container_compute_cardinality(BO): 316.00 cycles per operation
get_cardinality_through_conversion_to_array(B1): 4.30 cycles per operation