Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance of avx512 bit ops and popcounts #41

Open
brentp opened this issue Nov 7, 2019 · 4 comments
Open

performance of avx512 bit ops and popcounts #41

brentp opened this issue Nov 7, 2019 · 4 comments

Comments

@brentp
Copy link

brentp commented Nov 7, 2019

as requested, I am opening an issue.
somalier calculates relatedness between pairs of samples using bitwise operations and popcounts here

where genotypes is effectively:

type genotypes* = tuple[hom_ref:seq[uint64], het:seq[uint64], hom_alt:seq[uint64]]

and currently, those seqs have a len of about 300.
so for 10K samples, doing relatedness for all-vs-all, it will do ~50 million calls to the IBS function linked above.

the "simple" avx512 version is here which is identical in speed to the version currently in somalier

the unrolled version is here. this gives ~10-15% speedup.

this problem is embarrassingly parallel and it's currently single-threaded, so I could also improve speed that way, but I was interested to explore the simd stuff.

I'd be interested to hear your thoughts, maybe the "parallelization" scheme should be changed to load 8 samples at once instead of current 8 (*64) sites at once.

@mratsim
Copy link
Owner

mratsim commented Nov 7, 2019

Since its integers the compiler has the liberty to reorder lots of things while in floating point it can't because addition is not asociative so the usual suspects that slow down reductions do not exist.

In conclusion like you, I think fast popcnt might be the bottleneck.

Note on multithreading:
Another potential bottleneck is memory because your algorithm needs to iterate
on 6 sequences at the same time. But 300x6 is probably to small to benefits from multithreading see: https://github.com/zy97140/omp-benchmark-for-pytorch as I don't think popcnt is more CPU intensive than exp or sin.

Bench on my CPU

1.355095338 AVX512 got ibs0:21482500000 ibs2:28522500000 hets:36842500000 N:37740000000
1.918461567 got ibs0:21482500000 ibs2:28522500000 hets:36842500000 N:37740000000

Frequencies: scalar code 4.1 GHz, AVX512 3.5 Ghz (liquid cooling, I think for air-cooled CPU you are probably looking at a 1GHz downclocking).

It's important to know if your CPU is a Skylake-X, Xeon-W or Xeon Gold which have 2 AVX512 units per core or a Xeon Bronze / Silver which only have 1.

Now regarding popcnt, I did a bit of research and found the following:

I've forked the repo so that on AVX512 it doesn't use software emulation: https://github.com/mratsim/sse-popcount/tree/dont-use-avx512-emulation

You can run make run_avx512 after cloning to bench the speed on your computer. Here is mine:

./speed_avx512bw_g++ 1000000 100
lookup-8                              ... time = 0.032190 s
lookup-64                             ... time = 0.025103 s (speedup: 1.28)
bit-parallel                          ... time = 0.025244 s (speedup: 1.28)
bit-parallel-optimized                ... time = 0.016745 s (speedup: 1.92)
bit-parallel-mul                      ... time = 0.015128 s (speedup: 2.13)
bit-parallel32                        ... time = 0.041638 s (speedup: 0.77)
bit-parallel-optimized32              ... time = 0.030790 s (speedup: 1.05)
harley-seal                           ... time = 0.006844 s (speedup: 4.70)
sse-bit-parallel                      ... time = 0.008196 s (speedup: 3.93)
sse-bit-parallel-original             ... time = 0.008840 s (speedup: 3.64)
sse-bit-parallel-better               ... time = 0.006941 s (speedup: 4.64)
sse-harley-seal                       ... time = 0.003334 s (speedup: 9.65)
sse-lookup                            ... time = 0.004037 s (speedup: 7.97)
sse-lookup-original                   ... time = 0.005569 s (speedup: 5.78)
avx2-lookup                           ... time = 0.002572 s (speedup: 12.52)
avx2-lookup-original                  ... time = 0.003401 s (speedup: 9.46)
avx2-harley-seal                      ... time = 0.001867 s (speedup: 17.24)
cpu                                   ... time = 0.003142 s (speedup: 10.25)
sse-cpu                               ... time = 0.004497 s (speedup: 7.16)
avx2-cpu                              ... time = 0.004806 s (speedup: 6.70)
avx512-harley-seal                    ... time = 0.001112 s (speedup: 28.94)
avx512bw-shuf                         ... time = 0.001707 s (speedup: 18.86)
builtin-popcnt                        ... time = 0.004645 s (speedup: 6.93)
builtin-popcnt32                      ... time = 0.012765 s (speedup: 2.52)
builtin-popcnt-unrolled               ... time = 0.003129 s (speedup: 10.29)
builtin-popcnt-unrolled32             ... time = 0.007560 s (speedup: 4.26)
builtin-popcnt-unrolled-errata        ... time = 0.003150 s (speedup: 10.22)
builtin-popcnt-unrolled-errata-manual ... time = 0.003153 s (speedup: 10.21)
builtin-popcnt-movdq                  ... time = 0.004647 s (speedup: 6.93)
builtin-popcnt-movdq-unrolled         ... time = 0.003644 s (speedup: 8.83)
builtin-popcnt-movdq-unrolled_manual  ... time = 0.004313 s (speedup: 7.46)

Then in terms of implementations, given that there are many variable to track I expect that for your need AVX512 has a lot of advantages over AVX2 even with the downclocking due to 32 AVX512 registers versus only 16 for AVX2. The fastest popcnt on my machine, Harley Seal, needs 12 registers per popcnt https://github.com/WojciechMula/sse-popcount/blob/201fadc4580f785e394575a4af7d475135d4f7b1/popcnt-avx512-harley-seal.cpp#L39-L45

Details of the implementations

Assuming Harley seal is also the fastest on your machine here are some details:

  1. The wrapping function

The popcnt is done 64 by 64
https://github.com/WojciechMula/sse-popcount/blob/201fadc4580f785e394575a4af7d475135d4f7b1/popcnt-avx512-harley-seal.cpp#L88

And the remainder byte by byte: https://github.com/WojciechMula/sse-popcount/blob/201fadc4580f785e394575a4af7d475135d4f7b1/popcnt-avx512-harley-seal.cpp#L90-L91

  1. The vectorized loop

Here we stay in the m512i domain including for total, you want to have your for loop here:

https://github.com/WojciechMula/sse-popcount/blob/201fadc4580f785e394575a4af7d475135d4f7b1/popcnt-avx512-harley-seal.cpp#L38-L69

This is unrolled by 16

So you need 4 AVX512 running totals.

  1. Sum reduction of the running total
    The running total is reduced (following some algorithm probably in the paper), happening here:
    https://github.com/WojciechMula/sse-popcount/blob/201fadc4580f785e394575a4af7d475135d4f7b1/popcnt-avx512-harley-seal.cpp#L71-L75

  2. Then for the up to 15 remaining items we do the byte by byte popcount
    https://github.com/WojciechMula/sse-popcount/blob/201fadc4580f785e394575a4af7d475135d4f7b1/popcnt-avx512-harley-seal.cpp#L77-L78

  3. Then the reduction of all the totals

https://github.com/WojciechMula/sse-popcount/blob/201fadc4580f785e394575a4af7d475135d4f7b1/popcnt-avx512-harley-seal.cpp#L81

@brentp
Copy link
Author

brentp commented Nov 7, 2019

wow. thanks for taking a look. here is what I see:

lookup-8                              ... time = 0.067491 s
lookup-64                             ... time = 0.056205 s (speedup: 1.20)
bit-parallel                          ... time = 0.050526 s (speedup: 1.34)
bit-parallel-optimized                ... time = 0.032638 s (speedup: 2.07)
bit-parallel-mul                      ... time = 0.030241 s (speedup: 2.23)
bit-parallel32                        ... time = 0.081122 s (speedup: 0.83)
bit-parallel-optimized32              ... time = 0.060367 s (speedup: 1.12)
harley-seal                           ... time = 0.013470 s (speedup: 5.01)
sse-bit-parallel                      ... time = 0.015314 s (speedup: 4.41)
sse-bit-parallel-original             ... time = 0.018955 s (speedup: 3.56)
sse-bit-parallel-better               ... time = 0.013494 s (speedup: 5.00)
sse-harley-seal                       ... time = 0.006740 s (speedup: 10.01)
sse-lookup                            ... time = 0.008216 s (speedup: 8.21)
sse-lookup-original                   ... time = 0.012974 s (speedup: 5.20)
avx2-lookup                           ... time = 0.005621 s (speedup: 12.01)
avx2-lookup-original                  ... time = 0.007585 s (speedup: 8.90)
avx2-harley-seal                      ... time = 0.004249 s (speedup: 15.89)
cpu                                   ... time = 0.006446 s (speedup: 10.47)
sse-cpu                               ... time = 0.008706 s (speedup: 7.75)
avx2-cpu                              ... time = 0.010263 s (speedup: 6.58)
avx512-harley-seal                    ... time = 0.002780 s (speedup: 24.28)
avx512bw-shuf                         ... time = 0.004230 s (speedup: 15.95)
builtin-popcnt                        ... time = 0.009166 s (speedup: 7.36)
builtin-popcnt32                      ... time = 0.024966 s (speedup: 2.70)
builtin-popcnt-unrolled               ... time = 0.006413 s (speedup: 10.52)
builtin-popcnt-unrolled32             ... time = 0.014971 s (speedup: 4.51)
builtin-popcnt-unrolled-errata        ... time = 0.006482 s (speedup: 10.41)
builtin-popcnt-unrolled-errata-manual ... time = 0.006547 s (speedup: 10.31)
builtin-popcnt-movdq                  ... time = 0.009178 s (speedup: 7.35)
builtin-popcnt-movdq-unrolled         ... time = 0.007436 s (speedup: 9.08)
builtin-popcnt-movdq-unrolled_manual  ... time = 0.008399 s (speedup: 8.04)

As I understand it, the harley-seal will require a substantial change to the code. And I haven't figure out how to keep an m512i accumulating into an array (I think that's what you mean). I guess because the size is unknown... I'll revisit next week. Thanks again.

@mratsim
Copy link
Owner

mratsim commented Nov 8, 2019

You don't need to invent anything, the algorithm accumulate in one total already. You just need to have 4 variables vibs0, vibs2, vnhets, vnN instead of a single total.

@mratsim
Copy link
Owner

mratsim commented Nov 10, 2019

Another biologist exploring the popcount rabbit hole: http://www.dalkescientific.com/writings/diary/archive/2008/07/05/bitslice_and_popcount.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants