-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vulkan implementation of b3sum #80
base: master
Are you sure you want to change the base?
Conversation
Wow! I've often wondered if something like this could be done, but I have exactly zero GPU programming experience. What's a typical max bandwidth at which the CPU can send bytes to the GPU?
I didn't know that was relevant to GPUs. Do they not support 64-bit arithmetic? |
I also have zero GPU programming experience (which is why this took me so long, and perhaps why I couldn't beat the CPU code in speed).
As far as I know, it depends on the speed of the PCIe bus (or, for an integrated GPU like mine, it should be about the same speed as the CPU since they share the same memory bus). But this (the buffer management) is probably one of the parts which I might have done in a less than ideal way, though it should be less critical on an integrated GPU which has only one memory heap and memory type (which is both device local and host visible/coherent/cached).
Support for 64-bit arithmetic in GPU shaders is optional. In Vulkan, it's only available if |
Next on my reading list: https://devblogs.nvidia.com/even-easier-introduction-cuda/ |
Looks like a Phoronix post got written about this PR :) https://www.phoronix.com/scan.php?page=news_item&px=BLAKE3-Experimental-Vulkan |
And rewrite the gpu hashing loop to avoid pipeline barriers.
On a discrete GPU, this should allow writing directly from the mmap to the device through the PCIe bus, instead of writing to a memory buffer and letting the device read it through the PCIe bus. On an integrated GPU, this change has no effect, since there's only one memory type.
So to try to make this faster, I made several changes:
Unfortunately, on my integrated GPU it's still slower than hashing directly on the CPU, except when using a single thread ( But I think I now know the issue. If I comment out the That is, it's reading twice from memory (and writing once), instead of just reading once. Things might be better with a discrete GPU, since it would read only once from memory, write through the PCIe bus (which is separate from the CPU memory bus), and then read again on the GPU (which has faster memory, and is also separate from the CPU memory bus); of course, that would depend on how fast the PCIe bus and GPU memory is. Unfortunately, I don't have at the moment a device with a discrete GPU to test and see how it works. |
I'd be happy to run tests on my Windows gaming PC, if you think that would be helpful? I've also got an AWS Linux machine with a GPU that I was going to use for CUDA experiments but haven't gotten to yet. What commands should I try? |
Yes, it would be helpful. I'm not too hopeful it will end up being faster, but it doesn't hurt to try. First, you should run the Then, you can compare the speed of |
Also, since I'm leaving my informal benchmark results here: with |
Sorry, my mistake, I forgot the |
Ah well, it was exciting for a moment there :) It's going to be awesome when this works! Apologies for being less active these days. I'm a month into my new job at Zoom, and I don't have as much time for side projects as I did before. |
I am trying to implement the BLAKE3 hash on a TUL PYNQ-Z2 FPGA as well, would be happy to post the results when I am done with that. |
Hey all! If anyone is coming back to this in 2021, I recently ran a couple of tests, and here are the results. They demonstrate to me that this is a cool idea, but not quite yet ready for primetime. @cesarb crazy good job for actually making this proof-of-concept totally functional though! I compiled the b3sum program using cargo, and did it once with standard cargo build, and once with I ran these tests on a rented AWS g4dn.4xlarge machine with 16vCPU (Xeon Platinum 8259CL, 2.5GHz base 3.5GHz boost), 64GB of RAM, and a Tesla T4 GPU, which is essentially a slightly worse RTX 2060. I wrote a 50GB file full of random data to a ramdisk, and then used the command The results are as expected: the GPU actually performs slightly worse than a single core.
I used sha256 as a point of comparison from the openssl library, but note that only runs on 1 core. And yes, I monitored both |
The highly parallel structure of Blake3, and its use of 32-bit words, in theory make it suitable for being accelerated by a GPU. I wanted to see if the very high number of threads in a GPU would be enough to offset the extra overhead of sending the work to the GPU and getting the result back.
My results were:
--no-mmap
, but still slower than the rayon-optimizedmmap
path;vulkano
) inserts the required barriers between each step automatically, but its barrier-insertion code is unstable: it sometimes fails with a bogus conflict (changing the value ofTASKS
makes it more or less probable, it seems to depend on something like the memory address of each buffer).Since the performance results weren't satisfactory, I don't think this code should be merged as is, but it could be useful as a reference, or perhaps someone more experienced with GPU programming could fix it to be faster.
After a few false starts, the design I ended up with splits the necessary code between the blake3 and b3sum crates. In the blake3 crate, I put the shaders (pre-compiled to SPIR-V), and an extended version of the Blake3
Hasher
which can export parts of its internal state (the key and the flags) to be used by the shaders, and which can update its internal state from the output of the shaders; in the b3sum crate, I put all the code which calls into Vulkan (through thevulkano
crate), and the main loop which reads from the file, dispatches the shaders, and gives their output to the hasher. This in theory allows the generic part in theblake3
crate to be used with a different GPU library (perhaps using OpenGL instead of Vulkan, or sharing the Vulkan device with other users).A couple of design notes:
vulkano
). However, this pipeline barrier makes all following shader invocations wait for all the preceding shader invocations in the same Vulkan queue, even if they are from a different queue submit operation. To prevent this wait when not necessary, one should use more than one queue, and the code I wrote already tries to do that when possible. Unfortunately, the hardware I have has only one queue, so I cannot test how this actually affects the performance.rio
crate to read into the input buffer while the output buffer is being hashed.