Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan implementation of b3sum #80

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Conversation

cesarb
Copy link
Contributor

@cesarb cesarb commented Apr 12, 2020

The highly parallel structure of Blake3, and its use of 32-bit words, in theory make it suitable for being accelerated by a GPU. I wanted to see if the very high number of threads in a GPU would be enough to offset the extra overhead of sending the work to the GPU and getting the result back.

My results were:

  1. The hasher using the GPU has the same output as the CPU-only hasher;
  2. At least on my computer (Intel integrated GPU), it's faster than --no-mmap, but still slower than the rayon-optimized mmap path;
  3. The library I used (vulkano) inserts the required barriers between each step automatically, but its barrier-insertion code is unstable: it sometimes fails with a bogus conflict (changing the value of TASKS makes it more or less probable, it seems to depend on something like the memory address of each buffer).

Since the performance results weren't satisfactory, I don't think this code should be merged as is, but it could be useful as a reference, or perhaps someone more experienced with GPU programming could fix it to be faster.

After a few false starts, the design I ended up with splits the necessary code between the blake3 and b3sum crates. In the blake3 crate, I put the shaders (pre-compiled to SPIR-V), and an extended version of the Blake3 Hasher which can export parts of its internal state (the key and the flags) to be used by the shaders, and which can update its internal state from the output of the shaders; in the b3sum crate, I put all the code which calls into Vulkan (through the vulkano crate), and the main loop which reads from the file, dispatches the shaders, and gives their output to the hasher. This in theory allows the generic part in the blake3 crate to be used with a different GPU library (perhaps using OpenGL instead of Vulkan, or sharing the Vulkan device with other users).

A couple of design notes:

  1. Since each shader uses the output of the previous shader as an input, we have to insert a pipeline barrier between them (this is done automatically by vulkano). However, this pipeline barrier makes all following shader invocations wait for all the preceding shader invocations in the same Vulkan queue, even if they are from a different queue submit operation. To prevent this wait when not necessary, one should use more than one queue, and the code I wrote already tries to do that when possible. Unfortunately, the hardware I have has only one queue, so I cannot test how this actually affects the performance.
  2. The shaders receive their input as 32-bit words, and according to what I understand from the Vulkan specification, the driver should make all shaders use the native byte order, instead of a fixed byte order like little endian. To work around this, I inserted a separate shader to flip the byte order of all words, as the first and last step of the command sequence, but only when the native byte order is big endian. Unfortunately, I do not have access to any big endian hardware with a Vulkan-compatible GPU (and I don't know if they actually exist), so that code is mostly untested (the shader itself was tested in isolation).
  3. The main loop first updates the hasher from the output buffer, then reads from the file into the input buffer. A future enhancement could be to use the rio crate to read into the input buffer while the output buffer is being hashed.

@oconnor663
Copy link
Member

Wow! I've often wondered if something like this could be done, but I have exactly zero GPU programming experience. What's a typical max bandwidth at which the CPU can send bytes to the GPU?

and its use of 32-bit words

I didn't know that was relevant to GPUs. Do they not support 64-bit arithmetic?

@cesarb
Copy link
Contributor Author

cesarb commented Apr 13, 2020

I also have zero GPU programming experience (which is why this took me so long, and perhaps why I couldn't beat the CPU code in speed).

What's a typical max bandwidth at which the CPU can send bytes to the GPU?

As far as I know, it depends on the speed of the PCIe bus (or, for an integrated GPU like mine, it should be about the same speed as the CPU since they share the same memory bus). But this (the buffer management) is probably one of the parts which I might have done in a less than ideal way, though it should be less critical on an integrated GPU which has only one memory heap and memory type (which is both device local and host visible/coherent/cached).

I didn't know that was relevant to GPUs. Do they not support 64-bit arithmetic?

Support for 64-bit arithmetic in GPU shaders is optional. In Vulkan, it's only available if vkGetPhysicalDeviceFeatures returns true for the shaderInt64 feature, while 32-bit arithmetic is always available.

@oconnor663
Copy link
Member

Next on my reading list: https://devblogs.nvidia.com/even-easier-introduction-cuda/

@oconnor663
Copy link
Member

Looks like a Phoronix post got written about this PR :) https://www.phoronix.com/scan.php?page=news_item&px=BLAKE3-Experimental-Vulkan

cesarb added 7 commits May 6, 2020 19:25
And rewrite the gpu hashing loop to avoid pipeline barriers.
On a discrete GPU, this should allow writing directly from the mmap to
the device through the PCIe bus, instead of writing to a memory buffer
and letting the device read it through the PCIe bus. On an integrated
GPU, this change has no effect, since there's only one memory type.
@cesarb
Copy link
Contributor Author

cesarb commented May 21, 2020

So to try to make this faster, I made several changes:

  1. I changed from the vulkano crate to the lower-level ash crate, which gave me more control over the exact Vulkan calls being made, without having to fight the higher-level abstractions all the time;
  2. I changed the dispatch loop to completely avoid pipeline barriers, through the use of software pipelining and push constants for the control data;
  3. I made the chunk shader do the input endian conversion, and did the output endian conversion on the CPU (note that this code is still untested, since I have no big-endian machine with a modern GPU);
  4. I changed the dispatch loop to do the final hashing of the parents in parallel with the GPU starting the next unit of work;
  5. I made the Vulkan code also use memmap and copy from the memory-mapped file, instead of using a read system call, which allowed making the dispatch loop much simpler (since it no longer has to read back the tail of the file from the GPU input buffer);
  6. I changed the input buffer from host cached to device local, so that on a discrete GPU, the data is copied from the memory-mapped file through the PCIe bus to the GPU, instead of being copied from the memory-mapped file to a buffer in the CPU memory and then read through the PCIe bus by the GPU (this makes no difference on an integrated GPU, which has a single memory type).

Unfortunately, on my integrated GPU it's still slower than hashing directly on the CPU, except when using a single thread (--num-threads 1). I used VK_KHR_pipeline_executable_properties to peek at the generated shader executable, and saw no obvious issues there (no registers spills, the code looks sane).

But I think I now know the issue. If I comment out the copy_from_slice from the memory-mapped file to the shader input buffer, most of the performance loss goes away (and the result obviously becomes incorrect). It seems that the performance is being limited by the memory bandwidth; since the GPU is only allowed to read from buffers allocated specifically for the GPU, I have to copy from the memory-mapped file to the memory allocated for the GPU, while the CPU can read directly from the memory-mapped file.

That is, it's reading twice from memory (and writing once), instead of just reading once. Things might be better with a discrete GPU, since it would read only once from memory, write through the PCIe bus (which is separate from the CPU memory bus), and then read again on the GPU (which has faster memory, and is also separate from the CPU memory bus); of course, that would depend on how fast the PCIe bus and GPU memory is. Unfortunately, I don't have at the moment a device with a discrete GPU to test and see how it works.

@oconnor663
Copy link
Member

I'd be happy to run tests on my Windows gaming PC, if you think that would be helpful? I've also got an AWS Linux machine with a GPU that I was going to use for CUDA experiments but haven't gotten to yet. What commands should I try?

@cesarb
Copy link
Contributor Author

cesarb commented May 27, 2020

Yes, it would be helpful. I'm not too hopeful it will end up being faster, but it doesn't hurt to try.

First, you should run the b3sum tests with Vulkan enabled (cargo test --features=vulkan), to make sure I didn't do anything which works only on my integrated GPU; if you have the Vulkan SDK installed, you should also do it with the validation layer enabled (VK_INSTANCE_LAYERS=VK_LAYER_KHRONOS_validation cargo test --features=vulkan; by default, Vulkan doesn't check for invalid uses of its API, that's done by the validation layer).

Then, you can compare the speed of b3sum on a large file which fits in the RAM disk cache (I used CentOS-8.1.1911-x86_64-dvd1.iso), with and without --vulkan. Here, I'm getting around 1,4s with --vulkan, and only 0,66s without --vulkan, on an i5-8250U (4 cores, 8 threads).

@cesarb
Copy link
Contributor Author

cesarb commented May 27, 2020

Also, since I'm leaving my informal benchmark results here: with --vulkan, it takes 1,4s independent of the --num-threads; without --vulkan, it takes 0,66s for --num-threads 8, 0,77s for --num-threads 4, 1,4s for --num-threads 2, and 2,6s for --num-threads 1. So the Vulkan code ties with two CPU threads, and wins over a single CPU thread.

@cesarb
Copy link
Contributor Author

cesarb commented Jun 14, 2020

I guess I finally found what I was doing wrong: my laptop CPU is too fast. Trying on an older laptop with integrated GPU (Haswell), with a pair of files 750M total added twice to the command line (so 4 files with 1500M total), it took 0,6s without --vulkan, but only 0,5s with --vulkan. The hash is correct even though Mesa warns that "Haswell Vulkan support is incomplete".

So it seems the result is mixed; depending on your CPU and GPU, either could be the faster one.

Sorry, my mistake, I forgot the --release on the Haswell laptop. Adding the --release, the times are 0,3s without --vulkan and 0,5s with --vulkan.

@oconnor663
Copy link
Member

Ah well, it was exciting for a moment there :) It's going to be awesome when this works!

Apologies for being less active these days. I'm a month into my new job at Zoom, and I don't have as much time for side projects as I did before.

@Sanjay-A-Menon
Copy link

I am trying to implement the BLAKE3 hash on a TUL PYNQ-Z2 FPGA as well, would be happy to post the results when I am done with that.

@SebMoore
Copy link

Hey all! If anyone is coming back to this in 2021, I recently ran a couple of tests, and here are the results. They demonstrate to me that this is a cool idea, but not quite yet ready for primetime. @cesarb crazy good job for actually making this proof-of-concept totally functional though!

I compiled the b3sum program using cargo, and did it once with standard cargo build, and once with cargo rustc --features "vulkan" --release. These are named b3cpu and b3gpu respectively below.

I ran these tests on a rented AWS g4dn.4xlarge machine with 16vCPU (Xeon Platinum 8259CL, 2.5GHz base 3.5GHz boost), 64GB of RAM, and a Tesla T4 GPU, which is essentially a slightly worse RTX 2060. I wrote a 50GB file full of random data to a ramdisk, and then used the command taskset -c 0-<cores> time ./<function> /mnt/ram/test.dat to hash the files.

The results are as expected: the GPU actually performs slightly worse than a single core.

Function 16c 8c 4c 2c 1c GPU
SHA256 146.74
b3cpu 2.05 2.26 3.80 6.89 13.02
b3gpu 14.22

I used sha256 as a point of comparison from the openssl library, but note that only runs on 1 core. And yes, I monitored both top and nvidia-smi throughout to make sure everything was running on the correct cores & devices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants