Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU acceleration #21

Open
JoeyBF opened this issue Mar 29, 2021 · 12 comments
Open

GPU acceleration #21

JoeyBF opened this issue Mar 29, 2021 · 12 comments

Comments

@JoeyBF
Copy link
Collaborator

JoeyBF commented Mar 29, 2021

Since we make such heavy use of finite field linear algebra, would it make sense to write some compute kernels to offload some computations to a GPU? Maybe something like an F_p-cuBLAS? I think some routines are parallel enough to make it worthwhile. I see there's ticket #19 for making use of M4RI, but I wonder if there's even more performance to be squeezed out of a GPU.

@hoodmane
Copy link
Contributor

Maybe it would make sense to do this for larger matrices? My guess is that up to a certain size the overhead for setting up the GPU to do anything would exceed the benefit. Of course if you implement this you can benchmark it on varying size matrices to see.

@JoeyBF
Copy link
Collaborator Author

JoeyBF commented Mar 29, 2021

Maybe it would make sense to do this for larger matrices?

I agree, probably something around 500 x 500 would be a reasonable cutoff to start with. We would also have to keep in mind how to manage the memory, because the matrices quickly become too big to store entirely in a single block's cache. Having to constantly fetch from the main memory would negate any potential benefit. Even so, the existence of something like cuBLAS shows that this can be worked around.

So would you think it's something worth working on? I think increasing scalability is important if we want to carry out big computations on something like an HPC grid.

@dalcde
Copy link
Contributor

dalcde commented Mar 30, 2021 via email

@JoeyBF
Copy link
Collaborator Author

JoeyBF commented Mar 30, 2021

So would you recommend using Rust-GPU anyway? I wouldn't mind writing some CUDA code and calling it from the fp crate. Or maybe we would rather use OpenCL or OpenMP to keep it completely open-source. I just want to discuss how to attack this and whether or not it's worth pursuing before diving into the code.

@hoodmane
Copy link
Contributor

whether or not it's worth pursuing

In my opinion it's worth trying. It's really hard to know how much effort it will take or how much of a performance boost we'll get without someone trying. I think it'd be cool for someone to work out how to do this. Presumably you think it would be fun / interesting or else you wouldn't have suggested it. I personally would be interested to find out how it works out.

I spent a good amount of time writing webgl code for the chart rendering, which I would say was objectively much more of a waste of time than this would be.

@hoodmane
Copy link
Contributor

I think OpenCL is preferable to CUDA, a lot of graphics cards aren't made by NVIDIA.

@JoeyBF
Copy link
Collaborator Author

JoeyBF commented Mar 30, 2021

Presumably you think it would be fun / interesting or else you wouldn't have suggested it.

I do think it would be a fun challenge, and I think it would really help us push the computations if it pans out.

I think OpenCL is preferable to CUDA, a lot of graphics cards aren't made by NVIDIA.

That's true, but I worry that OpenCL might be too generic, if that makes sense. The way I see it, most people who will want to run an ext computation, at least in the range where using a GPU would become important, are probably going to be running the code on some kind of HPC cluster. As far as I know, most of the GPUs there are NVIDIA (e.g. Wayne has K20x'es, P100's and V100's).

Actually now that I'm looking into it more, something like Vulkan might be the right thing to use. Unlike OpenCL which covers GPUs as well as FPGAs, DSPs and more exotic stuff, Vulkan is GPU-only, so we can write at a lower level. And it also has some Rust support through crates like gfx-rs and vulkano.

@hoodmane
Copy link
Contributor

something like Vulkan might be the right thing to use

Yeah Vulkan is supposed to be the new right way to do these things. Since it's pretty new there will probably be less support / how to guides / example code, and it's supposed to be extremely verbose.

@hoodmane
Copy link
Contributor

I do wonder whether we aren't already limited more by memory access speeds than by CPU speed though: it's not clear that the basic operations involved are complicated enough to benefit from graphics card acceleration.

@dalcde
Copy link
Contributor

dalcde commented Mar 30, 2021 via email

@JoeyBF
Copy link
Collaborator Author

JoeyBF commented Mar 30, 2021

I'm a fan of being able to test things locally and in CI, and CUDA would be problematic on that front

That's a good point. I think Vulkan would be the way to go then.

I do wonder whether we aren't already limited more by memory access speeds than by CPU speed though: it's not clear that the basic operations involved are complicated enough to benefit from graphics card acceleration.

What I'm thinking is that we could copy the entire matrix to the GPU's RAM, and then call some compute kernels that act on it. The copy is the real bottleneck, so calling the kernels should be pretty fast. Already one benefit is that the bandwidth between the GPU cores and the GPU memory is substantially bigger, so even if we use the same algorithm it might be faster. (The figure I saw was 100 GB/s for a CPU vs 380 GB/s for a GPU.)

I was also thinking of using something like the "Method of the Four Russians" that M4RI uses for row reduction. Presumably it is parallelizable, since the M4RI library can use up to 4 cores.

@hoodmane
Copy link
Contributor

Sounds generally reasonable to me.

@dalcde dalcde mentioned this issue Apr 30, 2021
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants