CUDA support for ImageBufAlgo (experimental and very incomplete) #1929

lgritz · 2018-04-27T00:39:18Z

First stab at this, it's experimental, the general organization may change
as we extend it.

To get these features, you must build with USE_CUDA=1, in which case
it will look for Cuda toolkit. For simplicity, we're setting a version
floor of Cuda 7.0 and sm_30.
To enable at runtime (duh, still only if you built with Cuda support
enabled), you can either set OIIO::attribute("cuda",1) or use the
magic environment variable OPENIMAGEIO_CUDA=1. When running oiiotool,
the command line argument --cuda turns the attribut on (or cheat with
the aforementioned env variable).
When the attribute is set, ImageBuf of "local" (not ImageCache-backed)
float (no other data types yet) buffers will allocate and free with
cudaMallocManaged/cudaFree (other cases will use the usual malloc/free).
We are thus heavily leveraging Unified Memory, never do any explicit
copying of data back and forth.
Certain ImageBufAlgo functions, then, have the options of calling
Cuda implementations when all the stars align -- Cuda support enabled,
Cuda turned on, the ImageBufs in question all have local storage that
was allocated as visible to Cuda, the buffers are all float, and other
restrctions to just the most common cases (all image inputs have
identical ROIs, etc.).
Implemented this for IBA::add() and sub() initially. Will extend to
other operations in the future and as the need arises.

Results and discussion:

Perf: add and sub operations on 1920x1080 3 channel float images, on my
workstation (16 core Xeon Silver 4110, it's ISA is AVX-512 but I'm only
compiling for SSE4.2 support at the moment) runs in about 20ms single
threaded, ~3.8ms multithreaded. With Cuda enabled (NVIDIA Quadro P5000,
Pascal architecture), I am getting about 12ms (i.e., moderately faster
than single core, quite a bit slower than fully using all the CPU
cores).

Now, this is not an especially good case for GPU -- the compute-to-memory
ratio is very poor, just a single math op for every 12 bytes of transfer
on or off the GPU. When I contrive to do an example with about 10x more
math per pixel, the Cuda times are approximately equal to the CPU times
when I take advantage of all the CPU cores. Maybe it only helps if we
do a bunch of IBA operations in a row before needing the results. Maybe
it's only worth Cuda-accelerating the most expensive operations (resize,
area ops, etc.), but we'll never get gain from something simple like add?

If anybody can point out ways in which I'm being very wasteful, please do
let me know!

Even after we flesh out many more image operations to be
Cuda-accelerated, and even we see an improvement in all cases over CPU,
I don't expect people to see much practical improvement in a typical
oiiotool command line, since disk/network to read input images and write
results are almost certain to dominate runtime, compared to the
math. But if you have a program that's doing a whole bunch of repeated
image math via IBA calls themselves, that's where the bigger payoff is
going to be, I think.

Note that CUDA is extremely finicky about what compilers it can use,
with an especially narrow idea of which "host compiler" is required by
each version of the Cuda Toolkit/nvcc. I'm still working through those
issues, and am considering the merits of compiling the cuda itself with
clang (if available) rather than nvcc, just to ease up on these
requirements. We'll be making the rest of the build issues more robust
over time as well.

lgritz · 2018-04-27T00:50:53Z

I think I am doing something silly to get substandard perf. I know that a single add isn't going to be great, but it's basically what they were doing in this blog post where they added 1M-element arrays in 0.68ms on a GT750M and 0.094ms on a Tesla K80. I'm getting 12ms to add the equivalent of ~6M-element arrays on a P5000 (that is, about 1/3 of the speed they reported on their laptop). I expected better. Must be my fault, right?

First stab at this, it's experimental, the general organization may change as we extend it. * To get these features, you must build with `USE_CUDA=1`, in which case it will look for Cuda toolkit. For simplicity, we're setting a version floor of Cuda 7.0 and sm_30. * To enable at runtime (duh, still only if you built with Cuda support enabled), you can either set `OIIO::attribute("cuda",1)` or use the magic environment variable `OPENIMAGEIO_CUDA=1`. When running oiiotool, the command line argument `--cuda` turns the attribut on (or cheat with the aforementioned env variable). * When the attribute is set, ImageBuf of "local" (not ImageCache-backed) float (no other data types yet) buffers will allocate and free with cudaMallocManaged/cudaFree (other cases will use the usual malloc/free). We are thus heavily leveraging Unified Memory, never do any explicit copying of data back and forth. * Certain ImageBufAlgo functions, then, have the options of calling Cuda implementations when all the stars align -- Cuda support enabled, Cuda turned on, the ImageBufs in question all have local storage that was allocated as visible to Cuda, the buffers are all float, and other restrctions to just the most common cases (all image inputs have identical ROIs, etc.). * Implemented this for IBA::add() and sub() initially. Will extend to other operations in the future and as the need arises. Results and discussion: Perf: add and sub operations on 1920x1080 3 channel float images, on my workstation (16 core Xeon Silver 4110, it's ISA is AVX-512 but I'm only compiling for SSE4.2 support at the moment) runs in about 20ms single threaded, ~3.8ms multithreaded. With Cuda enabled (NVIDIA Quadro P5000, Pascal architecture), I am getting about 12ms (i.e., moderately faster than single core, quite a bit slower than fully using all the CPU cores). Now, this is not an especially good case for GPU -- the compute-to-memory ratio is very poor, just a single math op for every 12 bytes of transfer on or off the GPU. When I contrive to do an example with about 10x more math per pixel, the Cuda times are approximately equal to the CPU times when I take advantage of all the CPU cores. Maybe it only helps if we do a bunch of IBA operations in a row before needing the results. Maybe it's only worth Cuda-accelerating the most expensive operations (resize, area ops, etc.), but we'll never get gain from something simple like add? If anybody can point out ways in which I'm being very wasteful, please do let me know! Even after we flesh out many more image operations to be Cuda-accelerated, and even we see an improvement in all cases over CPU, I don't expect people to see much practical improvement in a typical oiiotool command line, since disk/network to read input images and write results are almost certain to dominate runtime, compared to the math. But if you have a program that's doing a whole bunch of repeated image math via IBA calls themselves, that's where the bigger payoff is going to be, I think. Note that CUDA is extremely finicky about what compilers it can use, with an especially narrow idea of which "host compiler" is required by each version of the Cuda Toolkit/nvcc. I'm still working through those issues, and am considering the merits of compiling the cuda itself with clang (if available) rather than nvcc, just to ease up on these requirements. We'll be making the rest of the build issues more robust over time as well.

samhodge · 2018-12-28T07:32:47Z

@lgritz

I am looking into some faster image operations with a C++ AI plugin for OpenFX hosts

The tensor math on the GPU is as fast as it is going to get thanks to the good work of the Google Engineering team on Tensorflow.

Now I could continue with doing pixel operations with tensors on the GPU.

The other alternative is NVIDIA’s Npp libraries.

But I stumbled on this old commit.

Where did all the balls land?

lgritz · 2018-12-28T16:49:23Z

@samhodge Yes, I intend to pick this back up in the new year. I got stalled because of the performance issues I noticed (see comments above), but got distracted by other things.

samhodge · 2018-12-28T20:27:12Z

If I cannot use it on this current project I can see it being useful in other contexts. Thanks for your efforts.

lgritz force-pushed the lg-cuda branch from 1997433 to 77afa5e Compare September 19, 2019 15:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA support for ImageBufAlgo (experimental and very incomplete) #1929

CUDA support for ImageBufAlgo (experimental and very incomplete) #1929

lgritz commented Apr 27, 2018

lgritz commented Apr 27, 2018 •

edited

samhodge commented Dec 28, 2018

lgritz commented Dec 28, 2018

samhodge commented Dec 28, 2018

CUDA support for ImageBufAlgo (experimental and very incomplete) #1929

Are you sure you want to change the base?

CUDA support for ImageBufAlgo (experimental and very incomplete) #1929

Conversation

lgritz commented Apr 27, 2018

lgritz commented Apr 27, 2018 • edited

samhodge commented Dec 28, 2018

lgritz commented Dec 28, 2018

samhodge commented Dec 28, 2018

lgritz commented Apr 27, 2018 •

edited