Fork of https://gist.github.com/Foadsf/628a046040c302f507c81fd0568d8b34
This minimal example demonstrates how to distribute a single vector-add job across multiple OpenCL devices (dGPU, iGPU, CPU) without host-side multithreading.
It uses one context per platform and one command queue per device, launches work asynchronously, and gates all kernel starts with a user event so devices begin together. Per-device timings are collected from profiling events, and the overall job time is the max queue span across devices.
build.bat
This configures with CMake, builds async_multidevice.exe
, runs it, and leaves:
build\Release\timings.csv
– per-N timings (Total + each device)build\Release\plot.gp
– gnuplot scriptbuild\Release\vector_add_kernel.cl
To plot:
cd build\Release
"C:\Program Files\gnuplot\bin\gnuplot.exe" ..\..\plot.gp
- Asynchronous multi-device execution with no OpenMP/threads
- Work splitting proportional to
compute_units × max_clock
- Event profiling to verify overlap and compute wall-time
- Log-log visualization of scale vs. latency
- OpenCL multi-device distribution https://stackoverflow.com/questions/50319531/opencl-how-to-distribute-a-calculation-on-different-devices-without-multithread https://www.reddit.com/r/gpgpu/comments/8j5zgi/opencl_how_to_distribute_a_calculation_on/ https://www.reddit.com/r/OpenCL/comments/8j624a/how_to_distribute_a_calculation_on_different/
- Kernel concurrency & profiling https://stackoverflow.com/questions/11763963/how-do-i-know-if-the-kernels-are-executing-concurrently
- Original gist this is based on https://gist.github.com/Foadsf/628a046040c302f507c81fd0568d8b34
- Intro tutorial used for the baseline vector-add https://web.archive.org/web/20160608052324/https://www.eriksmistad.no/getting-started-with-opencl-and-gpu-computing/