-
Notifications
You must be signed in to change notification settings - Fork 60
-
Notifications
You must be signed in to change notification settings - Fork 60
Too many CmdBarrier's are inserted resulting in gpu underutilisation #143
Comments
As far as I'm aware, AMDs open source LLVM compiler doesn't produce metadata needed for the runtime to determine if the buffer passed to the kernel is read only, however this was supported with the closed source compiler and the appropriate optimization should already be implemented in the runtime. Could you set the env var GPU_ENABLE_LC=0 to force the closed source compiler and recapture the RGP trace? Hopefully there won't be so many barriers anymore. |
After realising that my IDE scrubs environment variables and being puzzled for a while, setting that environment variable does indeed result in a correct looking RGP trace https://i.imgur.com/MoKHvBS.png Which at least answers that mystery! This doesn't fix the problem (because the closed source compiler is a solid 40%-1000% slower which unfortunately is not a typo for the actual usecase I'm debugging), but it makes sense On the plus side, I think it may be possible to work around this by abusing multiple command queues, which - while not ideal - at least seems to recover some of the lost performance |
@gandryey might have a better solution to your issue, but ultimately AMD's LLVM team would need to support OpenCL buffer attributes, which I do not think will happen. The closed source compiler doesn't have official support for OpenCL on Navi, it sort of "just" works. That would explain your performance issues with it. |
Thanks very much for the replies. Do you happen to know if this is something that affects HIP as well? I've been considering switching APIs to get better performance, but I'm slightly reluctant to go full vulkan |
Some brief testing has shown that using a ring of command queues (in my case 16) to submit parallel work, and using markers to synchronise does seem to successfully recover most of the expected performance - in my case about a 10% decrease (144ms -> 130ms). So that workaround seems to work well enough at least |
With HIP your situation is even worse. Due to Cuda supporting pointers to pointers (also no concept of read only memory), the runtime cannot do any memory tracking, hence there will be a memory barrier inserted after each dispatch. |
LC doesn't track read/write operations in the kernel. Hence runtime can't remove the barrier. Potentially you could workaround this issue with CL flag - CL_MEM_READ_ONLY. The apps usually ignore those flags and unfortunately the current logic in runtime relies on the compiler's tracking only to prevent possible incorrect input from the apps.
There are certain tasks where this optimization could help, but usually the app could just use multiple OCL queues in those scenarios. If you think asynchronous execution can help in your case , then please try 2 OCL queues. Please note Windows driver doesn't support user mode submission and to start asynchronous execution the app may need to call clFlush() on the both queues. |
In the end, I was able to create a longer term solution for me by inspecting the arguments to kernels for their read/write flags, and then automatically distributing work across a number of command queues with events to synchronise between kernels with read/write and write/write dependencies. This does involve rewriting most of the APIs in question to support this model of execution, as you need to be able to inspect cl_mem arguments to all functions like eg clEnqueueWriteBuffer This was actually much easier to do than I thought it would be which is nice. Its a shame that OpenCL has no way to query kernel arguments directly though, so it can't be done as directly as you might like |
I've been playing around with this a lot since I wrote this bug report, and its becoming an ever bigger thorn in trying to get acceptable GPU performance out AMD's OpenCL runtime There's a few problems that are unfixable unfortunately with a multiple queue workaround
In my case, I'm submitting a hundred or so kernels per 'tick', with each tick lasting ~30ms Ideally, the memory flags like CL_MEM_READ_ONLY would replace the compiler based system, as while you'd still need to correctly mark memory, it would massively alleviate the amount of work you have to put in to get good performance out of AMD gpus with OpenCL |
Would you be able to point to some code and instructions for building and running it that will allow others to reproduce the segfault issue you mentioned? While this may be an AMD driver issue, applications leaking events is very common. |
I've been pinning this issue down for a while, I'm on windows 10, on 22.3.1 on a 6700xt
When enqueuing a lot of kernels, if the kernels share any common arguments, the driver inserts a barrier which results in a stall. In Radeon GPU Profiler this looks like this
https://i.imgur.com/P5PBmAV.png
This is for the following kernel
Where the contents of the kernel do not matter. When invoking multiple copies of this kernel with a single shared read only buffer, each kernel invocation is serialised, with gaps in the GPU execution. This is regardless of if the queue is out of order or not - if any arguments of two kernels overlap, a barrier will be generated. This can be a significant overhead when executing a lot of kernels that all read from the same set of data, but output to many different buffers
As far as I know, a barrier shouldn't be needed just for shared read only buffers right? Is there any way around this at all, or is this a bug/wontfix?
The text was updated successfully, but these errors were encountered: