subgroup_size support #479
hughperkins
started this conversation in
Design / API
Replies: 1 comment 1 reply
-
|
For AMD, GCN had 64 bit wavefronts. Same with CDNA. It's only with RDNA they went 32 bit wavefronts. I don't think it's worth to support older AMD GPUs than GCN as they were VLIW. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
For the register-only shuffle tiles, https://github.com/Genesis-Embodied-AI/quadrants/blob/hp/tiles-5/docs/source/user_guide/tile16.md , we need a subgroup size, https://www.khronos.org/blog/vulkan-subgroup-tutorial , of at least 16. Because we need 16 threads to run in a single warp, together, so they can shuffle between them, avoiding using shared memory (or global memory) for communications.
I will use 'warp size' and 'subgroup size' as synonyms in this body. I will use 'warp size' when discussing CUDA API, otherwise subgroup size. (AMD maybe uses 'wavefront size'? e.g. https://rocm.docs.amd.com/projects/HIP/en/latest/reference/hardware_features.html )
On CUDA GPUs, using CUDA, the warp size is fixed to be 32. It is never higher, never lower, no matter on what GPU currently available, nor using which driver.
On Metal, subgroup size is fixed to be 32 too.
On AMD, wavefront size can be 32 or 64, depending on GPU version.
On Vulkan, Vulkan itself does not define subgroup size. On Intel, it can be: 8, 16, 32.
On Vulkan, using NVidia, I discovered that if we initialize the vulkan instance ~11 times, the subgroup size becomes 8, and cannot be changed through the Vulkan API, and thus the 16x16 tiles break.
For this last issue, I merged #465 , which re-uses the same vulkan instance across init/reset cycles, avoiding the bug. I believe I have a .cu script (quadrants-free) to reproduce the issue, if anyone is interested.
Now, at some point here, I thought that it would be good to ensure that the subgroup size is at least 16, when using the tiles. At least Intel will not work with the tiles, if it choose a subgroup size of 8. When I initiated this train of work/thought, it was my belief that the NVidia Vulkan was defaulting to subgroup size 8. It was only later that I discovered that there was a bug, which is worked around by #465 .
So, initially I thought, let's make the subgroup size default to 32, and apply that autoamtically, everywhere:
I wasn't entirely sure whether this would work for AMD. Maybe not on older GPUs? And I thought it might be nice to make the subgroup size user-controllable anyway.
So I created #455 which provides for:
qd.loop_configDoc here https://github.com/Genesis-Embodied-AI/quadrants/blob/hp/vulkan-subgroup-size-fix/docs/source/user_guide/subgroup_size.md
This would allow people to write (conceptually), something like:
and similar.
However, what started off as quite a small-ish PR (initially just setting default to 32), has now ballooned a lot. And finally I'm still assuming that all GPUs support subgroup size 32 anyway.
So, I'm wondering a few things:
I'm leaning towards either: we don't need this PR at all for now, or it might be sufficient to simply set the default subgroup size to 32 for now. To avoid introducing un-necessary code into our codebase.
Thoughts?
Beta Was this translation helpful? Give feedback.
All reactions