subgroup_size support #479

hughperkins · 2026-04-14T10:46:43Z

hughperkins
Apr 14, 2026
Maintainer

For the register-only shuffle tiles, https://github.com/Genesis-Embodied-AI/quadrants/blob/hp/tiles-5/docs/source/user_guide/tile16.md , we need a subgroup size, https://www.khronos.org/blog/vulkan-subgroup-tutorial , of at least 16. Because we need 16 threads to run in a single warp, together, so they can shuffle between them, avoiding using shared memory (or global memory) for communications.

I will use 'warp size' and 'subgroup size' as synonyms in this body. I will use 'warp size' when discussing CUDA API, otherwise subgroup size. (AMD maybe uses 'wavefront size'? e.g. https://rocm.docs.amd.com/projects/HIP/en/latest/reference/hardware_features.html )

On CUDA GPUs, using CUDA, the warp size is fixed to be 32. It is never higher, never lower, no matter on what GPU currently available, nor using which driver.

On Metal, subgroup size is fixed to be 32 too.

On AMD, wavefront size can be 32 or 64, depending on GPU version.

On Vulkan, Vulkan itself does not define subgroup size. On Intel, it can be: 8, 16, 32.

On Vulkan, using NVidia, I discovered that if we initialize the vulkan instance ~11 times, the subgroup size becomes 8, and cannot be changed through the Vulkan API, and thus the 16x16 tiles break.

For this last issue, I merged #465 , which re-uses the same vulkan instance across init/reset cycles, avoiding the bug. I believe I have a .cu script (quadrants-free) to reproduce the issue, if anyone is interested.

Now, at some point here, I thought that it would be good to ensure that the subgroup size is at least 16, when using the tiles. At least Intel will not work with the tiles, if it choose a subgroup size of 8. When I initiated this train of work/thought, it was my belief that the NVidia Vulkan was defaulting to subgroup size 8. It was only later that I discovered that there was a bug, which is worked around by #465 .

So, initially I thought, let's make the subgroup size default to 32, and apply that autoamtically, everywhere:

on NVidia this will simply work
on Metal this will simply work
on AMD, this seems to be compatible with at least recent AMD GPUs (I have not checked this point rigorosuly)
On Intel, using Vulkan, Intel does support 32 (as well as 8 and 16), so this should work too

I wasn't entirely sure whether this would work for AMD. Maybe not on older GPUs? And I thought it might be nice to make the subgroup size user-controllable anyway.

So I created #455 which provides for:

configurable subgrou size, in qd.loop_config
function to get minimum subgroup size (from driver for Vulkan, hard-coded for NVidia and Metal)
and function to get maximum subgroup size

Doc here https://github.com/Genesis-Embodied-AI/quadrants/blob/hp/vulkan-subgroup-size-fix/docs/source/user_guide/subgroup_size.md

This would allow people to write (conceptually), something like:

@qd.loop_config(subgroup_size=max(16, get_min_subgroup_size())

and similar.

However, what started off as quite a small-ish PR (initially just setting default to 32), has now ballooned a lot. And finally I'm still assuming that all GPUs support subgroup size 32 anyway.

So, I'm wondering a few things:

do we even need to support Intel by creating such a PR at all?
is it sufficient to simply set the default subgroup size to 32 for now (this would avoid issues on Intel)
or do we actually want to provide the full-blown support for configurable subgroup size?

I'm leaning towards either: we don't need this PR at all for now, or it might be sufficient to simply set the default subgroup size to 32 for now. To avoid introducing un-necessary code into our codebase.

Thoughts?

v01dXYZ · 2026-04-14T11:00:48Z

v01dXYZ
Apr 14, 2026

For AMD, GCN had 64 bit wavefronts. Same with CDNA. It's only with RDNA they went 32 bit wavefronts.

I don't think it's worth to support older AMD GPUs than GCN as they were VLIW.

1 reply

hughperkins Apr 14, 2026
Maintainer Author

Interestingly, I just checked, and my current PR is a nop for AMDGPU anyway

it allows the user to set 32 or 64, without checking the hardware
throws an exception for any other values
does nothing with the values

This pushes me more towards not merging the current PR at all, or simply setting the Vulkan subgroup size to 32, by default, and not doing anything else.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

subgroup_size support #479

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

subgroup_size support #479

Uh oh!

Uh oh!

hughperkins Apr 14, 2026 Maintainer

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

v01dXYZ Apr 14, 2026

Uh oh!

hughperkins Apr 14, 2026 Maintainer Author

hughperkins
Apr 14, 2026
Maintainer

Replies: 1 comment 1 reply

v01dXYZ
Apr 14, 2026

hughperkins Apr 14, 2026
Maintainer Author