You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(this kernel can launch groups of 512 threads on this system)
Maybe I'm misinterpreting the use of this API? I thought it was a counterpart of the CUDA occupancy API (cuOccupancyMaxPotentialBlockSize), suggesting a groupsize that accomplishes a reasonable occupancy.
The text was updated successfully, but these errors were encountered:
As noted by upstream, this is expected; the suggested launch configuration exactly covers the input space. Since we don't care about this, using bounds checks at run time, we can use more relaxed launch configurations. A workaround is implemented in #431, but once there's a new driver release we should use the Level Zero extension to query the maximum launch configuration for a given kernel.
maleadt
changed the title
Confusing suggest_groupsize results
Launch configuration: use ZE_extension_kernel_max_group_size_properties
Apr 22, 2024
With prime-sized inputs the suggested group size always consists of only a single thread:
But also with non prime-sized inputs the configuration looks highly suboptimal:
(this kernel can launch groups of 512 threads on this system)
Maybe I'm misinterpreting the use of this API? I thought it was a counterpart of the CUDA occupancy API (
cuOccupancyMaxPotentialBlockSize
), suggesting a groupsize that accomplishes a reasonable occupancy.The text was updated successfully, but these errors were encountered: