-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clinfo wrongly reports max work group size 256 #330
Comments
This is the result of a number of considerations including performance and misbehaving programs that do not use the reqd_work_group_size kernel attribute and do not call clGetKernelWorkGroupInfo. By using the kernel attribute, you can indeed have a work group "volume" as high as 1024, but of course this will affect the number of registers available per work-item. |
Hi @preda, (Edit: this was written in parallel with b-sumner's post, so I apologize for the needless extra information.) This may not be communicated in the clearest way possible, but the max workgroup size being reported here is found by querying the runtime using The AMD GCN ISA allows any thread to address up to 256 4-byte vector registers; in other words, each thread can use up to 1024 bytes of VGPR space. Every thread in a workgroup must run on the same compute unit (because, for instance, they must be able to communicate through that CU's local memory/LDS). Each compute unit in our existing microarchitectures has 256 KB of registers. With 256 KB of registers and each thread able to use up to 1 KB, we can only guarantee that a worst-case kernel can run 256 threads on a single CU. The runtime can't say anything more in the general case. However, if you have a kernel that does not use all of its available register space, the GCN architecture can fit more wavefronts in a workgroup. In the Vega ISA manual, Section 4.3 describes how up to 16 64-wide wavefronts can be combined into a workgroup. This means that, if you limit the kernel's dynamic resource requests such that 1024 threads can fit into a single CU, the hardware can run the larger workgroup. For instance, if each thread only requests 64 VGPRs, then each thread will only requests 256 bytes of VGPR space. This would allow 1024 threads to fit within a single CU. However, the runtime can only know if this is possible after the kernel has been compiled and it has laid out its VGPR allocation size. As such, one way to request this "larger than maximum workgroup size" is to call clEnqueueNDRangeKernel() with localWorkSize set to NULL (so that the runtime does not complain about the size being larger than the maximum workgroup size), but writing/compiling the kernel using The reqd_work_group_size() attribute will tell the compiler that you want to avoid using too many VGPRs and thus 1024 threads from this kernel will be able to fit within a single workgroup and CU. I'm not an expert at this, so there may be other, deeper, reasons for this decision. However, does this explanation make sense for why clinfo is reporting what it shows? |
Firstly, thanks for the explanations, it makes sense, but I still have these observations: As I understand, there is a different, "hard" workgroup-size maximum, which is 1024. No matter how small a kernel is, the hardware can't execute a workgroup with more that 1024 threads. How is this value queried or reported by clinfo? So there are two limits:
(IMO, I'd still say that "Max work group size" is 1024) @jlgreathouse: The work-group size to be used for kernel can also be specified in the program source using the attribute ((reqd_work_group_size(X, Y, Z))) qualifier. In this case the size of work group specified by local_work_size must match the value specified by the reqd_work_group_size attribute qualifier. Which seems to indicate that local_work_size can't be NULL when using reqd_work_group_size(), at least according to the spec. |
I would say that you should look at the You're right, there are multiple limits in play, and that's why it's a bit non-clear exactly what to report for "max work group size". One of the reasons I'm writing out these explanations here is because I don't think you want all of these details dumped to the screen very time you run However, if we reported the maximum work group size as 1024, and then the runtime failed when you tried to enqueue a kernel with 512 threads per workgroup, I would wager we would get a torrent of complaints about the runtime being broken. My personal opinion (not any official stance of AMD) is that the current "max work group = 256" description prevents this. As for the The current ROCm OpenCL runtime allows you to set |
OK, thanks! I'm going to close this issue. The information contained here is useful, and may be extracted to some documentation. |
ROCm 1.7, Ubuntu 17.10, Vega 64:
clinfo reports:
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
In particular the "max workgroup size" does not appear to be correct as I can execute workgroups with larger sizes just fine (tried 512).
The text was updated successfully, but these errors were encountered: