Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clinfo wrongly reports max work group size 256 #330

Closed
preda opened this issue Feb 8, 2018 · 5 comments
Closed

clinfo wrongly reports max work group size 256 #330

preda opened this issue Feb 8, 2018 · 5 comments

Comments

@preda
Copy link

preda commented Feb 8, 2018

ROCm 1.7, Ubuntu 17.10, Vega 64:
clinfo reports:
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256

In particular the "max workgroup size" does not appear to be correct as I can execute workgroups with larger sizes just fine (tried 512).

@b-sumner
Copy link
Collaborator

This is the result of a number of considerations including performance and misbehaving programs that do not use the reqd_work_group_size kernel attribute and do not call clGetKernelWorkGroupInfo. By using the kernel attribute, you can indeed have a work group "volume" as high as 1024, but of course this will affect the number of registers available per work-item.

@jlgreathouse
Copy link
Collaborator

jlgreathouse commented Feb 20, 2018

Hi @preda,

(Edit: this was written in parallel with b-sumner's post, so I apologize for the needless extra information.)

This may not be communicated in the clearest way possible, but the max workgroup size being reported here is found by querying the runtime using clGetDeviceInfo(CL_DEVICE_MAX_WORK_GROUP_SIZE). This is runtime-level information, which can't make any guarantees about what kind of kernel will be be asked to run.

The AMD GCN ISA allows any thread to address up to 256 4-byte vector registers; in other words, each thread can use up to 1024 bytes of VGPR space. Every thread in a workgroup must run on the same compute unit (because, for instance, they must be able to communicate through that CU's local memory/LDS). Each compute unit in our existing microarchitectures has 256 KB of registers.

With 256 KB of registers and each thread able to use up to 1 KB, we can only guarantee that a worst-case kernel can run 256 threads on a single CU. The runtime can't say anything more in the general case.

However, if you have a kernel that does not use all of its available register space, the GCN architecture can fit more wavefronts in a workgroup. In the Vega ISA manual, Section 4.3 describes how up to 16 64-wide wavefronts can be combined into a workgroup. This means that, if you limit the kernel's dynamic resource requests such that 1024 threads can fit into a single CU, the hardware can run the larger workgroup.

For instance, if each thread only requests 64 VGPRs, then each thread will only requests 256 bytes of VGPR space. This would allow 1024 threads to fit within a single CU. However, the runtime can only know if this is possible after the kernel has been compiled and it has laid out its VGPR allocation size.

As such, one way to request this "larger than maximum workgroup size" is to call clEnqueueNDRangeKernel() with localWorkSize set to NULL (so that the runtime does not complain about the size being larger than the maximum workgroup size), but writing/compiling the kernel using __attribute__((reqd_work_group_size(X, Y, Z))), where X, Y, and Z make the total thread-count > 256. For instance, __attribute__((reqd_work_group_size(1024, 1, 1))).

The reqd_work_group_size() attribute will tell the compiler that you want to avoid using too many VGPRs and thus 1024 threads from this kernel will be able to fit within a single workgroup and CU.

I'm not an expert at this, so there may be other, deeper, reasons for this decision. However, does this explanation make sense for why clinfo is reporting what it shows?

@preda
Copy link
Author

preda commented Feb 20, 2018

Firstly, thanks for the explanations, it makes sense, but I still have these observations:

As I understand, there is a different, "hard" workgroup-size maximum, which is 1024. No matter how small a kernel is, the hardware can't execute a workgroup with more that 1024 threads. How is this value queried or reported by clinfo?

So there are two limits:

  • any kernel can be run with workgroup-size <= 256,
  • some kernels can be run with workgroup-size <= 1024 (if the kernel VGPR usage allows it).
  • no kernel can be run with workgroup-size > 1024.

(IMO, I'd still say that "Max work group size" is 1024)

@jlgreathouse:
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html
says

The work-group size to be used for kernel can also be specified in the program source using the attribute ((reqd_work_group_size(X, Y, Z))) qualifier. In this case the size of work group specified by local_work_size must match the value specified by the reqd_work_group_size attribute qualifier.

Which seems to indicate that local_work_size can't be NULL when using reqd_work_group_size(), at least according to the spec.

@jlgreathouse
Copy link
Collaborator

I would say that you should look at the Max work items[] outputs of clinfo to understand the maximum work items that the hardware limits you to. This is found by querying each of the dimensions of CL_DEVICE_MAX_WORK_ITEM_SIZES.

You're right, there are multiple limits in play, and that's why it's a bit non-clear exactly what to report for "max work group size". One of the reasons I'm writing out these explanations here is because I don't think you want all of these details dumped to the screen very time you run clinfo. :)

However, if we reported the maximum work group size as 1024, and then the runtime failed when you tried to enqueue a kernel with 512 threads per workgroup, I would wager we would get a torrent of complaints about the runtime being broken. My personal opinion (not any official stance of AMD) is that the current "max work group = 256" description prevents this.

As for the reqd_work_group_size() attribute, my previous post was overly specific to say that you needed to set local_work_size to NULL.

The current ROCm OpenCL runtime allows you to set local_work_size to NULL and it will run with the reqd_work_group_size() of > 256 threads. You can also set local_work_size to the same dimensions as your reqd_work_group_size() attribute and it will work. (e.g. setting local_work_size={1024,1,1} and reqd_work_group_size(1024,1,1)). If you don't set reqd_work_group_size(), however, requesting local_work_size={1024,1,1} will fail.

@preda
Copy link
Author

preda commented Feb 20, 2018

OK, thanks! I'm going to close this issue. The information contained here is useful, and may be extracted to some documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants