Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why HSA_ISA_INFO_WORKGROUP_MAX_SIZE is hardcoded to 1024? #55

Closed
smartbitcoin opened this issue Mar 1, 2019 · 5 comments
Closed

why HSA_ISA_INFO_WORKGROUP_MAX_SIZE is hardcoded to 1024? #55

smartbitcoin opened this issue Mar 1, 2019 · 5 comments

Comments

@smartbitcoin
Copy link

As my understanding , hcc can only use 1024 thread per kernel. this is because libhsa_runtime.so return 1024 as max HSA_ISA_INFO_WORKGROUP_MAX_SIZE.
But my curious is the max wave front is 40, and wave front size is 64. it means the HSA_ISA_INFO_WORKGROUP_MAX_SIZE actually can be 40*64 = 2560 .
In opencl api , the max thread is 256, in hcc is 1024, I still feel it not right as it actually can be 2560, right?
Why ROCR choose 1024 instead of 2560 which is the hardware's max threads value?

@skeelyamd
Copy link
Collaborator

There are other resources, such as LDS, which are used by groups that enforce finer partitioning than thread count.

@jlgreathouse
Copy link
Contributor

Hi @smartbitcoin

You may be interested in this post I made last year that details the various workgroup size limitations that AMD GPUs have. In particular, the AMD GCN ISA only allows up to 1024 threads in a workgroup. That is a hardware ISA limitation.

However, you may not always be able to fit a 1024-thread workgroup into a compute unit (e.g. if you request 256 VGPRs per thread, we can only fit 256 threads in a CU). So we can only guarantee that 256-thread workgroups will always work -- that is why the OpenCL API claims 256 (see the linked post for more details).

While you can fit at most 2560 threads into a CU, those 2560 threads cannot all be in the same workgroup.

@smartbitcoin
Copy link
Author

@jlgreathouse Thanks for point me to ISA, I did checked chapter 4.3, you are right. it's hardware limitation, not resource issue.
Packing 2560 thread into one kernel with proper LDS and VGPRs allocation definitely works for plenty of algorithms and I test that LLVM definitely able to generate binary for that. Still curious why Vega ISA have this design b/c 10bit ( as 1024 ) not aligned to any boundary, maybe it's hardware stack size limitation.

@smartbitcoin
Copy link
Author

@jlgreathouse
I am also able to pump HSA_ISA_INFO_WORKGROUP_MAX_SIZE to 1280 and with proper LLVM compiled kernel.
1

my experiment confirmed that it's a hardware thing lol.

@smartbitcoin
Copy link
Author

or firmware ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants