New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cross-lane operation intrinsics in OpenCL #189
Comments
In OpenCL there are intrinsics for uint __builtin_amdgcn_ds_swizzle(uint src, uint pattern); |
The exec mask can be obtained with ulong __builtin_amdgcn_read_exec(void); You can get the 64 bit result of a compare (e.g. a != b) using ulong __builtin_amdgcn_sicmp(int a, int b, int c) where c indicates the type of comparison. 32:=EQ, 33:=NE, 40:=LT This can be used to produce ballot, any, and all |
Great, thanks for the feedback! Can you point me to the right documentation on this? Also, not directly related, but I could really use some documentation on wavefront scheduling and instruction throughput/latency tables. Do such things exist? |
|
You're going outside the scope of standard language features when you directly call _builtin*. These are implementation details that may change. You're best bet to check for support of these prior to some kind of compilation failure is to examine https://github.com/llvm-mirror/clang/blob/master/include/clang/Basic/BuiltinsAMDGPU.def for the compiler you are using. However, also note that whether these are actually implemented depends on the target, although all of the builtions mentioned here should be available on most targets except for swizzle and mov_dpp which you'll likely only want to use one of anyway. What heterogeneous languages need are portable high-level constructs for cross-lane access, and until those are available, these low level vendor and device specific mechanisms will have to be used. Standard OpenCL does at least provide some portable (for those who support the sub-group extension like AMD) built-in functions that may satisfy some of your needs. |
I know, that's why I asked whether OpenCL extensions could be added.
That doesn't sound like a solution that can be used outside of tinkering, can it? I want to use this in a production code that might be compiled with the ancient fglrx compiler by some and with the latest ROCm compilers by others. Without some means to test for compiler support for these buildtins, I don't see how can we use them in GROMACS (without awkward workarounds in the build system to test the compiler).
I understand and have no problem with it. I care more about performance than pretty high-level constructs, I'd just like to be able to write e.g.
|
...or is there a clang-way to programmatically test for the existence of these intrinsics? |
It wouldn't make sense to have a multitude of OpenCL extensions for fine grain hardware specific features. I don't think you want the HAVE_* guards, particularly if more that one can actually be true. What you want, I think, is a guard like AMD_SWIZZLE_IS_BEST. You will be able to construct such guards when clang starts defining target macros such as clang does support a __has_builtin macro, see https://clang.llvm.org/docs/LanguageExtensions.html |
Thanks! For our current code release the clang First performance-related observation/question: is it reasonable to expect that a wavefront "any" collective implemented using sicmp does not outperform a shared memory-based implementation? |
I would not expect that. |
@b-sumner Looks like increased register usage could be the cause for 30% performance drop in one of the kernels. Is it normal for sicmp to require an extra register? |
For the record, I get the same 25-30% regression when using sicmp on Vega too, same register usage (one extra wrt local mem-based collective). |
In particular exposing the
ds_permute
,ds_swizzle
would be useful, but given that hcc defines shfl ops too, why not add those too via some OpenCL extensions.In addition, the wavefront voting instrinsics
__any
,__all
,__ballot
would also be valuable.As an intermediate solution, is it possible to somehow write in-line ASM equivalent of these or somehow use hc along OpenCL in kernels?
The text was updated successfully, but these errors were encountered: