Fix cuda::dynamic_shared_memory alignment#7868
Conversation
844d9ba to
c16d46a
Compare
c16d46a to
7563850
Compare
| # if _CCCL_CUDA_COMPILATION() | ||
|
|
||
| template <class _Tp> | ||
| extern __shared__ _Tp __cccl_device_dyn_smem[]; |
There was a problem hiding this comment.
@bernhardmgruber had hard time to enforce shared memory alignment (he even wrote a guide on that). I'm not 100% sure that template variables work here. I would defer to Bernhard.
There was a problem hiding this comment.
It should be working fine, see https://godbolt.org/z/z6WWEzd5s
There was a problem hiding this comment.
This explain the problem in a practical way https://godbolt.org/z/e5P7dsGdq
There was a problem hiding this comment.
But what's the problem here? The output is right. See https://godbolt.org/z/EaY1bvP36
There was a problem hiding this comment.
The code above is fine if you never pass a type with an alignment higher than 16. If you do, the compiler correctly generates an .align N specifier into PTX, which is then lost in the backend writing the binary if you compile with -G or -rdc=true before nvcc 13.1.
There was a problem hiding this comment.
Here is the full (internal) story on dynamic SMEM: https://github.com/NVIDIA/cccl_private/wiki/Dynamic-shared-memory-alignment
There was a problem hiding this comment.
Remark: it is absolutely critical that CCCL never passes a type with an alignment larger than 16 bytes itself to this variable template.
There was a problem hiding this comment.
We could static_assert(alignof(typename _Opt::value_type) <= 16). Is this something we want to enforce?
There was a problem hiding this comment.
Where would you add the static assert? It's fine if the user passes a type with higher alignment. The problem I am referring to is that any type with alignment > 16 will cause an increase of the static shared memory padding for the entire TU, which can impact occupancy. CCCL must not cause such a change, but if the user causes it, it's not our fault ;)
🥳 CI Workflow Results🟩 Finished in 1h 38m: Pass: 100%/99 | Total: 1d 00h | Max: 1h 08m | Hits: 96%/255238See results here. |
| # if _CCCL_CUDA_COMPILATION() | ||
|
|
||
| template <class _Tp> | ||
| extern __shared__ _Tp __cccl_device_dyn_smem[]; |
There was a problem hiding this comment.
The code above is fine if you never pass a type with an alignment higher than 16. If you do, the compiler correctly generates an .align N specifier into PTX, which is then lost in the backend writing the binary if you compile with -G or -rdc=true before nvcc 13.1.
| template <class _Tp> | ||
| extern __shared__ _Tp __cccl_device_dyn_smem[]; |
There was a problem hiding this comment.
Remark: it's not needed to pull the declaration outside the function dynamic_shared_memory.
| # if _CCCL_CUDA_COMPILATION() | ||
|
|
||
| template <class _Tp> | ||
| extern __shared__ _Tp __cccl_device_dyn_smem[]; |
There was a problem hiding this comment.
Here is the full (internal) story on dynamic SMEM: https://github.com/NVIDIA/cccl_private/wiki/Dynamic-shared-memory-alignment
| # if _CCCL_CUDA_COMPILATION() | ||
|
|
||
| template <class _Tp> | ||
| extern __shared__ _Tp __cccl_device_dyn_smem[]; |
There was a problem hiding this comment.
Remark: it is absolutely critical that CCCL never passes a type with an alignment larger than 16 bytes itself to this variable template.
Fixes #7867.