Fix `cuda::dynamic_shared_memory` alignment by davebayer · Pull Request #7868 · NVIDIA/cccl

davebayer · 2026-03-03T18:03:19Z

fbusato · 2026-03-03T18:25:57Z

libcudacxx/include/cuda/__launch/configuration.h

 #  if _CCCL_CUDA_COMPILATION()

+template <class _Tp>
+extern __shared__ _Tp __cccl_device_dyn_smem[];


@bernhardmgruber had hard time to enforce shared memory alignment (he even wrote a guide on that). I'm not 100% sure that template variables work here. I would defer to Bernhard.

It should be working fine, see https://godbolt.org/z/z6WWEzd5s

This explain the problem in a practical way https://godbolt.org/z/e5P7dsGdq

But what's the problem here? The output is right. See https://godbolt.org/z/EaY1bvP36

The code above is fine if you never pass a type with an alignment higher than 16. If you do, the compiler correctly generates an .align N specifier into PTX, which is then lost in the backend writing the binary if you compile with -G or -rdc=true before nvcc 13.1.

See: https://godbolt.org/z/oW6Kcanc6

Here is the full (internal) story on dynamic SMEM: https://github.com/NVIDIA/cccl_private/wiki/Dynamic-shared-memory-alignment

Remark: it is absolutely critical that CCCL never passes a type with an alignment larger than 16 bytes itself to this variable template.

We could static_assert(alignof(typename _Opt::value_type) <= 16). Is this something we want to enforce?

Where would you add the static assert? It's fine if the user passes a type with higher alignment. The problem I am referring to is that any type with alignment > 16 will cause an increase of the static shared memory padding for the entire TU, which can impact occupancy. CCCL must not cause such a change, but if the user causes it, it's not our fault ;)

github-actions · 2026-03-03T19:52:04Z

🥳 CI Workflow Results

🟩 Finished in 1h 38m: Pass: 100%/99 | Total: 1d 00h | Max: 1h 08m | Hits: 96%/255238

See results here.

bernhardmgruber · 2026-03-03T20:43:00Z

libcudacxx/include/cuda/__launch/configuration.h

 #  if _CCCL_CUDA_COMPILATION()

+template <class _Tp>
+extern __shared__ _Tp __cccl_device_dyn_smem[];


The code above is fine if you never pass a type with an alignment higher than 16. If you do, the compiler correctly generates an .align N specifier into PTX, which is then lost in the backend writing the binary if you compile with -G or -rdc=true before nvcc 13.1.

See: https://godbolt.org/z/oW6Kcanc6

bernhardmgruber · 2026-03-03T20:44:16Z

libcudacxx/include/cuda/__launch/configuration.h

+template <class _Tp>
+extern __shared__ _Tp __cccl_device_dyn_smem[];


Remark: it's not needed to pull the declaration outside the function dynamic_shared_memory.

bernhardmgruber · 2026-03-03T20:45:15Z

libcudacxx/include/cuda/__launch/configuration.h

 #  if _CCCL_CUDA_COMPILATION()

+template <class _Tp>
+extern __shared__ _Tp __cccl_device_dyn_smem[];


Here is the full (internal) story on dynamic SMEM: https://github.com/NVIDIA/cccl_private/wiki/Dynamic-shared-memory-alignment

bernhardmgruber · 2026-03-03T20:48:10Z

libcudacxx/include/cuda/__launch/configuration.h

 #  if _CCCL_CUDA_COMPILATION()

+template <class _Tp>
+extern __shared__ _Tp __cccl_device_dyn_smem[];


Remark: it is absolutely critical that CCCL never passes a type with an alignment larger than 16 bytes itself to this variable template.

davebayer requested a review from a team as a code owner March 3, 2026 18:03

davebayer requested a review from fbusato March 3, 2026 18:03

github-project-automation bot added this to CCCL Mar 3, 2026

github-project-automation bot moved this to Todo in CCCL Mar 3, 2026

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Mar 3, 2026

davebayer force-pushed the fix_dynamic_shared_memory branch from 844d9ba to c16d46a Compare March 3, 2026 18:04

Fix cuda::dynamic_shared_memory alignment

7563850

davebayer force-pushed the fix_dynamic_shared_memory branch from c16d46a to 7563850 Compare March 3, 2026 18:10

fbusato reviewed Mar 3, 2026

View reviewed changes

bernhardmgruber reviewed Mar 3, 2026

View reviewed changes

		template <class _Tp>
		extern __shared__ _Tp __cccl_device_dyn_smem[];

Conversation

davebayer commented Mar 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fbusato Mar 3, 2026 • edited by davebayer Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 3, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 38m: Pass: 100%/99 | Total: 1d 00h | Max: 1h 08m | Hits: 96%/255238

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fbusato Mar 3, 2026 •

edited by davebayer

Loading