Skip to content

Fix cuda::dynamic_shared_memory alignment#7868

Open
davebayer wants to merge 1 commit intoNVIDIA:mainfrom
davebayer:fix_dynamic_shared_memory
Open

Fix cuda::dynamic_shared_memory alignment#7868
davebayer wants to merge 1 commit intoNVIDIA:mainfrom
davebayer:fix_dynamic_shared_memory

Conversation

@davebayer
Copy link
Contributor

Fixes #7867.

@davebayer davebayer requested a review from a team as a code owner March 3, 2026 18:03
@davebayer davebayer requested a review from fbusato March 3, 2026 18:03
@github-project-automation github-project-automation bot moved this to Todo in CCCL Mar 3, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Mar 3, 2026
@davebayer davebayer force-pushed the fix_dynamic_shared_memory branch from 844d9ba to c16d46a Compare March 3, 2026 18:04
@davebayer davebayer force-pushed the fix_dynamic_shared_memory branch from c16d46a to 7563850 Compare March 3, 2026 18:10
# if _CCCL_CUDA_COMPILATION()

template <class _Tp>
extern __shared__ _Tp __cccl_device_dyn_smem[];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bernhardmgruber had hard time to enforce shared memory alignment (he even wrote a guide on that). I'm not 100% sure that template variables work here. I would defer to Bernhard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be working fine, see https://godbolt.org/z/z6WWEzd5s

Copy link
Contributor

@fbusato fbusato Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explain the problem in a practical way https://godbolt.org/z/e5P7dsGdq

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what's the problem here? The output is right. See https://godbolt.org/z/EaY1bvP36

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code above is fine if you never pass a type with an alignment higher than 16. If you do, the compiler correctly generates an .align N specifier into PTX, which is then lost in the backend writing the binary if you compile with -G or -rdc=true before nvcc 13.1.

See: https://godbolt.org/z/oW6Kcanc6

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the full (internal) story on dynamic SMEM: https://github.com/NVIDIA/cccl_private/wiki/Dynamic-shared-memory-alignment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remark: it is absolutely critical that CCCL never passes a type with an alignment larger than 16 bytes itself to this variable template.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could static_assert(alignof(typename _Opt::value_type) <= 16). Is this something we want to enforce?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where would you add the static assert? It's fine if the user passes a type with higher alignment. The problem I am referring to is that any type with alignment > 16 will cause an increase of the static shared memory padding for the entire TU, which can impact occupancy. CCCL must not cause such a change, but if the user causes it, it's not our fault ;)

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 38m: Pass: 100%/99 | Total: 1d 00h | Max: 1h 08m | Hits: 96%/255238

See results here.

# if _CCCL_CUDA_COMPILATION()

template <class _Tp>
extern __shared__ _Tp __cccl_device_dyn_smem[];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code above is fine if you never pass a type with an alignment higher than 16. If you do, the compiler correctly generates an .align N specifier into PTX, which is then lost in the backend writing the binary if you compile with -G or -rdc=true before nvcc 13.1.

See: https://godbolt.org/z/oW6Kcanc6

Comment on lines +797 to +798
template <class _Tp>
extern __shared__ _Tp __cccl_device_dyn_smem[];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remark: it's not needed to pull the declaration outside the function dynamic_shared_memory.

# if _CCCL_CUDA_COMPILATION()

template <class _Tp>
extern __shared__ _Tp __cccl_device_dyn_smem[];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the full (internal) story on dynamic SMEM: https://github.com/NVIDIA/cccl_private/wiki/Dynamic-shared-memory-alignment

# if _CCCL_CUDA_COMPILATION()

template <class _Tp>
extern __shared__ _Tp __cccl_device_dyn_smem[];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remark: it is absolutely critical that CCCL never passes a type with an alignment larger than 16 bytes itself to this variable template.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

[BUG]: cuda::dynamic_shared_memory(config) may returned misaligned pointer

3 participants