PTX: Add helper functions for dsmem #1336

ahendriksen · 2024-02-05T08:41:50Z

Description

This PR prepares the cuda::ptx code for the addition of more instructions. It adds:

The cuda::ptx::n32 integral constant type that can be used with "n" annotated inline PTX instructions.
The __as_ptr_dsmem function for pointers that may point to local or remote distributed shared memory
The __from_ptr_* functions that map back from a state-space specific pointer to a generic pointer.

In addition, the error functions that are used to generate link-time errors in case of an architecture mismatch are now renamed and all return void. While writing new instruction wrappers, it turned out to be unsustainable to have the error functions generate a return value, especially with templates.

Finally, there are some whitespace and comments changes that make it easier to contribute code in the future.

Checklist

New or existing tests cover these changes.
No docs changes needed.

copy-pr-bot · 2024-02-05T08:41:54Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

- __as_ptr_dsmem - __from_ptr_{smem, dsmem, remote_dsmem, gmem}

miscco

Some minor nits

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/ptx/ptx_helper_functions.h

libcudacxx/test/libcudacxx/cuda/ptx/ptx.mbarrier.arrive.compile.pass.cpp

...libcxx/include/__cuda/ptx/parallel_synchronization_and_communication_instructions_mbarrier.h

ahendriksen

Thanks for the review. I have implemented the requested changes. I have made a bigger update to the test code that removes the need for the non_eliminated_false function.

...libcxx/include/__cuda/ptx/parallel_synchronization_and_communication_instructions_mbarrier.h

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/ptx/ptx_helper_functions.h

libcudacxx/test/libcudacxx/cuda/ptx/ptx.mbarrier.arrive.compile.pass.cpp

...libcxx/include/__cuda/ptx/parallel_synchronization_and_communication_instructions_mbarrier.h

libcudacxx/test/libcudacxx/cuda/ptx/ptx.mbarrier.arrive.compile.pass.cpp

libcudacxx/test/libcudacxx/cuda/ptx/ptx.mbarrier.wait.compile.pass.cpp

libcudacxx/test/libcudacxx/cuda/ptx/ptx.mbarrier.arrive.compile.pass.cpp

libcudacxx/test/libcudacxx/cuda/ptx/ptx.red.async.compile.pass.cpp

libcudacxx/test/libcudacxx/cuda/ptx/ptx.st.async.compile.pass.cpp

The newer mbarrier.arrive instructions accept .shared::cta instead of .shared. Updated for clarity.

ahendriksen · 2024-02-05T13:43:36Z

The spurious newlines have been removed. One minor addition is the use of .shared::cta in mbarrier.arrive for clarity.

ahendriksen · 2024-02-06T11:32:07Z

Is this one good to merge? I have added a follow-up PR: #1341

ahendriksen requested review from a team as code owners February 5, 2024 08:41

ahendriksen requested review from ericniebler and alliepiper February 5, 2024 08:41

ahendriksen added 11 commits February 5, 2024 09:57

Remove whitespace

2045b39

Remove deliberate downcast comment

253789d

Change name and return type of error function

e41be3b

Reserve section for special registers in ptx.h

9b1855c

Add more PTX helpers

6536cb7

- __as_ptr_dsmem - __from_ptr_{smem, dsmem, remote_dsmem, gmem}

Add cuda::ptx::n32 type for compile-time constants

7001e37

Use __as_ptr_remote_dsmem

0c667d9

Use generated mbarrier.arrive test

ebb8aa7

Update asm comments in tests

ebb4299

Re-enable st.async test in C++11

e2afa00

Update docs

f60e347

ahendriksen force-pushed the integrate-ptx branch from c363edd to f60e347 Compare February 5, 2024 08:58

miscco reviewed Feb 5, 2024

View reviewed changes

ahendriksen added 4 commits February 5, 2024 10:22

Replace static_cast<bool>(0) with false

1d1f8b0

Delegate helper functions

070bee3

Simplify tests

26a24a3

Include cstddef for size_t

7d66f14

ahendriksen commented Feb 5, 2024

View reviewed changes

miscco approved these changes Feb 5, 2024

View reviewed changes

ahendriksen added 2 commits February 5, 2024 14:40

Remove newline in front of main

252daad

Use .shared::cta when possible

8cbdff7

The newer mbarrier.arrive instructions accept .shared::cta instead of .shared. Updated for clarity.

miscco approved these changes Feb 5, 2024

View reviewed changes

ahendriksen mentioned this pull request Feb 6, 2024

PTX: Add cuda::ptx::fence #1341

Merged

2 tasks

miscco merged commit ac83b5f into NVIDIA:main Feb 6, 2024
538 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PTX: Add helper functions for dsmem #1336

PTX: Add helper functions for dsmem #1336

ahendriksen commented Feb 5, 2024

copy-pr-bot bot commented Feb 5, 2024

miscco left a comment

ahendriksen left a comment

ahendriksen commented Feb 5, 2024

ahendriksen commented Feb 6, 2024

PTX: Add helper functions for dsmem #1336

PTX: Add helper functions for dsmem #1336

Conversation

ahendriksen commented Feb 5, 2024

Description

Checklist

copy-pr-bot bot commented Feb 5, 2024

miscco left a comment

Choose a reason for hiding this comment

ahendriksen left a comment

Choose a reason for hiding this comment

ahendriksen commented Feb 5, 2024

ahendriksen commented Feb 6, 2024