PTX: Add `cuda::ptx:cp_async_bulk_*` #1403

ahendriksen · 2024-02-19T14:16:46Z

Add:

cp.async.bulk
cp.async.bulk.tensor
cp.reduce.async.bulk.tensor
cp.async.bulk.wait_group
cp.async.bulk.commit_group

Description

closes #1398, #1399, #1400, #1401, #1402

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

ahendriksen · 2024-02-19T16:25:43Z

Not sure if this is going to be caught by CI: the .multicast variants of cp.async.bulk{.tensor} are officially part of SM90, but are officiously part of (i.e. fast on) SM90a. Therefore, ptxas emits the following warning when any multicast instruction is used when compiling for sm90:

Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used 
on .target 'sm_90a' instead of .target 'sm_90' as this feature is expected to have substantially 
reduced performance on some future architectures

However, since we compile with --warning-as-error, this advisory gets treated as an error:

#$ ptxas --warning-as-error -arch=sm_90 -m64  "/tmp/tmpxft_00009e7f_00000000-6_ptx.cp.async.bulk.compile.pass.ptx"  -o "/tmp/tmpxft_00009e7f_00000000-8_ptx.cp.async.bulk.compile.pass.cubin" 
ptxas /tmp/tmpxft_00009e7f_00000000-6_ptx.cp.async.bulk.compile.pass.ptx, line 154; error   : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures
ptxas fatal   : Ptx assembly aborted due to errors
# --error 0xff --

Can we disable this behavior somehow in lit? @miscco

libcudacxx/include/cuda/barrier

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/ptx.h

libcudacxx/test/libcudacxx/cuda/ptx/ptx.cp.async.bulk.compile.pass.cpp

libcudacxx/test/libcudacxx/cuda/ptx/ptx.cp.reduce.async.bulk.tensor.compile.pass.cpp

ahendriksen

Thanks for the review! I have implemented most of the comments. One comment is addressed in #1359, and one comment (combining the ptx isa guards in the tests) I hope we can punt to a future PR.

libcudacxx/include/cuda/barrier

libcudacxx/test/libcudacxx/cuda/ptx/ptx.cp.async.bulk.compile.pass.cpp

libcudacxx/test/libcudacxx/cuda/ptx/ptx.cp.reduce.async.bulk.tensor.compile.pass.cpp

Add: - cp.async.bulk - cp.async.bulk.tensor - cp.reduce.async.bulk.tensor - cp.async.bulk.wait_group - cp.async.bulk.commit_group -

Add: - cp.async.bulk - cp.async.bulk.tensor - cp.reduce.async.bulk.tensor - cp.async.bulk.wait_group - cp.async.bulk.commit_group - Co-authored-by: Jake Hemstad <jhemstad@nvidia.com> Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

ahendriksen requested review from a team as code owners February 19, 2024 14:16

ahendriksen requested review from miscco and elstehle February 19, 2024 14:16

miscco reviewed Feb 20, 2024

View reviewed changes

ahendriksen commented Feb 20, 2024

View reviewed changes

libcudacxx/include/cuda/barrier Show resolved Hide resolved

libcudacxx/test/libcudacxx/cuda/ptx/ptx.cp.async.bulk.compile.pass.cpp Show resolved Hide resolved

libcudacxx/test/libcudacxx/cuda/ptx/ptx.cp.reduce.async.bulk.tensor.compile.pass.cpp Outdated Show resolved Hide resolved

ahendriksen force-pushed the ptx-add-cp-async-bulk branch 3 times, most recently from 2819ca0 to 3c5f7de Compare February 22, 2024 16:33

jrhemstad and others added 10 commits February 22, 2024 18:45

Add additional gcc12/cud12.3 build job for sm90.

feeaff9

Remove duplicated function definitions

d74d86d

Fix incorrect linker error

237a4a4

Ensure runfail tests do not fail without execution

42ed1b0

Add cuda::ptx:cp_async_bulk_*

37f1767

Add: - cp.async.bulk - cp.async.bulk.tensor - cp.reduce.async.bulk.tensor - cp.async.bulk.wait_group - cp.async.bulk.commit_group -

cp.async.bulk.tensor: Mark srcMem as const

bdf73f4

Use new cuda::ptx functionality

6876652

Remove void returns in error path

acd06a6

Use _CCCL_DEVICE

50220ca

ahendriksen force-pushed the ptx-add-cp-async-bulk branch from 3c5f7de to 2f2244f Compare February 23, 2024 09:36

miscco added 2 commits February 23, 2024 11:04

Merge branch 'pr/jrhemstad/1428' into pr/ahendriksen/1403

58fc940

Ensure that __cccl_ptx_isa properly guards feature flags

00338f1

ahendriksen requested a review from a team as a code owner February 23, 2024 10:14

ahendriksen requested a review from jarmak-nv February 23, 2024 10:14

miscco approved these changes Feb 23, 2024

View reviewed changes

Merge branch 'main' into ptx-add-cp-async-bulk

d7b65bc

miscco enabled auto-merge (squash) February 24, 2024 10:27

Merge branch 'main' into ptx-add-cp-async-bulk

efe8cad

miscco merged commit df4be01 into NVIDIA:main Feb 26, 2024
561 checks passed

This was referenced Feb 27, 2024

[FEA]: Add cuda::ptx::cp_reduce_async_bulk_tensor #1399

Closed

[FEA]: Add cuda::ptx::cp_async_bulk_commit_group #1400

Closed

[FEA]: Add cuda::ptx::cp_async_bulk_wait_group #1401

Closed

[FEA]: Add cuda::ptx::cp_async_bulk #1402

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PTX: Add `cuda::ptx:cp_async_bulk_*` #1403

PTX: Add `cuda::ptx:cp_async_bulk_*` #1403

ahendriksen commented Feb 19, 2024

ahendriksen commented Feb 19, 2024

ahendriksen left a comment

PTX: Add cuda::ptx:cp_async_bulk_* #1403

PTX: Add cuda::ptx:cp_async_bulk_* #1403

Conversation

ahendriksen commented Feb 19, 2024

Description

Checklist

ahendriksen commented Feb 19, 2024

ahendriksen left a comment

Choose a reason for hiding this comment

PTX: Add `cuda::ptx:cp_async_bulk_*` #1403

PTX: Add `cuda::ptx:cp_async_bulk_*` #1403