Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PTX: Add cuda::ptx:cp_async_bulk_* #1403

Merged
merged 14 commits into from
Feb 26, 2024

Conversation

ahendriksen
Copy link
Contributor

Add:

  • cp.async.bulk
  • cp.async.bulk.tensor
  • cp.reduce.async.bulk.tensor
  • cp.async.bulk.wait_group
  • cp.async.bulk.commit_group

Description

closes #1398, #1399, #1400, #1401, #1402

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@ahendriksen
Copy link
Contributor Author

Not sure if this is going to be caught by CI: the .multicast variants of cp.async.bulk{.tensor} are officially part of SM90, but are officiously part of (i.e. fast on) SM90a. Therefore, ptxas emits the following warning when any multicast instruction is used when compiling for sm90:

Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used 
on .target 'sm_90a' instead of .target 'sm_90' as this feature is expected to have substantially 
reduced performance on some future architectures

However, since we compile with --warning-as-error, this advisory gets treated as an error:

#$ ptxas --warning-as-error -arch=sm_90 -m64  "/tmp/tmpxft_00009e7f_00000000-6_ptx.cp.async.bulk.compile.pass.ptx"  -o "/tmp/tmpxft_00009e7f_00000000-8_ptx.cp.async.bulk.compile.pass.cubin" 
ptxas /tmp/tmpxft_00009e7f_00000000-6_ptx.cp.async.bulk.compile.pass.ptx, line 154; error   : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures
ptxas fatal   : Ptx assembly aborted due to errors
# --error 0xff --

Can we disable this behavior somehow in lit? @miscco

Copy link
Contributor Author

@ahendriksen ahendriksen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I have implemented most of the comments. One comment is addressed in #1359, and one comment (combining the ptx isa guards in the tests) I hope we can punt to a future PR.

@ahendriksen ahendriksen force-pushed the ptx-add-cp-async-bulk branch 3 times, most recently from 2819ca0 to 3c5f7de Compare February 22, 2024 16:33
@ahendriksen ahendriksen requested a review from a team as a code owner February 23, 2024 10:14
@miscco miscco enabled auto-merge (squash) February 24, 2024 10:27
@miscco miscco merged commit df4be01 into NVIDIA:main Feb 26, 2024
561 checks passed
miscco added a commit to miscco/cccl that referenced this pull request Feb 29, 2024
Add:

- cp.async.bulk
- cp.async.bulk.tensor
- cp.reduce.async.bulk.tensor
- cp.async.bulk.wait_group
- cp.async.bulk.commit_group
-

Co-authored-by: Jake Hemstad <jhemstad@nvidia.com>
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[FEA]: Add cuda::ptx::cp_async_bulk_tensor
3 participants