cp async fallback by alihassanijr · Pull Request #761 · NVIDIA/cutlass

alihassanijr · 2023-01-10T07:38:11Z

This PR adds partial specializations to cp_async and cp_async_zfill for SM80 to fall back to regular access when accessing 1 or 2 bytes (and not 4, 8, 16).

The problem

Obviously this is just a temporary solution I've found to a problem I was facing.

In one particular operation I'm working on, which contains a GEMM at the CTA level, the left hand side matrix is never 128 bit aligned.
The contiguous dimension is actually odd-shaped, and as a result, float and tf32 can do a maximum 4 byte read at a time, but fp16 and bf16 are stuck with 2 byte reads.

I just did these partial specializations that basically defeat the purpose of asynchronous copies, but in a way were my only solution that didn't involve rewriting the GEMM main loop, which I didn't want to do because the rest of the operations deal with 128-bit aligned inputs.

Closing this PR

This is so far the only solution I've been able to think of, that would at least allow the code to compile and run, but obviously the solution is far from ideal.

Instead of starting a discussion, I figured I'd open this, just in case someone has a better solution, in which case I could close or modify this PR accordingly.

If that is not to your liking, feel free to close this.

mhoemmen · 2023-02-06T17:08:59Z

Greetings and thanks for your interest and contribution!

The issue with this change is that cp_async is nonblocking both in name and (currently) in fact. As a result of this change, whether users would need to synchronize after calling cp_async would depend on its template arguments. This could lead to incorrect synchronization.

It's OK to have possibly-nonblocking functions. That's classic MPI (see elaboration below). However, if a function is possibly-nonblocking, it must always be correct to do the appropriate synchronization afterwards. (That sync might be a no-op, but it needs to be correct.) If users call cp_async, it's natural for them to invoke cp.async.wait_* or the relevant mbarrier operation to wait on the operation. However, if cp_async didn't actually call cp.async.*, then the wait won't have anything to wait on. It might wait forever (deadlock), or it might wait on the wrong operation.

To be more specific about MPI (the Message Passing Interface standard for distributed-memory parallelism): It's correct for an MPI implementation to implement MPI_Isend by stashing the send arguments in a data structure attached to the MPI_Comm, and doing nothing else until MPI_Wait is called on the resulting MPI_Request, at which point MPI blocks until the send is locally complete. However, users must always call MPI_Wait on the resulting MPI_Request.

One way to address this issue would be for cp_async to return some kind of continuation object that would tell the caller how to synchronize the result. This object would need to work correctly with both synchronous waits and CUTLASS's pipeline objects for tracking asynchronous execution. However, this approach would require changing a lot more than just cp_async. For example, the copy atoms would need to expose returning a continuation object. We're certainly interested in design discussions along these lines, and in other design approaches (e.g., based on P2300).

Again, thank you for your contribution and for your interest in CUTLASS!

alihassanijr · 2023-02-06T17:21:11Z

Thank you for the detailed comment and elaboration.

I can see why this happened to have worked in my use case, which was just a multistage GEMM, where the A matrix was not aligned and had to fall back to synchronous copy in fp16, but my B matrix was still 128-bit aligned and called cp async.

Given that this pull request would probably be the incorrect approach in the long run, I'll just go ahead and close it for now.
I'll probably have to figure out a different approach towards my specific problem.

But again, thanks so much for taking the time to explain.

mhoemmen · 2023-02-06T17:57:29Z

@alihassanijr Thank you for your careful investigation of this issue!

alihassanijr added 2 commits January 9, 2023 18:10

Adds missing semicolon

5c03690

Partial specializations for cp_async_zfill

6e701f5

mnicely added the feature request New feature or request label Feb 3, 2023

alihassanijr closed this Feb 6, 2023

coderabbitai bot mentioned this pull request Mar 19, 2026

Support NVFP4 KV for prefill and batch attention kernels flashinfer-ai/flashinfer#2820

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cp async fallback#761

cp async fallback#761
alihassanijr wants to merge 2 commits intoNVIDIA:masterfrom
alihassanijr:feature/cp-async-fallback

alihassanijr commented Jan 10, 2023

Uh oh!

mhoemmen commented Feb 6, 2023

Uh oh!

alihassanijr commented Feb 6, 2023

Uh oh!

mhoemmen commented Feb 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alihassanijr commented Jan 10, 2023

The problem

Other possible solutions

Closing this PR

Uh oh!

mhoemmen commented Feb 6, 2023

Uh oh!

alihassanijr commented Feb 6, 2023

Uh oh!

mhoemmen commented Feb 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants