Skip to content

cp async fallback#761

Closed
alihassanijr wants to merge 2 commits intoNVIDIA:masterfrom
alihassanijr:feature/cp-async-fallback
Closed

cp async fallback#761
alihassanijr wants to merge 2 commits intoNVIDIA:masterfrom
alihassanijr:feature/cp-async-fallback

Conversation

@alihassanijr
Copy link
Contributor

This PR adds partial specializations to cp_async and cp_async_zfill for SM80 to fall back to regular access when accessing 1 or 2 bytes (and not 4, 8, 16).

The problem

Obviously this is just a temporary solution I've found to a problem I was facing.

In one particular operation I'm working on, which contains a GEMM at the CTA level, the left hand side matrix is never 128 bit aligned.
The contiguous dimension is actually odd-shaped, and as a result, float and tf32 can do a maximum 4 byte read at a time, but fp16 and bf16 are stuck with 2 byte reads.

I just did these partial specializations that basically defeat the purpose of asynchronous copies, but in a way were my only solution that didn't involve rewriting the GEMM main loop, which I didn't want to do because the rest of the operations deal with 128-bit aligned inputs.

Other possible solutions

The async copy instruction appears to support arbitrary number of bytes to be written to shared memory, that's how the guard is working. I was thinking about using that and shifting bytes to get 2-byte reads to work, but that obviously introduces a race condition.

Closing this PR

This is so far the only solution I've been able to think of, that would at least allow the code to compile and run, but obviously the solution is far from ideal.

Instead of starting a discussion, I figured I'd open this, just in case someone has a better solution, in which case I could close or modify this PR accordingly.

If that is not to your liking, feel free to close this.

@mnicely mnicely added the feature request New feature or request label Feb 3, 2023
@mhoemmen
Copy link
Contributor

mhoemmen commented Feb 6, 2023

Greetings and thanks for your interest and contribution!

The issue with this change is that cp_async is nonblocking both in name and (currently) in fact. As a result of this change, whether users would need to synchronize after calling cp_async would depend on its template arguments. This could lead to incorrect synchronization.

It's OK to have possibly-nonblocking functions. That's classic MPI (see elaboration below). However, if a function is possibly-nonblocking, it must always be correct to do the appropriate synchronization afterwards. (That sync might be a no-op, but it needs to be correct.) If users call cp_async, it's natural for them to invoke cp.async.wait_* or the relevant mbarrier operation to wait on the operation. However, if cp_async didn't actually call cp.async.*, then the wait won't have anything to wait on. It might wait forever (deadlock), or it might wait on the wrong operation.

To be more specific about MPI (the Message Passing Interface standard for distributed-memory parallelism): It's correct for an MPI implementation to implement MPI_Isend by stashing the send arguments in a data structure attached to the MPI_Comm, and doing nothing else until MPI_Wait is called on the resulting MPI_Request, at which point MPI blocks until the send is locally complete. However, users must always call MPI_Wait on the resulting MPI_Request.

One way to address this issue would be for cp_async to return some kind of continuation object that would tell the caller how to synchronize the result. This object would need to work correctly with both synchronous waits and CUTLASS's pipeline objects for tracking asynchronous execution. However, this approach would require changing a lot more than just cp_async. For example, the copy atoms would need to expose returning a continuation object. We're certainly interested in design discussions along these lines, and in other design approaches (e.g., based on P2300).

Again, thank you for your contribution and for your interest in CUTLASS!

@alihassanijr
Copy link
Contributor Author

Thank you for the detailed comment and elaboration.

I can see why this happened to have worked in my use case, which was just a multistage GEMM, where the A matrix was not aligned and had to fall back to synchronous copy in fp16, but my B matrix was still 128-bit aligned and called cp async.

Given that this pull request would probably be the incorrect approach in the long run, I'll just go ahead and close it for now.
I'll probably have to figure out a different approach towards my specific problem.

But again, thanks so much for taking the time to explain.

@mhoemmen
Copy link
Contributor

mhoemmen commented Feb 6, 2023

@alihassanijr Thank you for your careful investigation of this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants