Conversation
|
Greetings and thanks for your interest and contribution! The issue with this change is that It's OK to have possibly-nonblocking functions. That's classic MPI (see elaboration below). However, if a function is possibly-nonblocking, it must always be correct to do the appropriate synchronization afterwards. (That sync might be a no-op, but it needs to be correct.) If users call To be more specific about MPI (the Message Passing Interface standard for distributed-memory parallelism): It's correct for an MPI implementation to implement One way to address this issue would be for Again, thank you for your contribution and for your interest in CUTLASS! |
|
Thank you for the detailed comment and elaboration. I can see why this happened to have worked in my use case, which was just a multistage GEMM, where the A matrix was not aligned and had to fall back to synchronous copy in fp16, but my B matrix was still 128-bit aligned and called cp async. Given that this pull request would probably be the incorrect approach in the long run, I'll just go ahead and close it for now. But again, thanks so much for taking the time to explain. |
|
@alihassanijr Thank you for your careful investigation of this issue! |
This PR adds partial specializations to
cp_asyncandcp_async_zfillfor SM80 to fall back to regular access when accessing 1 or 2 bytes (and not 4, 8, 16).The problem
Obviously this is just a temporary solution I've found to a problem I was facing.
In one particular operation I'm working on, which contains a GEMM at the CTA level, the left hand side matrix is never 128 bit aligned.
The contiguous dimension is actually odd-shaped, and as a result, float and tf32 can do a maximum 4 byte read at a time, but fp16 and bf16 are stuck with 2 byte reads.
I just did these partial specializations that basically defeat the purpose of asynchronous copies, but in a way were my only solution that didn't involve rewriting the GEMM main loop, which I didn't want to do because the rest of the operations deal with 128-bit aligned inputs.
Other possible solutions
The async copy instruction appears to support arbitrary number of bytes to be written to shared memory, that's how the guard is working. I was thinking about using that and shifting bytes to get 2-byte reads to work, but that obviously introduces a race condition.
Closing this PR
This is so far the only solution I've been able to think of, that would at least allow the code to compile and run, but obviously the solution is far from ideal.
Instead of starting a discussion, I figured I'd open this, just in case someone has a better solution, in which case I could close or modify this PR accordingly.
If that is not to your liking, feel free to close this.