Open
Description
I read the code of flash MLA, and I have some questions:
- why not use tma to load QK, but use the SM80 copy_async.
- To store data from register to global memory, it uses shared memory to change the layout and then read to register and then write to global.
As far as I know, hopper supports data transfer from SHM to global directly. why is that used?
I don't have deep insight into cutlass and hopper, so curious about that. Thank you!
Metadata
Metadata
Assignees
Labels
No labels