Question: why TMA is not used

I read the code of flash MLA, and I have some questions:

1. why not use tma to load QK, but use the SM80 copy_async.
2. To store data from register to global memory, it uses shared memory to change the layout and then read to register and then write to global.
As far as I know, hopper supports data transfer from SHM to global directly. why is that used? 

I don't have deep insight into cutlass and hopper, so curious about that. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question: why TMA is not used #61

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question: why TMA is not used #61

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions