Are the test performance results of tests/test_flash_mla.py accurate?

<pre>
    def flash_mla():
        torch.cuda.synchronize()
        tile_scheduler_metadata, num_splits = get_mla_metadata(cache_seqlens, s_q * h_q // h_kv, h_kv)
</pre>

I added a `sync()`, and found that the performance was much worse. With `sync()`, it took 360us, while without it, it only took 50us. 

Why does it feel like the cost time is a CPU's time? (The kernel submits asynchronously and hasn't finished executing yet.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Are the test performance results of tests/test_flash_mla.py accurate? #59

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Are the test performance results of tests/test_flash_mla.py accurate? #59

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions