Issue search results

Filter by

42 results

(78 ms)indeepseek-ai/FlashMLA (press backspace or delete to remove)

deepseek-ai/FlashMLA
Question of O_accum and LSE_accum shape

https://github.com/deepseek-ai/FlashMLA/blob/b31bfe72a83ea205467b3271a5845440a03ed7cb/csrc/flash_api.cpp#L184 Hi all, Just wondering why the shape of O_accum and LSE_accum change from [numsplit, batch, ...

mingyangHao

Opened
on Apr 7

deepseek-ai/FlashMLA
Why add permanent false condition(sum_lse != sum_lse) for determining global_lse?

@sijiac Why add permanent false condition for determine? https://github.com/deepseek-ai/FlashMLA/blob/b31bfe72a83ea205467b3271a5845440a03ed7cb/csrc/flash_fwd_mla_kernel.h#L528

LearnerInGithub

Opened
on Mar 6

deepseek-ai/FlashMLA
Question about usage of __launch_bounds__( )

@sijiac Hello everyone! I want to raise a question about the usage of CUDA qualifier launch_bounds. In CUDA document, launch_bounds( ) only has 2 parameters: maxThreadsPerBlock and minBlocksPerMultiprocessor. ...

LearnerInGithub

Opened
on Mar 6

deepseek-ai/FlashMLA
[suggestion] Video about the MLA

Hope it helps someone https://www.youtube.com/watch?v=0VLAoVGf_74

leo-smi

Opened
on Mar 6

deepseek-ai/FlashMLA
Question: why TMA is not used

I read the code of flash MLA, and I have some questions: 1. why not use tma to load QK, but use the SM80 copy_async. 2. To store data from register to global memory, it uses shared memory to change ...

Idonthaveaname-wq

Opened
on Mar 5

deepseek-ai/FlashMLA
Request for Ampere GPU Support in FlashMLA

I would like to request support for NVIDIA Ampere architecture GPUs in FlashMLA. I understand that many of the current optimizations are specific to Hopper GPUs, but having a lite version compatible with ...

ehartford

Opened
on Mar 3

deepseek-ai/FlashMLA
Are the test performance results of tests/test_flash_mla.py accurate?

pre def flash_mla(): torch.cuda.synchronize() tile_scheduler_metadata, num_splits = get_mla_metadata(cache_seqlens, s_q * h_q // h_kv, h_kv) /pre I added a sync(), and found that the performance was much ...

pipul

Opened
on Mar 3

deepseek-ai/FlashMLA
这个库是对应v2 paper里提到的MLA吗？

发现没有什么其它专门介绍这个库的paper。如果有，麻烦大佬给个指引。

echoht

Opened
on Feb 28

deepseek-ai/FlashMLA
Why head_size must be 576 while q_head_dim of DeepSeek-V3 is only 192?

img width= 609 alt= Image src= https://github.com/user-attachments/assets/b1864bd1-7898-40b6-bd5f-33645cb91b0f / head_size here comes from q.sizes()[3] But in modeling_deepseek.py of DeepSeek-V3 model, ...

WangNorthSea

Opened
on Feb 27

deepseek-ai/FlashMLA
Why warp specialization is faster than the older traditional style?

I m curious about this. It seems we can overlap cuda core and tensor core using warp specialization. But if it s just for overlap g2s and computation, is there any difference between the warp-specialization ...

sleepwalker2017

Opened
on Feb 27

Learn how you can use GitHub Issues to plan and track your work.

Save views for sprints, backlogs, teams, or releases. Rank, sort, and filter issues to suit the occasion. The possibilities are endless.Learn more about GitHub Issues

ProTip!

Press the

key to activate the search input again and adjust your query.

Learn how you can use GitHub Issues to plan and track your work.

Save views for sprints, backlogs, teams, or releases. Rank, sort, and filter issues to suit the occasion. The possibilities are endless.Learn more about GitHub Issues

ProTip!

Restrict your search to the title by using the in:title qualifier.

Languages

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter by

State

Advanced

deepseek-ai/FlashMLA
Question of O_accum and LSE_accum shape

deepseek-ai/FlashMLA
Why add permanent false condition(sum_lse != sum_lse) for determining global_lse?

deepseek-ai/FlashMLA
Question about usage of __launch_bounds__( )

deepseek-ai/FlashMLA
[suggestion] Video about the MLA

deepseek-ai/FlashMLA
Question: why TMA is not used

deepseek-ai/FlashMLA
Request for Ampere GPU Support in FlashMLA

deepseek-ai/FlashMLA
Are the test performance results of tests/test_flash_mla.py accurate?

deepseek-ai/FlashMLA
这个库是对应v2 paper里提到的MLA吗？

deepseek-ai/FlashMLA
Why head_size must be 576 while q_head_dim of DeepSeek-V3 is only 192?

deepseek-ai/FlashMLA
Why warp specialization is faster than the older traditional style?

Learn how you can use GitHub Issues to plan and track your work.

Learn how you can use GitHub Issues to plan and track your work.

issues Search Results · repo:deepseek-ai/FlashMLA language:C++

Filter by

State

Advanced

42 results

Learn how you can use GitHub Issues to plan and track your work.

Learn how you can use GitHub Issues to plan and track your work.