Low performance of POD Attention compared to BatchPrefillWithPagedKVCache

Hi @yzh119 @AKKamath thanks for your effort in integrating this novel work into FlashInfer.
I recently made a PR to integrate POD Attn into SGLang, but found the performance is lower than BatchPrefillWithPagedKVCache. I ran a chunked prefill workload with input length 4000 and output length 200, feeding decode and prefill requests into the same kernel.
To support multiple prefill reqs I use a 2D mask to compute local attention in a packed sequence.
In profiling I see the POD Attn kernel always takes about 3 times longer to finish than the BatchPrefill kernel. 

<img width="670" alt="Image" src="https://github.com/user-attachments/assets/4142da51-62e5-431a-a40e-ddb4e5d57b51" />

<img width="826" alt="Image" src="https://github.com/user-attachments/assets/fdd61660-7f48-4724-8be6-dddb3b0aa8af" />

The results can be reproduced using [my branch](https://github.com/sgl-project/sglang/pull/5169)
```
export SGLANG_TORCH_PROFILER_DIR=/sgl-workspace/sglang/profile_log
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0 --disable-cuda-graph --attention-backend flashinfer --enable-mixed-chunk --enable-pd-colocation
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 4000 --random-output 200 --request-rate 8 --num-prompt 480 --port 30000 --profile
```
Also sometimes [this line](https://github.com/flashinfer-ai/flashinfer/blob/9220fb3443b5a5d274f00ca5552f798e225239b7/include/flashinfer/attention/pod.cuh#L421) throws illegal memory access on kernel launch. Perhaps the decode bs + prefill seqlen is too large and we run out of shared memory?

I wonder if you have any insights on these. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Low performance of POD Attention compared to BatchPrefillWithPagedKVCache #1022

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Low performance of POD Attention compared to BatchPrefillWithPagedKVCache #1022

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions