Skip to content

Low performance of POD Attention compared to BatchPrefillWithPagedKVCache #1022

Open
@Edenzzzz

Description

@Edenzzzz

Hi @yzh119 @AKKamath thanks for your effort in integrating this novel work into FlashInfer.
I recently made a PR to integrate POD Attn into SGLang, but found the performance is lower than BatchPrefillWithPagedKVCache. I ran a chunked prefill workload with input length 4000 and output length 200, feeding decode and prefill requests into the same kernel.
To support multiple prefill reqs I use a 2D mask to compute local attention in a packed sequence.
In profiling I see the POD Attn kernel always takes about 3 times longer to finish than the BatchPrefill kernel.

Image Image

The results can be reproduced using my branch

export SGLANG_TORCH_PROFILER_DIR=/sgl-workspace/sglang/profile_log
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0 --disable-cuda-graph --attention-backend flashinfer --enable-mixed-chunk --enable-pd-colocation
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 4000 --random-output 200 --request-rate 8 --num-prompt 480 --port 30000 --profile

Also sometimes this line throws illegal memory access on kernel launch. Perhaps the decode bs + prefill seqlen is too large and we run out of shared memory?

I wonder if you have any insights on these. Thanks.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions