Description
Hi @yzh119 @AKKamath thanks for your effort in integrating this novel work into FlashInfer.
I recently made a PR to integrate POD Attn into SGLang, but found the performance is lower than BatchPrefillWithPagedKVCache. I ran a chunked prefill workload with input length 4000 and output length 200, feeding decode and prefill requests into the same kernel.
To support multiple prefill reqs I use a 2D mask to compute local attention in a packed sequence.
In profiling I see the POD Attn kernel always takes about 3 times longer to finish than the BatchPrefill kernel.


The results can be reproduced using my branch
export SGLANG_TORCH_PROFILER_DIR=/sgl-workspace/sglang/profile_log
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0 --disable-cuda-graph --attention-backend flashinfer --enable-mixed-chunk --enable-pd-colocation
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 4000 --random-output 200 --request-rate 8 --num-prompt 480 --port 30000 --profile
Also sometimes this line throws illegal memory access on kernel launch. Perhaps the decode bs + prefill seqlen is too large and we run out of shared memory?
I wonder if you have any insights on these. Thanks.