[Usage]: DeepSeek R1 input tokens cannot exceed 32k and how to correctly use FlashMLA #14882
Open
1 task done
Labels
usage
How to use vllm
Your current environment
How would you like to use vllm
Issue Description
When using vLLM with Triton as the inference backend for DeepSeek R1, the input tokens cannot exceed 32k, otherwise a
RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered
error occurs. According to this PR (sgl-project/sglang#3779), this seems to be a known issue.My startup command is as follows (using Docker environment):
I've noticed that the new version seems to support FlashMLA, and I want to switch to using Flashinfer as the inference backend instead of Triton. After checking the documentation, I found that this seems to be controlled through an environment variable called
VLLM_ATTENTION_BACKEND
. So I tried adding this environment variable when starting Docker (I'm not sure if my understanding is correct). The new startup command is as follows:However, from the execution, it appears that the backend is still using Triton. How can I properly enable FlashMLA in vLLM?
Questions
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: