-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
[Core] Add Flashinfer TRTLLM Backend for Flashinfer decode path (SM100). #19825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @pavanimajety, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a significant enhancement by integrating a new Flashinfer TRTLLM backend for the decode attention path, specifically optimized for Blackwell (SM100) architectures. This aims to provide a more performant decoding kernel, which can be toggled via an environment variable. The changes involve extending existing data structures, implementing conditional execution logic for the new backend, and adding a dedicated benchmark to validate its performance.
Highlights
- New TRTLLM Decode Backend Integration: I've integrated the
trtllm_batch_decode_with_kv_cache
function from Flashinfer, providing an alternative, potentially optimized, decode attention kernel for Blackwell (SM100) architectures. - Feature Flag Control: A new environment variable,
VLLM_USE_TRTLLM_DECODE_ATTENTION
, has been added. This allows users to explicitly enable or disable the new TRTLLM decode path, offering flexibility and control over which backend is utilized. - FlashInfer Metadata Extension: The
FlashInferMetadata
class has been extended to accommodate additional parameters required by the TRTLLM decode function, such asmax_seq_len
,seq_lens
,block_table_tensor
, andworkspace_buffer
. - Conditional Execution Logic: The core logic in the
_plan
andforward
methods of theFlashInferBackend
has been updated to conditionally invoke either the new TRTLLM decode kernel or the existing Flashinfer decode based on theVLLM_USE_TRTLLM_DECODE_ATTENTION
environment variable. This includes handling the specific KV cache layout expected by TRTLLM. - Dedicated Benchmark Test: A new test file (
test_flashinfer_trtllm_decode_attention.py
) has been introduced. This benchmark specifically measures the performance of thetrtllm_batch_decode_with_kv_cache
function across various sequence counts, ensuring the new integration can be properly evaluated.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a Flashinfer TRTLLM backend for the Flashinfer decode path, specifically targeting SM100 architecture. The changes include modifications to the attention backend, environment variables, and a new test file. The code introduces a new environment variable to enable the TRTLLM backend and integrates it into the existing Flashinfer attention implementation. The test file benchmarks the performance of the new backend. There are several areas where the code could be improved, including hardcoded values, redundant calculations, and missing documentation.
36ca48a
to
03c31c5
Compare
This pull request has merge conflicts that must be resolved before it can be |
7cdac4d
to
8e10c86
Compare
This pull request has merge conflicts that must be resolved before it can be |
Kernel benchmark:
|
0b52d41
to
2f8bc21
Compare
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
d69aee2
to
d9782a5
Compare
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
adf6826
to
a953d1c
Compare
tests/kernels/attention/test_flashinfer_trtllm_decode_attention.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Head branch was pushed to by a user without write access
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
This pull request has merge conflicts that must be resolved before it can be |
…0). (vllm-project#19825) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: shuw <shuw@nvidia.com> Co-authored-by: mgoin <mgoin64@gmail.com>
…0). (vllm-project#19825) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: shuw <shuw@nvidia.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored by @wenscarl
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Adds decode kernels for Paged GQA for kv-cache-dtype="auto". A follow up PR would include FA3 style of Q=FP8 and KV=FP8 support
Test Plan
Test Result
Llama 3.3 70B FP8 Benchmarking results:
(Optional) Documentation Update
Introduces
VLLM_USE_TRTLLM_DECODE_ATTENTION
for switching between flashinferBatchDecodePagedKVCacheWrapper
wrapper and thetrtllm_batch_decode_with_kv_cache
APIKernel level Benchmarks: see comments
Test results: