-
-
Notifications
You must be signed in to change notification settings - Fork 9.5k
[Wheel Size] Only build FA2 8.0+PTX #19336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @LucasWilkinson, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello everyone, gemini-code-assist here to provide a summary of this pull request. The primary goal of this PR, as indicated by the title and description, is to reduce the size of the built wheel package. This is being attempted by modifying the build process to only include support for FlashAttention 2 (FA2) with CUDA compute capability 8.0 and PTX, rather than building for multiple CUDA versions (8.0, 9.0, 10.0, etc.). The change itself involves updating the specific commit hash used for the vllm-flash-attn
external project dependency within the CMake build configuration.
Highlights
- Dependency Update: The core change is updating the Git tag for the
vllm-flash-attn
external project dependency in the CMake configuration. - Build Configuration Change: The update is intended to align the build process with the goal of only building FlashAttention for CUDA 8.0+PTX, aiming to reduce the final wheel size.
- Wheel Size Reduction: This PR is part of an effort to decrease the size of the distribution package by being more selective about the included CUDA architectures for FlashAttention.
Changelog
- cmake/external_projects/vllm_flash_attn.cmake
- Updated the
GIT_TAG
for thevllm-flash-attn
FetchContent_Declare from8798f27777fb57f447070301bf33a9f9c607f491
to61f1cfbf303884e131632626e7349912d784726c
(lines 41 in both before/after snippets). This points the build process to a different commit in the FlashAttention repository, presumably one configured to support the targeted CUDA architecture subset.
- Updated the
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to reduce the final wheel size by updating the flash-attention
dependency to a specific commit. The intention is that this version of flash-attention
will only build FA2 kernels for the 8.0+PTX architecture. This is a good optimization if it meets the project's target support. My review focuses on verifying this assumption and ensuring clarity for future maintenance.
Summary of Findings
- Verification of Dependency's Build Behavior: It's crucial to confirm that the specified
GIT_TAG
(commit61f1cfbf303884e131632626e7349912d784726c
) for thevllm-flash-attention
dependency indeed restricts the FA2 build to only the 8.0+PTX architecture, as this is the core purpose of the PR. - Clarity of Dependency Versioning: Using a raw commit hash for dependencies is precise but can be less transparent for maintenance. Consider if a tag or a dedicated branch in the
vllm-project/flash-attention
repository could be used to mark this specific build configuration for better long-term clarity.
Merge Readiness
This pull request is currently marked as [WIP]
and the test results are TBD
. Before this PR can be considered ready for merge, the critical point about verifying that the new flash-attention
commit (61f1cfbf303884e131632626e7349912d784726c
) correctly implements the described build restrictions for FA2 (8.0+PTX only) needs to be addressed. Additionally, the CI tests should pass to confirm the wheel size reduction and functionality. As a reviewer, I am not authorized to approve pull requests, but I recommend addressing these points before seeking final approval from other maintainers.
@@ -38,7 +38,7 @@ else() | |||
FetchContent_Declare( | |||
vllm-flash-attn | |||
GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git | |||
GIT_TAG 8798f27777fb57f447070301bf33a9f9c607f491 | |||
GIT_TAG 61f1cfbf303884e131632626e7349912d784726c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinning to a specific commit hash ensures reproducibility, which is good. However, for long-term maintainability and understanding, raw commit hashes can sometimes be less clear than using descriptive tags or dedicated branches.
Is this commit 61f1cfbf303884e131632626e7349912d784726c
part of a stable release, a feature branch, or a temporary fix in the vllm-project/flash-attention
repository? If it's intended to be a specific version with these build characteristics, would it be feasible to create a tag (e.g., vX.Y.Z-fa2-8.0ptx
) or use a named branch in the dependency repository? This could make it easier to understand the nature of this pinned version in the future.
Unfortunately this does seem to hurt performance a lot. B200 long context throughput
|
did you see subsequently bad perf on subsequent runs? maybe the JIT cache needs to warm up EDIT: got setup on a Blackwell machine, seems like the slowdown is from the JIT cache warmup First run:
Subsequent run:
NOTE: you can clear the JIT cache with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, I think this is a worthwhile tradeoff.
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Signed-off-by: minpeter <kali2005611@gmail.com>
Signed-off-by: Yang Wang <elainewy@meta.com>
Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
Purpose
Reduce wheel size by only building FA2 8.0+PTX instead of 8.0,9.0,10.0 etc.
376.35 MB -> 339.32 MB
This does cause a slowdown in the initial runs on a machine while the PTX is JIT compiled
subsequent runs:
Test Plan
CI
Test Result
CI passing