Skip to content

[Wheel Size] Only build FA2 8.0+PTX #19336

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 17, 2025

Conversation

LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Jun 9, 2025

Purpose

Reduce wheel size by only building FA2 8.0+PTX instead of 8.0,9.0,10.0 etc.

376.35 MB -> 339.32 MB

This does cause a slowdown in the initial runs on a machine while the PTX is JIT compiled

(vllm) lwilkinson@gpu66:~/code/vllm$ vllm bench throughput --model RedHatAI/Meta-Llama-3.1-8B-FP8 --load-format dummy --input-len 10000 --output-len 200 --num-prompts 100
...
Throughput: 1.27 requests/s, 12954.78 total tokens/s, 254.02 output tokens/s

subsequent runs:

(vllm) lwilkinson@gpu66:~/code/vllm$ vllm bench throughput --model RedHatAI/Meta-Llama-3.1-8B-FP8 --load-format dummy --input-len 10000 --output-len 200 --num-prompts 100
...
Throughput: 4.15 requests/s, 42355.56 total tokens/s, 830.50 output tokens/s

Test Plan

CI

Test Result

CI passing

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Copy link

github-actions bot commented Jun 9, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @LucasWilkinson, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello everyone, gemini-code-assist here to provide a summary of this pull request. The primary goal of this PR, as indicated by the title and description, is to reduce the size of the built wheel package. This is being attempted by modifying the build process to only include support for FlashAttention 2 (FA2) with CUDA compute capability 8.0 and PTX, rather than building for multiple CUDA versions (8.0, 9.0, 10.0, etc.). The change itself involves updating the specific commit hash used for the vllm-flash-attn external project dependency within the CMake build configuration.

Highlights

  • Dependency Update: The core change is updating the Git tag for the vllm-flash-attn external project dependency in the CMake configuration.
  • Build Configuration Change: The update is intended to align the build process with the goal of only building FlashAttention for CUDA 8.0+PTX, aiming to reduce the final wheel size.
  • Wheel Size Reduction: This PR is part of an effort to decrease the size of the distribution package by being more selective about the included CUDA architectures for FlashAttention.

Changelog

  • cmake/external_projects/vllm_flash_attn.cmake
    • Updated the GIT_TAG for the vllm-flash-attn FetchContent_Declare from 8798f27777fb57f447070301bf33a9f9c607f491 to 61f1cfbf303884e131632626e7349912d784726c (lines 41 in both before/after snippets). This points the build process to a different commit in the FlashAttention repository, presumably one configured to support the targeted CUDA architecture subset.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the ci/build label Jun 9, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to reduce the final wheel size by updating the flash-attention dependency to a specific commit. The intention is that this version of flash-attention will only build FA2 kernels for the 8.0+PTX architecture. This is a good optimization if it meets the project's target support. My review focuses on verifying this assumption and ensuring clarity for future maintenance.

Summary of Findings

  • Verification of Dependency's Build Behavior: It's crucial to confirm that the specified GIT_TAG (commit 61f1cfbf303884e131632626e7349912d784726c) for the vllm-flash-attention dependency indeed restricts the FA2 build to only the 8.0+PTX architecture, as this is the core purpose of the PR.
  • Clarity of Dependency Versioning: Using a raw commit hash for dependencies is precise but can be less transparent for maintenance. Consider if a tag or a dedicated branch in the vllm-project/flash-attention repository could be used to mark this specific build configuration for better long-term clarity.

Merge Readiness

This pull request is currently marked as [WIP] and the test results are TBD. Before this PR can be considered ready for merge, the critical point about verifying that the new flash-attention commit (61f1cfbf303884e131632626e7349912d784726c) correctly implements the described build restrictions for FA2 (8.0+PTX only) needs to be addressed. Additionally, the CI tests should pass to confirm the wheel size reduction and functionality. As a reviewer, I am not authorized to approve pull requests, but I recommend addressing these points before seeking final approval from other maintainers.

@@ -38,7 +38,7 @@ else()
FetchContent_Declare(
vllm-flash-attn
GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
GIT_TAG 8798f27777fb57f447070301bf33a9f9c607f491
GIT_TAG 61f1cfbf303884e131632626e7349912d784726c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Pinning to a specific commit hash ensures reproducibility, which is good. However, for long-term maintainability and understanding, raw commit hashes can sometimes be less clear than using descriptive tags or dedicated branches.

Is this commit 61f1cfbf303884e131632626e7349912d784726c part of a stable release, a feature branch, or a temporary fix in the vllm-project/flash-attention repository? If it's intended to be a specific version with these build characteristics, would it be feasible to create a tag (e.g., vX.Y.Z-fa2-8.0ptx) or use a named branch in the dependency repository? This could make it easier to understand the nature of this pinned version in the future.

@LucasWilkinson LucasWilkinson marked this pull request as ready for review June 9, 2025 01:14
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 9, 2025
@mgoin
Copy link
Member

mgoin commented Jun 9, 2025

Unfortunately this does seem to hurt performance a lot.

B200 long context throughput

# Benchmark
vllm bench throughput --model RedHatAI/Meta-Llama-3.1-8B-FP8 --load-format dummy --input-len 10000 --output-len 200 --num-prompts 100

# Before
Throughput: 4.17 requests/s, 42510.83 total tokens/s, 833.55 output tokens/s

# After
Throughput: 1.27 requests/s, 12969.67 total tokens/s, 254.31 output tokens/s

@LucasWilkinson
Copy link
Collaborator Author

LucasWilkinson commented Jun 10, 2025

Unfortunately this does seem to hurt performance a lot.

B200 long context throughput

# Benchmark
vllm bench throughput --model RedHatAI/Meta-Llama-3.1-8B-FP8 --load-format dummy --input-len 10000 --output-len 200 --num-prompts 100

# Before
Throughput: 4.17 requests/s, 42510.83 total tokens/s, 833.55 output tokens/s

# After
Throughput: 1.27 requests/s, 12969.67 total tokens/s, 254.31 output tokens/s

did you see subsequently bad perf on subsequent runs? maybe the JIT cache needs to warm up

EDIT: got setup on a Blackwell machine, seems like the slowdown is from the JIT cache warmup

First run:

Throughput: 1.27 requests/s, 12954.78 total tokens/s, 254.02 output tokens/

Subsequent run:

Throughput: 4.15 requests/s, 42355.56 total tokens/s, 830.50 output tokens/s

NOTE: you can clear the JIT cache with rm -rf ~/.nv/ComputeCache

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
@LucasWilkinson LucasWilkinson changed the title [WIP][Wheel Size] Only build FA2 8.0+PTX [Wheel Size] Only build FA2 8.0+PTX Jun 16, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, I think this is a worthwhile tradeoff.

@mgoin mgoin merged commit 0733495 into vllm-project:main Jun 17, 2025
46 checks passed
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

yeqcharlotte pushed a commit to yeqcharlotte/vllm that referenced this pull request Jun 22, 2025
minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025
Signed-off-by: minpeter <kali2005611@gmail.com>
yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Jun 24, 2025
Signed-off-by: Yang Wang <elainewy@meta.com>
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 30, 2025
wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025
avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025
Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants