[Wheel Size] Only build FA2 8.0+PTX #19336

LucasWilkinson · 2025-06-09T01:12:35Z

Purpose

Reduce wheel size by only building FA2 8.0+PTX instead of 8.0,9.0,10.0 etc.

376.35 MB -> 339.32 MB

This does cause a slowdown in the initial runs on a machine while the PTX is JIT compiled

(vllm) lwilkinson@gpu66:~/code/vllm$ vllm bench throughput --model RedHatAI/Meta-Llama-3.1-8B-FP8 --load-format dummy --input-len 10000 --output-len 200 --num-prompts 100
...
Throughput: 1.27 requests/s, 12954.78 total tokens/s, 254.02 output tokens/s

subsequent runs:

(vllm) lwilkinson@gpu66:~/code/vllm$ vllm bench throughput --model RedHatAI/Meta-Llama-3.1-8B-FP8 --load-format dummy --input-len 10000 --output-len 200 --num-prompts 100
...
Throughput: 4.15 requests/s, 42355.56 total tokens/s, 830.50 output tokens/s

Test Plan

CI

Test Result

CI passing

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

github-actions · 2025-06-09T01:12:45Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Hello @LucasWilkinson, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello everyone, gemini-code-assist here to provide a summary of this pull request. The primary goal of this PR, as indicated by the title and description, is to reduce the size of the built wheel package. This is being attempted by modifying the build process to only include support for FlashAttention 2 (FA2) with CUDA compute capability 8.0 and PTX, rather than building for multiple CUDA versions (8.0, 9.0, 10.0, etc.). The change itself involves updating the specific commit hash used for the vllm-flash-attn external project dependency within the CMake build configuration.

Highlights

Dependency Update: The core change is updating the Git tag for the vllm-flash-attn external project dependency in the CMake configuration.
Build Configuration Change: The update is intended to align the build process with the goal of only building FlashAttention for CUDA 8.0+PTX, aiming to reduce the final wheel size.
Wheel Size Reduction: This PR is part of an effort to decrease the size of the distribution package by being more selective about the included CUDA architectures for FlashAttention.

Changelog

cmake/external_projects/vllm_flash_attn.cmake
- Updated the GIT_TAG for the vllm-flash-attn FetchContent_Declare from 8798f27777fb57f447070301bf33a9f9c607f491 to 61f1cfbf303884e131632626e7349912d784726c (lines 41 in both before/after snippets). This points the build process to a different commit in the FlashAttention repository, presumably one configured to support the targeted CUDA architecture subset.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to reduce the final wheel size by updating the flash-attention dependency to a specific commit. The intention is that this version of flash-attention will only build FA2 kernels for the 8.0+PTX architecture. This is a good optimization if it meets the project's target support. My review focuses on verifying this assumption and ensuring clarity for future maintenance.

Summary of Findings

Verification of Dependency's Build Behavior: It's crucial to confirm that the specified GIT_TAG (commit 61f1cfbf303884e131632626e7349912d784726c) for the vllm-flash-attention dependency indeed restricts the FA2 build to only the 8.0+PTX architecture, as this is the core purpose of the PR.
Clarity of Dependency Versioning: Using a raw commit hash for dependencies is precise but can be less transparent for maintenance. Consider if a tag or a dedicated branch in the vllm-project/flash-attention repository could be used to mark this specific build configuration for better long-term clarity.

Merge Readiness

This pull request is currently marked as [WIP] and the test results are TBD. Before this PR can be considered ready for merge, the critical point about verifying that the new flash-attention commit (61f1cfbf303884e131632626e7349912d784726c) correctly implements the described build restrictions for FA2 (8.0+PTX only) needs to be addressed. Additionally, the CI tests should pass to confirm the wheel size reduction and functionality. As a reviewer, I am not authorized to approve pull requests, but I recommend addressing these points before seeking final approval from other maintainers.

cmake/external_projects/vllm_flash_attn.cmake

gemini-code-assist · 2025-06-09T01:13:41Z

cmake/external_projects/vllm_flash_attn.cmake

@@ -38,7 +38,7 @@ else()
  FetchContent_Declare(
          vllm-flash-attn
          GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
-          GIT_TAG 8798f27777fb57f447070301bf33a9f9c607f491
+          GIT_TAG 61f1cfbf303884e131632626e7349912d784726c


Pinning to a specific commit hash ensures reproducibility, which is good. However, for long-term maintainability and understanding, raw commit hashes can sometimes be less clear than using descriptive tags or dedicated branches.

Is this commit 61f1cfbf303884e131632626e7349912d784726c part of a stable release, a feature branch, or a temporary fix in the vllm-project/flash-attention repository? If it's intended to be a specific version with these build characteristics, would it be feasible to create a tag (e.g., vX.Y.Z-fa2-8.0ptx) or use a named branch in the dependency repository? This could make it easier to understand the nature of this pinned version in the future.

mgoin · 2025-06-09T21:35:58Z

Unfortunately this does seem to hurt performance a lot.

B200 long context throughput

# Benchmark
vllm bench throughput --model RedHatAI/Meta-Llama-3.1-8B-FP8 --load-format dummy --input-len 10000 --output-len 200 --num-prompts 100

# Before
Throughput: 4.17 requests/s, 42510.83 total tokens/s, 833.55 output tokens/s

# After
Throughput: 1.27 requests/s, 12969.67 total tokens/s, 254.31 output tokens/s

LucasWilkinson · 2025-06-10T01:27:11Z

Unfortunately this does seem to hurt performance a lot.

B200 long context throughput

# Benchmark
vllm bench throughput --model RedHatAI/Meta-Llama-3.1-8B-FP8 --load-format dummy --input-len 10000 --output-len 200 --num-prompts 100

# Before
Throughput: 4.17 requests/s, 42510.83 total tokens/s, 833.55 output tokens/s

# After
Throughput: 1.27 requests/s, 12969.67 total tokens/s, 254.31 output tokens/s

did you see subsequently bad perf on subsequent runs? maybe the JIT cache needs to warm up

EDIT: got setup on a Blackwell machine, seems like the slowdown is from the JIT cache warmup

First run:

Throughput: 1.27 requests/s, 12954.78 total tokens/s, 254.02 output tokens/

Subsequent run:

Throughput: 4.15 requests/s, 42355.56 total tokens/s, 830.50 output tokens/s

NOTE: you can clear the JIT cache with rm -rf ~/.nv/ComputeCache

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

mgoin

Nice work, I think this is a worthwhile tradeoff.

gemini-code-assist · 2025-06-17T03:32:53Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Signed-off-by: minpeter <kali2005611@gmail.com>

Signed-off-by: Yang Wang <elainewy@meta.com>

Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>

update FA

d7e3c23

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

gemini-code-assist bot reviewed Jun 9, 2025

View reviewed changes

mergify bot added the ci/build label Jun 9, 2025

gemini-code-assist bot suggested changes Jun 9, 2025

View reviewed changes

LucasWilkinson marked this pull request as ready for review June 9, 2025 01:14

mgoin added the ready label Jun 9, 2025

LucasWilkinson mentioned this pull request Jun 16, 2025

FA2 8.0 PTX vllm-project/flash-attention#69

Merged

update FA

522e2f9

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

LucasWilkinson changed the title ~~[WIP][Wheel Size] Only build FA2 8.0+PTX~~ [Wheel Size] Only build FA2 8.0+PTX Jun 16, 2025

mgoin approved these changes Jun 16, 2025

View reviewed changes

mgoin merged commit 0733495 into vllm-project:main Jun 17, 2025
46 checks passed

mgoin mentioned this pull request Jun 18, 2025

[CI] Add SM120 to the Dockerfile #19794

Merged

yeqcharlotte pushed a commit to yeqcharlotte/vllm that referenced this pull request Jun 22, 2025

[Wheel Size] Only build FA2 8.0+PTX (vllm-project#19336)

ef73604

minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025

[Wheel Size] Only build FA2 8.0+PTX (vllm-project#19336)

190d768

Signed-off-by: minpeter <kali2005611@gmail.com>

yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Jun 24, 2025

[Wheel Size] Only build FA2 8.0+PTX (vllm-project#19336)

6d5cbb4

Signed-off-by: Yang Wang <elainewy@meta.com>

cyril23 mentioned this pull request Jun 27, 2025

[NVIDIA] Support Cutlass w8a8 FP8 for Blackwell Geforce GPUs (sm120) #17280

Merged

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 30, 2025

[Wheel Size] Only build FA2 8.0+PTX (vllm-project#19336)

a889667

wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025

[Wheel Size] Only build FA2 8.0+PTX (vllm-project#19336)

361ab7c

ProExpertProg mentioned this pull request Jul 1, 2025

[Feature]: Improve startup time UX #19824

Open

1 task

avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025

[Wheel Size] Only build FA2 8.0+PTX (vllm-project#19336)

f74ee10

Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Wheel Size] Only build FA2 8.0+PTX #19336

[Wheel Size] Only build FA2 8.0+PTX #19336

Uh oh!

LucasWilkinson commented Jun 9, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jun 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Jun 9, 2025

Uh oh!

mgoin commented Jun 9, 2025

Uh oh!

LucasWilkinson commented Jun 10, 2025 •

edited

Loading

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

gemini-code-assist bot commented Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

[Wheel Size] Only build FA2 8.0+PTX #19336

[Wheel Size] Only build FA2 8.0+PTX #19336

Uh oh!

Conversation

LucasWilkinson commented Jun 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Jun 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

gemini-code-assist bot Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin commented Jun 9, 2025

Uh oh!

LucasWilkinson commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot commented Jun 17, 2025

Uh oh!

Uh oh!

LucasWilkinson commented Jun 9, 2025 •

edited by github-actions bot

Loading

LucasWilkinson commented Jun 10, 2025 •

edited

Loading