-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
Fixes IMA for TP w/ flex-attention #19712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @drisspg, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request delivers a crucial fix for the FlexAttention backend, enabling its stable operation with multi-GPU Tensor Parallelism. By addressing a specific CUDA memory error related to block mask creation, it enhances the robustness and usability of FlexAttention in distributed environments, alongside minor code cleanups.
Highlights
- Tensor Parallelism Fix: Resolved CUDA memory errors that occurred when using FlexAttention with multi-GPU Tensor Parallelism (TP > 1) setups.
- Block Mask Generation: Modified the FlexAttention backend to consistently use
create_block_mask_compiled
for block mask generation, removing previous conditional logic that bypassed it for TP > 1 configurations. - Device Parameterization: Added an explicit
device
parameter to thecreate_block_mask_compiled
function call, ensuring correct device assignment. - Code Cleanup: Removed an unused import (
get_tensor_model_parallel_world_size
) from the FlexAttention backend and an unnecessary environment variable setting (VLLM_ENABLE_V1_MULTIPROCESSING=0
) from the FlexAttention tests.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses potential CUDA illegal memory access errors in the FlexAttention backend, particularly when using tensor parallelism (TP > 1). The core of the fix lies in consistently utilizing create_block_mask_compiled
and, importantly, providing an explicit device
parameter during its invocation. This change enhances the robustness of the memory operations and simplifies the codebase by removing conditional logic that previously attempted to work around issues with the compiled version in TP > 1 scenarios.
The removal of the VLLM_ENABLE_V1_MULTIPROCESSING=0
environment variable from the tests is a positive indicator that the fix allows for correct operation in multi-GPU or multi-process setups. The cleanup of an unused import also improves code maintainability.
While the PR description had a slight ambiguity regarding the necessity of the fix on the very latest PyTorch nightly, the implemented changes are sound engineering practice for multi-GPU CUDA programming and address a known class of issues. The code is now clearer and more robust.
return create_block_mask_compiled( | ||
self.mask_mod, | ||
None, | ||
None, | ||
self.num_actual_tokens, | ||
self.total_cache_tokens, | ||
device=self.query_start_loc.device, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change to consistently use create_block_mask_compiled
and explicitly pass device=self.query_start_loc.device
is a significant improvement. It directly addresses potential CUDA illegal memory access errors in tensor parallel (TP > 1) setups by ensuring the block mask is created on the correct device.
This also simplifies the logic by removing the previous conditional use of create_block_mask
(which was guarded by a FIXME
for TP > 1 scenarios with the compiled version).
For future clarity, you might consider adding a brief inline comment above this call, noting that the explicit device
parameter is key to resolving previous TP > 1 issues with create_block_mask_compiled
.
Signed-off-by: drisspg <drisspguessous@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fix!
self.mask_mod, | ||
None, | ||
None, | ||
self.num_actual_tokens, | ||
self.total_cache_tokens, | ||
device=self.block_table.device, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So passing in device is the key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we have run into this before where torch.compiles lack of implicit device transfer causes it to show up as IMA,
I added some debug logging locall
(VllmWorker rank=1 pid=4120217) INFO 06-16 18:59:19 [flex_attention.py:240] create_block_mask_compiled called with device: cuda:1
(VllmWorker rank=6 pid=4120237) INFO 06-16 18:59:19 [flex_attention.py:240] create_block_mask_compiled called with device: cuda:6
(VllmWorker rank=7 pid=4120242) INFO 06-16 18:59:19 [flex_attention.py:240] create_block_mask_compiled called with device: cuda:7
(VllmWorker rank=3 pid=4120224) INFO 06-16 18:59:19 [flex_attention.py:240] create_block_mask_compiled called with device: cuda:3
(VllmWorker rank=4 pid=4120227) INFO 06-16 18:59:19 [flex_attention.py:240] create_block_mask_compiled called with device: cuda:4
(VllmWorker rank=0 pid=4120216) INFO 06-16 18:59:19 [flex_attention.py:240] create_block_mask_compiled called with device: cuda:0
(VllmWorker rank=5 pid=4120233) INFO 06-16 18:59:19 [flex_attention.py:240] create_block_mask_compiled called with device: cuda:5
(VllmWorker rank=2 pid=4120220) INFO 06-16 18:59:19 [flex_attention.py:240] create_block_mask_compiled called with device: cuda:2
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Signed-off-by: drisspg <drisspguessous@gmail.com>
Signed-off-by: drisspg <drisspguessous@gmail.com> Signed-off-by: minpeter <kali2005611@gmail.com>
Signed-off-by: drisspg <drisspguessous@gmail.com> Signed-off-by: Yang Wang <elainewy@meta.com>
Signed-off-by: drisspg <drisspguessous@gmail.com>
Signed-off-by: drisspg <drisspguessous@gmail.com>
FlexAttention Backend Fix
So I have been using this file for testing: https://gist.github.com/drisspg/3050c61f587030f09b96d86e14b10711
I am on the latest PyTorch Nightly, and I found that it it is working even before this fix:
So I am not sure, that being said people have ran into the create-block mask problem before w/ compile so this was a mistake on my end
Files Changed
tests/kernels/test_flex_attention.py
vllm/v1/attention/backends/flex_attention.py
Changes
VLLM_ENABLE_V1_MULTIPROCESSING=0
env varcreate_block_mask_compiled
Result
FlexAttention now works with multi-GPU tensor parallel setups without CUDA errors.