-
-
Notifications
You must be signed in to change notification settings - Fork 8.3k
[Feature] add quick all reduce #19744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Feature] add quick all reduce #19744
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @lihaoyang-amd, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances VLLM's distributed communication capabilities on AMD GPUs by introducing a highly optimized 'Quick All-Reduce' feature. It integrates a specialized quickreduce
library, providing support for various quantization levels to accelerate all-reduce operations. The changes include adding low-level C++/HIP kernels, exposing them through Python bindings, and implementing intelligent dispatch logic to automatically select the most performant communication strategy based on tensor properties and system configuration, aiming to improve performance for large tensor all-reduces.
Highlights
- New Feature: Quick All-Reduce: Introduced a 'Quick All-Reduce' feature specifically optimized for AMD GPUs (ROCm), leveraging the
quickreduce
library for enhanced distributed communication performance. - Quantization Support: The new quick all-reduce supports various quantization levels, including FP16, INT8, INT6, and INT4, allowing for flexible performance tuning based on precision requirements.
- Intelligent Dispatch Logic: Integrated the quick all-reduce into VLLM's existing custom all-reduce framework, implementing logic to dynamically select between the standard custom all-reduce and the new quick all-reduce based on message size, data type, and the distributed world size.
- Low-Level Kernel Implementation: Added new C++/HIP kernels and Python bindings for the quick all-reduce operations, including specialized code for efficient IPC memory handling and GPU-specific intrinsics for AMD's CDNA architecture.
- Configurable Behavior: Introduced new environment variables (
VLLM_ROCM_QR_QUANT_REGIME
andVLLM_ROCM_QR_CAST_BF16_TO_FP16
) to allow users to configure the quick all-reduce's quantization level and bfloat16 casting behavior.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a 'Quick All Reduce' feature, primarily targeting ROCm/HIP environments. It adds new C++/HIP kernels for an accelerated all-reduce operation, including support for quantization (FP16, INT8, INT6, INT4). The changes include new source files for the QuickReduce implementation, CMake build system updates, PyTorch C++ bindings, Python wrappers, and integration into the CustomAllreduce
class. Key areas of feedback include a critical bug in a quantization codec, a potential runtime error in Python logic, and a suggestion for type clarity in C++ header declarations.
csrc/ops.h
Outdated
@@ -360,3 +360,13 @@ std::tuple<int64_t, torch::Tensor> allocate_shared_buffer_and_handle( | |||
int64_t size); | |||
int64_t open_mem_handle(torch::Tensor& mem_handle); | |||
void free_shared_buffer(int64_t buffer); | |||
|
|||
#ifdef USE_ROCM | |||
fptr_t init_custom_qr(int64_t rank, int64_t world_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new qr_*
function declarations (lines 365-370) use fptr_t
. Their implementations in custom_quickreduce.cu
use quickreduce::fptr_t
(which is int64_t
namespaced under quickreduce
).
While both fptr_t
(if it's a global int64_t
typedef, common in such bindings) and quickreduce::fptr_t
might resolve to int64_t
, explicitly using quickreduce::fptr_t
in these declarations would improve clarity, type safety, and robustness, especially if the underlying type of quickreduce::fptr_t
were to change or if the global fptr_t
had a different meaning or origin.
This change should be applied to all relevant qr_*
function declarations in this block (lines 365-370).
fptr_t init_custom_qr(int64_t rank, int64_t world_size); | |
quickreduce::fptr_t init_custom_qr(int64_t rank, int64_t world_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to sorting out the namespace issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forcing a prefix will result in a compilation error. In fact, ops.h doesn't contain any header files for cu functions, so we can't use namespaces. Custom allreduce doesn't do that either, it defines fptr_t in ops.h, I think we can reuse fptr_t.
|
This pull request has merge conflicts that must be resolved before it can be |
e9fff8c
to
f314fe4
Compare
|
f314fe4
to
381009f
Compare
381009f
to
d5832a8
Compare
3b81d4d
to
811be44
Compare
Can you merge from main to see if the CI failures are resolved? |
This pull request has merge conflicts that must be resolved before it can be |
aa4f696
to
51b9e6c
Compare
2f23c69
to
0209b2e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this PR modify custom_all_reduce.py
instead of introducing a quick_all_reduce.py
device communicator? Prior to this commit, the two implementations were separate and I think that approach was cleaner overall with more minimal changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
The external interfaces, initialization functions, reduce logic, and cleanup logic for quick allreduce and custom allreduce are very similar.
Integrating quick allreduce into custom allreduce allows for code reuse, resulting in fewer and more concise code changes.
(This way, we don't need to modify cuda_communicator.py). -
Quick allreduce is intended to complement custom allreduce. Even when quick allreduce is available, custom allreduce should be used for small amounts of data (e.g. 2MB or less, mostly during the decode phase). Integration facilitates easier switching between the two.
Quick allreduce is a type of custom allreduce, but it cannot fully replace the custom allreduce. Exposing an additional quick allreduce interface may confuse users as to why they would still need to use the custom allreduce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason I think it should be split up is that I think it would significantly reduce the amount of if statements, branches, and different cases that need to be handled in every function in this file.
Most of the time, vLLM developers and users only need to think about custom allreduce. And this PR significantly increases the complexity of the code for that case. Even if split up, QuickReduce can still fall back to custom allreduce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. So let's go back to splitting.
csrc/ops.h
Outdated
@@ -360,3 +360,13 @@ std::tuple<int64_t, torch::Tensor> allocate_shared_buffer_and_handle( | |||
int64_t size); | |||
int64_t open_mem_handle(torch::Tensor& mem_handle); | |||
void free_shared_buffer(int64_t buffer); | |||
|
|||
#ifdef USE_ROCM | |||
fptr_t init_custom_qr(int64_t rank, int64_t world_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to sorting out the namespace issue
1552e5a
to
c8c63dd
Compare
Signed-off-by: ilmarkov <imarkov@redhat.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Add min sizes for QR Cleanup Signed-off-by: ilmarkov <imarkov@redhat.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: ilmarkov <imarkov@redhat.com>
Signed-off-by: ilmarkov <imarkov@redhat.com>
Signed-off-by: ilmarkov <imarkov@redhat.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Co-authored-by: ilmarkov <imarkov@redhat.com> Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Co-authored-by: ilmarkov <imarkov@redhat.com> Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Co-authored-by: ilmarkov <imarkov@redhat.com> Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Co-authored-by: ilmarkov <imarkov@redhat.com> Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Co-authored-by: ilmarkov <imarkov@redhat.com> Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Co-authored-by: ilmarkov <imarkov@redhat.com> Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
e7e84da
to
7b7822c
Compare
Co-authored-by: ilmarkov <imarkov@redhat.com> Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Co-authored-by: ilmarkov <imarkov@redhat.com> Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Co-authored-by: ilmarkov <imarkov@redhat.com> Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
go back to splitting
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Just For ROCM
1.Add quickreduce alternative to custom allreduce. (In case of large amount of data, custom quick reduce is used instead of custom allreduce and rccl, you can refer to the results of kernel tests.)
2.The collective is only enabled on AMD, MI300, for fp16/bf16 inputs and when custom allreduce is enabled. The kernels support full precision and quantized int8, int6, int4 (symmetric quantization with group size 32) all reduce collective quantization algorithm.
3.The quickreduce can be enabled by setting
VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=[NONE|FP|INT8|INT6|INT4]
env variable. quickreduce supports int8, int6, int4 quantization. NONE means turn off quick allreduce.4.PR supports fp16 and bf16 kernels but given the lack of intrinsics of bf16 math operations, bf16 kernels performance is worse (see kernel benchmark results below), so by default we convert bf16 all reduce input to fp16. To disable this behavior, set the environment variable
VLLM_ROCM_QR_CAST_BF16_TO_FP16=0
.5.As long as quickreduce only get the performance benefits at middle/higher input sizes (see kernel benchmarks), vllm keeps using custom allreduce for small inputs. The lower limit for enabling quickreduce is chosen based on experimental results
6.The default maximum input size of quickreduce is 2GB, for users with limited video memory, the preset buffer may be a bit too large, you can adjust the value in MB by
VLLM_ROCM_QUICK_REDUCE_MAX_SIZE_BYTES_MB
.Kernels benchmark
Baseline is custom allreduce when the data size is less than 16MB and rccl when the data size is greater than 16MB
TP=2
TP=4
TP 8
E2E server benchmark float16
Server:
VLLM_USE_V1=1 VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve meta-llama/Llama-3.1-70B-Instruct --block_size=32 --disable-log-requests --no-enable-prefix-caching -tp $tp --dtype float16
Client:
python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.1-70B-Instruct --dataset-name sonnet --dataset-path benchmarks/sonnet.txt --num-prompts 500--request-rate 10 --ignore-eos
TP=8
TP=4
E2E server benchmark bfloat16
Server:
VLLM_USE_V1=1 VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve model_path --block_size=32 --disable-log-requests --no-enable-prefix-caching -tp $tp --dtype auto
Client:
python benchmarks/benchmark_serving.py --model model_path --dataset-name sonnet --dataset-path benchmarks/sonnet.txt --num-prompts 500--request-rate 10 --ignore-eos
bfloat16 kernels ( fp16 kernels results in the table are done with VLLM_ROCM_QR_CAST_BF16_TO_FP16=1):
Llama-3.1-70B-Instruct TP=4
use VLLM_ROCM_QR_CAST_BF16_TO_FP16=1
Qwen2.5-72B TP=8
Llama-3.1-70B TP=8
Llama-3.1-8B TP=8
Evaluation results
on MMLU benchmark (LLaMa 3.1 70B, TP=8)
on GSM8K(use bf2fp16 by envs.VLLM_ROCM_QR_CAST_BF16_TO_FP16=1)
@ilmarkov is the originator of pr, I'm a collaborator.