-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
deepep low latency + fp8 dispatch - test fixes #19911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deepep low latency + fp8 dispatch - test fixes #19911
Conversation
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @varun-sundar-rabindranath, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request primarily focuses on enhancing the robustness and coverage of tests for DeepEP (Deep Expert Parallelism) with low-latency and FP8 dispatch capabilities. It refactors how test data is generated to align more closely with actual kernel implementations and expands the test parameter space, alongside making minor but important numerical adjustments within the core FP8 quantization kernel.
Highlights
- Test Infrastructure Enhancement: The test suite for DeepEP MoE (Mixture of Experts) has been updated to utilize the actual
fused_topk
kernel for generatingtopk_weights
andtopk_ids
, moving away from randomized generation. This ensures that tests more accurately reflect the behavior of the production code. - Expanded Test Coverage: The test matrix (
MNKs
) has been significantly expanded to include a wider range of tensor dimensions, and testing withUSE_FP8_DISPATCH=True
has been enabled. This broadens the scope of testing for low-latency and FP8 dispatch scenarios. - FP8 Kernel Numerical Stability: Adjustments were made within the
batched_deep_gemm_moe.py
kernel, specifically in_silu_mul_fp8_quant_deep_gemm
, to improve numerical stability and precision during FP8 quantization, including refining type casting and adjusting the epsilon value.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces fixes and enhancements for DeepEP (DeepGEMM MoE) tests, particularly focusing on low latency and FP8 dispatch scenarios. Key changes include a critical correction in FP8 minimum value calculation, adjustments to a Triton kernel for clarity, and significant updates to test data generation and parameters to improve coverage and realism. The tests now cover both FP8 dispatch enabled and disabled cases, and use more representative dimensions.
score = torch.randn((m, config.num_experts), | ||
device="cuda", | ||
dtype=dtype) | ||
topk_weights, topk_ids, _ = fused_topk(rank_tokens, score, topk, False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fused_topk
function returns three values: topk_weights
, topk_ids
, and token_expert_indices
. Currently, token_expert_indices
is being discarded with _
. While fused_experts
(called later in the test) re-derives necessary routing information from topk_ids
, it's worth confirming if there's any scenario where these pre-computed token_expert_indices
might be useful for more direct testing or if fused_topk
incurs significant overhead in computing them only to be discarded. If they are truly not needed for this test's purpose, this is fine.
@@ -152,7 +152,7 @@ def silu_mul_fp8_quant_deep_gemm( | |||
|
|||
f_info = torch.finfo(fp8_dtype) | |||
fp8_max = f_info.max | |||
fp8_min = -f_info.max | |||
fp8_min = f_info.min |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change from fp8_min = -f_info.max
to fp8_min = f_info.min
is a crucial correctness fix for handling FP8 data types. torch.finfo(dtype).min
accurately provides the smallest representable finite number for the given FP8 dtype
(e.g., torch.float8_e4m3fn
), which is essential for correct clamping operations during quantization. Using -f_info.max
is generally incorrect for floating-point types, especially for potentially asymmetric ones like FP8.
|
||
x = x * (1.0 / (1.0 + tl.exp(-x))) | ||
x = (x * (1.0 / (1.0 + tl.exp(-x)))).to(input_ptr.dtype.element_ty) | ||
y = x * y2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type conversion dance is needed to satisfy the tests when comparing against the cuda kernel outputs - reference :
vllm/csrc/activation_kernels.cu
Line 37 in f1e840e
return (T)(((float)x) / (1.0f + expf((float)-x))); |
DeepEP Low Latency Test Fixes