Force vLLM non-gated MoE through Triton#1572
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR updates the vLLM fakequant serve launcher to conditionally default to ChangesConditional MoE backend defaults
🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
6387746 to
900ddd5
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1572 +/- ##
==========================================
- Coverage 76.88% 76.50% -0.38%
==========================================
Files 478 478
Lines 52209 54286 +2077
==========================================
+ Hits 40140 41533 +1393
- Misses 12069 12753 +684
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
e10c249 to
7819ae9
Compare
vLLM can select fused FlashInfer MoE backends whose expert GEMMs are not visible to ModelOpt fakequant hooks. Rebuilding vLLM's internal MoE kernel after model loading is fragile and can mismatch the weight layout that vLLM already selected during process_weights_after_loading. Keep the ModelOpt plugin simple: accept decomposed Triton backends, leave disabled expert quantizers alone, and fail loudly for fused or unsupported MoE backends with guidance to configure vLLM with moe_backend='triton' before model construction. The vLLM fakequant serving example now defaults to that backend when the installed vLLM parser exposes the option. Constraint: vLLM backend selection controls both kernel construction and weight post-processing, so changing only the runtime kernel inside ModelOpt is not robust. Rejected: Reconstruct vLLM MoE kernels from oracle helper functions | too much low-level vLLM API coupling and can drift with vLLM internals. Confidence: high Scope-risk: narrow Directive: Do not add ModelOpt-side vLLM MoE kernel reconstruction unless vLLM exposes a stable high-level API for post-load backend migration. Tested: python -m py_compile examples/vllm_serve/vllm_serve_fakequant.py modelopt/torch/quantization/plugins/vllm.py tests/unit/torch/quantization/plugins/test_vllm.py Tested: env -u NODE_OPTIONS .venv/bin/pre-commit run --all-files --show-diff-on-failure Not-tested: Focused pytest; current .venv fails before collection because tests/conftest.py imports requests, which is not installed. Signed-off-by: Meng Xin <mxin@nvidia.com>
7819ae9 to
1a3d7c9
Compare
The vLLM plugin-level backend guard added maintenance cost and depended on backend details that are likely to churn. Keep the behavior change at the user-facing example launcher instead: when vLLM exposes moe_backend, the example defaults it to triton so expert fakequant runs through a decomposed path. Rejected: Runtime plugin assertions for MoE backend selection | too much low-level coupling for this example-only use case Confidence: high Scope-risk: narrow Tested: python -m py_compile examples/vllm_serve/vllm_serve_fakequant.py Tested: ruff check examples/vllm_serve/vllm_serve_fakequant.py Tested: git diff --check on touched paths Not-tested: Full vLLM serve smoke after removing plugin assertions Signed-off-by: Meng Xin <mxin@nvidia.com>
What does this PR do?
Type of change: bug fix
From v0.20.0, vLLM selects the FlashInfer CUTLASS unquantized MoE backend for Nano-style non-gated MoE layers. That fused backend hides the intermediate activation between the expert GEMMs, so the w2 input quantizer can not be inserted.
Solution:
Add the parameter to force use triton kernel.
Usage
Testing
Tested on Nano3
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
Documentation
Improvements