Skip to content

Force vLLM non-gated MoE through Triton#1572

Merged
mxinO merged 2 commits into
mainfrom
mxin/vllm-flashinfer-moe-fakequant
May 30, 2026
Merged

Force vLLM non-gated MoE through Triton#1572
mxinO merged 2 commits into
mainfrom
mxin/vllm-flashinfer-moe-fakequant

Conversation

@mxinO
Copy link
Copy Markdown
Contributor

@mxinO mxinO commented May 30, 2026

What does this PR do?

Type of change: bug fix

From v0.20.0, vLLM selects the FlashInfer CUTLASS unquantized MoE backend for Nano-style non-gated MoE layers. That fused backend hides the intermediate activation between the expert GEMMs, so the w2 input quantizer can not be inserted.

Solution:
Add the parameter to force use triton kernel.

Usage

Testing

Tested on Nano3

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A
  • Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • Documentation

    • Updated vLLM fakequant serve documentation with notes on moe-backend configuration defaults.
  • Improvements

    • Enhanced launcher logic to conditionally set moe-backend defaults only when the feature is available, improving configuration flexibility across different vLLM installations.

Review Change Stack

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 30, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: edf4be5b-6ca7-49c8-aa85-5e081a8e540f

📥 Commits

Reviewing files that changed from the base of the PR and between 40a4dd3 and a34d446.

📒 Files selected for processing (2)
  • examples/vllm_serve/README.md
  • examples/vllm_serve/vllm_serve_fakequant.py

📝 Walkthrough

Walkthrough

This PR updates the vLLM fakequant serve launcher to conditionally default to --moe-backend triton when the installed vLLM version supports it. The change adds a parser helper function, updates configuration logic to check for argument support before setting defaults, and documents the behavior for users.

Changes

Conditional MoE backend defaults

Layer / File(s) Summary
Conditional MoE backend defaults and documentation
examples/vllm_serve/vllm_serve_fakequant.py, examples/vllm_serve/README.md
Added _parser_has_argument() helper to detect if a parser supports --moe-backend argument; updated main() to conditionally set moe_backend="triton" only when parser includes this argument; documented the behavior explaining that decomposed MoE backend is required for expert fakequant calibration visibility.

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main change: forcing vLLM non-gated MoE through Triton backend, which is the core of the pull request addressing the MoE fakequant issue.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed PR changes contain no security anti-patterns: no unsafe torch.load/numpy.load, no hardcoded trust_remote_code, no eval/exec on untrusted input, no nosec comments, no new unsafe dependencies.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mxin/vllm-flashinfer-moe-fakequant

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 30, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-05-30 16:44 UTC

@mxinO mxinO force-pushed the mxin/vllm-flashinfer-moe-fakequant branch 2 times, most recently from 6387746 to 900ddd5 Compare May 30, 2026 01:30
@codecov
Copy link
Copy Markdown

codecov Bot commented May 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.50%. Comparing base (5eba879) to head (a34d446).
⚠️ Report is 8 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1572      +/-   ##
==========================================
- Coverage   76.88%   76.50%   -0.38%     
==========================================
  Files         478      478              
  Lines       52209    54286    +2077     
==========================================
+ Hits        40140    41533    +1393     
- Misses      12069    12753     +684     
Flag Coverage Δ
examples 41.64% <ø> (+8.77%) ⬆️
unit 53.60% <ø> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mxinO mxinO force-pushed the mxin/vllm-flashinfer-moe-fakequant branch 3 times, most recently from e10c249 to 7819ae9 Compare May 30, 2026 02:56
vLLM can select fused FlashInfer MoE backends whose expert GEMMs are not visible to ModelOpt fakequant hooks. Rebuilding vLLM's internal MoE kernel after model loading is fragile and can mismatch the weight layout that vLLM already selected during process_weights_after_loading.

Keep the ModelOpt plugin simple: accept decomposed Triton backends, leave disabled expert quantizers alone, and fail loudly for fused or unsupported MoE backends with guidance to configure vLLM with moe_backend='triton' before model construction. The vLLM fakequant serving example now defaults to that backend when the installed vLLM parser exposes the option.

Constraint: vLLM backend selection controls both kernel construction and weight post-processing, so changing only the runtime kernel inside ModelOpt is not robust.
Rejected: Reconstruct vLLM MoE kernels from oracle helper functions | too much low-level vLLM API coupling and can drift with vLLM internals.
Confidence: high
Scope-risk: narrow
Directive: Do not add ModelOpt-side vLLM MoE kernel reconstruction unless vLLM exposes a stable high-level API for post-load backend migration.
Tested: python -m py_compile examples/vllm_serve/vllm_serve_fakequant.py modelopt/torch/quantization/plugins/vllm.py tests/unit/torch/quantization/plugins/test_vllm.py
Tested: env -u NODE_OPTIONS .venv/bin/pre-commit run --all-files --show-diff-on-failure
Not-tested: Focused pytest; current .venv fails before collection because tests/conftest.py imports requests, which is not installed.
Signed-off-by: Meng Xin <mxin@nvidia.com>
@mxinO mxinO force-pushed the mxin/vllm-flashinfer-moe-fakequant branch from 7819ae9 to 1a3d7c9 Compare May 30, 2026 03:00
The vLLM plugin-level backend guard added maintenance cost and depended on backend details that are likely to churn. Keep the behavior change at the user-facing example launcher instead: when vLLM exposes moe_backend, the example defaults it to triton so expert fakequant runs through a decomposed path.

Rejected: Runtime plugin assertions for MoE backend selection | too much low-level coupling for this example-only use case

Confidence: high

Scope-risk: narrow

Tested: python -m py_compile examples/vllm_serve/vllm_serve_fakequant.py

Tested: ruff check examples/vllm_serve/vllm_serve_fakequant.py

Tested: git diff --check on touched paths

Not-tested: Full vLLM serve smoke after removing plugin assertions
Signed-off-by: Meng Xin <mxin@nvidia.com>
@mxinO mxinO changed the title Calibrate vLLM non-gated MoE through Triton fakequant Force vLLM non-gated MoE through Triton May 30, 2026
@mxinO mxinO marked this pull request as ready for review May 30, 2026 04:50
@mxinO mxinO requested a review from a team as a code owner May 30, 2026 04:50
@mxinO mxinO enabled auto-merge (squash) May 30, 2026 04:55
@mxinO mxinO merged commit 7ae4ee7 into main May 30, 2026
40 checks passed
@mxinO mxinO deleted the mxin/vllm-flashinfer-moe-fakequant branch May 30, 2026 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants