[OMNIML-2932] Fusing pre_quant_scale for NVFP4 AWQ #421
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #421 +/- ##
=======================================
Coverage 74.43% 74.43%
=======================================
Files 182 182
Lines 18234 18234
=======================================
Hits 13572 13572
Misses 4662 4662 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6da3636 to
cd036ed
Compare
cd036ed to
c5d9682
Compare
d9dfc39 to
a5a6e39
Compare
ae2a32c to
6020e94
Compare
| @@ -0,0 +1,193 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
There was a problem hiding this comment.
No, that test will still pass.
a591330 to
234b7c2
Compare
| from .plugins import export_spec_ckpt_config, export_spec_ckpt_state_dict, spec_opt_only | ||
| from .quant_utils import ( | ||
| fuse_prequant_layernorm, | ||
| fuse_prequant_to_linear, |
There was a problem hiding this comment.
Can use_prequant_to_linear and fuse_prequant_layernorm be combined or they are mutual exclusive?
There was a problem hiding this comment.
They are quite different. use_prequant_to_linear is rule-based fusion and doesn't need graph tracing.
| layernorm_module.weight = torch.nn.Parameter( | ||
| layernorm_module.weight * getattr(modules[0].input_quantizer, "_pre_quant_scale") | ||
| ) | ||
| if hasattr(layernorm_module, "bias"): |
There was a problem hiding this comment.
Do we need to handle bias now (not before) because of some new model support or it's nvfp4 awq related?
There was a problem hiding this comment.
No, this is just for future proof
| mtq.NVFP4_AWQ_LITE_CFG, | ||
| ], | ||
| ) | ||
| def test_pattern_fuse_prequant_moe(quant_config): |
There was a problem hiding this comment.
Could we also cover a test case for BMM style MoE like in llama4 or gpt-oss?
There was a problem hiding this comment.
The current implementation does not work for BMM style Moe, but we can add the support later.
| .expand(num_kv_heads, n_rep, kv_head_dim) | ||
| .reshape(-1) | ||
| ) | ||
| # Update o_proj's pre_quant_scale |
There was a problem hiding this comment.
So this update is regards to update o_proj's PQS so we can just take the first head and apply to v right?
There was a problem hiding this comment.
yes, this updates the o_proj's PQS, so input channels of o_proj associated with the same query group (output channel) of v have the same prequant scale.
cjluo-nv
left a comment
There was a problem hiding this comment.
Thanks for implementing this.
d8528f1 to
986824e
Compare
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
986824e to
f55baad
Compare
What does this PR do?
Type of change: ?
Overview:
This PR and NVIDIA/TensorRT-LLM#8698 enable NVFP4 AWQ deployment for TRT-LLM. Specifically, this PR fuses pre_quant_scale in following two cases:
Usage
# Add a code snippet demonstrating how to use thisTesting
unit test, e2e test for Qwen3 dense and moe models.
Before your PR is "Ready for review"
Additional Information