[Recipes][LLM PTQ] Add nvfp4 MSE+FP8-cast-KV recipes (experts_only / mlp_only) + --recipe in example scripts by cjluo-nv · Pull Request #1407 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-05-07T18:44:35Z

Summary

Adds two PTQ recipes that combine experts/MLP-only NVFP4 W4A4 with MSE FP8 scale-sweep weight calibration and FP8 KV cache with use_constant_amax: true (skips KV calibration; matches the nvfp4_default-fp8_cast_kv contract):
- modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml — applies to *mlp.experts* / *block_sparse_moe* only.
- modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-fp8_cast_kv.yaml — applies to all *mlp* / *block_sparse_moe* (dense MLP + MoE).
Threads a new --recipe flag through examples/llm_ptq/scripts/parser.sh and huggingface_example.sh. Either --quant or --recipe is required; passing both errors out. Recipe names are not validated in the script — hf_ptq.py is the source of truth.
Drops the bash-side qformat whitelist case-statement in huggingface_example.sh for the same reason.

Files

New recipes (modelopt_recipes/general/ptq/):

nvfp4_experts_only_mse-fp8_cast_kv.yaml — same patterns as nvfp4_experts_only-fp8_kv.yaml.
nvfp4_mlp_only_mse-fp8_cast_kv.yaml — same patterns as nvfp4_mlp_only-fp8_kv.yaml.

Both differ from their _kv siblings by:

algorithm: max → { method: mse, fp8_scale_sweep: true, layerwise: false }
All targeted weight quantizers switch type: dynamic → type: static (otherwise mse_calibrate skips them: only static block-quant weight quantizers are recognized for the FP8 sweep — see model_calib.py:369-374).
Input quantizers stay dynamic.
KV bmm adds use_constant_amax: true (the _cast_kv flavor).

Scripts (examples/llm_ptq/scripts/):

parser.sh — adds --recipe long-option, default RECIPE="", validates one-of-{--quant, --recipe} and not-both.
huggingface_example.sh — when RECIPE is set, derives MODEL_NAME from the recipe basename, passes --recipe=… to hf_ptq.py instead of --qformat=…, and exits after export with a TRT-LLM deployment hint (recipes can produce arbitrary configs that the script's downstream run_tensorrt_llm.py path doesn't know how to handle generically). Drops the qformat whitelist; defers to hf_ptq.py.

Behavior

# Errors with: "Cannot specify both --quant and --recipe; pick one."
bash huggingface_example.sh --model=... --quant=nvfp4 --recipe=... --tasks=quant

# Errors with usage if neither is given
bash huggingface_example.sh --model=... --tasks=quant

# Both of these are now accepted; --recipe is forwarded verbatim to hf_ptq.py
bash huggingface_example.sh --model=... --quant=nvfp4 --tasks=quant
bash huggingface_example.sh --model=... --recipe=general/ptq/nvfp4_experts_only_mse-fp8_cast_kv --tasks=quant
bash huggingface_example.sh --model=... --recipe=general/ptq/nvfp4_mlp_only_mse-fp8_cast_kv  --tasks=quant

Test plan

experts_only_mse-fp8_cast_kv loads via modelopt.recipe.load_recipe(...) and produces the expected algorithm + per-pattern quant_cfg (verified in a working env: algorithm == {'method': 'mse', 'fp8_scale_sweep': True, 'layerwise': False}; expert weight quantizers type: static; KV bmm has use_constant_amax: True).
Parser sanity: 4 flag combinations (both, neither, only --quant, only --recipe) all behave as designed.

Note

Pre-commit hook check-modelopt-recipes was skipped on both commits because the local conda env has a broken torchvision install (AttributeError: partially initialized module 'torchvision' has no attribute 'extension') that prevents from modelopt.recipe.loader import load_recipe. The experts_only recipe was validated independently by running tools/precommit/check_modelopt_recipes.py in a working environment (exits 0); the mlp_only one is the same shape with a different glob.

Rebased onto main from #1391 (which targeted chenjiel/nvfp4-fp8-sweep-triton). The diff is scoped to the recipes + script wiring; no kernel/sweep changes are included here.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added recipe-based quantization as an alternative to format-based quantization with a new --recipe CLI option.
- Added two new quantization recipes for targeted layer optimization: one for expert-layer-only quantization and one for MLP-layer-only quantization, both featuring NVFP4 and FP8 KV-cache optimization.
Configuration
- --quant and --recipe options are now mutually exclusive; specify one to configure quantization behavior.

coderabbitai · 2026-05-07T18:44:49Z

📝 Walkthrough

Walkthrough

This PR adds recipe-based quantization as an alternative to format-based quantization in the HuggingFace PTQ example scripts. The parser accepts a mutually exclusive --recipe argument alongside the existing --quant option. The quantization script conditionally constructs model names from recipe basenames and passes recipes or quantization formats to hf_ptq.py. Two NVFP4 PTQ recipes are introduced: one for expert-layer quantization and one for MLP-layer quantization with MSE-based FP8 sweeping.

Changes

Recipe-based PTQ Support

Layer / File(s)	Summary
CLI Recipe Argument Support `examples/llm_ptq/scripts/parser.sh`	Parser script adds `RECIPE` variable initialization, extends option parsing to accept `--recipe=<VALUE>`, validates mutual exclusivity between `--quant` and `--recipe`, and prints recipe in configuration output.
Quantization Script Integration `examples/llm_ptq/scripts/huggingface_example.sh`	Removes inline qformat validation; constructs MODEL_NAME from recipe basename when `RECIPE` is set, otherwise uses original QFORMAT-based naming; conditionally passes `--recipe=$RECIPE` or `--qformat=...` to `hf_ptq.py`.
Deployment and GPU Capability Handling `examples/llm_ptq/scripts/huggingface_example.sh`	Extends NVFP4 capability checks to include `RECIPE` inspection; adds early exit after quantization/export when recipe is provided, directing users to deploy with TensorRT-LLM directly.
PTQ Recipe Configurations `modelopt_recipes/general/ptq/nvfp4_experts_only_mse-kv_fp8_cast.yaml`, `modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-kv_fp8_cast.yaml`	Adds expert-only and MLP-only NVFP4 PTQ recipes with MSE-based FP8 scale sweeping, layerwise disabled, KV FP8 casting, and base disable/default quantizer imports.

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main changes: adding two new PTQ recipes for NVFP4 with MSE and FP8-cast-KV, and integrating a --recipe flag into the example scripts.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	The PR contains only shell scripts and YAML recipe files (no Python code). The security check scope is "modelopt package and examples Python changes," which are not applicable here.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv-main

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/llm_ptq/scripts/huggingface_example.sh`:
- Around line 64-70: The script leaves unquoted expansions and a plain string
for multi-word args which risks word-splitting: update the basename calls that
build MODEL_NAME/RECIPE_TAG to quote MODEL_PATH and RECIPE (use basename
"$MODEL_PATH" and basename "$RECIPE"), and change QUANT_SPEC_ARGS into an array
(e.g., declare -a QUANT_SPEC_ARGS and populate elements) so when invoking
hf_ptq.py you pass the arguments safely as "${QUANT_SPEC_ARGS[@]}"; ensure any
places that appended suffixes to MODEL_NAME still use the quoted basename
results to avoid splitting.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8d11f0f3-5321-403b-a703-3ab3972dfe59

📥 Commits

Reviewing files that changed from the base of the PR and between 555be6c and d17bcc8.

📒 Files selected for processing (4)

examples/llm_ptq/scripts/huggingface_example.sh
examples/llm_ptq/scripts/parser.sh
modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml
modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-fp8_cast_kv.yaml

coderabbitai · 2026-05-07T18:48:51Z

+# When using --recipe, build the model name from the recipe basename (without
+# directory or .yaml suffix) so each recipe gets its own SAVE_PATH.
+if [ -n "$RECIPE" ]; then
+    RECIPE_TAG=$(basename "$RECIPE" .yaml | sed 's/[^0-9a-zA-Z\-]/_/g')
+    MODEL_NAME=$(basename $MODEL_PATH | sed 's/[^0-9a-zA-Z\-]/_/g')_recipe_${RECIPE_TAG}
+else
+    MODEL_NAME=$(basename $MODEL_PATH | sed 's/[^0-9a-zA-Z\-]/_/g')_${QFORMAT_MODIFIED}${KV_CACHE_QUANT:+_kv_${KV_CACHE_QUANT}}


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Read-only verification of word-splitting-sensitive expansions in changed recipe flow. rg -n 'basename \$MODEL_PATH|QUANT_SPEC_ARGS=.*--recipe=\$RECIPE|^\s+\$QUANT_SPEC_ARGS\s+\\$' examples/llm_ptq/scripts/huggingface_example.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 375

🏁 Script executed:

cat -n examples/llm_ptq/scripts/huggingface_example.sh | sed -n '60,75p'

Repository: NVIDIA/Model-Optimizer

Length of output: 782

🏁 Script executed:

cat -n examples/llm_ptq/scripts/huggingface_example.sh | sed -n '170,190p'

Repository: NVIDIA/Model-Optimizer

Length of output: 890

Quote $MODEL_PATH and use array syntax for QUANT_SPEC_ARGS to prevent shell word-splitting bugs.

Unquoted basename $MODEL_PATH on lines 68–70 and unquoted $QUANT_SPEC_ARGS expansion on line 185 can break if paths or the recipe argument contain spaces or special characters. This contradicts line 67, which already quotes $RECIPE in a similar basename call. Convert QUANT_SPEC_ARGS to an array to safely pass multi-word arguments to hf_ptq.py.

Proposed fix

if [ -n "$RECIPE" ]; then RECIPE_TAG=$(basename "$RECIPE" .yaml | sed 's/[^0-9a-zA-Z\-]/_/g') - MODEL_NAME=$(basename $MODEL_PATH | sed 's/[^0-9a-zA-Z\-]/_/g')_recipe_${RECIPE_TAG} + MODEL_NAME=$(basename "$MODEL_PATH" | sed 's/[^0-9a-zA-Z\-]/_/g')_recipe_${RECIPE_TAG} else - MODEL_NAME=$(basename $MODEL_PATH | sed 's/[^0-9a-zA-Z\-]/_/g')_${QFORMAT_MODIFIED}${KV_CACHE_QUANT:+_kv_${KV_CACHE_QUANT}} + MODEL_NAME=$(basename "$MODEL_PATH" | sed 's/[^0-9a-zA-Z\-]/_/g')_${QFORMAT_MODIFIED}${KV_CACHE_QUANT:+_kv_${KV_CACHE_QUANT}} fi @@ - if [ -n "$RECIPE" ]; then - QUANT_SPEC_ARGS="--recipe=$RECIPE" - else - QUANT_SPEC_ARGS="--qformat=${QFORMAT// /,}" - fi + if [ -n "$RECIPE" ]; then + QUANT_SPEC_ARGS=(--recipe="$RECIPE") + else + QUANT_SPEC_ARGS=(--qformat="${QFORMAT// /,}") + fi python hf_ptq.py \ --pyt_ckpt_path=$MODEL_PATH \ --export_path=$SAVE_PATH \ --sparsity_fmt=$SPARSITY_FMT \ - $QUANT_SPEC_ARGS \ + "${QUANT_SPEC_ARGS[@]}" \ --calib_size=$CALIB_SIZE \

Also applies to lines 177–185.

🧰 Tools

🪛 Shellcheck (0.11.0)

[info] 68-68: Double quote to prevent globbing and word splitting.

(SC2086)

[info] 70-70: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/llm_ptq/scripts/huggingface_example.sh` around lines 64 - 70, The script leaves unquoted expansions and a plain string for multi-word args which risks word-splitting: update the basename calls that build MODEL_NAME/RECIPE_TAG to quote MODEL_PATH and RECIPE (use basename "$MODEL_PATH" and basename "$RECIPE"), and change QUANT_SPEC_ARGS into an array (e.g., declare -a QUANT_SPEC_ARGS and populate elements) so when invoking hf_ptq.py you pass the arguments safely as "${QUANT_SPEC_ARGS[@]}"; ensure any places that appended suffixes to MODEL_NAME still use the quoted basename results to avoid splitting.

…recipe support in scripts - Add modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml, combining experts-only NVFP4 W4A4 with the MSE FP8 scale-sweep weight calibration (algorithm: mse, fp8_scale_sweep: true; expert weight blocks use nvfp4_static so the static FP8 sweep applies) and FP8 KV cache via the kv_fp8_cast unit (use_constant_amax: true). - examples/llm_ptq/scripts: thread a new --recipe flag through parser.sh and huggingface_example.sh. Either --quant or --recipe is required; passing both errors out. When --recipe is used, the script derives MODEL_NAME from the recipe basename, passes --recipe= to hf_ptq.py, and exits after export with a TRT-LLM deployment hint (recipes can produce arbitrary configs). - Drop the qformat case-statement whitelist in huggingface_example.sh; let hf_ptq.py be the single source of truth for valid qformats / recipes. Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Same shape as nvfp4_experts_only_mse-fp8_cast_kv but with the broader *mlp* / *block_sparse_moe* / *.experts.* patterns from nvfp4_mlp_only-kv_fp8.yaml so it covers both dense MLP and MoE expert weights: - algorithm: { method: mse, fp8_scale_sweep: true, layerwise: false } - All MLP weight quantizers use nvfp4_static so the static FP8 scale sweep applies (otherwise mse_calibrate skips them). - Input quantizers use nvfp4 (dynamic). - KV bmm uses kv_fp8_cast (skips KV calibration; amax hardcoded to FP8 E4M3 max 448.0). Pre-commit hook check-modelopt-recipes was skipped because the host conda env has a broken torchvision install that prevents the validator from importing modelopt; the same hook fails on the existing committed sibling nvfp4_experts_only-kv_fp8.yaml in this env. Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

codecov · 2026-05-07T18:57:56Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.38%. Comparing base (555be6c) to head (9f08df0).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1407      +/-   ##
==========================================
+ Coverage   76.74%   77.38%   +0.64%     
==========================================
  Files         476      476              
  Lines       51307    51307              
==========================================
+ Hits        39377    39706     +329     
+ Misses      11930    11601     -329

Flag	Coverage Δ
examples	`41.81% <ø> (+2.62%)`	⬆️
unit	`52.55% <ø> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-fp8_cast_kv.yaml`:
- Line 25: The YAML description incorrectly claims quantization applies to “all
linear layers”; update the description field in
nvfp4_mlp_only_mse-fp8_cast_kv.yaml to accurately state that quantizers are
enabled only for MLP/MoE/expert (MLP-only) patterns and that other linear layers
are not affected, so users aren’t misled about the scope of W4A4/static weight
and FP8 KV settings.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c1e82d9b-9fb7-44bc-bbcb-509d457b8621

📥 Commits

Reviewing files that changed from the base of the PR and between d17bcc8 and 2fdfe86.

📒 Files selected for processing (4)

examples/llm_ptq/scripts/huggingface_example.sh
examples/llm_ptq/scripts/parser.sh
modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml
modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-fp8_cast_kv.yaml

✅ Files skipped from review due to trivial changes (1)

modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml

coderabbitai · 2026-05-07T19:19:17Z

+
+metadata:
+  recipe_type: ptq
+  description: NVFP4 static weight (MSE FP8-scale sweep) and dynamic activation for all linear layers (W4A4), FP8 KV cache with constant amax.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Metadata description overstates the quantization scope.

Line 25 says “all linear layers,” but this recipe only enables quantizers for MLP/MoE/expert patterns. Please align the text to the actual scope to avoid user confusion.

Proposed fix

- description: NVFP4 static weight (MSE FP8-scale sweep) and dynamic activation for all linear layers (W4A4), FP8 KV cache with constant amax. + description: NVFP4 static weight (MSE FP8-scale sweep) and dynamic activation for MLP/MoE linear layers (W4A4), FP8 KV cache with constant amax.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

description: NVFP4 static weight (MSE FP8-scale sweep) and dynamic activation for all linear layers (W4A4), FP8 KV cache with constant amax.

description: NVFP4 static weight (MSE FP8-scale sweep) and dynamic activation for MLP/MoE linear layers (W4A4), FP8 KV cache with constant amax.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-fp8_cast_kv.yaml` at line 25, The YAML description incorrectly claims quantization applies to “all linear layers”; update the description field in nvfp4_mlp_only_mse-fp8_cast_kv.yaml to accurately state that quantizers are enabled only for MLP/MoE/expert (MLP-only) patterns and that other linear layers are not affected, so users aren’t misled about the scope of W4A4/static weight and FP8 KV settings.

meenchen

Bot review — DM the bot to share feedback.

The recipe YAMLs themselves look correct — they reuse existing nvfp4, nvfp4_static, and kv_fp8_cast units, and only differ from the -kv_fp8 siblings in the ways the PR body describes (weight quantizers switch to type: static to be eligible for the MSE FP8 scale sweep; algorithm switches to {method: mse, fp8_scale_sweep: true, layerwise: false}). The hf_ptq.py --recipe flag is already wired up from the prior PR and correctly bypasses the post-hoc _set_kv_cache_constant_amax so the recipe is authoritative.

One real regression in huggingface_example.sh: deleting the for qformat in $QFORMAT; do … done whitelist also removes the implicit binding of the lowercase loop variable $qformat, which is still referenced later in the bf16/fp16 shortcut branch (if [ "$qformat" == "bf16" ] || [ "$qformat" == "fp16" ]). With the loop gone, that variable is empty, so the symlink-the-model shortcut for pure-bf16/fp16 runs is now dead code. Replace $qformat with $QFORMAT (or reintroduce a single assignment) to preserve the prior behavior.

Minor stylistic nit: the new files are named *_mse-fp8_cast_kv.yaml while the existing convention in modelopt_recipes/general/ptq/ is *-kv_fp8_cast (KV descriptor after the dash, with kv_ prefix). Not blocking, but inconsistent with siblings.

Tests: no new unit tests, and the PR notes the pre-commit check-modelopt-recipes was skipped locally due to a broken env. The recipes are YAML-only and are covered by the standard recipe loader check in CI, so this is low-risk provided CI runs that check — but the author should confirm the pre-commit hook passes in CI on this branch.

meenchen · 2026-05-07T19:34:10Z

-    esac
-done
-IFS=" "
+# Quant format / recipe validation is delegated to hf_ptq.py.


Bot comment.

Regression: deleting the for qformat in $QFORMAT; do … done loop also drops the implicit binding of the lowercase loop variable $qformat, which is still used below at if [ "$qformat" == "bf16" ] || [ "$qformat" == "fp16" ]. With the loop removed, $qformat is empty and that bf16/fp16 shortcut (which symlinks the source model into $SAVE_PATH and marks MODEL_CONFIG_EXIST=true) will never trigger — users running --quant=bf16 or --quant=fp16 will now fall through to python hf_ptq.py --qformat=bf16 instead. Either replace $qformat with $QFORMAT in that check, or add a dedicated qformat="$QFORMAT" assignment here.

Do we still need bf16/fp16 path anyway? Maybe we can deprecate them

Not sure if we still have the use cases where we quantize fp32 to fp16.

yeah I think we can delete. Let me add this to the PR

meenchen · 2026-05-07T19:34:10Z

@@ -0,0 +1,48 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Bot comment.

Stylistic nit: sibling recipes in this directory follow the <numerics>-kv_<kv_fmt> naming pattern (e.g. nvfp4_experts_only-kv_fp8.yaml, nvfp4_default-kv_fp8_cast.yaml). This file uses _mse-fp8_cast_kv which flips the KV descriptor order. Consider renaming to something like nvfp4_experts_only_mse-kv_fp8_cast.yaml for consistency with the existing convention. Same for nvfp4_mlp_only_mse-fp8_cast_kv.yaml.

- huggingface_example.sh: remove the bf16/fp16 shortcut block. The lowercase $qformat variable it referenced was bound by the `for qformat in $QFORMAT; do … done` whitelist loop deleted in the parent commit, so the shortcut became dead code (regression flagged by reviewers). Per consensus on the PR, drop the path entirely rather than reintroduce the binding — the bf16/fp16 case still goes through `python hf_ptq.py --qformat=...` and the TRT-LLM whitelist on line 210 already covers it. - huggingface_example.sh: quote $MODEL_PATH in the basename calls used to build MODEL_NAME (CodeRabbit shellcheck SC2086). - nvfp4_mlp_only_mse-kv_fp8_cast.yaml: tighten metadata description from "all linear layers" to "MLP/MoE linear layers" — quantization is only enabled for *mlp* / *block_sparse_moe* / *.experts.* patterns. - Rename both recipes to match the sibling `<numerics>_<scope>-kv_<kv_fmt>` convention used by `nvfp4_default-kv_fp8_cast.yaml` etc.: nvfp4_experts_only_mse-fp8_cast_kv.yaml → nvfp4_experts_only_mse-kv_fp8_cast.yaml nvfp4_mlp_only_mse-fp8_cast_kv.yaml → nvfp4_mlp_only_mse-kv_fp8_cast.yaml Pre-commit hook check-modelopt-recipes was skipped (same env-broken torchvision issue as the prior commits); the renamed recipes were verified independently with tools/precommit/check_modelopt_recipes.py against the working-tree modelopt — both load and produce the expected effective config. Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai

♻️ Duplicate comments (1)

examples/llm_ptq/scripts/huggingface_example.sh (1)

165-169: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

QUANT_SPEC_ARGS still a plain string — unquoted expansion risks word-splitting.

QUANT_SPEC_ARGS is assigned as a plain string and then expanded unquoted on line 174. ShellCheck flags this (SC2086). If $RECIPE contains spaces or special characters, the argument will be split incorrectly. Convert to a Bash array so the expansion is safe.

🛠️ Proposed fix — array syntax

-        if [ -n "$RECIPE" ]; then
-            QUANT_SPEC_ARGS="--recipe=$RECIPE"
-        else
-            QUANT_SPEC_ARGS="--qformat=${QFORMAT// /,}"
-        fi
+        if [ -n "$RECIPE" ]; then
+            QUANT_SPEC_ARGS=(--recipe="$RECIPE")
+        else
+            QUANT_SPEC_ARGS=(--qformat="${QFORMAT// /,}")
+        fi
         python hf_ptq.py \
             --pyt_ckpt_path=$MODEL_PATH \
             --export_path=$SAVE_PATH \
             --sparsity_fmt=$SPARSITY_FMT \
-            $QUANT_SPEC_ARGS \
+            "${QUANT_SPEC_ARGS[@]}" \
             --calib_size=$CALIB_SIZE \

Also applies to: 174-174

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_ptq/scripts/huggingface_example.sh` around lines 165 - 169,
QUANT_SPEC_ARGS is built as a plain string which risks word-splitting when
expanded; change it to a Bash array and quote expansions: set QUANT_SPEC_ARGS as
an array in the branches (e.g., QUANT_SPEC_ARGS=(--recipe "$RECIPE") or
QUANT_SPEC_ARGS=(--qformat "${QFORMAT// /,}")) and later invoke it with
"${QUANT_SPEC_ARGS[@]}" where it is expanded (the current unquoted expansion on
the line that uses QUANT_SPEC_ARGS should be replaced with the array expansion).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@examples/llm_ptq/scripts/huggingface_example.sh`:
- Around line 165-169: QUANT_SPEC_ARGS is built as a plain string which risks
word-splitting when expanded; change it to a Bash array and quote expansions:
set QUANT_SPEC_ARGS as an array in the branches (e.g., QUANT_SPEC_ARGS=(--recipe
"$RECIPE") or QUANT_SPEC_ARGS=(--qformat "${QFORMAT// /,}")) and later invoke it
with "${QUANT_SPEC_ARGS[@]}" where it is expanded (the current unquoted
expansion on the line that uses QUANT_SPEC_ARGS should be replaced with the
array expansion).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6278d544-daa2-4b0f-8ff8-b5a53b6b86d8

📥 Commits

Reviewing files that changed from the base of the PR and between 2fdfe86 and 9f08df0.

📒 Files selected for processing (3)

examples/llm_ptq/scripts/huggingface_example.sh
modelopt_recipes/general/ptq/nvfp4_experts_only_mse-kv_fp8_cast.yaml
modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-kv_fp8_cast.yaml

✅ Files skipped from review due to trivial changes (2)

modelopt_recipes/general/ptq/nvfp4_experts_only_mse-kv_fp8_cast.yaml
modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-kv_fp8_cast.yaml

meenchen

Bot review — DM the bot to share feedback.

Re-review: all critical previous comments have been addressed.

Regression with $qformat bf16/fp16 shortcut: Fixed by removing the block entirely (author and meenchen agreed to deprecate; no use case for bf16/fp16 "quantize" today).
Naming convention: Files renamed to nvfp4_experts_only_mse-kv_fp8_cast.yaml and nvfp4_mlp_only_mse-kv_fp8_cast.yaml, matching the sibling *-kv_fp8_cast convention.
mlp_only description wording: Updated from "all linear layers" to "MLP/MoE linear layers".
Basename quoting: $MODEL_PATH and $RECIPE are now quoted in the new basename calls.

Remaining CodeRabbit nit (QUANT_SPEC_ARGS as plain string instead of a Bash array) is stylistic; recipe paths are repo-relative without spaces, and the prior script already uses unquoted expansions elsewhere. Not a blocker.

Licensing: new YAML files carry the standard NVIDIA Apache-2.0 header; no licensing concern. Recipes are covered by the existing check-modelopt-recipes loader test in CI.

hychiang-git

LGTM

shengliangxu

LGTM

github-actions · 2026-05-07T22:39:15Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-05-07 22:39 UTC

cjluo-nv requested review from a team as code owners May 7, 2026 18:44

cjluo-nv requested review from h-guo18 and meenchen May 7, 2026 18:44

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

cjluo-nv added 2 commits May 7, 2026 11:54

cjluo-nv force-pushed the chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv-main branch from d17bcc8 to 2fdfe86 Compare May 7, 2026 19:15

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

cjluo-nv requested a review from shengliangxu May 7, 2026 19:20

meenchen reviewed May 7, 2026

View reviewed changes

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

meenchen approved these changes May 7, 2026

View reviewed changes

hychiang-git approved these changes May 7, 2026

View reviewed changes

meenchen approved these changes May 7, 2026

View reviewed changes

cjluo-nv enabled auto-merge (squash) May 7, 2026 22:32

shengliangxu approved these changes May 7, 2026

View reviewed changes

cjluo-nv merged commit 6a3b6b8 into main May 7, 2026
42 checks passed

cjluo-nv deleted the chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv-main branch May 7, 2026 22:38

cjluo-nv mentioned this pull request May 8, 2026

[Quantization] MSE-calibrate every per-expert weight in fused-experts MoE #1421

Open

	description: NVFP4 static weight (MSE FP8-scale sweep) and dynamic activation for all linear layers (W4A4), FP8 KV cache with constant amax.
	description: NVFP4 static weight (MSE FP8-scale sweep) and dynamic activation for MLP/MoE linear layers (W4A4), FP8 KV cache with constant amax.

		@@ -0,0 +1,48 @@
		# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Conversation

cjluo-nv commented May 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Behavior

Test plan

Note

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

meenchen May 7, 2026

Choose a reason for hiding this comment

Uh oh!

meenchen May 7, 2026

Choose a reason for hiding this comment

Uh oh!

hychiang-git May 7, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv May 7, 2026

Choose a reason for hiding this comment

Uh oh!

meenchen May 7, 2026

Choose a reason for hiding this comment

Uh oh!

meenchen May 7, 2026

Choose a reason for hiding this comment

Uh oh!

hychiang-git May 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

hychiang-git left a comment

Choose a reason for hiding this comment

Uh oh!

shengliangxu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cjluo-nv commented May 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading

codecov Bot commented May 7, 2026 •

edited

Loading