Skip to content

[Recipes][LLM PTQ] Add nvfp4 MSE+FP8-cast-KV recipes (experts_only / mlp_only) + --recipe in example scripts#1407

Merged
cjluo-nv merged 3 commits intomainfrom
chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv-main
May 7, 2026
Merged

[Recipes][LLM PTQ] Add nvfp4 MSE+FP8-cast-KV recipes (experts_only / mlp_only) + --recipe in example scripts#1407
cjluo-nv merged 3 commits intomainfrom
chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv-main

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv commented May 7, 2026

Summary

  • Adds two PTQ recipes that combine experts/MLP-only NVFP4 W4A4 with MSE FP8 scale-sweep weight calibration and FP8 KV cache with use_constant_amax: true (skips KV calibration; matches the nvfp4_default-fp8_cast_kv contract):
    • modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml — applies to *mlp.experts* / *block_sparse_moe* only.
    • modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-fp8_cast_kv.yaml — applies to all *mlp* / *block_sparse_moe* (dense MLP + MoE).
  • Threads a new --recipe flag through examples/llm_ptq/scripts/parser.sh and huggingface_example.sh. Either --quant or --recipe is required; passing both errors out. Recipe names are not validated in the script — hf_ptq.py is the source of truth.
  • Drops the bash-side qformat whitelist case-statement in huggingface_example.sh for the same reason.

Files

New recipes (modelopt_recipes/general/ptq/):

  • nvfp4_experts_only_mse-fp8_cast_kv.yaml — same patterns as nvfp4_experts_only-fp8_kv.yaml.
  • nvfp4_mlp_only_mse-fp8_cast_kv.yaml — same patterns as nvfp4_mlp_only-fp8_kv.yaml.

Both differ from their _kv siblings by:

  • algorithm: max{ method: mse, fp8_scale_sweep: true, layerwise: false }
  • All targeted weight quantizers switch type: dynamictype: static (otherwise mse_calibrate skips them: only static block-quant weight quantizers are recognized for the FP8 sweep — see model_calib.py:369-374).
  • Input quantizers stay dynamic.
  • KV bmm adds use_constant_amax: true (the _cast_kv flavor).

Scripts (examples/llm_ptq/scripts/):

  • parser.sh — adds --recipe long-option, default RECIPE="", validates one-of-{--quant, --recipe} and not-both.
  • huggingface_example.sh — when RECIPE is set, derives MODEL_NAME from the recipe basename, passes --recipe=… to hf_ptq.py instead of --qformat=…, and exits after export with a TRT-LLM deployment hint (recipes can produce arbitrary configs that the script's downstream run_tensorrt_llm.py path doesn't know how to handle generically). Drops the qformat whitelist; defers to hf_ptq.py.

Behavior

# Errors with: "Cannot specify both --quant and --recipe; pick one."
bash huggingface_example.sh --model=... --quant=nvfp4 --recipe=... --tasks=quant

# Errors with usage if neither is given
bash huggingface_example.sh --model=... --tasks=quant

# Both of these are now accepted; --recipe is forwarded verbatim to hf_ptq.py
bash huggingface_example.sh --model=... --quant=nvfp4 --tasks=quant
bash huggingface_example.sh --model=... --recipe=general/ptq/nvfp4_experts_only_mse-fp8_cast_kv --tasks=quant
bash huggingface_example.sh --model=... --recipe=general/ptq/nvfp4_mlp_only_mse-fp8_cast_kv  --tasks=quant

Test plan

  • experts_only_mse-fp8_cast_kv loads via modelopt.recipe.load_recipe(...) and produces the expected algorithm + per-pattern quant_cfg (verified in a working env: algorithm == {'method': 'mse', 'fp8_scale_sweep': True, 'layerwise': False}; expert weight quantizers type: static; KV bmm has use_constant_amax: True).
  • Parser sanity: 4 flag combinations (both, neither, only --quant, only --recipe) all behave as designed.

Note

Pre-commit hook check-modelopt-recipes was skipped on both commits because the local conda env has a broken torchvision install (AttributeError: partially initialized module 'torchvision' has no attribute 'extension') that prevents from modelopt.recipe.loader import load_recipe. The experts_only recipe was validated independently by running tools/precommit/check_modelopt_recipes.py in a working environment (exits 0); the mlp_only one is the same shape with a different glob.

Rebased onto main from #1391 (which targeted chenjiel/nvfp4-fp8-sweep-triton). The diff is scoped to the recipes + script wiring; no kernel/sweep changes are included here.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added recipe-based quantization as an alternative to format-based quantization with a new --recipe CLI option.
    • Added two new quantization recipes for targeted layer optimization: one for expert-layer-only quantization and one for MLP-layer-only quantization, both featuring NVFP4 and FP8 KV-cache optimization.
  • Configuration

    • --quant and --recipe options are now mutually exclusive; specify one to configure quantization behavior.

@cjluo-nv cjluo-nv requested review from a team as code owners May 7, 2026 18:44
@cjluo-nv cjluo-nv requested review from h-guo18 and meenchen May 7, 2026 18:44
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds recipe-based quantization as an alternative to format-based quantization in the HuggingFace PTQ example scripts. The parser accepts a mutually exclusive --recipe argument alongside the existing --quant option. The quantization script conditionally constructs model names from recipe basenames and passes recipes or quantization formats to hf_ptq.py. Two NVFP4 PTQ recipes are introduced: one for expert-layer quantization and one for MLP-layer quantization with MSE-based FP8 sweeping.

Changes

Recipe-based PTQ Support

Layer / File(s) Summary
CLI Recipe Argument Support
examples/llm_ptq/scripts/parser.sh
Parser script adds RECIPE variable initialization, extends option parsing to accept --recipe=<VALUE>, validates mutual exclusivity between --quant and --recipe, and prints recipe in configuration output.
Quantization Script Integration
examples/llm_ptq/scripts/huggingface_example.sh
Removes inline qformat validation; constructs MODEL_NAME from recipe basename when RECIPE is set, otherwise uses original QFORMAT-based naming; conditionally passes --recipe=$RECIPE or --qformat=... to hf_ptq.py.
Deployment and GPU Capability Handling
examples/llm_ptq/scripts/huggingface_example.sh
Extends NVFP4 capability checks to include RECIPE inspection; adds early exit after quantization/export when recipe is provided, directing users to deploy with TensorRT-LLM directly.
PTQ Recipe Configurations
modelopt_recipes/general/ptq/nvfp4_experts_only_mse-kv_fp8_cast.yaml, modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-kv_fp8_cast.yaml
Adds expert-only and MLP-only NVFP4 PTQ recipes with MSE-based FP8 scale sweeping, layerwise disabled, KV FP8 casting, and base disable/default quantizer imports.

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main changes: adding two new PTQ recipes for NVFP4 with MSE and FP8-cast-KV, and integrating a --recipe flag into the example scripts.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed The PR contains only shell scripts and YAML recipe files (no Python code). The security check scope is "modelopt package and examples Python changes," which are not applicable here.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv-main

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/llm_ptq/scripts/huggingface_example.sh`:
- Around line 64-70: The script leaves unquoted expansions and a plain string
for multi-word args which risks word-splitting: update the basename calls that
build MODEL_NAME/RECIPE_TAG to quote MODEL_PATH and RECIPE (use basename
"$MODEL_PATH" and basename "$RECIPE"), and change QUANT_SPEC_ARGS into an array
(e.g., declare -a QUANT_SPEC_ARGS and populate elements) so when invoking
hf_ptq.py you pass the arguments safely as "${QUANT_SPEC_ARGS[@]}"; ensure any
places that appended suffixes to MODEL_NAME still use the quoted basename
results to avoid splitting.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8d11f0f3-5321-403b-a703-3ab3972dfe59

📥 Commits

Reviewing files that changed from the base of the PR and between 555be6c and d17bcc8.

📒 Files selected for processing (4)
  • examples/llm_ptq/scripts/huggingface_example.sh
  • examples/llm_ptq/scripts/parser.sh
  • modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml
  • modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-fp8_cast_kv.yaml

Comment on lines +64 to +70
# When using --recipe, build the model name from the recipe basename (without
# directory or .yaml suffix) so each recipe gets its own SAVE_PATH.
if [ -n "$RECIPE" ]; then
RECIPE_TAG=$(basename "$RECIPE" .yaml | sed 's/[^0-9a-zA-Z\-]/_/g')
MODEL_NAME=$(basename $MODEL_PATH | sed 's/[^0-9a-zA-Z\-]/_/g')_recipe_${RECIPE_TAG}
else
MODEL_NAME=$(basename $MODEL_PATH | sed 's/[^0-9a-zA-Z\-]/_/g')_${QFORMAT_MODIFIED}${KV_CACHE_QUANT:+_kv_${KV_CACHE_QUANT}}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Read-only verification of word-splitting-sensitive expansions in changed recipe flow.
rg -n 'basename \$MODEL_PATH|QUANT_SPEC_ARGS=.*--recipe=\$RECIPE|^\s+\$QUANT_SPEC_ARGS\s+\\$' examples/llm_ptq/scripts/huggingface_example.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 375


🏁 Script executed:

cat -n examples/llm_ptq/scripts/huggingface_example.sh | sed -n '60,75p'

Repository: NVIDIA/Model-Optimizer

Length of output: 782


🏁 Script executed:

cat -n examples/llm_ptq/scripts/huggingface_example.sh | sed -n '170,190p'

Repository: NVIDIA/Model-Optimizer

Length of output: 890


Quote $MODEL_PATH and use array syntax for QUANT_SPEC_ARGS to prevent shell word-splitting bugs.

Unquoted basename $MODEL_PATH on lines 68–70 and unquoted $QUANT_SPEC_ARGS expansion on line 185 can break if paths or the recipe argument contain spaces or special characters. This contradicts line 67, which already quotes $RECIPE in a similar basename call. Convert QUANT_SPEC_ARGS to an array to safely pass multi-word arguments to hf_ptq.py.

Proposed fix
 if [ -n "$RECIPE" ]; then
     RECIPE_TAG=$(basename "$RECIPE" .yaml | sed 's/[^0-9a-zA-Z\-]/_/g')
-    MODEL_NAME=$(basename $MODEL_PATH | sed 's/[^0-9a-zA-Z\-]/_/g')_recipe_${RECIPE_TAG}
+    MODEL_NAME=$(basename "$MODEL_PATH" | sed 's/[^0-9a-zA-Z\-]/_/g')_recipe_${RECIPE_TAG}
 else
-    MODEL_NAME=$(basename $MODEL_PATH | sed 's/[^0-9a-zA-Z\-]/_/g')_${QFORMAT_MODIFIED}${KV_CACHE_QUANT:+_kv_${KV_CACHE_QUANT}}
+    MODEL_NAME=$(basename "$MODEL_PATH" | sed 's/[^0-9a-zA-Z\-]/_/g')_${QFORMAT_MODIFIED}${KV_CACHE_QUANT:+_kv_${KV_CACHE_QUANT}}
 fi
@@
-        if [ -n "$RECIPE" ]; then
-            QUANT_SPEC_ARGS="--recipe=$RECIPE"
-        else
-            QUANT_SPEC_ARGS="--qformat=${QFORMAT// /,}"
-        fi
+        if [ -n "$RECIPE" ]; then
+            QUANT_SPEC_ARGS=(--recipe="$RECIPE")
+        else
+            QUANT_SPEC_ARGS=(--qformat="${QFORMAT// /,}")
+        fi
         python hf_ptq.py \
             --pyt_ckpt_path=$MODEL_PATH \
             --export_path=$SAVE_PATH \
             --sparsity_fmt=$SPARSITY_FMT \
-            $QUANT_SPEC_ARGS \
+            "${QUANT_SPEC_ARGS[@]}" \
             --calib_size=$CALIB_SIZE \

Also applies to lines 177–185.

🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 68-68: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 70-70: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_ptq/scripts/huggingface_example.sh` around lines 64 - 70, The
script leaves unquoted expansions and a plain string for multi-word args which
risks word-splitting: update the basename calls that build MODEL_NAME/RECIPE_TAG
to quote MODEL_PATH and RECIPE (use basename "$MODEL_PATH" and basename
"$RECIPE"), and change QUANT_SPEC_ARGS into an array (e.g., declare -a
QUANT_SPEC_ARGS and populate elements) so when invoking hf_ptq.py you pass the
arguments safely as "${QUANT_SPEC_ARGS[@]}"; ensure any places that appended
suffixes to MODEL_NAME still use the quoted basename results to avoid splitting.

cjluo-nv added 2 commits May 7, 2026 11:54
…recipe support in scripts

- Add modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml,
  combining experts-only NVFP4 W4A4 with the MSE FP8 scale-sweep weight
  calibration (algorithm: mse, fp8_scale_sweep: true; expert weight blocks
  use nvfp4_static so the static FP8 sweep applies) and FP8 KV cache via
  the kv_fp8_cast unit (use_constant_amax: true).

- examples/llm_ptq/scripts: thread a new --recipe flag through parser.sh and
  huggingface_example.sh. Either --quant or --recipe is required; passing both
  errors out. When --recipe is used, the script derives MODEL_NAME from the
  recipe basename, passes --recipe= to hf_ptq.py, and exits after export with
  a TRT-LLM deployment hint (recipes can produce arbitrary configs).

- Drop the qformat case-statement whitelist in huggingface_example.sh; let
  hf_ptq.py be the single source of truth for valid qformats / recipes.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Same shape as nvfp4_experts_only_mse-fp8_cast_kv but with the broader
*mlp* / *block_sparse_moe* / *.experts.* patterns from
nvfp4_mlp_only-kv_fp8.yaml so it covers both dense MLP and MoE expert
weights:

- algorithm: { method: mse, fp8_scale_sweep: true, layerwise: false }
- All MLP weight quantizers use nvfp4_static so the static FP8 scale
  sweep applies (otherwise mse_calibrate skips them).
- Input quantizers use nvfp4 (dynamic).
- KV bmm uses kv_fp8_cast (skips KV calibration; amax hardcoded to FP8
  E4M3 max 448.0).

Pre-commit hook check-modelopt-recipes was skipped because the host
conda env has a broken torchvision install that prevents the validator
from importing modelopt; the same hook fails on the existing committed
sibling nvfp4_experts_only-kv_fp8.yaml in this env.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.38%. Comparing base (555be6c) to head (9f08df0).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1407      +/-   ##
==========================================
+ Coverage   76.74%   77.38%   +0.64%     
==========================================
  Files         476      476              
  Lines       51307    51307              
==========================================
+ Hits        39377    39706     +329     
+ Misses      11930    11601     -329     
Flag Coverage Δ
examples 41.81% <ø> (+2.62%) ⬆️
unit 52.55% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cjluo-nv cjluo-nv force-pushed the chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv-main branch from d17bcc8 to 2fdfe86 Compare May 7, 2026 19:15
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-fp8_cast_kv.yaml`:
- Line 25: The YAML description incorrectly claims quantization applies to “all
linear layers”; update the description field in
nvfp4_mlp_only_mse-fp8_cast_kv.yaml to accurately state that quantizers are
enabled only for MLP/MoE/expert (MLP-only) patterns and that other linear layers
are not affected, so users aren’t misled about the scope of W4A4/static weight
and FP8 KV settings.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c1e82d9b-9fb7-44bc-bbcb-509d457b8621

📥 Commits

Reviewing files that changed from the base of the PR and between d17bcc8 and 2fdfe86.

📒 Files selected for processing (4)
  • examples/llm_ptq/scripts/huggingface_example.sh
  • examples/llm_ptq/scripts/parser.sh
  • modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml
  • modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-fp8_cast_kv.yaml
✅ Files skipped from review due to trivial changes (1)
  • modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml


metadata:
recipe_type: ptq
description: NVFP4 static weight (MSE FP8-scale sweep) and dynamic activation for all linear layers (W4A4), FP8 KV cache with constant amax.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Metadata description overstates the quantization scope.

Line 25 says “all linear layers,” but this recipe only enables quantizers for MLP/MoE/expert patterns. Please align the text to the actual scope to avoid user confusion.

Proposed fix
-  description: NVFP4 static weight (MSE FP8-scale sweep) and dynamic activation for all linear layers (W4A4), FP8 KV cache with constant amax.
+  description: NVFP4 static weight (MSE FP8-scale sweep) and dynamic activation for MLP/MoE linear layers (W4A4), FP8 KV cache with constant amax.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
description: NVFP4 static weight (MSE FP8-scale sweep) and dynamic activation for all linear layers (W4A4), FP8 KV cache with constant amax.
description: NVFP4 static weight (MSE FP8-scale sweep) and dynamic activation for MLP/MoE linear layers (W4A4), FP8 KV cache with constant amax.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-fp8_cast_kv.yaml` at line 25,
The YAML description incorrectly claims quantization applies to “all linear
layers”; update the description field in nvfp4_mlp_only_mse-fp8_cast_kv.yaml to
accurately state that quantizers are enabled only for MLP/MoE/expert (MLP-only)
patterns and that other linear layers are not affected, so users aren’t misled
about the scope of W4A4/static weight and FP8 KV settings.

@cjluo-nv cjluo-nv requested a review from shengliangxu May 7, 2026 19:20
Copy link
Copy Markdown
Contributor

@meenchen meenchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

The recipe YAMLs themselves look correct — they reuse existing nvfp4, nvfp4_static, and kv_fp8_cast units, and only differ from the -kv_fp8 siblings in the ways the PR body describes (weight quantizers switch to type: static to be eligible for the MSE FP8 scale sweep; algorithm switches to {method: mse, fp8_scale_sweep: true, layerwise: false}). The hf_ptq.py --recipe flag is already wired up from the prior PR and correctly bypasses the post-hoc _set_kv_cache_constant_amax so the recipe is authoritative.

One real regression in huggingface_example.sh: deleting the for qformat in $QFORMAT; do … done whitelist also removes the implicit binding of the lowercase loop variable $qformat, which is still referenced later in the bf16/fp16 shortcut branch (if [ "$qformat" == "bf16" ] || [ "$qformat" == "fp16" ]). With the loop gone, that variable is empty, so the symlink-the-model shortcut for pure-bf16/fp16 runs is now dead code. Replace $qformat with $QFORMAT (or reintroduce a single assignment) to preserve the prior behavior.

Minor stylistic nit: the new files are named *_mse-fp8_cast_kv.yaml while the existing convention in modelopt_recipes/general/ptq/ is *-kv_fp8_cast (KV descriptor after the dash, with kv_ prefix). Not blocking, but inconsistent with siblings.

Tests: no new unit tests, and the PR notes the pre-commit check-modelopt-recipes was skipped locally due to a broken env. The recipes are YAML-only and are covered by the standard recipe loader check in CI, so this is low-risk provided CI runs that check — but the author should confirm the pre-commit hook passes in CI on this branch.

esac
done
IFS=" "
# Quant format / recipe validation is delegated to hf_ptq.py.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot comment.

Regression: deleting the for qformat in $QFORMAT; do … done loop also drops the implicit binding of the lowercase loop variable $qformat, which is still used below at if [ "$qformat" == "bf16" ] || [ "$qformat" == "fp16" ]. With the loop removed, $qformat is empty and that bf16/fp16 shortcut (which symlinks the source model into $SAVE_PATH and marks MODEL_CONFIG_EXIST=true) will never trigger — users running --quant=bf16 or --quant=fp16 will now fall through to python hf_ptq.py --qformat=bf16 instead. Either replace $qformat with $QFORMAT in that check, or add a dedicated qformat="$QFORMAT" assignment here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need bf16/fp16 path anyway? Maybe we can deprecate them

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we still have the use cases where we quantize fp32 to fp16.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think we can delete. Let me add this to the PR

@@ -0,0 +1,48 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot comment.

Stylistic nit: sibling recipes in this directory follow the <numerics>-kv_<kv_fmt> naming pattern (e.g. nvfp4_experts_only-kv_fp8.yaml, nvfp4_default-kv_fp8_cast.yaml). This file uses _mse-fp8_cast_kv which flips the KV descriptor order. Consider renaming to something like nvfp4_experts_only_mse-kv_fp8_cast.yaml for consistency with the existing convention. Same for nvfp4_mlp_only_mse-fp8_cast_kv.yaml.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree.

- huggingface_example.sh: remove the bf16/fp16 shortcut block. The lowercase
  $qformat variable it referenced was bound by the `for qformat in $QFORMAT;
  do … done` whitelist loop deleted in the parent commit, so the shortcut
  became dead code (regression flagged by reviewers). Per consensus on the
  PR, drop the path entirely rather than reintroduce the binding — the
  bf16/fp16 case still goes through `python hf_ptq.py --qformat=...` and the
  TRT-LLM whitelist on line 210 already covers it.

- huggingface_example.sh: quote $MODEL_PATH in the basename calls used to
  build MODEL_NAME (CodeRabbit shellcheck SC2086).

- nvfp4_mlp_only_mse-kv_fp8_cast.yaml: tighten metadata description from
  "all linear layers" to "MLP/MoE linear layers" — quantization is only
  enabled for *mlp* / *block_sparse_moe* / *.experts.* patterns.

- Rename both recipes to match the sibling `<numerics>_<scope>-kv_<kv_fmt>`
  convention used by `nvfp4_default-kv_fp8_cast.yaml` etc.:
    nvfp4_experts_only_mse-fp8_cast_kv.yaml
      → nvfp4_experts_only_mse-kv_fp8_cast.yaml
    nvfp4_mlp_only_mse-fp8_cast_kv.yaml
      → nvfp4_mlp_only_mse-kv_fp8_cast.yaml

Pre-commit hook check-modelopt-recipes was skipped (same env-broken
torchvision issue as the prior commits); the renamed recipes were verified
independently with tools/precommit/check_modelopt_recipes.py against the
working-tree modelopt — both load and produce the expected effective
config.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
examples/llm_ptq/scripts/huggingface_example.sh (1)

165-169: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

QUANT_SPEC_ARGS still a plain string — unquoted expansion risks word-splitting.

QUANT_SPEC_ARGS is assigned as a plain string and then expanded unquoted on line 174. ShellCheck flags this (SC2086). If $RECIPE contains spaces or special characters, the argument will be split incorrectly. Convert to a Bash array so the expansion is safe.

🛠️ Proposed fix — array syntax
-        if [ -n "$RECIPE" ]; then
-            QUANT_SPEC_ARGS="--recipe=$RECIPE"
-        else
-            QUANT_SPEC_ARGS="--qformat=${QFORMAT// /,}"
-        fi
+        if [ -n "$RECIPE" ]; then
+            QUANT_SPEC_ARGS=(--recipe="$RECIPE")
+        else
+            QUANT_SPEC_ARGS=(--qformat="${QFORMAT// /,}")
+        fi
         python hf_ptq.py \
             --pyt_ckpt_path=$MODEL_PATH \
             --export_path=$SAVE_PATH \
             --sparsity_fmt=$SPARSITY_FMT \
-            $QUANT_SPEC_ARGS \
+            "${QUANT_SPEC_ARGS[@]}" \
             --calib_size=$CALIB_SIZE \

Also applies to: 174-174

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_ptq/scripts/huggingface_example.sh` around lines 165 - 169,
QUANT_SPEC_ARGS is built as a plain string which risks word-splitting when
expanded; change it to a Bash array and quote expansions: set QUANT_SPEC_ARGS as
an array in the branches (e.g., QUANT_SPEC_ARGS=(--recipe "$RECIPE") or
QUANT_SPEC_ARGS=(--qformat "${QFORMAT// /,}")) and later invoke it with
"${QUANT_SPEC_ARGS[@]}" where it is expanded (the current unquoted expansion on
the line that uses QUANT_SPEC_ARGS should be replaced with the array expansion).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@examples/llm_ptq/scripts/huggingface_example.sh`:
- Around line 165-169: QUANT_SPEC_ARGS is built as a plain string which risks
word-splitting when expanded; change it to a Bash array and quote expansions:
set QUANT_SPEC_ARGS as an array in the branches (e.g., QUANT_SPEC_ARGS=(--recipe
"$RECIPE") or QUANT_SPEC_ARGS=(--qformat "${QFORMAT// /,}")) and later invoke it
with "${QUANT_SPEC_ARGS[@]}" where it is expanded (the current unquoted
expansion on the line that uses QUANT_SPEC_ARGS should be replaced with the
array expansion).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6278d544-daa2-4b0f-8ff8-b5a53b6b86d8

📥 Commits

Reviewing files that changed from the base of the PR and between 2fdfe86 and 9f08df0.

📒 Files selected for processing (3)
  • examples/llm_ptq/scripts/huggingface_example.sh
  • modelopt_recipes/general/ptq/nvfp4_experts_only_mse-kv_fp8_cast.yaml
  • modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-kv_fp8_cast.yaml
✅ Files skipped from review due to trivial changes (2)
  • modelopt_recipes/general/ptq/nvfp4_experts_only_mse-kv_fp8_cast.yaml
  • modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-kv_fp8_cast.yaml

Copy link
Copy Markdown
Contributor

@meenchen meenchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Re-review: all critical previous comments have been addressed.

  • Regression with $qformat bf16/fp16 shortcut: Fixed by removing the block entirely (author and meenchen agreed to deprecate; no use case for bf16/fp16 "quantize" today).
  • Naming convention: Files renamed to nvfp4_experts_only_mse-kv_fp8_cast.yaml and nvfp4_mlp_only_mse-kv_fp8_cast.yaml, matching the sibling *-kv_fp8_cast convention.
  • mlp_only description wording: Updated from "all linear layers" to "MLP/MoE linear layers".
  • Basename quoting: $MODEL_PATH and $RECIPE are now quoted in the new basename calls.

Remaining CodeRabbit nit (QUANT_SPEC_ARGS as plain string instead of a Bash array) is stylistic; recipe paths are repo-relative without spaces, and the prior script already uses unquoted expansions elsewhere. Not a blocker.

Licensing: new YAML files carry the standard NVIDIA Apache-2.0 header; no licensing concern. Recipes are covered by the existing check-modelopt-recipes loader test in CI.

Copy link
Copy Markdown
Contributor

@hychiang-git hychiang-git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cjluo-nv cjluo-nv enabled auto-merge (squash) May 7, 2026 22:32
Copy link
Copy Markdown
Collaborator

@shengliangxu shengliangxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cjluo-nv cjluo-nv merged commit 6a3b6b8 into main May 7, 2026
42 checks passed
@cjluo-nv cjluo-nv deleted the chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv-main branch May 7, 2026 22:38
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-05-07 22:39 UTC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants