-
Notifications
You must be signed in to change notification settings - Fork 242
Define kv cache scaling factor as amax / 448 #790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Unified the FP8 and NVFP4 kv cache scaling factor definition so the same checkpoint can be used for both FP8 and NVFP4 kv cache quantization deployment Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughRemoved three constant imports and replaced dynamic kv_cache_max_bound calculation with a hard-coded value of 448 in the export function, accompanied by a clarifying comment about FP8 and NVFP4 KV cache quantization. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 656-660: In _export_transformers_checkpoint the variable
kv_cache_format is undefined when calling postprocess_state_dict—extract
kv_cache_format from quant_config (the dict fetched earlier) using
quant_config["quantization"]["kv_cache_quant_algo"] and pass that value into
postprocess_state_dict; update the code around where quantized_state_dict and
kv_cache_max_bound are set so kv_cache_format is defined (e.g., read from
quant_config before calling postprocess_state_dict) to avoid the NameError
involving kv_cache_format and ensure postprocess_state_dict receives the correct
format.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
modelopt/torch/export/unified_export_hf.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: wait-checks / wait
- GitHub Check: code-quality
- GitHub Check: build-docs
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| # We define kv cache scale as amax / 448 for both FP8 and NVFP4 KV cache quantization. | ||
| kv_cache_max_bound = 448 | ||
| quantized_state_dict = postprocess_state_dict( | ||
| quantized_state_dict, kv_cache_max_bound, kv_cache_format, is_modelopt_qlora | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's read the function containing lines 656-660 to understand the context
cat -n modelopt/torch/export/unified_export_hf.py | sed -n '630,680p'Repository: NVIDIA/Model-Optimizer
Length of output: 2295
🏁 Script executed:
# Check if kv_cache_format is defined anywhere in the function before line 659
rg -n 'kv_cache_format\s*=' modelopt/torch/export/unified_export_hf.pyRepository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Find the postprocess_state_dict function definition to see its signature
rg -n 'def postprocess_state_dict' -A 5 --type pyRepository: NVIDIA/Model-Optimizer
Length of output: 460
🏁 Script executed:
# Search for all usages of kv_cache_format in the codebase
rg -n 'kv_cache_format' --type py -C 2Repository: NVIDIA/Model-Optimizer
Length of output: 2059
🏁 Script executed:
# Check how get_quant_config returns data and what it contains regarding kv_cache
rg -n 'quant_config\[' modelopt/torch/export/unified_export_hf.pyRepository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Look at the structure returned by get_quant_config
cat -n modelopt/torch/export/quant_utils.py | sed -n '1220,1290p'Repository: NVIDIA/Model-Optimizer
Length of output: 3458
Critical: kv_cache_format is undefined and will cause a NameError at runtime.
The variable kv_cache_format is passed to postprocess_state_dict at line 659, but it is never defined in the _export_transformers_checkpoint function. The value should be extracted from quant_config, which is obtained at line 645 and contains the kv cache quantization format in quant_config["quantization"]["kv_cache_quant_algo"].
🔧 Suggested fix
# We define kv cache scale as amax / 448 for both FP8 and NVFP4 KV cache quantization.
kv_cache_max_bound = 448
+ kv_cache_format = quant_config["quantization"].get("kv_cache_quant_algo")
quantized_state_dict = postprocess_state_dict(
quantized_state_dict, kv_cache_max_bound, kv_cache_format, is_modelopt_qlora
)🤖 Prompt for AI Agents
In `@modelopt/torch/export/unified_export_hf.py` around lines 656 - 660, In
_export_transformers_checkpoint the variable kv_cache_format is undefined when
calling postprocess_state_dict—extract kv_cache_format from quant_config (the
dict fetched earlier) using quant_config["quantization"]["kv_cache_quant_algo"]
and pass that value into postprocess_state_dict; update the code around where
quantized_state_dict and kv_cache_max_bound are set so kv_cache_format is
defined (e.g., read from quant_config before calling postprocess_state_dict) to
avoid the NameError involving kv_cache_format and ensure postprocess_state_dict
receives the correct format.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #790 +/- ##
=======================================
Coverage 74.20% 74.20%
=======================================
Files 192 192
Lines 19238 19238
=======================================
Hits 14276 14276
Misses 4962 4962 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What does this PR do?
Overview: ?
Unified the FP8 and NVFP4 kv cache scaling factor definition so the same checkpoint can be used for both FP8 and NVFP4 kv cache quantization deployment
Testing
Unit test
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
Release Notes
Refactor
Chores
✏️ Tip: You can customize this high-level summary in your review settings.