Add support for Qwen3Omni30B thinking model#856
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds Qwen3Omni multimodal support across PTQ, dataloaders, vLLM inference, and export: new text/image/video processors and dataloaders, model-type mapping, quantization adjustments, calibration-batch propagation through pre/post-quantize, a vLLM runner, and a CLI option to save quantization summaries. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as User/CLI
participant Main as hf_ptq (quantize_main)
participant Proc as ProcessorSelector
participant DLFactory as DataLoaderFactory
participant PreQ as pre_quantize
participant Quant as quantize_model
participant PostQ as post_quantize
participant Export as unified_export
CLI->>Main: start with --model_name=qwen3omni
Main->>Proc: get_processor(model_type="qwen3omni")
Proc-->>Main: Qwen3OmniImageProcessor / TextProcessor / VideoProcessor
Main->>DLFactory: select dataloader (video / vl / text)
DLFactory-->>Main: calib_dataloader
Main->>PreQ: pre_quantize(calib_dataloader,...)
PreQ-->>Main: preview_ids, generated_before_ptq, calib_batch
Main->>Quant: quantize_model(calib_batch,...)
Quant->>Quant: route inference via _should_use_generate / forward
Quant-->>Main: quantized_model
Main->>PostQ: post_quantize(..., calib_batch)
PostQ-->>Main: post-PTQ results
Main->>Export: export_quantized(...)
Export->>Export: patch_config_for_unified_export
Export-->>Main: exported_checkpoint
alt --quant_summary_path provided
Main->>Main: write quant summary to file + print confirmation
else
Main->>Main: print quant summary to stdout
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 219-221: The current code mutates processor.dtype (processor.dtype
= language_model.dtype) which can have unintended side effects; instead avoid
changing the existing processor and either pass dtype explicitly into
get_vlm_dataset_dataloader/collate function or construct a new processor
instance with the desired dtype (e.g., create a new Qwen3OmniImageProcessor
using processor.tokenizer and language_model.dtype) and pass that to
get_vlm_dataset_dataloader so the original processor remains unchanged.
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1001-1009: The code currently mutates model.generation_config
in-place (via gen_config and clearing temperature/top_p/top_k) which persists
after export; instead, create a non-mutating copy of model.generation_config (or
record the original values for "temperature","top_p","top_k"), apply the
sampling fixes to that copy (or to the saved serialization data) and use the
copy for writing the checkpoint (e.g., before calling save_pretrained), then
restore the original model.generation_config (or simply never mutate it) so the
caller's model is unchanged; reference symbols: model.generation_config,
gen_config, and the sampling attrs ["temperature","top_p","top_k"] when locating
where to implement the copy/restore behavior.
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 470-472: The code computes batch_size using tensor_data but
iterates keys from batch_data (batch_size =
tensor_data[next(iter(batch_data.keys()))].shape[0]), which can pick a
non-tensor key (e.g., max_new_tokens) and cause .shape access to fail; change
the key source to tensor_data so you select a tensor key (e.g., batch_size =
tensor_data[next(iter(tensor_data.keys()))].shape[0]) or explicitly find the
first value in tensor_data that is a tensor and use its shape[0]; update the
logic around the batch_size variable and references to tensor_data/batch_data to
ensure batch_size is derived from actual tensors (functions/variables to check:
tensor_data, batch_data, batch_size).
In `@modelopt/torch/utils/image_processor.py`:
- Around line 264-277: The code uses torch.tensor(...) for "pixel_values" and
"audio_features" which can change/infer dtypes inconsistently; change those to
preserve original dtype by using
torch.as_tensor(first["pixel_values"]).to(self.device) and
torch.as_tensor(first["audio_features"]).to(self.device) (keeping the existing
.to(self.device) pattern), so the dtype of the incoming data is preserved;
update the handling in image_processor.py where result["pixel_values"] and
result["audio_features"] are set (using the local variables first and result) to
use torch.as_tensor instead of torch.tensor.
🧹 Nitpick comments (5)
modelopt/torch/export/model_utils.py (1)
145-148: Minor: Update comment for consistency.The comment "Pattern 3: No language_model found" at line 148 is now misleading since a new pattern was inserted above it. Consider renumbering or removing pattern numbering.
💡 Suggested fix
if hasattr(model, "thinker"): return [model, model.thinker] - # Pattern 3: No language_model found + # No language_model found return Noneexamples/llm_ptq/example_utils.py (1)
284-286: Update docstring to reflect new return types.The docstring states it returns a
MllamaImageProcessorobject, but now it can also returnQwen3OmniImageProcessorfor the new model type.📝 Suggested fix
def get_processor( ckpt_path, model_type, device: torch.device = "auto", trust_remote_code=False, attn_implementation=None, ) -> BaseImageProcessor | ProcessorMixin | None: """ - Returns a :class:`modelopt.torch.utils.image_processor.MllamaImageProcessor` object. + Returns an image processor for multimodal models (MllamaImageProcessor, Qwen3OmniImageProcessor) + or a ProcessorMixin for models like Whisper. """modelopt/torch/utils/image_processor.py (1)
115-172: Consider interface consistency forpreprocess_function.
Qwen3OmniTextProcessor.preprocess_function(text: str)has a different signature than the base class and sibling classes, which usepreprocess_function(examples: dict). This could cause issues if callers expect a uniform interface.Additionally, the
dtypeattribute is stored but never used in the processor methods.💡 Suggested approach
Consider either:
- Aligning the signature with the base class by wrapping text in a dict:
def preprocess_function(self, examples): text = examples.get("text", examples) if isinstance(examples, dict) else examples # ... rest of implementation
- Or documenting the intentional deviation clearly in the class docstring.
For the unused
dtype, either utilize it in tensor conversion or remove if not needed.modelopt/torch/utils/dataset_utils.py (1)
136-146: Replace assertions with proper exceptions for public API.Using
assertfor input validation in a public API function (get_qwen3omni_text_dataloader) is discouraged because assertions can be disabled with-Oflag. UseValueErrororTypeErrorinstead.💡 Suggested fix
- assert processor is not None, "Please provide a Qwen3OmniTextProcessor." + if processor is None: + raise ValueError("Please provide a Qwen3OmniTextProcessor.") if isinstance(num_samples, int): num_samples = [num_samples] if isinstance(dataset_name, str): dataset_name = [dataset_name] - assert len(dataset_name) == len(num_samples), ( - "dataset_name and num_samples must be the same length" - ) + if len(dataset_name) != len(num_samples): + raise ValueError("dataset_name and num_samples must be the same length")examples/llm_ptq/hf_ptq.py (1)
756-766: Code duplication in Qwen3Omni generation handling.The logic for handling Qwen3Omni's tuple return from
generate()is duplicated betweenpre_quantize(lines 756-766) andpost_quantize(lines 817-827). Consider extracting to a helper function.♻️ Suggested refactor
def _extract_qwen3omni_generated_ids(result): """Extract generated IDs from Qwen3Omni generate() output.""" if isinstance(result, tuple): text_ids, _ = result return text_ids.sequences if hasattr(text_ids, "sequences") else text_ids return resultThen use in both places:
# In pre_quantize and post_quantize: result = full_model.generate(**calib_batch, max_new_tokens=100) generated_ids = _extract_qwen3omni_generated_ids(result)Also applies to: 817-827
cjluo-nv
left a comment
There was a problem hiding this comment.
@Edwardf0t1 could you review it as well?
Edwardf0t1
left a comment
There was a problem hiding this comment.
@ajrasane I'm wondering if the exported checkpoint can be deployed on vLLM/SGLang/TRT-LLM and how's the accuracy?
|
@Edwardf0t1 ,I was able to deploy think checkpoint on vLLM and get good outputs. I have also generated accuracy results with MMLU. PTAL. |
| return None | ||
|
|
||
|
|
||
| def ensure_tokenizer_files(model_path: str, source_model_id: str) -> None: |
There was a problem hiding this comment.
This was required by vLLM for running the model. If the tokenizer files are not saved, then we are unable to deploy the checkpoint with vLLM.
There was a problem hiding this comment.
I think we should be able to export the tokenizer files with https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/llm_ptq/hf_ptq.py#L691-L696
| print("No custom model files found to copy") | ||
|
|
||
|
|
||
| def patch_config_for_unified_export(model_type: str, export_path: str) -> None: |
| else: | ||
| calibrate_loop = create_forward_loop(dataloader=calib_dataloader) | ||
| calibrate_loop = create_forward_loop( | ||
| dataloader=calib_dataloader, generation_kwargs=generation_kwargs |
There was a problem hiding this comment.
what will happen if generation_kwargs is empty?
There was a problem hiding this comment.
The generation_kwargs are finally passed to the models infer_method (forward/generate). If the kwargs are empty, there should be no issues while unpacking this empty dict and just passing the batch_data.
| generated_ids_before_ptq, | ||
| is_nemotron_vl_model, | ||
| first_text_speech_dataset, | ||
| calib_batch: dict | None = None, |
There was a problem hiding this comment.
please document why we need to add calib_batch
| return processor.tokenizer.batch_decode(input_ids) | ||
| # BaseImageProcessor covers MllamaImageProcessor and Qwen3OmniImageProcessor | ||
| if processor is not None and isinstance(processor, BaseImageProcessor): | ||
| return processor.tokenizer.batch_decode(input_ids, skip_special_tokens=True) |
There was a problem hiding this comment.
I have updated the code since the last review. Is this comment stil relevant?
| @@ -0,0 +1,136 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
is this qwen3 omni specific?
There was a problem hiding this comment.
Currently this is qwen3omni specific. But we can later update this to also enable the other models supported by hf_ptq.py
8740db3 to
e202164
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
modelopt/torch/export/unified_export_hf.py (1)
360-384:⚠️ Potential issue | 🔴 CriticalUnreachable code: Lines 380-382 are dead code.
The condition at line 380 (
getattr(model.config, "is_encoder_decoder", False)) is identical to line 360. Since line 360 already handles all encoder-decoder models, the elif branch at line 380 can never be reached.Control flow analysis:
- If
is_encoder_decoderisTrue→ line 360 branch executes- If
is_encoder_decoderisFalse→ skip line 360, check line 363- If line 363 is also
False→ check line 380, but this is the same condition as line 360, which we already know isFalseIf the intent was to handle encoder-decoder models that are also VL (Nemotron/Qwen3Omni), consider restructuring the logic.
🐛 Proposed fix: Remove dead code
if getattr(model.config, "is_encoder_decoder", False): # For encoder-decoder models, we need to pass both the encoder and decoder input ids model(fake_input, decoder_input_ids=decoder_fake_input) elif (is_vl_model and "nemotron" in model_type) or model_type.startswith("qwen3omni"): # For Nemotron VL models, try to run optimization on just the language model part language_model_lineage = get_language_model_from_vl(model) if language_model_lineage is not None: language_model = language_model_lineage[-1] print( f"Running optimization on language model with fake_input shape: {fake_input.shape}" ) # Pass use_cache=False to avoid KV cache issues in encoder-decoder models language_model(fake_input, use_cache=False) else: raise ValueError( f"Cannot extract language_model from VL model (type: {model_type}). " "This is required for requantization/resmoothing optimization. " "Please ensure the model architecture is supported or file an issue." ) - elif getattr(model.config, "is_encoder_decoder", False): - # For other encoder-decoder models (non-VL), pass both encoder and decoder input ids - model(fake_input, decoder_input_ids=decoder_fake_input) else: model(fake_input)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/unified_export_hf.py` around lines 360 - 384, The duplicate check for getattr(model.config, "is_encoder_decoder", False) creates unreachable code; remove the second identical elif block (the branch starting with that getattr at the end) and merge its behavior into the first encoder-decoder handling if needed, or ensure encoder-decoder VL cases are handled in the earlier VL-specific branch (the logic around is_vl_model, "nemotron", model_type.startswith("qwen3omni"), and get_language_model_from_vl). Update the control flow so only one encoder-decoder check remains (keep the first if that calls model(fake_input, decoder_input_ids=decoder_fake_input)) and adjust the VL branch to fall through or delegate to that encoder-decoder handling where appropriate.modelopt/torch/utils/dataset_utils.py (2)
626-628:⚠️ Potential issue | 🟠 MajorMissing
generation_kwargsin recursive_process_batchcall.When splitting batches due to known max_working_batch_size, the recursive call on line 626-628 doesn't pass
generation_kwargs, causing it to default to an empty dict. This means generation parameters won't be propagated during batch splitting.🐛 Proposed fix
max_working_batch_size = _process_batch( - split_data, infer_method, max_working_batch_size + split_data, infer_method, generation_kwargs, max_working_batch_size )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/dataset_utils.py` around lines 626 - 628, The recursive call to _process_batch when splitting into split_data currently omits the generation_kwargs parameter, so generation options aren't propagated; update the call to include generation_kwargs (i.e., call _process_batch(split_data, infer_method, max_working_batch_size, generation_kwargs)) so the same generation settings are passed through; ensure you reference the _process_batch function signature and the generation_kwargs variable used earlier in the enclosing scope to maintain behavior consistency.
658-659:⚠️ Potential issue | 🟠 MajorMissing
generation_kwargsin OOM recovery recursive calls.The recursive
_process_batchcalls on lines 658-659 (for OOM recovery by splitting in half) also don't passgeneration_kwargs.🐛 Proposed fix
# Recursively process each half and track max working batch size - max_working_batch_size = _process_batch(split_data_1, infer_method) - max_working_batch_size = _process_batch(split_data_2, infer_method, max_working_batch_size) + max_working_batch_size = _process_batch(split_data_1, infer_method, generation_kwargs) + max_working_batch_size = _process_batch(split_data_2, infer_method, generation_kwargs, max_working_batch_size)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/dataset_utils.py` around lines 658 - 659, The OOM recovery recursion calling _process_batch for split_data_1 and split_data_2 omits passing generation_kwargs, so ensure both recursive calls include the generation_kwargs parameter (i.e., call _process_batch(split_data_1, infer_method, generation_kwargs=generation_kwargs, ...) and _process_batch(split_data_2, infer_method, generation_kwargs=generation_kwargs, ...) while preserving the existing max_working_batch_size argument) so that generation settings are preserved during batch-splitting; update the calls referencing _process_batch and variables split_data_1 and split_data_2 accordingly.
♻️ Duplicate comments (3)
modelopt/torch/export/unified_export_hf.py (1)
1176-1186:⚠️ Potential issue | 🟡 MinorModel's
generation_configmutation persists after export returns.This block mutates
model.generation_config.do_samplein-place. Afterexport_hf_checkpointreturns, callers retain the modified model withdo_sample=True, which may surprise users who continue to use the model object.Consider saving the original value and restoring it after
save_pretrained()completes.♻️ Proposed fix: Restore original do_sample after save
# Fix generation_config conflicts before saving # Some models have temperature/top_p/top_k set but do_sample=False which causes validation errors + _orig_do_sample = None if hasattr(model, "generation_config") and model.generation_config is not None: gen_config = model.generation_config if not getattr(gen_config, "do_sample", True): # Enable sampling if sampling params are present if any( getattr(gen_config, attr, None) is not None for attr in ["temperature", "top_p", "top_k"] ): + _orig_do_sample = gen_config.do_sample gen_config.do_sample = True # Save model # Temporarily disable revert_weight_conversion if available — it doesn't handle # quantized state dicts (scalar scale tensors have 0 dimensions, causing IndexError). # We must patch both the source module and the importing module since # modeling_utils does `from core_model_loading import revert_weight_conversion`. _patches = _patch_revert_weight_conversion() try: model.save_pretrained( export_dir, state_dict={**post_state_dict, **(extra_state_dict or {})}, save_modelopt_state=save_modelopt_state, ) finally: _unpatch_revert_weight_conversion(_patches) + # Restore original do_sample if we modified it + if _orig_do_sample is not None and hasattr(model, "generation_config"): + model.generation_config.do_sample = _orig_do_sample🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/unified_export_hf.py` around lines 1176 - 1186, The code currently mutates model.generation_config.do_sample in-place causing the change to leak to the caller; capture the original value (e.g., orig_do_sample = gen_config.do_sample) before setting gen_config.do_sample = True, proceed with the existing save flow (save_pretrained()), and then restore gen_config.do_sample = orig_do_sample after saving completes; reference the same objects/variables used in the diff (model, model.generation_config, gen_config, and save_pretrained()) so the temporary change does not persist after export_hf_checkpoint returns.modelopt/torch/utils/dataset_utils.py (1)
608-610:⚠️ Potential issue | 🟠 MajorBug:
batch_data.keys()used instead oftensor_data.keys()when computing batch_size.Line 610 retrieves batch_size from
tensor_databut iterates keys frombatch_data. If the first key inbatch_datais a scalar parameter (likemax_new_tokens), accessing.shape[0]will fail because scalars don't have a.shapeattribute.🐛 Proposed fix
# Get the batch size of current data - batch_size = tensor_data[next(iter(batch_data.keys()))].shape[0] + batch_size = tensor_data[next(iter(tensor_data.keys()))].shape[0]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/dataset_utils.py` around lines 608 - 610, The code computes batch_size by indexing tensor_data with a key from batch_data, which can pick up non-tensor scalar keys (e.g., max_new_tokens) and cause .shape access to fail; update the batch_size computation to use a key from tensor_data instead (e.g., use the first key/value from tensor_data or its values iterator) so you always read a tensor to call .shape[0]; ensure tensor_data is non-empty before accessing to avoid StopIteration.modelopt/torch/utils/image_processor.py (1)
264-277:⚠️ Potential issue | 🟡 MinorInconsistent tensor creation may cause dtype issues.
torch.tensor()(lines 265, 277) infers dtype from input data, whiletorch.LongTensor()explicitly creates int64 tensors. Forpixel_valuesandaudio_features, the original dtype (likely float32/bfloat16) should be preserved or explicitly set.🔧 Suggested fix
# Handle pixel values for images if first.get("pixel_values") is not None: - result["pixel_values"] = torch.tensor(first["pixel_values"]).to(self.device) + result["pixel_values"] = torch.as_tensor(first["pixel_values"]).to(self.device) ... if first.get("audio_features") is not None: - result["audio_features"] = torch.tensor(first["audio_features"]).to(self.device) + result["audio_features"] = torch.as_tensor(first["audio_features"]).to(self.device)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/image_processor.py` around lines 264 - 277, The code is inconsistently creating tensors with torch.tensor(...) for pixel_values and audio_features while using torch.LongTensor(...) for integer lengths, which can change or infer dtypes incorrectly; update the pixel_values and audio_features creation to preserve their original dtypes and place them on the device (e.g., use torch.as_tensor(...) or torch.tensor(..., dtype=first_dtype, device=self.device)) instead of torch.tensor(...). Keep image_grid_thw and audio_feature_lens as explicit LongTensors (result["image_grid_thw"], result["audio_feature_lens"]) and ensure all tensor constructions reference self.device consistently.
🧹 Nitpick comments (2)
modelopt/torch/utils/dataset_utils.py (1)
676-678: Unused_should_use_generatefunction - incomplete refactoring?The
_should_use_generatefunction is defined at lines 850-859 but the call site at line 677 is commented out, usingmodel_type_is_enc_decinstead. Either remove the dead_should_use_generatefunction or uncomment the intended usage. This appears to be leftover from incomplete refactoring.♻️ If _should_use_generate is intended, use it
with torch.no_grad(): - # use_generate = _should_use_generate(model) - use_generate = model_type_is_enc_dec(model) + use_generate = _should_use_generate(model) infer_method = model.generate if use_generate else model.forwardOr remove the unused function if
model_type_is_enc_decis the correct behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/dataset_utils.py` around lines 676 - 678, The code currently comments out the call to _should_use_generate and instead uses model_type_is_enc_dec to set use_generate (affecting infer_method), leaving the _should_use_generate function unused; either restore the intended behavior by uncommenting/using _should_use_generate where use_generate is assigned (so infer_method = model.generate if _should_use_generate(model) else model.forward), or delete the dead _should_use_generate function (defined at lines 850-859) to remove dead code and ensure consistency between the use_generate logic and the actual helper used (model_type_is_enc_dec or _should_use_generate).examples/llm_ptq/hf_ptq.py (1)
436-443: Clarify:processor = Noneforces text-only calibration for Qwen3Omni.Setting
processor = Noneat line 441 meansget_qwen3omni_dataloaderwill use the text-only fallback path. If multimodal calibration (video/VLM datasets) is ever needed, the processor would need to be preserved. Consider adding a comment clarifying this is intentional for text-only calibration.📝 Suggested documentation
if model_type == "qwen3omni": print("Disabling talker for Qwen3Omni model") full_model.disable_talker() language_model = full_model.thinker.model tokenizer = processor.tokenizer.tokenizer + # Use text-only calibration for Qwen3Omni; set processor=None to skip multimodal path processor = None default_padding_side = tokenizer.padding_side default_pad_token = tokenizer.pad_token🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/hf_ptq.py` around lines 436 - 443, The assignment processor = None inside the model_type == "qwen3omni" branch intentionally forces text-only calibration (so get_qwen3omni_dataloader will take the text-only fallback); make this explicit by adding a clarifying comment next to processor = None that states it disables multimodal processing and forces text-only calibration for Qwen3Omni, and note that to enable multimodal/video/VLM calibration the processor must be preserved (i.e., remove the processor = None line), referencing the surrounding symbols full_model.disable_talker, language_model = full_model.thinker.model, tokenizer = processor.tokenizer.tokenizer, and get_qwen3omni_dataloader to guide where the change applies.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 1097-1099: The code mutates processor.dtype directly
(processor.dtype = model_dtype) which can leak state; instead avoid altering the
passed-in Processor instance: create a shallow copy of the processor before
changing dtype (e.g., copy/clone the Processor and set dtype on the clone) or
modify get_vlm_dataset_dataloader/collate_function to accept an explicit dtype
parameter and pass model_dtype through that path; update the call sites around
get_vlm_dataset_dataloader and any collate_function usage to use the cloned
processor or the explicit dtype argument so the original processor instance is
not mutated.
- Around line 1078-1079: The current check "if dataset_name in
get_supported_video_datasets()" fails when dataset_name is a list because a list
cannot be a member of a list of strings; update the condition to handle both str
and list[str] by first checking isinstance(dataset_name, str) and comparing it
to get_supported_video_datasets(), or if isinstance(dataset_name, list) iterate
(or check dataset_name[0]) against get_supported_video_datasets(); adjust the
subsequent assert to validate the element type(s) (e.g., assert
isinstance(dataset_name, str) for the string branch or assert all(isinstance(x,
str) for x in dataset_name) for the list branch) and ensure the branch logic
uses the correct element(s).
In `@examples/llm_ptq/run_vllm.py`:
- Around line 104-111: The LLM is being instantiated with a hardcoded
trust_remote_code=True; update the LLM(...) call to use the CLI flag instead
(e.g., replace trust_remote_code=True with
trust_remote_code=args.trust_remote_code or the appropriate parsed flag name) so
the vLLM LLM class respects the command-line setting.
- Around line 68-76: The code currently hardcodes trust_remote_code=True in
AutoConfig.from_pretrained and AutoProcessor.from_pretrained; add a CLI flag
(e.g., args.trust_remote_code, default False) and use that value instead of the
hardcoded True when calling AutoConfig.from_pretrained and
AutoProcessor.from_pretrained (also update any parser help text and default
behavior), ensuring the examples/***.py exposes the option to callers and
defaults to False to avoid executing remote code by default.
In `@modelopt/torch/utils/video_dataset_utils.py`:
- Around line 121-129: Add an inline comment next to the torch.load(cache_path,
weights_only=False) call explaining why weights_only is set to False and why
this is safe: note that the cache file (loaded into processed_samples and later
turned into processed_dataset via Dataset.from_list) is produced internally by
this module, trusted, and not coming from untrusted external sources, so loading
non-weight tensors is acceptable; include mention that the cache format is
controlled by the code that writes it to ensure compatibility and security.
- Around line 281-325: The collate_function currently grabs only the first item
(first = batch[0]) and discards other samples, so either enforce batch_size==1
or properly aggregate the batch; update collate_function to iterate over batch
and stack/concat per-key (e.g., for "input_ids" and "attention_mask" create a
LongTensor from a list of all items and .to(self.device), for
"pixel_values_videos" and "input_features" build a torch.tensor for each item
then torch.stack and apply self.dtype and device, and similarly stack
"video_grid_thw", "video_second_per_grid" and "feature_attention_mask"), or
alternatively add an explicit validation in
collate_function/get_video_dataset_dataloader that raises a clear error if
len(batch) != 1 when multi-sample batching is not supported; reference
collate_function, batch, first, and get_video_dataset_dataloader when making the
change.
---
Outside diff comments:
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 360-384: The duplicate check for getattr(model.config,
"is_encoder_decoder", False) creates unreachable code; remove the second
identical elif block (the branch starting with that getattr at the end) and
merge its behavior into the first encoder-decoder handling if needed, or ensure
encoder-decoder VL cases are handled in the earlier VL-specific branch (the
logic around is_vl_model, "nemotron", model_type.startswith("qwen3omni"), and
get_language_model_from_vl). Update the control flow so only one encoder-decoder
check remains (keep the first if that calls model(fake_input,
decoder_input_ids=decoder_fake_input)) and adjust the VL branch to fall through
or delegate to that encoder-decoder handling where appropriate.
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 626-628: The recursive call to _process_batch when splitting into
split_data currently omits the generation_kwargs parameter, so generation
options aren't propagated; update the call to include generation_kwargs (i.e.,
call _process_batch(split_data, infer_method, max_working_batch_size,
generation_kwargs)) so the same generation settings are passed through; ensure
you reference the _process_batch function signature and the generation_kwargs
variable used earlier in the enclosing scope to maintain behavior consistency.
- Around line 658-659: The OOM recovery recursion calling _process_batch for
split_data_1 and split_data_2 omits passing generation_kwargs, so ensure both
recursive calls include the generation_kwargs parameter (i.e., call
_process_batch(split_data_1, infer_method, generation_kwargs=generation_kwargs,
...) and _process_batch(split_data_2, infer_method,
generation_kwargs=generation_kwargs, ...) while preserving the existing
max_working_batch_size argument) so that generation settings are preserved
during batch-splitting; update the calls referencing _process_batch and
variables split_data_1 and split_data_2 accordingly.
---
Duplicate comments:
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1176-1186: The code currently mutates
model.generation_config.do_sample in-place causing the change to leak to the
caller; capture the original value (e.g., orig_do_sample = gen_config.do_sample)
before setting gen_config.do_sample = True, proceed with the existing save flow
(save_pretrained()), and then restore gen_config.do_sample = orig_do_sample
after saving completes; reference the same objects/variables used in the diff
(model, model.generation_config, gen_config, and save_pretrained()) so the
temporary change does not persist after export_hf_checkpoint returns.
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 608-610: The code computes batch_size by indexing tensor_data with
a key from batch_data, which can pick up non-tensor scalar keys (e.g.,
max_new_tokens) and cause .shape access to fail; update the batch_size
computation to use a key from tensor_data instead (e.g., use the first key/value
from tensor_data or its values iterator) so you always read a tensor to call
.shape[0]; ensure tensor_data is non-empty before accessing to avoid
StopIteration.
In `@modelopt/torch/utils/image_processor.py`:
- Around line 264-277: The code is inconsistently creating tensors with
torch.tensor(...) for pixel_values and audio_features while using
torch.LongTensor(...) for integer lengths, which can change or infer dtypes
incorrectly; update the pixel_values and audio_features creation to preserve
their original dtypes and place them on the device (e.g., use
torch.as_tensor(...) or torch.tensor(..., dtype=first_dtype,
device=self.device)) instead of torch.tensor(...). Keep image_grid_thw and
audio_feature_lens as explicit LongTensors (result["image_grid_thw"],
result["audio_feature_lens"]) and ensure all tensor constructions reference
self.device consistently.
---
Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 436-443: The assignment processor = None inside the model_type ==
"qwen3omni" branch intentionally forces text-only calibration (so
get_qwen3omni_dataloader will take the text-only fallback); make this explicit
by adding a clarifying comment next to processor = None that states it disables
multimodal processing and forces text-only calibration for Qwen3Omni, and note
that to enable multimodal/video/VLM calibration the processor must be preserved
(i.e., remove the processor = None line), referencing the surrounding symbols
full_model.disable_talker, language_model = full_model.thinker.model, tokenizer
= processor.tokenizer.tokenizer, and get_qwen3omni_dataloader to guide where the
change applies.
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 676-678: The code currently comments out the call to
_should_use_generate and instead uses model_type_is_enc_dec to set use_generate
(affecting infer_method), leaving the _should_use_generate function unused;
either restore the intended behavior by uncommenting/using _should_use_generate
where use_generate is assigned (so infer_method = model.generate if
_should_use_generate(model) else model.forward), or delete the dead
_should_use_generate function (defined at lines 850-859) to remove dead code and
ensure consistency between the use_generate logic and the actual helper used
(model_type_is_enc_dec or _should_use_generate).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d986f022-58ef-43cb-80f2-82592aabdf00
📒 Files selected for processing (12)
examples/llm_eval/run_lm_eval_vllm.shexamples/llm_ptq/example_utils.pyexamples/llm_ptq/hf_ptq.pyexamples/llm_ptq/run_vllm.pymodelopt/torch/export/model_utils.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/quantization/model_quant.pymodelopt/torch/quantization/plugins/huggingface.pymodelopt/torch/utils/__init__.pymodelopt/torch/utils/dataset_utils.pymodelopt/torch/utils/image_processor.pymodelopt/torch/utils/video_dataset_utils.py
🚧 Files skipped from review as they are similar to previous changes (3)
- modelopt/torch/utils/init.py
- modelopt/torch/export/model_utils.py
- modelopt/torch/quantization/plugins/huggingface.py
| # Set the dtype for proper tensor conversion in collate_function | ||
| processor.dtype = model_dtype | ||
| calib_dataloader = get_vlm_dataset_dataloader( |
There was a problem hiding this comment.
Side effect: Mutating processor.dtype directly.
Line 1098 mutates processor.dtype = model_dtype, which could affect other callers if the processor is reused. The past review flagged this but the fix (creating a new processor) wasn't applied here.
💡 Suggested approach
Consider creating a copy or documenting that this mutation is intentional:
# Set the dtype for proper tensor conversion in collate_function
+ # Note: This mutates the processor instance
processor.dtype = model_dtypeOr pass dtype through the dataloader/collate function rather than mutating the processor.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Set the dtype for proper tensor conversion in collate_function | |
| processor.dtype = model_dtype | |
| calib_dataloader = get_vlm_dataset_dataloader( | |
| # Set the dtype for proper tensor conversion in collate_function | |
| # Note: This mutates the processor instance | |
| processor.dtype = model_dtype | |
| calib_dataloader = get_vlm_dataset_dataloader( |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/llm_ptq/example_utils.py` around lines 1097 - 1099, The code mutates
processor.dtype directly (processor.dtype = model_dtype) which can leak state;
instead avoid altering the passed-in Processor instance: create a shallow copy
of the processor before changing dtype (e.g., copy/clone the Processor and set
dtype on the clone) or modify get_vlm_dataset_dataloader/collate_function to
accept an explicit dtype parameter and pass model_dtype through that path;
update the call sites around get_vlm_dataset_dataloader and any collate_function
usage to use the cloned processor or the explicit dtype argument so the original
processor instance is not mutated.
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (2)
modelopt/torch/utils/video_dataset_utils.py (1)
286-288:⚠️ Potential issue | 🟠 Major
collate_functiondrops all items except the first.With
batch_size > 1, this silently discards samples. Either enforcebatch_size == 1or implement real batch collation.🐛 Proposed fix (enforce supported behavior)
def get_video_dataset_dataloader( @@ ) -> DataLoader: @@ assert processor is not None, "Please provide a valid processor." + if batch_size != 1: + raise ValueError( + "Qwen3OmniVideoProcessor currently supports batch_size=1 only. " + "collate_function consumes only one sample." + )Also applies to: 156-160
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/video_dataset_utils.py` around lines 286 - 288, The collate_function currently takes only batch[0] (variable first) which silently drops additional samples; update collate_function to either enforce batch_size==1 by checking len(batch) and raising a clear exception if >1, or implement proper collation logic that merges per-sample multimodal fields (e.g., stack tensors where possible, keep lists for variable-length modalities like frames/audio, or pad/pack sequences) so batches are not discarded; locate the collate_function and the use of variable first to apply the chosen fix.examples/llm_ptq/example_utils.py (1)
1078-1081:⚠️ Potential issue | 🟡 Minor
dataset_namelist handling is still incomplete for multimodal paths.Only single-item lists are normalized. Longer lists silently fail membership checks and fall into unsupported-dataset errors.
💡 Proposed fix
- if isinstance(dataset_name, list) and len(dataset_name) == 1: - dataset_name = dataset_name[0] + if isinstance(dataset_name, list): + if len(dataset_name) != 1: + raise ValueError( + "For Qwen3Omni multimodal calibration with a processor, provide exactly one dataset." + ) + dataset_name = dataset_name[0]Also applies to: 1094-1095
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/example_utils.py` around lines 1078 - 1081, The code only normalizes single-item lists for dataset_name then checks membership against get_supported_video_datasets(), which causes multi-item lists to fail; update the handling so that if dataset_name is a list and len==1 you unwrap, and if len>1 you either (a) validate that all entries are supported (e.g., all(name in get_supported_video_datasets() for name in dataset_name)) or (b) if multimodal paths accept mixed datasets, validate that at least one/required subset is supported—then branch accordingly instead of asserting dataset_name is a str; update the checks around dataset_name and get_supported_video_datasets() (also apply the same fix at the other occurrence around lines 1094-1095) so multi-item lists are correctly recognized and validated.
🧹 Nitpick comments (2)
modelopt/torch/utils/video_dataset_utils.py (1)
179-205: Temporary video files rely on manual cleanup only.
_save_video_bytes_to_file()accumulates temp files, butcleanup()is opt-in. Consider auto-registering cleanup to avoid disk buildup in normal script exits.♻️ Proposed refactor
import os import tempfile +import atexit @@ self._temp_dir = tempfile.mkdtemp(prefix="qwen3omni_video_") self._video_counter = 0 + atexit.register(self.cleanup)Also applies to: 329-334
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/video_dataset_utils.py` around lines 179 - 205, The temporary video files created by _save_video_bytes_to_file() are only removed by an opt-in cleanup(), so register automatic cleanup to avoid disk buildup: modify the class initializer where _temp_dir is created to either use tempfile.TemporaryDirectory() or call atexit.register(self.cleanup) (and import atexit), and ensure the existing cleanup() method removes self._temp_dir and handles repeated calls safely; update references to _temp_dir, _video_counter, and cleanup() (and any temp creation at lines ~329-334) so automatic teardown runs on process exit.modelopt/torch/utils/image_processor.py (1)
124-133:device="auto"should be normalized before tensor.to(device)calls.The new processors pass
"auto"through toBaseImageProcessor; downstream tensor moves may fail unless this is converted to a real torch device string.♻️ Proposed fix
class Qwen3OmniTextProcessor(BaseImageProcessor): @@ def __init__(self, processor, device="auto", dtype=None): @@ - super().__init__(processor, device) + if device == "auto": + device = "cuda" if torch.cuda.is_available() else "cpu" + super().__init__(processor, device) self.dtype = dtype @@ class Qwen3OmniImageProcessor(BaseImageProcessor): @@ def __init__(self, tokenizer, device="auto", dtype=None, use_audio_in_video=False): """Constructor.""" - super().__init__(tokenizer, device) + if device == "auto": + device = "cuda" if torch.cuda.is_available() else "cpu" + super().__init__(tokenizer, device)Also applies to: 178-182
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/image_processor.py` around lines 124 - 133, The constructor currently accepts device="auto" but leaves it unnormalized, causing later tensor.to(device) calls to fail; update the __init__ in the image processor (and the other similar constructor around lines 178-182) to normalize device when device == "auto" by resolving to a concrete torch.device (e.g., torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")), store that normalized value in self.device, and ensure dtype (if provided) is a torch.dtype; use torch.cuda.is_available() and torch.device to perform the resolution so downstream .to(self.device) calls always receive a real device object or string.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 167-170: The code calls snapshot_download(...) unconditionally and
fails for local tokenizer paths; update ensure_tokenizer_files to detect when
source_model_id is a local directory (use os.path.isdir or
Path(source_model_id).is_dir()) and in that case set cache_dir to the local path
instead of calling snapshot_download; otherwise keep the existing
snapshot_download call using TOKENIZER_FILES and assign the result to cache_dir
so downstream logic uses the correct directory.
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 534-536: The code determining whether to call model.generate still
uses model_type_is_enc_dec(model) instead of the new routing helper; update the
two call sites (the infer_method selection around the use_generate variable and
the similar block at the later occurrence) to call _should_use_generate(model)
rather than model_type_is_enc_dec(model) so Qwen3Omni and other generation-only
models route to generate(); keep the rest of the logic (infer_method =
model.generate if use_generate else model.forward) unchanged.
- Around line 651-652: When building split_data_1 and split_data_2 from
tensor_data, guard against None (or non-sliceable) fields so the OOM split path
doesn't call __getitem__ on None; update the comprehensions that create
split_data_1 and split_data_2 to check each tensor_data[key] (e.g., use key
existence and "if value is None" or "if not sliceable") and assign None for
those keys, otherwise slice using [:mid, ...] and [mid:, ...]; ensure you
reference the same symbols (tensor_data, split_data_1, split_data_2, mid) so the
behavior matches existing logic.
- Line 590: The recursive calls to _process_batch are passing
max_working_batch_size as the third positional argument so it lands in
generation_kwargs (breaking infer_method(**batch_data, **generation_kwargs));
update those recursive calls inside _process_batch to pass generation_kwargs and
max_working_batch_size by name (e.g., generation_kwargs=generation_kwargs,
max_working_batch_size=max_working_batch_size) or reorder arguments to match the
signature so generation_kwargs is always a mapping when forwarded to
infer_method; locate calls to _process_batch within the same function and change
them to use keyword arguments.
In `@modelopt/torch/utils/image_processor.py`:
- Around line 162-172: The collate_function currently only uses batch[0],
dropping other samples and losing data; update it to either (preferred) stack
supported fields across all samples or (alternative) raise a clear error when
len(batch) > 1; specifically, for collate_function gather all "input_ids" and
"attention_mask" values from each item in batch (skip None entries), convert the
collected lists into a single tensor via torch.LongTensor(torch.stack(...) or
torch.LongTensor(list_of_lists)) and move to self.device, or if you choose
fail-fast, raise a ValueError when len(batch) > 1 with a clear message; apply
the same fix to the other collator implementation referenced (the
collate_function at the later block around lines 255-263) so both handle
multi-sample batches consistently.
---
Duplicate comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 1078-1081: The code only normalizes single-item lists for
dataset_name then checks membership against get_supported_video_datasets(),
which causes multi-item lists to fail; update the handling so that if
dataset_name is a list and len==1 you unwrap, and if len>1 you either (a)
validate that all entries are supported (e.g., all(name in
get_supported_video_datasets() for name in dataset_name)) or (b) if multimodal
paths accept mixed datasets, validate that at least one/required subset is
supported—then branch accordingly instead of asserting dataset_name is a str;
update the checks around dataset_name and get_supported_video_datasets() (also
apply the same fix at the other occurrence around lines 1094-1095) so multi-item
lists are correctly recognized and validated.
In `@modelopt/torch/utils/video_dataset_utils.py`:
- Around line 286-288: The collate_function currently takes only batch[0]
(variable first) which silently drops additional samples; update
collate_function to either enforce batch_size==1 by checking len(batch) and
raising a clear exception if >1, or implement proper collation logic that merges
per-sample multimodal fields (e.g., stack tensors where possible, keep lists for
variable-length modalities like frames/audio, or pad/pack sequences) so batches
are not discarded; locate the collate_function and the use of variable first to
apply the chosen fix.
---
Nitpick comments:
In `@modelopt/torch/utils/image_processor.py`:
- Around line 124-133: The constructor currently accepts device="auto" but
leaves it unnormalized, causing later tensor.to(device) calls to fail; update
the __init__ in the image processor (and the other similar constructor around
lines 178-182) to normalize device when device == "auto" by resolving to a
concrete torch.device (e.g., torch.device("cuda") if torch.cuda.is_available()
else torch.device("cpu")), store that normalized value in self.device, and
ensure dtype (if provided) is a torch.dtype; use torch.cuda.is_available() and
torch.device to perform the resolution so downstream .to(self.device) calls
always receive a real device object or string.
In `@modelopt/torch/utils/video_dataset_utils.py`:
- Around line 179-205: The temporary video files created by
_save_video_bytes_to_file() are only removed by an opt-in cleanup(), so register
automatic cleanup to avoid disk buildup: modify the class initializer where
_temp_dir is created to either use tempfile.TemporaryDirectory() or call
atexit.register(self.cleanup) (and import atexit), and ensure the existing
cleanup() method removes self._temp_dir and handles repeated calls safely;
update references to _temp_dir, _video_counter, and cleanup() (and any temp
creation at lines ~329-334) so automatic teardown runs on process exit.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 74edc7ca-e9dc-402b-a2fe-869d8c4cb6ae
📒 Files selected for processing (6)
examples/llm_ptq/example_utils.pyexamples/llm_ptq/run_vllm.pymodelopt/torch/quantization/model_quant.pymodelopt/torch/utils/dataset_utils.pymodelopt/torch/utils/image_processor.pymodelopt/torch/utils/video_dataset_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
- modelopt/torch/quantization/model_quant.py
cjluo-nv
left a comment
There was a problem hiding this comment.
Review: Code Simplicity & Avoiding Rebuilt Logic
1. Duplicated collate logic across 3 processor classes
Qwen3OmniImageProcessor.collate_function, Qwen3OmniVideoProcessor.collate_function, and Qwen3OmniTextProcessor.collate_function all repeat the same pattern: take batch[0], convert lists to tensors by key, move to device, handle dtype. The only difference is which keys they handle.
Suggestion: Extract a shared helper in BaseImageProcessor (or a standalone function) that takes a key-to-dtype mapping:
def _collate_first_item(self, batch, float_keys=(), long_keys=()):
first = batch[0]
result = {}
for key in long_keys:
if first.get(key) is not None:
result[key] = torch.LongTensor(first[key]).to(self.device)
for key in float_keys:
if first.get(key) is not None:
t = torch.tensor(first[key])
if self.dtype is not None:
t = t.to(self.dtype)
result[key] = t.to(self.device)
return resultEach subclass then just calls this with different key lists.
2. Duplicated preprocess_function between Qwen3OmniImageProcessor and Qwen3OmniVideoProcessor
Both build a conversation, call apply_chat_template, call process_mm_info, call self.tokenizer(...), then convert to lists with the same all_keys / tolist() pattern. The video processor adds video-byte-to-file logic but the rest is identical.
Suggestion: Factor the shared conversation-build + tokenize + serialize logic into a base method, with subclasses only overriding how they extract media content from examples.
3. get_qwen3omni_text_dataloader rebuilds get_dataset_dataloader
dataset_utils.py already has get_dataset_dataloader() which loads text samples, tokenizes, and returns a DataLoader. The new get_qwen3omni_text_dataloader does essentially the same thing but with a different tokenization step (applying chat template first). Rather than creating a whole new function with an inline _Qwen3OmniTextDataset class, consider passing a custom preprocess_fn to the existing get_dataset_dataloader or composing with it.
4. _should_use_generate is defined but never used
dataset_utils.py adds _should_use_generate() but it's commented out in _forward_loop (# use_generate = _should_use_generate(model)), and the old model_type_is_enc_dec is used instead. This is dead code — either use it or remove it.
5. Duplicated Qwen3Omni generate+unpack logic (pre/post quantize)
In hf_ptq.py, the pattern:
result = full_model.generate(**calib_batch, return_audio=False, thinker_max_new_tokens=100)
if isinstance(result, tuple):
text_ids, _ = result
generated_ids = text_ids.sequences if hasattr(text_ids, "sequences") else text_ids
else:
generated_ids = resultappears identically in both pre_quantize() and post_quantize(). Extract this into a helper like _qwen3omni_generate(model, batch).
6. get_model_type_from_config duplicates MODEL_NAME_TO_TYPE matching
example_utils.py adds get_model_type_from_config() which reads config.json, then loops over MODEL_NAME_TO_TYPE doing case-insensitive substring matching. This is essentially what model_utils.py:get_model_type() already does when given a model. If the model is already loaded, just use the existing infrastructure. If you need it pre-load, consider adding a from_config classmethod to the existing detection code rather than reimplementing the matching logic in example code.
7. patch_config_for_unified_export is a post-hoc fixup for a root-cause issue
This function patches hf_quant_config.json and config.json after export to add missing exclusion patterns. This suggests the export itself isn't producing the right exclusions. The cleaner fix would be to ensure unified_export_hf.py emits the correct exclusion list in the first place, rather than patching config files after the fact.
8. Mutable default argument generation_kwargs={}
In _process_batch, _forward_loop, and create_forward_loop, the default argument is generation_kwargs={}. Mutable defaults are a well-known Python footgun — use generation_kwargs=None with if generation_kwargs is None: generation_kwargs = {} inside the function body.
9. pre_quantize return value change is a breaking API change
pre_quantize previously returned a 2-tuple (preview_input_ids, generated_ids_before_ptq) and now returns a 3-tuple adding calib_batch. Any external caller would break. Consider making calib_batch an optional part of the return or using a dataclass/named tuple.
10. print_quant_summary silently drops the try/except
The old code had a try/except around print_quant_summary + save_expert_token_count_table. The new code removes it. If save_expert_token_count_table can fail (e.g., on non-MOE models), this will now crash the pipeline instead of continuing gracefully.
11. Qwen3OmniVideoProcessor temp file cleanup never called
The processor creates temp files in __init__ via tempfile.mkdtemp() but cleanup() is never called anywhere in the PR. This will leak temp files. Consider using __del__ or a context manager, or at minimum call cleanup() after dataloader creation.
Summary
The core model support additions (MoE block registration, MODEL_NAME_TO_TYPE entry, get_language_model_from_vl thinker support, layer_utils) are clean and minimal. The main concerns are around duplicated patterns in the processor/dataloader code and rebuilding existing infrastructure rather than extending it. Consolidating the collate, preprocess, and generate-unpack logic would meaningfully reduce the +1271 line count and maintenance burden.
a5146cc to
201fbae
Compare
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
201fbae to
3938f2e
Compare
|
133ee7b to
63d0af5
Compare
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
63d0af5 to
20a4b33
Compare
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Review Comment Status CheckFully Addressed (7/12)
Still Need Follow-up (3/12)
Please address the above before I can approve. |
| default=None, | ||
| help="Max model length (auto-detected from config if not specified)", | ||
| ) | ||
| parser.add_argument("--prompt", type=str, default="What in Nvidia?", help="Text prompt") |
There was a problem hiding this comment.
Supposed to be "What is Nvidia?"?
| "batch_data values must be tensors" | ||
| if generation_kwargs is None: | ||
| generation_kwargs = {} | ||
| # Separate tensor values from scalar parameters (like max_new_tokens) |
There was a problem hiding this comment.
Would max_new_tokens be in generation_kwargs? Why is it in batch_data? Are there other non-tensors in batch_data?
| split_data_2.update(scalar_data) | ||
|
|
||
| # Recursively process each half and track max working batch size | ||
| max_working_batch_size = _process_batch(split_data_1, infer_method) |
There was a problem hiding this comment.
Do you need to pass the generation_kwargs through here too?
| @@ -0,0 +1,277 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
[nit] bump copyright year from 2024
| """ | ||
| assert processor is not None, "Please provide a valid processor." | ||
|
|
||
| # Default cache_dir to temp directory |
There was a problem hiding this comment.
Do we need some additional modification on top of the Huggingface ~/.cache/huggingface cache?
|
|
||
| processed_dataset = None | ||
|
|
||
| # Try to load from cache (use torch.save/load to avoid Arrow 32-bit offset overflow) |
There was a problem hiding this comment.
Is this Arrow 32-bit offset issue specific to the Qwen3OmniVideoProcessor processor and/or the finevideo dataset? If so, can we special case it?
| super().__init__(tokenizer, device) | ||
| self.dtype = dtype | ||
| self.use_audio_in_video = use_audio_in_video | ||
| self._temp_dir = tempfile.mkdtemp(prefix="qwen3omni_video_") |
There was a problem hiding this comment.
If you can switch to tempfile.TemporaryDirectory, then the clean up is automatic and you can remove the cleanup() and __del__()
https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryDirectory
def demo_cleanup():
tmp = tempfile.TemporaryDirectory(prefix="demo_")
print(f"Created: {tmp.name}")
print(f"Exists before crash: {os.path.exists(tmp.name)}")
raise RuntimeError("crash!")
if __name__ == "__main__":
demo_cleanup()
What does this PR do?
Type of change:
Model support
Overview:
Usage
Testing
Able to quantize model, export hf_checkpoint and run inference with vLLM:
Before your PR is "Ready for review"
Summary by CodeRabbit
New Features
Bug Fixes
Chores