Skip to content

RuntimeError during INT4 AWQ quantization of Qwen3-Next-80B-A3B-Instruct: probability tensor contains inf/nan Description #850

@alexschexc

Description

@alexschexc

When attempting to quantize Qwen3-Next-80B-A3B-Instruct using the HF PTQ example with INT4 AWQ quantization, the calibration process appears to complete successfully, but the post-quantization generation step fails with a RuntimeError indicating that the probability tensor contains invalid values (inf, nan, or negative elements).

The error occurs during the post_quantize() function's preview generation call, specifically when torch.multinomial() attempts to sample from the model's output probabilities.
Environment

  • nvidia-modelopt: 0.41.0

  • PyTorch: 2.9.0a0+145a3a7bda.nv25.10

  • CUDA: 13.0

  • cuDNN: 91400

  • transformers: 4.57.6

  • tensorrt: 10.13.3.9

  • tensorrt_llm: 1.2.0rc6

  • accelerate: 1.12.0

  • Python: 3.12.3

  • GPU: NVIDIA A100 80GB PCIe (compute capability 8.0)

  • OS: Ubuntu 24.04.3 LTS (running in Docker container)

  • ModelOpt Git commit: d39cf45

Model Details

  • Model: Qwen3-Next-80B-A3B-Instruct

  • Architecture: Sparse MoE with 512 experts, 10 experts per token

  • Layers: 48

  • Special features: Sparse attention with decoder_sparse_step: 1 and full_attention_interval: 4

  • Source: Local model directory (originally from Hugging Face)

Reproduction Steps

  1. Clone/install NVIDIA Model Optimizer (commit d39cf45)

  2. Download Qwen3-Next-80B-A3B-Instruct model weights

  3. Run the following command:

cd TensorRT-Model-Optimizer/examples/llm_ptq/scripts

./huggingface_example.sh \
  --model path/to/my/model/Qwen3-Next-80B-A3B-Instruct/ \
  --quant int4_awq \
  --calib 32 \
  --calib_batch_size 8 \
  --batch 8 \
  --calib_dataset cnn_dailymail

Expected Behavior
The post-quantization generation should complete successfully, producing text output (even if quality is degraded compared to the full-precision model). The quantized model should not produce inf or nan values in its logits/probabilities. Due to hardware constraints we have modified huggingface_example.sh to pass the --device=cpu argument to hf_ptq.py. The server we are using only has 80GB of VRAM via a single Nvidia A100, however the CPU has access to 216 GB of system RAM.

Actual Behavior
The quantization process appears to complete calibration, but fails during the post-quantization preview generation step with the following traceback:

Traceback (most recent call last):
  File "/storage/TensorRT-Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 1025, in <module>
    main(args)
  File "/storage/TensorRT-Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 1004, in main
    quantize_main(
  File "/storage/TensorRT-Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 797, in quantize_main
    post_quantize(
  File "/storage/TensorRT-Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 623, in post_quantize
    generated_ids_after_ptq = full_model.generate(preview_input_ids, max_new_tokens=100)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 2566, in generate
    result = decoding_method(
             ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 2831, in _sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
######## GPU 0: Peak memory usage = 1.17 GB for all processes on the GPU ########

Additional Context

  • The relatively low peak memory usage (1.17 GB) suggests the model loaded and ran through calibration, but the quantized inference is producing numerically invalid outputs.

  • This is a production use case where INT4 quantization is needed to fit the 80B MoE model efficiently on available hardware.

  • I'm happy to provide additional debugging information, test alternative configurations, or run instrumented versions of the code if that would help diagnose the root cause.

Workarounds Attempted

As previously stated, we are quantizing on the CPU to circumvent the hardware problem.
We have arrived at this situation as a result of numerous attempted workarounds involving modifying code in both ModelOpt and TensorRT-LLM in a handful of places in order to attempt to force quantization to be completed. We are not sure where to go from here.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions