Describe the bug
I used ParoQuant to quantize Qwen3.5 27B, but it showed kernel compilation failure. Quantization on a single H100 takes two days.
GPU Info
Show output of:
nvidia-smi
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 |
| N/A 36C P0 71W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 |
| N/A 57C P0 699W / 700W | 58333MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 |
| N/A 56C P0 699W / 700W | 58333MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 |
| N/A 63C P0 698W / 700W | 58333MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 |
| N/A 65C P0 697W / 700W | 58333MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 |
| N/A 57C P0 698W / 700W | 58333MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 |
| N/A 67C P0 699W / 700W | 58333MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 |
| N/A 55C P0 699W / 700W | 58333MiB / 81559MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
Software Info
Operation System/Version + Python Version
Show output of:
pip show gptqmodel torch transformers accelerate triton
Name: GPTQModel
Version: 7.0.0
---
Name: torch
Version: 2.11.0
---
Name: transformers
Version: 5.8.0
---
Name: accelerate
Version: 1.13.0
---
Name: triton
Version: 3.6.0
To Reproduce
#!/usr/bin/env python3
-- coding: utf-8 --
import json
import os
from typing import List, Dict, Any
import argparse
from gptqmodel import GPTQModel
from gptqmodel.quantization.config import ParoConfig
from gptqmodel.quantization import QuantizeConfig, FORMAT, METHOD
p.add_argument("--bits", type=int, default=4, help="量化位宽")
p.add_argument("--group-size", type=int, default=128, help="分组大小")
p.add_argument("--max-samples", type=int, default=128, help="最多使用多少条样本做校准")
p.add_argument("--batch-size", type=int, default=1, help="校准时批大小")
p.add_argument("--max-length", type=int, default=4096, help="每条校准样本最大 token 长度")
#Redirect Function
p.add_argument("--keep-system", action="store_true", help="是否保留 system 消息")
p.add_argument("--drop-assistant", action="store_true", help="是否丢弃 assistant 回复")
return p.parse_args()
def main():
args = parse_args()
qcfg = ParoConfig(bits=args.bits, group_size=args.group_size)
model = GPTQModel.load(args.model_id, qcfg, trust_remote_code=True)
conversations = load_conversations(
args.jsonl_path,
keep_system=args.keep_system,
drop_assistant=args.drop_assistant,
max_samples=args.max_samples,
)
calibration_dataset = conversations_to_calibration_texts(
conversations,
model.tokenizer,
max_length=args.max_length,
)
if not calibration_dataset:
raise RuntimeError("chat_template 后没有可用文本,请检查 messages 格式。")
model.quantize(calibration_dataset, batch_size=args.batch_size)
os.makedirs(args.output_dir, exist_ok=True)
model.save(args.output_dir)
if name == "main":
main()
Expected behavior
Why the kernel compiled failure? I update my nvcc and gcc , both not work
Model/Datasets
qwen3.5 27B dense
Screenshots
Additional context
grep -n "fatal error|error:|Python.h|nvcc|CUDA|fallback" /tmp/paroquant_run.log | head -200 INFO ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness. INFO ParoQuant rotation: torch.ops JIT compilation failed in 0.0s (estimated ~78s, -78s); using fallback path.
WARN Python GIL is enabled: Multi-gpu quant acceleration for MoE models is sub-optimal and multi-core accelerated cpu packing is also disabled. We recommend Python >= 3.13.3t with Pytorch > 2.8 for mult-gpu quantization and multi-cpu packing with env PYTHON_GIL=0.
INFO ENV: Auto setting PYTORCH_ALLOC_CONF='expandable_segments:True,max_split_size_mb:256,garbage_collection_threshold:0.7' for memory saving.
INFO ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.
INFO
┌─────────────┐ ┌────────────────────────┐ ┌────────────┐ ┌─────────┐
│ GPT-QModel │ -> │ ▓▓▓▓▓▓▓▓▓▓▓▓ 16bit │ -> │ ▒▒▒▒ 8bit │ -> │ ░░ 4bit │
└─────────────┘ └────────────────────────┘ └────────────┘ └─────────┘
GPT-QModel : 7.0.0
Transformers : 5.8.0
Torch : 2.11.0+cu130
Triton : 3.6.0
加载模型 ...
INFO QuantizeConfig: offload_to_disk_path auto set to temporary dir /tmp/gptqmodel__jszh3ej
WARNING:fla.utils:Current Python version 3.10 is below the recommended 3.11 version. It is recommended to upgrade to Python 3.11 or higher for the best experience.
HF: overriding trust_remote_code=True to False for /qwen35_27B_v2 because model_type qwen3_5 is integrated in installed transformers as Qwen3_5ForCausalLM.
INFO Loader: Auto dtype (native bfloat16): torch.bfloat16
INFO Estimated Quantization BPW (bits per weight): 4.2875 bpw, based on [bits: 4, group_size: 128]
INFO Loader: using checkpoint-backed lazy turtle source for /qwen35_27B_v2
INFO:tokenicer.tokenicer:Tokenicer: Auto fixed pad_token_id=248044 (token='<|endoftext|>').
INFO Model: Loaded generation_config: GenerationConfig {
"eos_token_id": 248044,
"output_attentions": false,
"output_hidden_states": false,
"use_cache": true
}
INFO Model: generation_config.json not found. Skipped checking.
INFO Kernel: loaded -> []
原始 conversations 数量: 128
[template] usable=128 bad=0
INFO Packing Kernel: selected: ParoLinear
INFO Packing Kernel: selected: ParoLinear
WARN Calibration dataset size should be more than 256. Current: 128.
INFO Calibration: Sort in descending order by length
INFO Calibration: Total padded tokens: 0
INFO Calibration: Total non-padded tokens: 524236
INFO Calibration: Total tokens: 524236
INFO Disk subsystem write throughput detected at 832.0 MB/s.
INFO ModuleLooper: capturing layer inputs from 128 calibration batches
INFO Offloading base modules to disk...
INFO ParoQuant: prewarming fused rotation extension...
INFO ParoQuant rotation: compiling torch.ops JIT extension in /root/.cache/gptqmodel/torch_extensions/paroquant/34bedede7df97941.
INFO ParoQuant rotation: torch.ops JIT compilation failed in 0.0s (estimated ~78s, -78s); using fallback path.
Describe the bug
I used ParoQuant to quantize Qwen3.5 27B, but it showed kernel compilation failure. Quantization on a single H100 takes two days.
GPU Info
Show output of:
Software Info
Operation System/Version + Python Version
Show output of:
To Reproduce
#!/usr/bin/env python3
-- coding: utf-8 --
import json
import os
from typing import List, Dict, Any
import argparse
from gptqmodel import GPTQModel
from gptqmodel.quantization.config import ParoConfig
from gptqmodel.quantization import QuantizeConfig, FORMAT, METHOD
def main():
args = parse_args()
if name == "main":
main()
Expected behavior
Why the kernel compiled failure? I update my nvcc and gcc , both not work
Model/Datasets
qwen3.5 27B dense
Screenshots
Additional context
grep -n "fatal error|error:|Python.h|nvcc|CUDA|fallback" /tmp/paroquant_run.log | head -200 INFO ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness. INFO ParoQuant rotation: torch.ops JIT compilation failed in 0.0s (estimated ~78s, -78s); using fallback path.
WARN Python GIL is enabled: Multi-gpu quant acceleration for MoE models is sub-optimal and multi-core accelerated cpu packing is also disabled. We recommend Python >= 3.13.3t with Pytorch > 2.8 for mult-gpu quantization and multi-cpu packing with env
PYTHON_GIL=0.INFO ENV: Auto setting PYTORCH_ALLOC_CONF='expandable_segments:True,max_split_size_mb:256,garbage_collection_threshold:0.7' for memory saving.
INFO ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.
INFO
┌─────────────┐ ┌────────────────────────┐ ┌────────────┐ ┌─────────┐
│ GPT-QModel │ -> │ ▓▓▓▓▓▓▓▓▓▓▓▓ 16bit │ -> │ ▒▒▒▒ 8bit │ -> │ ░░ 4bit │
└─────────────┘ └────────────────────────┘ └────────────┘ └─────────┘
GPT-QModel : 7.0.0
Transformers : 5.8.0
Torch : 2.11.0+cu130
Triton : 3.6.0
加载模型 ...
INFO QuantizeConfig: offload_to_disk_path auto set to temporary dir
/tmp/gptqmodel__jszh3ejWARNING:fla.utils:Current Python version 3.10 is below the recommended 3.11 version. It is recommended to upgrade to Python 3.11 or higher for the best experience.
HF: overriding trust_remote_code=True to False for
/qwen35_27B_v2because model_typeqwen3_5is integrated in installed transformers asQwen3_5ForCausalLM.INFO Loader: Auto dtype (native bfloat16):
torch.bfloat16INFO Estimated Quantization BPW (bits per weight): 4.2875 bpw, based on [bits: 4, group_size: 128]
INFO Loader: using checkpoint-backed lazy turtle source for
/qwen35_27B_v2INFO:tokenicer.tokenicer:Tokenicer: Auto fixed pad_token_id=248044 (token='<|endoftext|>').
INFO Model: Loaded
generation_config: GenerationConfig {"eos_token_id": 248044,
"output_attentions": false,
"output_hidden_states": false,
"use_cache": true
}
INFO Model:
generation_config.jsonnot found. Skipped checking.INFO Kernel: loaded ->
[]原始 conversations 数量: 128
[template] usable=128 bad=0
INFO Packing Kernel: selected:
ParoLinearINFO Packing Kernel: selected:
ParoLinearWARN Calibration dataset size should be more than 256. Current: 128.
INFO Calibration: Sort in descending order by length
INFO Calibration: Total padded tokens: 0
INFO Calibration: Total non-padded tokens: 524236
INFO Calibration: Total tokens: 524236
INFO Disk subsystem write throughput detected at 832.0 MB/s.
INFO ModuleLooper: capturing layer inputs from 128 calibration batches
INFO Offloading base modules to disk...
INFO ParoQuant: prewarming fused rotation extension...
INFO ParoQuant rotation: compiling torch.ops JIT extension in
/root/.cache/gptqmodel/torch_extensions/paroquant/34bedede7df97941.INFO ParoQuant rotation: torch.ops JIT compilation failed in 0.0s (estimated ~78s, -78s); using fallback path.