Skip to content

[BUG] ParoQuant Quant takes too long, Jit Compilation failed #2898

@Jealousc11gx

Description

@Jealousc11gx

Describe the bug

I used ParoQuant to quantize Qwen3.5 27B, but it showed kernel compilation failure. Quantization on a single H100 takes two days.

GPU Info

Show output of:

nvidia-smi
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:19:00.0 Off |                    0 |
| N/A   36C    P0             71W /  700W |       4MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:3B:00.0 Off |                    0 |
| N/A   57C    P0            699W /  700W |   58333MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:4C:00.0 Off |                    0 |
| N/A   56C    P0            699W /  700W |   58333MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:5D:00.0 Off |                    0 |
| N/A   63C    P0            698W /  700W |   58333MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  |   00000000:9B:00.0 Off |                    0 |
| N/A   65C    P0            697W /  700W |   58333MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  |   00000000:BB:00.0 Off |                    0 |
| N/A   57C    P0            698W /  700W |   58333MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  |   00000000:CB:00.0 Off |                    0 |
| N/A   67C    P0            699W /  700W |   58333MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  |   00000000:DB:00.0 Off |                    0 |
| N/A   55C    P0            699W /  700W |   58333MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

Software Info

Operation System/Version + Python Version

Show output of:

pip show gptqmodel torch transformers accelerate triton

Name: GPTQModel
Version: 7.0.0

---
Name: torch
Version: 2.11.0

---
Name: transformers
Version: 5.8.0

---
Name: accelerate
Version: 1.13.0

---
Name: triton
Version: 3.6.0

To Reproduce

#!/usr/bin/env python3

-- coding: utf-8 --

import json
import os
from typing import List, Dict, Any
import argparse

from gptqmodel import GPTQModel
from gptqmodel.quantization.config import ParoConfig
from gptqmodel.quantization import QuantizeConfig, FORMAT, METHOD

p.add_argument("--bits", type=int, default=4, help="量化位宽")
p.add_argument("--group-size", type=int, default=128, help="分组大小")
p.add_argument("--max-samples", type=int, default=128, help="最多使用多少条样本做校准")
p.add_argument("--batch-size", type=int, default=1, help="校准时批大小")  
p.add_argument("--max-length", type=int, default=4096, help="每条校准样本最大 token 长度")

#Redirect Function

p.add_argument("--keep-system", action="store_true", help="是否保留 system 消息")
p.add_argument("--drop-assistant", action="store_true", help="是否丢弃 assistant 回复")

return p.parse_args()

def main():
args = parse_args()

qcfg = ParoConfig(bits=args.bits, group_size=args.group_size)
model = GPTQModel.load(args.model_id, qcfg, trust_remote_code=True)

conversations = load_conversations(
    args.jsonl_path,
    keep_system=args.keep_system,
    drop_assistant=args.drop_assistant,
    max_samples=args.max_samples,
)


calibration_dataset = conversations_to_calibration_texts(
    conversations,
    model.tokenizer,
    max_length=args.max_length,
)
if not calibration_dataset:
    raise RuntimeError("chat_template 后没有可用文本,请检查 messages 格式。")

model.quantize(calibration_dataset, batch_size=args.batch_size)
os.makedirs(args.output_dir, exist_ok=True)
model.save(args.output_dir)

if name == "main":
main()

Expected behavior

Why the kernel compiled failure? I update my nvcc and gcc , both not work

Model/Datasets

qwen3.5 27B dense

Screenshots

Image

Additional context

grep -n "fatal error|error:|Python.h|nvcc|CUDA|fallback" /tmp/paroquant_run.log | head -200 INFO ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness. INFO ParoQuant rotation: torch.ops JIT compilation failed in 0.0s (estimated ~78s, -78s); using fallback path.

WARN Python GIL is enabled: Multi-gpu quant acceleration for MoE models is sub-optimal and multi-core accelerated cpu packing is also disabled. We recommend Python >= 3.13.3t with Pytorch > 2.8 for mult-gpu quantization and multi-cpu packing with env PYTHON_GIL=0.
INFO ENV: Auto setting PYTORCH_ALLOC_CONF='expandable_segments:True,max_split_size_mb:256,garbage_collection_threshold:0.7' for memory saving.
INFO ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.
INFO

┌─────────────┐ ┌────────────────────────┐ ┌────────────┐ ┌─────────┐
│ GPT-QModel │ -> │ ▓▓▓▓▓▓▓▓▓▓▓▓ 16bit │ -> │ ▒▒▒▒ 8bit │ -> │ ░░ 4bit │
└─────────────┘ └────────────────────────┘ └────────────┘ └─────────┘
GPT-QModel : 7.0.0
Transformers : 5.8.0
Torch : 2.11.0+cu130
Triton : 3.6.0
加载模型 ...
INFO QuantizeConfig: offload_to_disk_path auto set to temporary dir /tmp/gptqmodel__jszh3ej
WARNING:fla.utils:Current Python version 3.10 is below the recommended 3.11 version. It is recommended to upgrade to Python 3.11 or higher for the best experience.
HF: overriding trust_remote_code=True to False for /qwen35_27B_v2 because model_type qwen3_5 is integrated in installed transformers as Qwen3_5ForCausalLM.
INFO Loader: Auto dtype (native bfloat16): torch.bfloat16
INFO Estimated Quantization BPW (bits per weight): 4.2875 bpw, based on [bits: 4, group_size: 128]
INFO Loader: using checkpoint-backed lazy turtle source for /qwen35_27B_v2
INFO:tokenicer.tokenicer:Tokenicer: Auto fixed pad_token_id=248044 (token='<|endoftext|>').
INFO Model: Loaded generation_config: GenerationConfig {
"eos_token_id": 248044,
"output_attentions": false,
"output_hidden_states": false,
"use_cache": true
}

INFO Model: generation_config.json not found. Skipped checking.
INFO Kernel: loaded -> []
原始 conversations 数量: 128
[template] usable=128 bad=0
INFO Packing Kernel: selected: ParoLinear
INFO Packing Kernel: selected: ParoLinear
WARN Calibration dataset size should be more than 256. Current: 128.
INFO Calibration: Sort in descending order by length
INFO Calibration: Total padded tokens: 0
INFO Calibration: Total non-padded tokens: 524236
INFO Calibration: Total tokens: 524236
INFO Disk subsystem write throughput detected at 832.0 MB/s.
INFO ModuleLooper: capturing layer inputs from 128 calibration batches
INFO Offloading base modules to disk...
INFO ParoQuant: prewarming fused rotation extension...
INFO ParoQuant rotation: compiling torch.ops JIT extension in /root/.cache/gptqmodel/torch_extensions/paroquant/34bedede7df97941.
INFO ParoQuant rotation: torch.ops JIT compilation failed in 0.0s (estimated ~78s, -78s); using fallback path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions