[BUG] ParoQuant Quant takes too long, Jit Compilation failed

**Describe the bug**

I used ParoQuant to quantize Qwen3.5 27B, but it showed kernel compilation failure. Quantization on a single H100 takes two days.

**GPU Info**

Show output of:

```
nvidia-smi
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:19:00.0 Off |                    0 |
| N/A   36C    P0             71W /  700W |       4MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:3B:00.0 Off |                    0 |
| N/A   57C    P0            699W /  700W |   58333MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:4C:00.0 Off |                    0 |
| N/A   56C    P0            699W /  700W |   58333MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:5D:00.0 Off |                    0 |
| N/A   63C    P0            698W /  700W |   58333MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  |   00000000:9B:00.0 Off |                    0 |
| N/A   65C    P0            697W /  700W |   58333MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  |   00000000:BB:00.0 Off |                    0 |
| N/A   57C    P0            698W /  700W |   58333MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  |   00000000:CB:00.0 Off |                    0 |
| N/A   67C    P0            699W /  700W |   58333MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  |   00000000:DB:00.0 Off |                    0 |
| N/A   55C    P0            699W /  700W |   58333MiB /  81559MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
```

**Software Info**

Operation System/Version + Python Version

Show output of:
```
pip show gptqmodel torch transformers accelerate triton

Name: GPTQModel
Version: 7.0.0

---
Name: torch
Version: 2.11.0

---
Name: transformers
Version: 5.8.0

---
Name: accelerate
Version: 1.13.0

---
Name: triton
Version: 3.6.0

```


**To Reproduce**

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import json
import os
from typing import List, Dict, Any
import argparse

from gptqmodel import GPTQModel
from gptqmodel.quantization.config import ParoConfig
from gptqmodel.quantization import QuantizeConfig, FORMAT, METHOD




    p.add_argument("--bits", type=int, default=4, help="量化位宽")
    p.add_argument("--group-size", type=int, default=128, help="分组大小")
    p.add_argument("--max-samples", type=int, default=128, help="最多使用多少条样本做校准")
    p.add_argument("--batch-size", type=int, default=1, help="校准时批大小")  
    p.add_argument("--max-length", type=int, default=4096, help="每条校准样本最大 token 长度")

    #Redirect Function

    p.add_argument("--keep-system", action="store_true", help="是否保留 system 消息")
    p.add_argument("--drop-assistant", action="store_true", help="是否丢弃 assistant 回复")

    return p.parse_args()


def main():
    args = parse_args()

    qcfg = ParoConfig(bits=args.bits, group_size=args.group_size)
    model = GPTQModel.load(args.model_id, qcfg, trust_remote_code=True)

    conversations = load_conversations(
        args.jsonl_path,
        keep_system=args.keep_system,
        drop_assistant=args.drop_assistant,
        max_samples=args.max_samples,
    )


    calibration_dataset = conversations_to_calibration_texts(
        conversations,
        model.tokenizer,
        max_length=args.max_length,
    )
    if not calibration_dataset:
        raise RuntimeError("chat_template 后没有可用文本，请检查 messages 格式。")

    model.quantize(calibration_dataset, batch_size=args.batch_size)
    os.makedirs(args.output_dir, exist_ok=True)
    model.save(args.output_dir)


if __name__ == "__main__":
    main()


**Expected behavior**

Why the kernel compiled failure? I update my nvcc and gcc , both not work 

**Model/Datasets**

qwen3.5 27B dense 

**Screenshots**

<img width="1008" height="549" alt="Image" src="https://github.com/user-attachments/assets/aa1b0080-a10f-4145-bc06-156b5a517e3d" />

**Additional context**

grep -n "fatal error\|error:\|Python.h\|nvcc\|CUDA\|fallback" /tmp/paroquant_run.log | head -200 INFO ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness. INFO ParoQuant rotation: torch.ops JIT compilation failed in 0.0s (estimated ~78s, -78s); using fallback path.

WARN  Python GIL is enabled: Multi-gpu quant acceleration for MoE models is sub-optimal and multi-core accelerated cpu packing is also disabled. We recommend Python >= 3.13.3t with Pytorch > 2.8 for mult-gpu quantization and multi-cpu packing with env `PYTHON_GIL=0`.
INFO  ENV: Auto setting PYTORCH_ALLOC_CONF='expandable_segments:True,max_split_size_mb:256,garbage_collection_threshold:0.7' for memory saving.
INFO  ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.          
INFO  

┌─────────────┐    ┌────────────────────────┐    ┌────────────┐    ┌─────────┐
│ GPT-QModel  │ -> │ ▓▓▓▓▓▓▓▓▓▓▓▓ 16bit     │ -> │ ▒▒▒▒ 8bit  │ -> │ ░░ 4bit │
└─────────────┘    └────────────────────────┘    └────────────┘    └─────────┘
GPT-QModel   : 7.0.0
Transformers : 5.8.0
Torch        : 2.11.0+cu130
Triton       : 3.6.0
加载模型 ...
INFO  QuantizeConfig: offload_to_disk_path auto set to temporary dir `/tmp/gptqmodel__jszh3ej`
WARNING:fla.utils:Current Python version 3.10 is below the recommended 3.11 version. It is recommended to upgrade to Python 3.11 or higher for the best experience.
HF: overriding trust_remote_code=True to False for `/qwen35_27B_v2` because model_type `qwen3_5` is integrated in installed transformers as `Qwen3_5ForCausalLM`.
INFO  Loader: Auto dtype (native bfloat16): `torch.bfloat16`                   
INFO  Estimated Quantization BPW (bits per weight): 4.2875 bpw, based on [bits: 4, group_size: 128]
INFO  Loader: using checkpoint-backed lazy turtle source for `/qwen35_27B_v2`   
INFO:tokenicer.tokenicer:Tokenicer: Auto fixed pad_token_id=248044 (token='<|endoftext|>').
INFO  Model: Loaded `generation_config`: GenerationConfig {
  "eos_token_id": 248044,
  "output_attentions": false,
  "output_hidden_states": false,
  "use_cache": true
}

INFO  Model: `generation_config.json` not found. Skipped checking.             
INFO  Kernel: loaded -> `[]`                                                   
原始 conversations 数量: 128
[template] usable=128 bad=0
INFO  Packing Kernel: selected: `ParoLinear`                                   
INFO  Packing Kernel: selected: `ParoLinear`                                   
WARN  Calibration dataset size should be more than 256. Current: 128.          
INFO  Calibration: Sort in descending order by length                          
INFO  Calibration: Total padded tokens: 0                                      
INFO  Calibration: Total non-padded tokens: 524236                             
INFO  Calibration: Total tokens: 524236                                        
INFO  Disk subsystem write throughput detected at 832.0 MB/s.                  
INFO  ModuleLooper: capturing layer inputs from 128 calibration batches        
INFO  Offloading base modules to disk...                                        
INFO  ParoQuant: prewarming fused rotation extension...                        
INFO  ParoQuant rotation: compiling torch.ops JIT extension in `/root/.cache/gptqmodel/torch_extensions/paroquant/34bedede7df97941`.
INFO  ParoQuant rotation: torch.ops JIT compilation failed in 0.0s (estimated ~78s, -78s); using fallback path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ParoQuant Quant takes too long, Jit Compilation failed #2898

-- coding: utf-8 --

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] ParoQuant Quant takes too long, Jit Compilation failed #2898

Description

-- coding: utf-8 --

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions