# Transformers 模型量化技术：GPTQ

## 1. 环境配置与依赖检查确认

In [3]:
import os

os.environ['http_proxy'] = 'http://127.0.0.1:1087'
os.environ['https_proxy'] = 'http://127.0.0.1:1087'

In [1]:
import torch

# 基础 CUDA 可用性检查
print("CUDA 可用:", torch.cuda.is_available())  # 必须为 True
print("PyTorch 绑定 CUDA 版本:", torch.version.cuda)  # 必须显示 11.8

# 测试报错的 cholesky 函数（量化过程的关键操作）
try:
    # 在 GPU 上创建对称正定矩阵（cholesky 要求输入正定）
    x = torch.randn(5, 5).cuda()
    x = x @ x.T + 5 * torch.eye(5).cuda()  # 确保正定
    y = torch.linalg.cholesky(x)  # 执行之前报错的操作
    print("CUDA 线性代数功能测试通过！")
except Exception as e:
    print("仍有错误:", e)

CUDA 可用: True
PyTorch 绑定 CUDA 版本: 12.1
CUDA 线性代数功能测试通过！


## 2. 使用标准数据集（Wikitext2）进行 GPTQ 量化

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
import torch

model_name_or_path = "/home/cc/models/opt/facebook-opt-6.7b"

# 配置 GPTQ 量化参数
quantization_config = GPTQConfig(
    bits=4, 
    group_size=128,  
    dataset="wikitext2", 
    desc_act=False, 
)

In [4]:
# 加载模型并执行量化
quant_model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    quantization_config=quantization_config,
    device_map='auto',
    trust_remote_code=True)

GPTQModel has been merged into Transformers/Optimum and full deprecation of AutoGPTQ within HF frameworks is planned in the near-future.
  def forward(ctx, input, qweight, scales, qzeros, g_idx, bits, maxq):
  def backward(ctx, grad_output):
  @custom_fwd(cast_inputs=torch.float16)


Quantizing model.decoder.layers blocks :   0%|          | 0/32 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


In [5]:
quant_model.model.decoder.layers[0].self_attn.q_proj.__dict__

{'training': True,
 '_parameters': OrderedDict(),
 '_buffers': OrderedDict([('qweight',
               tensor([[-1533367701,  2022282410, -1182414405,  ..., -1537760937,
                          968190105, -1218869078],
                       [-1973762678, -1990888091,  1451669112,  ...,  1767327094,
                         1485273242, -1769109111],
                       [ -890521657,  1705355194,  2042256023,  ...,  1401453177,
                         -963081656, -1212573545],
                       ...,
                       [-1192544404,   697191045,  1432856694,  ...,  1967820506,
                        -1482119368, -1787262823],
                       [-1736931943, -1753576812,  2027985786,  ..., -1757504693,
                         2090308806, -1987483739],
                       [ 1549191317,  1151064006, -1735993498,  ..., -1317428394,
                        -1182375288, -1199925157]], device='cuda:0', dtype=torch.int32)),
              ('qzeros',
               tensor(


    qweight：4 位量化后的权重张量（存储为 int32，节省空间）；
    scales：缩放因子（将量化权重映射回浮点数的校正参数）；
    qzeros：零值补偿参数（减少量化误差）；
    g_idx：分组索引（对应 group_size=128 的分组信息）。
    这些张量存储在 cuda:0（GPU）上，说明量化成功且模型已加载到 GPU。


## 3. 量化模型保存

In [6]:
quant_model.save_pretrained("/home/cc/models/quant_models/opt-6.7b-gptq")

## 4. 量化模型的推理验证

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)  # 加载配套分词器
text = "Merry Christmas! I'm glad to"  # 输入提示文本
inputs = tokenizer(text, return_tensors="pt").to(0)  # 文本转张量并移至 GPU
out = quant_model.generate(** inputs, max_new_tokens=64)  # 生成续写文本
print(tokenizer.decode(out[0], skip_special_tokens=True))  # 张量转文本并输出

NameError: name 'quant_model' is not defined

输出 "Merry Christmas! I'm glad to see you're still around..." 表明模型保留了基本的语义连贯性，量化未导致严重性能损失。

## 5. 使用自定义数据集量化模型（灵活扩展）

In [11]:
from transformers import AutoModelForCausalLM, GPTQConfig, AutoTokenizer

model_name_or_path = "/home/cc/models/opt/facebook-opt-6.7b"
custom_dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."]

custom_quantization_config = GPTQConfig(
    bits=4,
    group_size=128,
    desc_act=False,
    dataset=custom_dataset
)

custom_quant_model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                                          quantization_config=custom_quantization_config,
                                                          torch_dtype=torch.float16,
                                                          device_map="auto")

Quantizing model.decoder.layers blocks :   0%|          | 0/24 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]



In [12]:
text = "Merry Christmas! I'm glad to"
inputs = tokenizer(text, return_tensors="pt").to(0)

out = custom_quant_model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Merry Christmas! I'm glad to be.



 is a implementation of of.



.

 is a implementation of.

.
 is a implementation of.
 is a implementation.
 is a implementation. is a implementation. is an implementation. is an implementation. is an implementation. is an implementation. is


生成结果 "Merry Christmas! I'm glad to be. is a implementation ..." 连贯性稍差，这是由于标准数据集（如 Wikitext2）适用于通用场景，而自定义数据集太短（仅一句话），校准不足。这说明实际使用中需提供足够多样的自定义数据才能保证量化效果。