# Transformers 模型量化技术：AWQ（OPT-6.7B）

## 使用 AutoAWQ 量化模型

下面我们以 `facebook opt-6.7B` 模型为例，使用 `AutoAWQ` 库实现的 AWQ 算法实现模型量化。

## 1. 预处理

In [1]:
import os

os.environ['http_proxy'] = 'http://127.0.0.1:1087'
os.environ['https_proxy'] = 'http://127.0.0.1:1087'

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name_or_path = "/home/cc/models/opt/facebook-opt-6.7b"
quant_model_dir = "/home/cc/models/quant_models/opt-6.7b-awq"

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

I have left this message as the final dev message to help you transition.

Important Notice:
- AutoAWQ is officially deprecated and will no longer be maintained.
- The last tested configuration used Torch 2.6.0 and Transformers 4.51.3.
- If future versions of Transformers break AutoAWQ compatibility, please report the issue to the Transformers project.

Alternative:
- AutoAWQ has been adopted by the vLLM Project: https://github.com/vllm-project/llm-compressor

For further inquiries, feel free to reach out:
- X: https://x.com/casper_hansen_
- LinkedIn: https://www.linkedin.com/in/casper-hansen-804005170/



由于模型加载器期望找到 model.safetensors 文件，但实际目录中存在的是 分片的 PyTorch safetensors 文件（如 pytorch_model-00001-of-00002.safetensors 和 pytorch_model-00002-of-00002.safetensors），需要先合并为model.safetensors

In [6]:
from safetensors.torch import load_file, save_file
import torch

# 加载分片文件
shard1 = load_file("/home/cc/models/opt/facebook-opt-6.7b/pytorch_model-00001-of-00002.safetensors")
shard2 = load_file("/home/cc/models/opt/facebook-opt-6.7b/pytorch_model-00002-of-00002.safetensors")

# 合并张量
merged_tensors = {**shard1, **shard2}

# 保存为单个文件
save_file(merged_tensors, "/home/cc/models/opt/facebook-opt-6.7b/model.safetensors")

## 2. 加载模型

In [2]:
model = AutoAWQForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)

## 3. 量化模型

In [3]:
model.quantize(tokenizer, quant_config=quant_config)

Repo card metadata block was not found. Setting CardData to empty.
AWQ: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [06:35<00:00, 12.36s/it]


### 实测 AWQ 量化模型：GPU显存占用峰值超过16GB

```bash
$ nvidia-smi  
Tue Aug 19 18:32:28 2025         
+---------------------------------------------------------------------------------------+  
| NVIDIA-SMI 535.230.02             Driver Version: 535.230.02   CUDA Version: 12.2     |  
|-----------------------------------------+----------------------+----------------------+  
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |  
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |  
|                                         |                      |               MIG M. |  
|=========================================+======================+======================|  
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:01:00.0  On |                  Off |  
| 73%   63C    P2             350W / 450W |  16112MiB / 24564MiB |     97%      Default |  
|                                         |                      |                  N/A |  


In [4]:
quant_config

{'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'}

## 4. 配置并保存模型

#### Transformers 兼容性配置

为了使`quant_config` 与 transformers 兼容，我们需要修改配置文件：`使用 Transformers.AwqConfig 来实例化量化模型配置`

In [5]:
from transformers import AwqConfig, AutoConfig

# 修改配置文件以使其与transformers集成兼容
quantization_config = AwqConfig(
    bits=quant_config["w_bit"],
    group_size=quant_config["q_group_size"],
    zero_point=quant_config["zero_point"],
    version=quant_config["version"].lower(),
).to_dict()

# 预训练的transformers模型存储在model属性中，我们需要传递一个字典
model.model.config.quantization_config = quantization_config

In [6]:
# 保存模型权重
model.save_quantized(quant_model_dir)
# 保存分词器
tokenizer.save_pretrained(quant_model_dir)  

('/home/cc/models/quant_models/opt-6.7b-awq/tokenizer_config.json',
 '/home/cc/models/quant_models/opt-6.7b-awq/special_tokens_map.json',
 '/home/cc/models/quant_models/opt-6.7b-awq/vocab.json',
 '/home/cc/models/quant_models/opt-6.7b-awq/merges.txt',
 '/home/cc/models/quant_models/opt-6.7b-awq/added_tokens.json',
 '/home/cc/models/quant_models/opt-6.7b-awq/tokenizer.json')

In [7]:
model.eval()

OptAWQForCausalLM(
  (model): OPTForCausalLM(
    (model): OPTModel(
      (decoder): OPTDecoder(
        (embed_tokens): Embedding(50272, 4096, padding_idx=1)
        (embed_positions): OPTLearnedPositionalEmbedding(2050, 4096)
        (final_layer_norm): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (layers): ModuleList(
          (0-31): 32 x OPTDecoderLayer(
            (self_attn): OPTAttention(
              (k_proj): WQLinear_GEMM(in_features=4096, out_features=4096, bias=True, w_bit=4, group_size=128)
              (v_proj): WQLinear_GEMM(in_features=4096, out_features=4096, bias=True, w_bit=4, group_size=128)
              (q_proj): WQLinear_GEMM(in_features=4096, out_features=4096, bias=True, w_bit=4, group_size=128)
              (out_proj): WQLinear_GEMM(in_features=4096, out_features=4096, bias=True, w_bit=4, group_size=128)
            )
            (activation_fn): ReLU()
            (self_attn_layer_norm): LayerNorm((4096,), eps=1e-05, elementwise_affin

### 5. 使用 GPU 加载模型

In [9]:
def generate_text(text):
    inputs = tokenizer(text, return_tensors="pt").to(0)

    out = model.generate(**inputs, max_new_tokens=64)
    return tokenizer.decode(out[0], skip_special_tokens=True)


In [8]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(quant_model_dir)
model = AutoModelForCausalLM.from_pretrained(quant_model_dir, device_map="cuda").to(0)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [10]:
result = generate_text("Merry Christmas! I'm glad to")
print(result)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Merry Christmas! I'm glad to know you put in a good chunk of time at least.
Thanks! Ya I put in enough time I'll be satisfied with it.


圣诞快乐！我很高兴知道你至少投入了相当多的时间。  

谢谢！是啊，我投入了足够的时间，对此我会感到满意的。

In [11]:
result = generate_text("The woman worked as a")
print(result)

The woman worked as a journalist with the Times of India

The police had said the man held the journalist in a car, before raping her in a field

New Delhi:

Four days after a 24-year-old student was gang-raped on a moving bus across the capital, a senior police officer confirmed to NDTV


这名女子曾供职于《印度时报》，担任记者。

警方曾表示，该男子将这名记者拘禁在车内，之后在一片田地里对其实施了强奸。

新德里讯：

在一名24岁学生在首都一辆行驶的公交车上遭轮奸四天后，一名高级警官向NDTV证实…… 

模型在基础语言能力（连贯性、语法、相关性） 上表现合格，尤其在日常对话场景（如圣诞祝福）中生成效果自然流畅。