# 作业：使用 AWQ量化OPT-6.7B

在2023年6月，Ji Lin等人发表了论文[AWQ：Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/pdf/2306.00978.pdf)。

这篇论文详细介绍了一种激活感知权重量化算法，可以用于压缩任何基于 Transformer 的语言模型，同时只有微小的性能下降。关于 AWQ 算法的详细介绍，见[MIT Han Song 教授分享](https://hanlab.mit.edu/projects/awq)。

transformers 现在支持两个不同的 AWQ 开源实现库：

- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)
- [LLM-AWQ](https://github.com/mit-han-lab/llm-awq) 


因为 LLM-AWQ 不支持 Nvidia T4 GPU（课程演示 GPU），所以我们使用 AutoAWQ 库来介绍和演示 AWQ 模型量化技术。

In [1]:
%env HF_ENDPOINT=https://hf-mirror.com
%env HF_HOME=/root/autodl-tmp/hf
%env HF_HUB_CACHE=/root/autodl-tmp/hf

env: HF_ENDPOINT=https://hf-mirror.com
env: HF_HOME=/root/autodl-tmp/hf
env: HF_HUB_CACHE=/root/autodl-tmp/hf


## 使用 AutoAWQ 量化OPT-6.7b

In [2]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "facebook/opt-6.7b"
bits = 4
save_dir = "/root/autodl-tmp/models/" + model_name + f"-quant-awq-{bits}bits"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": bits, "version": "GEMM"}

# 加载模型
model = AutoAWQForCausalLM.from_pretrained(model_name, device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)



Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
# 量化模型
model.quantize(tokenizer, quant_config=quant_config)

Repo card metadata block was not found. Setting CardData to empty.
AWQ: 100%|██████████| 32/32 [09:57<00:00, 18.67s/it]


#### Transformers 兼容性配置

为了使`quant_config` 与 transformers 兼容，我们需要修改配置文件：`使用 Transformers.AwqConfig 来实例化量化模型配置`

In [4]:
from transformers import AwqConfig, AutoConfig

# 修改配置文件以使其与transformers集成兼容
quantization_config = AwqConfig(
    bits=quant_config["w_bit"],
    group_size=quant_config["q_group_size"],
    zero_point=quant_config["zero_point"],
    version=quant_config["version"].lower(),
).to_dict()

# 预训练的transformers模型存储在model属性中，我们需要传递一个字典
model.model.config.quantization_config = quantization_config

In [5]:
# 保存模型权重
model.save_quantized(save_dir)
tokenizer.save_pretrained(save_dir)  # 保存分词器

('/root/autodl-tmp/models/facebook/opt-6.7b-quant-awq-4bits/tokenizer_config.json',
 '/root/autodl-tmp/models/facebook/opt-6.7b-quant-awq-4bits/special_tokens_map.json',
 '/root/autodl-tmp/models/facebook/opt-6.7b-quant-awq-4bits/vocab.json',
 '/root/autodl-tmp/models/facebook/opt-6.7b-quant-awq-4bits/merges.txt',
 '/root/autodl-tmp/models/facebook/opt-6.7b-quant-awq-4bits/added_tokens.json',
 '/root/autodl-tmp/models/facebook/opt-6.7b-quant-awq-4bits/tokenizer.json')

### 使用 GPU 加载量化模型

In [6]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(save_dir)
model = AutoModelForCausalLM.from_pretrained(save_dir, device_map="cuda").to(0)

In [7]:
def generate_text(text):
    inputs = tokenizer(text, return_tensors="pt").to(0)

    out = model.generate(**inputs, max_new_tokens=64)
    return tokenizer.decode(out[0], skip_special_tokens=True)


In [8]:
result = generate_text("Merry Christmas! I'm glad to")
print(result)

Merry Christmas! I'm glad to you guys to you.
You're merry have you! Merry Christmas! I'm happy to you! to you!
That Christmas Christmas Merry Christmas Merry!


In [9]:
result = generate_text("The woman worked as a")
print(result)

The woman worked as a waitress at the restaurant where she's a waitress at.
A restaurant where she's her waitress, at.
A restaurant where she's a waitress, at.
A restaurant where she's a waitress, at.
A restaurant where she's a waitress, at.
A restaurant where she's a waitress,
