量化加载chatglm3，报错：round_vml_cpu not implemented for Half #1217

imempty · 2024-05-14T13:37:01Z

System Info / 系統信息

略

Who can help? / 谁可以帮助到您？

略

Information / 问题信息

The official example scripts / 官方的示例脚本
My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

1，加载chatglm3，因为GPU显存不够，尝试量化方式加载，加载语句如下：
model = AutoModel.from_pretrained("./chatglm3/",trust_remote_code=True).quantize(4).cuda()
2，报错日志：

RuntimeError Traceback (most recent call last)
Cell In[1], line 18
14 tokenizer = AutoTokenizer.from_pretrained("./chatglm3/", trust_remote_code=True)
15 #model = AutoModel.from_pretrained("./chatglm3/", trust_remote_code=True).half().cuda()
16 #model = AutoModel.from_pretrained("./chatglm3/", trust_remote_code=True).quantize(8).half().cuda()
17 #model = AutoModel.from_pretrained("./chatglm3/", trust_remote_code=True).quantize(8).float().cuda()
---> 18 model = AutoModel.from_pretrained("./chatglm3/",trust_remote_code=True).quantize(4).cuda()
19 #model = AutoModel.from_pretrained("./chatglm3/", trust_remote_code=True).cuda().quantize(4)
20 model = model.eval()

File ~/.cache/huggingface/modules/transformers_modules/modeling_chatglm.py:1212, in ChatGLMForConditionalGeneration.quantize(self, bits, empty_init, device, **kwargs)
1208 self.quantized = True
1210 self.config.quantization_bit = bits
-> 1212 self.transformer.encoder = quantize(self.transformer.encoder, bits, empty_init=empty_init, device=device,
1213 **kwargs)
1214 return self

File ~/.cache/huggingface/modules/transformers_modules/quantization.py:155, in quantize(model, weight_bit_width, empty_init, device)
153 """Replace fp16 linear with quantized linear"""
154 for layer in model.layers:
--> 155 layer.self_attention.query_key_value = QuantizedLinear(
156 weight_bit_width=weight_bit_width,
157 #weight=layer.self_attention.query_key_value.weight,
158 weight=layer.self_attention.query_key_value.weight.to(torch.cuda.current_device()),
159 bias=layer.self_attention.query_key_value.bias,
160 dtype=layer.self_attention.query_key_value.weight.dtype,
161 device=layer.self_attention.query_key_value.weight.device if device is None else device,
162 empty_init=empty_init
163 )
164 layer.self_attention.dense = QuantizedLinear(
165 weight_bit_width=weight_bit_width,
166 # weight=layer.self_attention.dense.weight,
(...)
171 empty_init=empty_init
172 )
173 layer.mlp.dense_h_to_4h = QuantizedLinear(
174 weight_bit_width=weight_bit_width,
175 # weight=layer.mlp.dense_h_to_4h.weight,
(...)
180 empty_init=empty_init
181 )

File ~/.cache/huggingface/modules/transformers_modules/quantization.py:137, in QuantizedLinear.init(self, weight_bit_width, weight, bias, device, dtype, empty_init)
135 self.weight_scale = weight.abs().max(dim=-1).values / ((2 ** (weight_bit_width - 1)) - 1)
136 # self.weight = torch.round(weight / self.weight_scale[:, None]).to(torch.int8)
--> 137 self.weight = torch.round(weight.cpu() / self.weight_scale.cpu()[:, None]).cpu()
138 if weight_bit_width == 4:
139 self.weight = compress_int4_weight(self.weight)

RuntimeError: "round_vml_cpu" not implemented for 'Half'

3，尝试搜索谷歌百度，均未找到可行解决方案

Expected behavior / 期待表现

正常时间低精度量化加载

zRzRzRzRzRzRzR · 2024-05-15T02:05:52Z

CPU跑步了在线量化，要用到在线算子，你看一下最新代码怎么加载的，hf 和github都要更新

imempty · 2024-05-15T02:19:36Z

CPU跑步了在线量化，要用到在线算子，你看一下最新代码怎么加载的，hf 和github都要更新

只更新相关Python包就可以解决？
我就用的官网示例加载代码：https://github.com/THUDM/ChatGLM3?tab=readme-ov-file#%E6%A8%A1%E5%9E%8B%E9%87%8F%E5%8C%96

zRzRzRzRzRzRzR · 2024-05-21T09:39:18Z

用到了cuda 算子来量化

mingyue0094 · 2024-05-27T17:31:33Z

用旧版本的 quantization.py 可解决
https://huggingface.co/THUDM/chatglm3-6b/discussions/47#663cb3c8a4d7c8c9038c5312

前提是 cpu 内存，要足够大。或者载入时就用已经量化后的模型。

mingyue0094 · 2024-05-27T17:38:04Z

GPU显存足够完整的载入。那么可以用最新的，载入完整模型，然后用 model = AutoModel.from_pretrained("./chatglm3/",trust_remote_code=True).quantize(4).cuda() 进行完整载入，会进行gpu进行4bit 量化后模型到gpu，后续在gpu用量化后的模型。
CPU内存，足够完整载入。那么可以用替换旧版本的 quantization.py. 载入完整模型，然后用 model = AutoModel.from_pretrained("./chatglm3/",trust_remote_code=True).quantize(4).cuda() 进行完整载入，会进行cpu进行4bit 量化，然后转移模型到Gpu, 后续用gpu 运行量化后的模型。

内存和显存都不够完整载入。那么你没法使用 ``model = AutoModel.from_pretrained("./chatglm3/",trust_remote_code=True).quantize(4).cuda()` 。这种情况，你只能载入已经量化后的模型。

zRzRzRzRzRzRzR self-assigned this May 15, 2024

zRzRzRzRzRzRzR closed this as not planned Won't fix, can't repro, duplicate, stale May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

量化加载chatglm3，报错：round_vml_cpu not implemented for Half #1217

量化加载chatglm3，报错：round_vml_cpu not implemented for Half #1217

imempty commented May 14, 2024

zRzRzRzRzRzRzR commented May 15, 2024

imempty commented May 15, 2024

zRzRzRzRzRzRzR commented May 21, 2024

mingyue0094 commented May 27, 2024 •

edited

mingyue0094 commented May 27, 2024 •

edited

量化加载chatglm3，报错：round_vml_cpu not implemented for Half #1217

量化加载chatglm3，报错：round_vml_cpu not implemented for Half #1217

Comments

imempty commented May 14, 2024

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

zRzRzRzRzRzRzR commented May 15, 2024

imempty commented May 15, 2024

zRzRzRzRzRzRzR commented May 21, 2024

mingyue0094 commented May 27, 2024 • edited

mingyue0094 commented May 27, 2024 • edited

mingyue0094 commented May 27, 2024 •

edited

mingyue0094 commented May 27, 2024 •

edited