Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

量化加载chatglm3,报错:round_vml_cpu not implemented for Half #1217

Closed
1 of 2 tasks
imempty opened this issue May 14, 2024 · 5 comments
Closed
1 of 2 tasks
Assignees

Comments

@imempty
Copy link

imempty commented May 14, 2024

System Info / 系統信息

Who can help? / 谁可以帮助到您?

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

1,加载chatglm3,因为GPU显存不够,尝试量化方式加载,加载语句如下:
model = AutoModel.from_pretrained("./chatglm3/",trust_remote_code=True).quantize(4).cuda()
2,报错日志:


RuntimeError Traceback (most recent call last)
Cell In[1], line 18
14 tokenizer = AutoTokenizer.from_pretrained("./chatglm3/", trust_remote_code=True)
15 #model = AutoModel.from_pretrained("./chatglm3/", trust_remote_code=True).half().cuda()
16 #model = AutoModel.from_pretrained("./chatglm3/", trust_remote_code=True).quantize(8).half().cuda()
17 #model = AutoModel.from_pretrained("./chatglm3/", trust_remote_code=True).quantize(8).float().cuda()
---> 18 model = AutoModel.from_pretrained("./chatglm3/",trust_remote_code=True).quantize(4).cuda()
19 #model = AutoModel.from_pretrained("./chatglm3/", trust_remote_code=True).cuda().quantize(4)
20 model = model.eval()

File ~/.cache/huggingface/modules/transformers_modules/modeling_chatglm.py:1212, in ChatGLMForConditionalGeneration.quantize(self, bits, empty_init, device, **kwargs)
1208 self.quantized = True
1210 self.config.quantization_bit = bits
-> 1212 self.transformer.encoder = quantize(self.transformer.encoder, bits, empty_init=empty_init, device=device,
1213 **kwargs)
1214 return self

File ~/.cache/huggingface/modules/transformers_modules/quantization.py:155, in quantize(model, weight_bit_width, empty_init, device)
153 """Replace fp16 linear with quantized linear"""
154 for layer in model.layers:
--> 155 layer.self_attention.query_key_value = QuantizedLinear(
156 weight_bit_width=weight_bit_width,
157 #weight=layer.self_attention.query_key_value.weight,
158 weight=layer.self_attention.query_key_value.weight.to(torch.cuda.current_device()),
159 bias=layer.self_attention.query_key_value.bias,
160 dtype=layer.self_attention.query_key_value.weight.dtype,
161 device=layer.self_attention.query_key_value.weight.device if device is None else device,
162 empty_init=empty_init
163 )
164 layer.self_attention.dense = QuantizedLinear(
165 weight_bit_width=weight_bit_width,
166 # weight=layer.self_attention.dense.weight,
(...)
171 empty_init=empty_init
172 )
173 layer.mlp.dense_h_to_4h = QuantizedLinear(
174 weight_bit_width=weight_bit_width,
175 # weight=layer.mlp.dense_h_to_4h.weight,
(...)
180 empty_init=empty_init
181 )

File ~/.cache/huggingface/modules/transformers_modules/quantization.py:137, in QuantizedLinear.init(self, weight_bit_width, weight, bias, device, dtype, empty_init)
135 self.weight_scale = weight.abs().max(dim=-1).values / ((2 ** (weight_bit_width - 1)) - 1)
136 # self.weight = torch.round(weight / self.weight_scale[:, None]).to(torch.int8)
--> 137 self.weight = torch.round(weight.cpu() / self.weight_scale.cpu()[:, None]).cpu()
138 if weight_bit_width == 4:
139 self.weight = compress_int4_weight(self.weight)

RuntimeError: "round_vml_cpu" not implemented for 'Half'

3,尝试搜索谷歌百度,均未找到可行解决方案

Expected behavior / 期待表现

正常时间低精度量化加载

@zRzRzRzRzRzRzR
Copy link
Collaborator

CPU跑步了在线量化,要用到在线算子,你看一下最新代码怎么加载的,hf 和github都要更新

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR self-assigned this May 15, 2024
@imempty
Copy link
Author

imempty commented May 15, 2024

CPU跑步了在线量化,要用到在线算子,你看一下最新代码怎么加载的,hf 和github都要更新

只更新相关Python包就可以解决?
我就用的官网示例加载代码:https://github.com/THUDM/ChatGLM3?tab=readme-ov-file#%E6%A8%A1%E5%9E%8B%E9%87%8F%E5%8C%96

@zRzRzRzRzRzRzR
Copy link
Collaborator

用到了cuda 算子来量化

@mingyue0094
Copy link

mingyue0094 commented May 27, 2024

用旧版本的 quantization.py 可解决
https://huggingface.co/THUDM/chatglm3-6b/discussions/47#663cb3c8a4d7c8c9038c5312

前提是 cpu 内存,要足够 大。 或者载入时就用 已经量化后的模型。

@mingyue0094
Copy link

mingyue0094 commented May 27, 2024

GPU显存 足够完整的载入。那么可以用最新的,载入完整模型,然后用 model = AutoModel.from_pretrained("./chatglm3/",trust_remote_code=True).quantize(4).cuda() 进行完整载入,会进行gpu进行4bit 量化 后模型到gpu,后续在gpu用量化后的模型。
CPU内存,足够完整载入。那么可以用替换旧版本的 quantization.py. 载入完整模型,然后用 model = AutoModel.from_pretrained("./chatglm3/",trust_remote_code=True).quantize(4).cuda() 进行完整载入,会进行cpu进行4bit 量化,然后转移模型到Gpu, 后续用gpu 运行量化后的模型。

内存和显存都不够完整载入。那么你没法使用 ``model = AutoModel.from_pretrained("./chatglm3/",trust_remote_code=True).quantize(4).cuda()` 。这种情况,你只能载入 已经量化后的模型。

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR closed this as not planned Won't fix, can't repro, duplicate, stale May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants