-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
量化加载chatglm3,报错:round_vml_cpu not implemented for Half #1217
Comments
CPU跑步了在线量化,要用到在线算子,你看一下最新代码怎么加载的,hf 和github都要更新 |
只更新相关Python包就可以解决? |
用到了cuda 算子来量化 |
用旧版本的 quantization.py 可解决 前提是 cpu 内存,要足够 大。 或者载入时就用 已经量化后的模型。 |
GPU显存 足够完整的载入。那么可以用最新的,载入完整模型,然后用 内存和显存都不够完整载入。那么你没法使用 ``model = AutoModel.from_pretrained("./chatglm3/",trust_remote_code=True).quantize(4).cuda()` 。这种情况,你只能载入 已经量化后的模型。 |
System Info / 系統信息
略
Who can help? / 谁可以帮助到您?
略
Information / 问题信息
Reproduction / 复现过程
1,加载chatglm3,因为GPU显存不够,尝试量化方式加载,加载语句如下:
model = AutoModel.from_pretrained("./chatglm3/",trust_remote_code=True).quantize(4).cuda()
2,报错日志:
RuntimeError Traceback (most recent call last)
Cell In[1], line 18
14 tokenizer = AutoTokenizer.from_pretrained("./chatglm3/", trust_remote_code=True)
15 #model = AutoModel.from_pretrained("./chatglm3/", trust_remote_code=True).half().cuda()
16 #model = AutoModel.from_pretrained("./chatglm3/", trust_remote_code=True).quantize(8).half().cuda()
17 #model = AutoModel.from_pretrained("./chatglm3/", trust_remote_code=True).quantize(8).float().cuda()
---> 18 model = AutoModel.from_pretrained("./chatglm3/",trust_remote_code=True).quantize(4).cuda()
19 #model = AutoModel.from_pretrained("./chatglm3/", trust_remote_code=True).cuda().quantize(4)
20 model = model.eval()
File ~/.cache/huggingface/modules/transformers_modules/modeling_chatglm.py:1212, in ChatGLMForConditionalGeneration.quantize(self, bits, empty_init, device, **kwargs)
1208 self.quantized = True
1210 self.config.quantization_bit = bits
-> 1212 self.transformer.encoder = quantize(self.transformer.encoder, bits, empty_init=empty_init, device=device,
1213 **kwargs)
1214 return self
File ~/.cache/huggingface/modules/transformers_modules/quantization.py:155, in quantize(model, weight_bit_width, empty_init, device)
153 """Replace fp16 linear with quantized linear"""
154 for layer in model.layers:
--> 155 layer.self_attention.query_key_value = QuantizedLinear(
156 weight_bit_width=weight_bit_width,
157 #weight=layer.self_attention.query_key_value.weight,
158 weight=layer.self_attention.query_key_value.weight.to(torch.cuda.current_device()),
159 bias=layer.self_attention.query_key_value.bias,
160 dtype=layer.self_attention.query_key_value.weight.dtype,
161 device=layer.self_attention.query_key_value.weight.device if device is None else device,
162 empty_init=empty_init
163 )
164 layer.self_attention.dense = QuantizedLinear(
165 weight_bit_width=weight_bit_width,
166 # weight=layer.self_attention.dense.weight,
(...)
171 empty_init=empty_init
172 )
173 layer.mlp.dense_h_to_4h = QuantizedLinear(
174 weight_bit_width=weight_bit_width,
175 # weight=layer.mlp.dense_h_to_4h.weight,
(...)
180 empty_init=empty_init
181 )
File ~/.cache/huggingface/modules/transformers_modules/quantization.py:137, in QuantizedLinear.init(self, weight_bit_width, weight, bias, device, dtype, empty_init)
135 self.weight_scale = weight.abs().max(dim=-1).values / ((2 ** (weight_bit_width - 1)) - 1)
136 # self.weight = torch.round(weight / self.weight_scale[:, None]).to(torch.int8)
--> 137 self.weight = torch.round(weight.cpu() / self.weight_scale.cpu()[:, None]).cpu()
138 if weight_bit_width == 4:
139 self.weight = compress_int4_weight(self.weight)
RuntimeError: "round_vml_cpu" not implemented for 'Half'
3,尝试搜索谷歌百度,均未找到可行解决方案
Expected behavior / 期待表现
正常时间低精度量化加载
The text was updated successfully, but these errors were encountered: