Fix result dtype conversion in QuantLinear.forward() #390

vivekkhandelwal1 · 2023-11-01T06:40:01Z

Signed-Off By: Vivek Khandelwal vivek@nod-labs.com

Fixes: AutoGPTQ#385 (comment) Signed-Off By: Vivek Khandelwal <vivek@nod-labs.com>

vivekkhandelwal1 · 2023-11-01T06:41:12Z

@fxmarty, can you please review this PR?

fxmarty · 2023-11-01T09:20:27Z

auto_gptq/nn_modules/qlinear/qlinear_cuda.py

@@ -268,8 +268,8 @@ def forward(self, x: torch.Tensor):
                    g_idx_i = self.g_idx[i*num_dim:(i+1)*num_dim]
                    weights.append(scale_i[g_idx_i.long()] * (weight_i - zeros_i[g_idx_i.long()]))
                weights = torch.cat(weights,dim=1)
-            out = torch.matmul(x, weights)
-        out = out.to(dtype=weights.dtype).reshape(out_shape)
+            out = torch.matmul(x, weights).to(dtype=weights.dtype)


I'm not sure to remember why these casts are needed in the first place. Shouldn't the activation & weight be of same dtype (either both fp32, either both fp16)?

vivekkhandelwal1 · 2023-11-01T16:07:54Z

EDIT: The error is happening because of this change: a7d61ca#diff-c4c2bf0dd8440248a29510131f06affa3c2ab00d1bd7ca507dc0b7125a04f825R20

@fxmarty, I'm getting the following error:

File "/home/vivek/work/vivek-AutoGPTQ/repro_gptq.py", line 13, in <module>
    model = AutoModelForCausalLM.from_pretrained(checkpoint, low_cpu_mem_usage=True, device_map="cpu", quantization_config=quantization_config, torch_dtype=torch.float32)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vivek/work/shark-vivekkhandelwal1/shark.venv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vivek/work/shark-vivekkhandelwal1/shark.venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2713, in from_pretrained
    from optimum.gptq import GPTQQuantizer
  File "/home/vivek/work/shark-vivekkhandelwal1/shark.venv/lib/python3.11/site-packages/optimum/gptq/__init__.py", line 15, in <module>
    from .quantizer import GPTQQuantizer, load_quantized_model
  File "/home/vivek/work/shark-vivekkhandelwal1/shark.venv/lib/python3.11/site-packages/optimum/gptq/quantizer.py", line 44, in <module>
    from auto_gptq import exllama_set_max_input_length
  File "/home/vivek/work/vivek-AutoGPTQ/auto_gptq/__init__.py", line 4, in <module>
    from .utils.peft_utils import get_gptq_peft_model
  File "/home/vivek/work/vivek-AutoGPTQ/auto_gptq/utils/peft_utils.py", line 20, in <module>
    from ..nn_modules.qlinear.qlinear_exllama import QuantLinear as QuantLinearExllama
  File "/home/vivek/work/vivek-AutoGPTQ/auto_gptq/nn_modules/qlinear/qlinear_exllama.py", line 14, in <module>
    from exllama_kernels import make_q4, q4_matmul
ModuleNotFoundError: No module named 'exllama_kernels'

For the following code:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

checkpoint = "TheBloke/Mistral-7B-Instruct-v0.1-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
quantization_config = GPTQConfig(bits=4, disable_exllama=True)

model = AutoModelForCausalLM.from_pretrained(checkpoint, low_cpu_mem_usage=True, device_map="cpu", quantization_config=quantization_config, torch_dtype=torch.float32)

inputs = tokenizer.encode("Hello how are you?", return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=4, do_sample=False)
print(tokenizer.decode(outputs[0]))

Is this happening because of this commit: bcd1406

fxmarty · 2023-11-02T15:12:35Z

Hi thank you - superseded by #393

Note this bug in accelerate: huggingface/accelerate#2116

Fix result dtype conversion in QuantLinear.forward()

9858f29

Fixes: AutoGPTQ#385 (comment) Signed-Off By: Vivek Khandelwal <vivek@nod-labs.com>

vivekkhandelwal1 mentioned this pull request Nov 1, 2023

Add fix for CPU Inference #385

Merged

fxmarty reviewed Nov 1, 2023

View reviewed changes

fxmarty closed this Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix result dtype conversion in QuantLinear.forward() #390

Fix result dtype conversion in QuantLinear.forward() #390

vivekkhandelwal1 commented Nov 1, 2023

vivekkhandelwal1 commented Nov 1, 2023

fxmarty Nov 1, 2023

vivekkhandelwal1 commented Nov 1, 2023 •

edited

Loading

fxmarty commented Nov 2, 2023 •

edited

Loading

Fix result dtype conversion in QuantLinear.forward() #390

Fix result dtype conversion in QuantLinear.forward() #390

Conversation

vivekkhandelwal1 commented Nov 1, 2023

vivekkhandelwal1 commented Nov 1, 2023

fxmarty Nov 1, 2023

Choose a reason for hiding this comment

vivekkhandelwal1 commented Nov 1, 2023 • edited Loading

fxmarty commented Nov 2, 2023 • edited Loading

vivekkhandelwal1 commented Nov 1, 2023 •

edited

Loading

fxmarty commented Nov 2, 2023 •

edited

Loading