[`core`/ `QLinear`] Support CPU inference #376

younesbelkada · 2023-10-23T10:47:21Z

On par with: huggingface/transformers#26719

This PR simply proposes to temporary upcast the weights and hidden states in fp32 before performing matmul in case users are on CPU. I can confirm that together with huggingface/transformers#26719 and this PR the script below:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

checkpoint = "TheBloke/Mistral-7B-Instruct-v0.1-GPTQ"
device = "cpu" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

quantization_config = GPTQConfig(bits=4, disable_exllama=True)

model = AutoModelForCausalLM.from_pretrained(checkpoint, low_cpu_mem_usage=True, quantization_config=quantization_config, torch_dtype=torch.float32).to(device)

inputs = tokenizer.encode("Hello how are you?", return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=4, do_sample=False)
print(tokenizer.decode(outputs[0]))

Runs fine on CPU (but is slow)

younesbelkada · 2023-10-23T10:49:48Z

cc @PanQiWei @fxmarty @TheBloke

fxmarty · 2023-10-23T15:04:20Z

auto_gptq/nn_modules/qlinear/qlinear_cuda.py

-            out = torch.matmul(x.to(weights.dtype), weights)
+
+            # To support CPU inference
+            if weight.dtype == torch.float16 and weight.device.type == "cpu":


When does this case arise?

if you run:

import torch from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig checkpoint = "TheBloke/Mistral-7B-Instruct-v0.1-GPTQ" device = "cpu" # for GPU usage or "cpu" for CPU usage tokenizer = AutoTokenizer.from_pretrained(checkpoint) quantization_config = GPTQConfig(bits=4, disable_exllama=True) model = AutoModelForCausalLM.from_pretrained(checkpoint, low_cpu_mem_usage=True, quantization_config=quantization_config, torch_dtype=torch.float32).to(device) inputs = tokenizer.encode("Hello how are you?", return_tensors="pt").to(device) outputs = model.generate(inputs, max_new_tokens=4, do_sample=False) print(tokenizer.decode(outputs[0]))

with this patch huggingface/transformers#26719 being applied on transformers

SunMarc · 2023-10-23T16:01:38Z

auto_gptq/nn_modules/qlinear/qlinear_cuda_old.py

@@ -266,7 +266,12 @@ def forward(self, x):
            weight = (scales * (weight - zeros))


To add more context, the weights are in fp16 because the scales are in fp16.

fxmarty · 2023-10-24T07:47:51Z

I'm not keen on merging this - scales should be in fp32 in the first place.

vivekkhandelwal1 · 2023-10-26T12:26:33Z

I'm not keen on merging this - scales should be in fp32 in the first place.

Hi @fxmarty, can we get this in with any changes?

fxmarty · 2023-10-26T12:38:25Z

@vivekkhandelwal1 Yet happy to have it in the next release, but what is proposed in this PR is not the correct solution. What is needed is to dispatch correctly the module parameters / buffers (remove hard-coded .to(torch.float16) in the init).

vivekkhandelwal1 · 2023-10-26T14:17:29Z

@vivekkhandelwal1 Yet happy to have it in the next release, but what is proposed in this PR is not the correct solution. What is needed is to dispatch correctly the module parameters / buffers (remove hard-coded .to(torch.float16) in the init).

Yeah, you're correct. @younesbelkada, can you make the changes accordingly, otherwise I can do that.

younesbelkada · 2023-10-26T14:31:19Z

hi @vivekkhandelwal1 , thanks for offering help, it would be great if you can quickly do that if possible 🙏

vivekkhandelwal1 · 2023-10-27T07:59:53Z

Hi @younesbelkada, I don't have push access to your repo, can you please provide me that?

fxmarty · 2023-10-27T08:25:32Z

@vivekkhandelwal1 @PanQiWei is the owner and would need to give you that. In the meantime if you open a PR I can review it and merge.

vivekkhandelwal1 · 2023-10-27T08:29:44Z

@vivekkhandelwal1 @PanQiWei is the owner and would need to give you that. In the meantime if you open a PR I can review it and merge.

@fxmarty Please review this PR: #385

fxmarty · 2023-10-31T11:39:58Z

Closing as superseded by #385

younesbelkada added 2 commits October 23, 2023 12:43

Update qlinear_cuda_old.py

6f47ab8

Update qlinear_cuda.py

348edea

younesbelkada mentioned this pull request Oct 23, 2023

Add support for loading GPTQ models on CPU huggingface/transformers#26719

Merged

fxmarty reviewed Oct 23, 2023

View reviewed changes

SunMarc reviewed Oct 23, 2023

View reviewed changes

fxmarty closed this Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`core`/ `QLinear`] Support CPU inference #376

[`core`/ `QLinear`] Support CPU inference #376

younesbelkada commented Oct 23, 2023

younesbelkada commented Oct 23, 2023

fxmarty Oct 23, 2023

younesbelkada Oct 23, 2023

SunMarc Oct 23, 2023

fxmarty commented Oct 24, 2023

vivekkhandelwal1 commented Oct 26, 2023 •

edited

Loading

fxmarty commented Oct 26, 2023

vivekkhandelwal1 commented Oct 26, 2023

younesbelkada commented Oct 26, 2023

vivekkhandelwal1 commented Oct 27, 2023

fxmarty commented Oct 27, 2023

vivekkhandelwal1 commented Oct 27, 2023

fxmarty commented Oct 31, 2023

		@@ -266,7 +266,12 @@ def forward(self, x):
		weight = (scales * (weight - zeros))

[core/ QLinear] Support CPU inference #376

[core/ QLinear] Support CPU inference #376

Conversation

younesbelkada commented Oct 23, 2023

younesbelkada commented Oct 23, 2023

fxmarty Oct 23, 2023

Choose a reason for hiding this comment

younesbelkada Oct 23, 2023

Choose a reason for hiding this comment

SunMarc Oct 23, 2023

Choose a reason for hiding this comment

fxmarty commented Oct 24, 2023

vivekkhandelwal1 commented Oct 26, 2023 • edited Loading

fxmarty commented Oct 26, 2023

vivekkhandelwal1 commented Oct 26, 2023

younesbelkada commented Oct 26, 2023

vivekkhandelwal1 commented Oct 27, 2023

fxmarty commented Oct 27, 2023

vivekkhandelwal1 commented Oct 27, 2023

fxmarty commented Oct 31, 2023

[`core`/ `QLinear`] Support CPU inference #376

[`core`/ `QLinear`] Support CPU inference #376

vivekkhandelwal1 commented Oct 26, 2023 •

edited

Loading