Modify qlinear_cuda for tracing the GPTQ model #367

vivekkhandelwal1 · 2023-10-09T14:22:01Z

Changes:
-- The change to the torch.bitwise_and is done because during
tracing this model the current usage of the torch.bitwise_and
result in an in-place variant of this op, resulting in an issue
during the downstream lowering pipeline of the traced model via
Torch-MLIR and IREE-SHARK. That's why the op usage is changed to
not result in an in-place variaunt.

-- The change to the torch.matmul call in the forward function is
done because currently, it assumes that the weights will always
be of fp16 type. But, when the model is executed for the float32
weights it results in an error. That's why the current change
cast the LHS of the matmul to the same type as the RHS one.

Both the above changes doesn't affect the model in any way.

Signed-Off By: Vivek Khandelwal vivek@nod-labs.com

vivekkhandelwal1 · 2023-10-10T06:44:39Z

@PanQiWei, can you please take a look at this PR?

vivekkhandelwal1 · 2023-10-13T13:09:04Z

@qwopqwop200 @TheBloke @fxmarty @PanQiWei Can you please review this PR? I'm blocked on this PR right now. I would be really grateful if we can get this merged soon.

Changes: -- The change to the torch.bitwise_and is done because during tracing this model the current usage of the torch.bitwise_and result in an in-place variant of this op, resulting in an issue during the downstream lowering pipeline of the traced model via Torch-MLIR and IREE-SHARK. That's why the op usage is changed to not result in an in-place variaunt. -- The change to the torch.matmul call in the forward function is done because currently, it assumes that the weights will always be of fp16 type. But, when the model is executed for the float32 weights it results in an error. That's why the current change cast the LHS of the matmul to the same type as the RHS one. Both the above changes doesn't affect the model in any way. Signed-Off By: Vivek Khandelwal <vivek@nod-labs.com>

fxmarty · 2023-10-20T13:06:40Z

Apology for the delay.

Both the above changes doesn't affect the model in any way.

well it kind of does, I guess there's an additional memory allocation instead of doing the operation in-place? Sounds not like a big deal though.

I am wondering though what is the point of using AutoGPTQ if self.autogptq_cuda_available is False? It feels like the python implementation must be very slow. Torch-MLIR needs to be able to lower every branches of controlflows?

fxmarty · 2023-10-20T13:09:03Z

auto_gptq/nn_modules/qlinear/qlinear_cuda.py

@@ -267,10 +267,10 @@ def forward(self, x: torch.Tensor):
                    g_idx_i = self.g_idx[i*num_dim:(i+1)*num_dim]
                    weights.append(scale_i[g_idx_i.long()] * (weight_i - zeros_i[g_idx_i.long()]))
                weights = torch.cat(weights,dim=1)
-            out = torch.matmul(x.half(), weights)
+            out = torch.matmul(x.to(weights.dtype), weights)
        out = out.half().reshape(out_shape)


Why is there still a .half() here?

I'm not sure about this. I think this was left by mistake. Should I remove this?

I don't mind leaving it you if this half() is not a blocker for you. But was wondering given that you replaced some of the .half() to remove the assumption on fp16.

vivekkhandelwal1 · 2023-10-20T13:46:38Z

Apology for the delay.

Both the above changes doesn't affect the model in any way.

well it kind of does, I guess there's an additional memory allocation instead of doing the operation in-place? Sounds not like a big deal though.

I am wondering though what is the point of using AutoGPTQ if self.autogptq_cuda_available is False? It feels like the python implementation must be very slow. Torch-MLIR needs to be able to lower every branches of controlflows?

So, the model would be run on a GPU but after getting lowered through Torch-MLIR and compiled via IREE. The changes are done because Torch-MLIR doesn't support tensors on CUDA device, so the model is lowered on CPU, and then lowered via Torch-MLIR and after compilation we run it on the GPUs for CUDA, Vulkan and Rocm backend.

fxmarty · 2023-10-20T13:51:30Z

That's neat! Have you been able to run AutoGPTQ this way? Is it competitive on say CUDA compared to the homemade kernel?

vivekkhandelwal1 · 2023-10-20T14:09:13Z

That's neat! Have you been able to run AutoGPTQ this way? Is it competitive on say CUDA compared to the homemade kernel?

Yeah! I have been able to run the falcon-180b-GPTQ on the CPU and the falcon-7b-GPTQ on the CPU and CUDA.

vivekkhandelwal1 · 2023-10-20T14:10:36Z

We have been blocked on this patch for weeks. Also, my changes in huggingface/transformers#26719 are also dependent on this PR getting merged. Please let me know if I need to make any further changes to get this in.

fxmarty

LGTM, thanks a lot!

Apology for the delay and inconvenience, I am not maintaining the repo so am not actively checking the PRs, I just happen to have rights.

vivekkhandelwal1 · 2023-10-20T16:41:21Z

LGTM, thanks a lot!

Apology for the delay and inconvenience, I am not maintaining the repo so am not actively checking the PRs, I just happen to have rights.

Thanks for merging this patch!

fxmarty · 2023-10-26T08:35:04Z

auto_gptq/nn_modules/qlinear/qlinear_cuda_old.py

        out = out.half().reshape(out_shape)
        out = out + self.bias if self.bias is not None else out
-        return out
+        return out.to(x.dtype)


This introduces a bug when using the CUDA kernel in fp32.

What kind of bug? Can you please explain?

When use_cuda_fp16=False, there is a cast x = x.to(torch.float32) which results in the output dtype being wrong with the change above. This is fixed in #382.

Thanks @fxmarty!

vivekkhandelwal1 force-pushed the shark branch from d744c59 to 979ee3f Compare October 10, 2023 15:26

vivekkhandelwal1 mentioned this pull request Oct 10, 2023

Add support for loading GPTQ models on CPU huggingface/transformers#26719

Merged

vivekkhandelwal1 force-pushed the shark branch from 979ee3f to c6a037d Compare October 20, 2023 12:41

fxmarty reviewed Oct 20, 2023

View reviewed changes

fxmarty approved these changes Oct 20, 2023

View reviewed changes

fxmarty merged commit e4b2493 into AutoGPTQ:main Oct 20, 2023

vivekkhandelwal1 deleted the shark branch October 20, 2023 16:41

fxmarty reviewed Oct 26, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify qlinear_cuda for tracing the GPTQ model #367

Modify qlinear_cuda for tracing the GPTQ model #367

vivekkhandelwal1 commented Oct 9, 2023

vivekkhandelwal1 commented Oct 10, 2023

vivekkhandelwal1 commented Oct 13, 2023

fxmarty commented Oct 20, 2023 •

edited

Loading

fxmarty Oct 20, 2023

vivekkhandelwal1 Oct 20, 2023

fxmarty Oct 20, 2023

vivekkhandelwal1 commented Oct 20, 2023

fxmarty commented Oct 20, 2023

vivekkhandelwal1 commented Oct 20, 2023

vivekkhandelwal1 commented Oct 20, 2023

fxmarty left a comment

vivekkhandelwal1 commented Oct 20, 2023

fxmarty Oct 26, 2023

vivekkhandelwal1 Oct 26, 2023

fxmarty Oct 26, 2023

vivekkhandelwal1 Oct 26, 2023

Modify qlinear_cuda for tracing the GPTQ model #367

Modify qlinear_cuda for tracing the GPTQ model #367

Conversation

vivekkhandelwal1 commented Oct 9, 2023

vivekkhandelwal1 commented Oct 10, 2023

vivekkhandelwal1 commented Oct 13, 2023

fxmarty commented Oct 20, 2023 • edited Loading

fxmarty Oct 20, 2023

Choose a reason for hiding this comment

vivekkhandelwal1 Oct 20, 2023

Choose a reason for hiding this comment

fxmarty Oct 20, 2023

Choose a reason for hiding this comment

vivekkhandelwal1 commented Oct 20, 2023

fxmarty commented Oct 20, 2023

vivekkhandelwal1 commented Oct 20, 2023

vivekkhandelwal1 commented Oct 20, 2023

fxmarty left a comment

Choose a reason for hiding this comment

vivekkhandelwal1 commented Oct 20, 2023

fxmarty Oct 26, 2023

Choose a reason for hiding this comment

vivekkhandelwal1 Oct 26, 2023

Choose a reason for hiding this comment

fxmarty Oct 26, 2023

Choose a reason for hiding this comment

vivekkhandelwal1 Oct 26, 2023

Choose a reason for hiding this comment

fxmarty commented Oct 20, 2023 •

edited

Loading