Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoGPTQ or AutoGPTQ-bugfix? #57

Open
Alvant opened this issue Jan 22, 2024 · 7 comments
Open

AutoGPTQ or AutoGPTQ-bugfix? #57

Alvant opened this issue Jan 22, 2024 · 7 comments

Comments

@Alvant
Copy link
Contributor

Alvant commented Jan 22, 2024

Some time ago, in README there was a link to the "fixed version" of AutoGPTQ: AutoGPTQ-bugfix. However, current README gives link to the original repo: AutoGPTQ.

So, does this mean that everything is OK with AutoGPTQ real quantization now and we do not need the fixed repo?

I am asking such question, because, for example, the fix for qlinear triton was the following (link1, link2):

# qlinear_triton.py
# ...

qweight = qweight.astype(np.int32)
self.qweight = torch.from_numpy(qweight)

# zeros -= 1  # This line removed in the fix
zeros = zeros.numpy().astype(np.uint32)
qzeros = np.zeros((zeros.shape[0], zeros.shape[1] // 32 * self.bits), dtype=np.uint32)
i = 0

# ...

However, in AutoGPTQ there is still such zeros modification (link). So, it seems that original AutoGPTQ still might have some problems?..

@Alvant
Copy link
Contributor Author

Alvant commented Jan 22, 2024

Actually, I also tried to quantize some models with AutoGPTQ from https://github.com/AutoGPTQ/AutoGPTQ, and it seemed like the quality was worse.

@Alvant
Copy link
Contributor Author

Alvant commented Feb 2, 2024

Wherever the PPL quality is better with AutoGPTQ-bugfix or not, the following is worth noting. If saving checkpoint with AutoGPTQ-bugfix, then the model will not work properly with vLLM, because their GPTQ kernels seem to make use of this "zeros +- 1" trick: https://github.com/vllm-project/vllm/blob/main/csrc/quantization/gptq/q_gemm.cu#L172-L175.

@ChenMnZ
Copy link
Collaborator

ChenMnZ commented Feb 5, 2024

AutoGPTQ-bugfix is ok. Sorry for previous confusion, the official AutoGPTQ repo have merged the "zeros +- 1" solution before. However, the solution was reverted due to some incompatibility, please refer AutoGPTQ/AutoGPTQ#354 for more details.

@Alvant
Copy link
Contributor Author

Alvant commented Mar 2, 2024

@ChenMnZ Ok! Thank you.

Well, the picture became clearer but still not quite 😅 What is better: fixed version or not? Why there was even a need for such fix?

As far as I understood, the thing is the following. AutoGPTQ assumes that quantization is symmetric. However, it may be not so (OmniQuant uses asymmetric quantization by default). What is more, there is no way to tell AutoGPTQ's QuantLinear that quantization is not symmetric.

All public GPTQ-packed real quantized models that I met are symmetrically quantized (for example, Llama-2-13B by TheBloke).

Personally, I tested OmniQuant on Phi2 model. Symmetric quantization resulted in good quality GPTQ model, whereas asymmetric one led to broken GPTQ real-quant model.

So, seems that this "AutoGPTQ or AutoGPTQ-bugfix" question may be more of a "symmetric or asymmetric quantization" question. At least, seems that the real-quant (GPTQ) backend may be not always compatible with the way a model is actually quantized.

P.S. Sorry for my late response 😅

@lqzzy
Copy link

lqzzy commented Mar 19, 2024

Wherever the PPL quality is better with AutoGPTQ-bugfix or not, the following is worth noting. If saving checkpoint with AutoGPTQ-bugfix, then the model will not work properly with vLLM, because their GPTQ kernels seem to make use of this "zeros +- 1" trick: https://github.com/vllm-project/vllm/blob/main/csrc/quantization/gptq/q_gemm.cu#L172-L175.

Could you please tell me a way to make the model produced by running ./scripts/Llama-2/Llama-2-7b/w4a4.sh work properly with vLLM? I want to accelerate the speed of inference by combining W4A4 with vllm.

@lqzzy
Copy link

lqzzy commented Mar 19, 2024

@Alvant @ChenMnZ

@Alvant
Copy link
Contributor Author

Alvant commented Mar 26, 2024

@lqzzy Hello! I am afraid I can't help you with this 😅 Personally, I used vLLM only with W4A16 quantized models (and there were no problems with these ones, vLLM can handle it). I guess, GPTQ does not quantize activations at all. For example, in OmniQuant code, there is also no possibility to obtain a real-quantized weights and activations GPTQ model (OmniQuant utilizes GPTQ for real quantization).

GPTQ reduces model size, vLLM boosts models (it accelerates inference of even FP precision models). So maybe there is no need to quantize activations if you use vLLM?) However, if you really want to use OmniQuant W4A4 models, as far as I understand, there is no easy way to make these ones compatible with vLLM...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants