Supporting uint4 inference of pre-quantized models in HPU #689

HolyFalafel · 2024-06-17T06:11:57Z

Added native support in HPU, using a conversion kernel.
Currently we only support inference on a preloaded HF model.
This feature will be usable in Synapse v1.17

Supporting llama uint4 inference using AutoGPTQ in HPU
Removed hpu pack until we'll implement it in HPU

* Supporting llama int4 quantization using AutoGPTQ * Running only PT code (similar to cuda_old) on HPU * Testing convert_from_int4 * Started cleanup * code cleanup * Added weight reshape in preprocessing Added llama7b generation hpu test * Changed reshape to match matmul (still not accurate) and fixed q4 test * Fixing zero points * Update pack function * Fixed accuracy * Uncommented exllama * Marlin test fix + added hpu bias test * Review comments * Removed hpu pack until we'll implement it in HPU --------- Co-authored-by: yan tomsinsky <ytomsinsky@habana.ai>

Qubitium · 2024-06-20T11:25:51Z

@HolyFalafel For various reasons my team has forked AutoGPTQ into GPTQModel project and would like to integrate this PR as soon as possible. If you can push this PR to there as well, that would be great. We can also cherrypick your commits and create a PR there but we are running into issue of validation. How to get our hands on intel habana HPU for testings/validation? Is there an intel cloud/team that we can connect with to borrow an HPU that can run these validation tests. Thanks. I don't want to polllute msg space unrelated to this repo so if you can connect with me via a new issue at GPTQModel, that would be best.

fxmarty · 2024-06-24T13:20:30Z

Thank you!

fxmarty · 2024-06-28T09:58:44Z

@HolyFalafel This PR apparently breaks for people not having habana_frameworks installed. @HolyFalafel, it would be awesome if you can submit a fix! #695

HolyFalafel · 2024-06-28T20:11:47Z

@fxmarty working on it

* fix pack() thread regression via code * stream pytest output * backport h100 fixed marlin kernel from vllm * Revert "backport h100 fixed marlin kernel from vllm" This reverts commit 8ac1b87. * revert * fix h100 * revert debug code * now that h100 is validated, remove hopper check * Supporting uint4 inference of pre-quantized models in HPU (AutoGPTQ#689) * Supporting llama uint4 quantization using AutoGPTQ (#1) * Supporting llama int4 quantization using AutoGPTQ * Running only PT code (similar to cuda_old) on HPU * Testing convert_from_int4 * Started cleanup * code cleanup * Added weight reshape in preprocessing Added llama7b generation hpu test * Changed reshape to match matmul (still not accurate) and fixed q4 test * Fixing zero points * Update pack function * Fixed accuracy * Uncommented exllama * Marlin test fix + added hpu bias test * Review comments * Removed hpu pack until we'll implement it in HPU --------- Co-authored-by: yan tomsinsky <ytomsinsky@habana.ai> * Added assert when g_idx is not trivial (#2) --------- Co-authored-by: yan tomsinsky <ytomsinsky@habana.ai> * Update qlinear_hpu.py * Update test_q4.py --------- Co-authored-by: Qubitium <417764+Qubitium@users.noreply.github.com> Co-authored-by: yan tomsinsky <ytomsinsky@habana.ai> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>

HolyFalafel and others added 2 commits June 11, 2024 10:18

Merge branch 'main' into main

dfdcd65

HolyFalafel marked this pull request as draft June 17, 2024 09:15

Added assert when g_idx is not trivial (#2)

7364539

HolyFalafel marked this pull request as ready for review June 20, 2024 08:04

fxmarty approved these changes Jun 24, 2024

View reviewed changes

fxmarty merged commit b57bea0 into AutoGPTQ:main Jun 24, 2024

This was referenced Jun 29, 2024

Fix upstream regression when there's no HPU device HabanaAI/AutoGPTQ#3

Merged

Fix upstream regression when there's no HPU device #701

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting uint4 inference of pre-quantized models in HPU #689

Supporting uint4 inference of pre-quantized models in HPU #689

HolyFalafel commented Jun 17, 2024

Qubitium commented Jun 20, 2024 •

edited

Loading

fxmarty commented Jun 24, 2024

fxmarty commented Jun 28, 2024

HolyFalafel commented Jun 28, 2024

Supporting uint4 inference of pre-quantized models in HPU #689

Supporting uint4 inference of pre-quantized models in HPU #689

Conversation

HolyFalafel commented Jun 17, 2024

Qubitium commented Jun 20, 2024 • edited Loading

fxmarty commented Jun 24, 2024

fxmarty commented Jun 28, 2024

HolyFalafel commented Jun 28, 2024

Qubitium commented Jun 20, 2024 •

edited

Loading