Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting uint4 inference of pre-quantized models in HPU #689

Merged
merged 3 commits into from
Jun 24, 2024

Conversation

HolyFalafel
Copy link
Contributor

Added native support in HPU, using a conversion kernel.
Currently we only support inference on a preloaded HF model.
This feature will be usable in Synapse v1.17

  • Supporting llama uint4 inference using AutoGPTQ in HPU
  • Removed hpu pack until we'll implement it in HPU

HolyFalafel and others added 2 commits June 11, 2024 10:18
* Supporting llama int4 quantization using AutoGPTQ

* Running only PT code (similar to cuda_old) on HPU

* Testing convert_from_int4

* Started cleanup

* code cleanup

* Added weight reshape in preprocessing
Added llama7b generation hpu test

* Changed reshape to match matmul (still not accurate) and fixed q4 test

* Fixing zero points

* Update pack function

* Fixed accuracy

* Uncommented exllama

* Marlin test fix + added hpu bias test

* Review comments

* Removed hpu pack until we'll implement it in HPU

---------

Co-authored-by: yan tomsinsky <ytomsinsky@habana.ai>
@HolyFalafel HolyFalafel marked this pull request as draft June 17, 2024 09:15
@HolyFalafel HolyFalafel marked this pull request as ready for review June 20, 2024 08:04
@Qubitium
Copy link
Collaborator

Qubitium commented Jun 20, 2024

@HolyFalafel For various reasons my team has forked AutoGPTQ into GPTQModel project and would like to integrate this PR as soon as possible. If you can push this PR to there as well, that would be great. We can also cherrypick your commits and create a PR there but we are running into issue of validation. How to get our hands on intel habana HPU for testings/validation? Is there an intel cloud/team that we can connect with to borrow an HPU that can run these validation tests. Thanks. I don't want to polllute msg space unrelated to this repo so if you can connect with me via a new issue at GPTQModel, that would be best.

@fxmarty
Copy link
Collaborator

fxmarty commented Jun 24, 2024

Thank you!

@fxmarty fxmarty merged commit b57bea0 into AutoGPTQ:main Jun 24, 2024
@fxmarty
Copy link
Collaborator

fxmarty commented Jun 28, 2024

@HolyFalafel This PR apparently breaks for people not having habana_frameworks installed. @HolyFalafel, it would be awesome if you can submit a fix! #695

@HolyFalafel
Copy link
Contributor Author

@fxmarty working on it

HolyFalafel added a commit to HabanaAI/AutoGPTQ that referenced this pull request Jul 4, 2024
* fix pack() thread regression via code

* stream pytest output

* backport h100 fixed marlin kernel from vllm

* Revert "backport h100 fixed marlin kernel from vllm"

This reverts commit 8ac1b87.

* revert

* fix h100

* revert debug code

* now that h100 is validated, remove hopper check

* Supporting uint4 inference of pre-quantized models in HPU (AutoGPTQ#689)

* Supporting llama uint4 quantization using AutoGPTQ (#1)

* Supporting llama int4 quantization using AutoGPTQ

* Running only PT code (similar to cuda_old) on HPU

* Testing convert_from_int4

* Started cleanup

* code cleanup

* Added weight reshape in preprocessing
Added llama7b generation hpu test

* Changed reshape to match matmul (still not accurate) and fixed q4 test

* Fixing zero points

* Update pack function

* Fixed accuracy

* Uncommented exllama

* Marlin test fix + added hpu bias test

* Review comments

* Removed hpu pack until we'll implement it in HPU

---------

Co-authored-by: yan tomsinsky <ytomsinsky@habana.ai>

* Added assert when g_idx is not trivial (#2)

---------

Co-authored-by: yan tomsinsky <ytomsinsky@habana.ai>

* Update qlinear_hpu.py

* Update test_q4.py

---------

Co-authored-by: Qubitium <417764+Qubitium@users.noreply.github.com>
Co-authored-by: yan tomsinsky <ytomsinsky@habana.ai>
Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants