-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supporting uint4 inference of pre-quantized models in HPU #689
Conversation
* Supporting llama int4 quantization using AutoGPTQ * Running only PT code (similar to cuda_old) on HPU * Testing convert_from_int4 * Started cleanup * code cleanup * Added weight reshape in preprocessing Added llama7b generation hpu test * Changed reshape to match matmul (still not accurate) and fixed q4 test * Fixing zero points * Update pack function * Fixed accuracy * Uncommented exllama * Marlin test fix + added hpu bias test * Review comments * Removed hpu pack until we'll implement it in HPU --------- Co-authored-by: yan tomsinsky <ytomsinsky@habana.ai>
@HolyFalafel For various reasons my team has forked AutoGPTQ into GPTQModel project and would like to integrate this PR as soon as possible. If you can push this PR to there as well, that would be great. We can also cherrypick your commits and create a PR there but we are running into issue of validation. How to get our hands on intel habana HPU for testings/validation? Is there an intel cloud/team that we can connect with to borrow an HPU that can run these validation tests. Thanks. I don't want to polllute msg space unrelated to this repo so if you can connect with me via a new issue at GPTQModel, that would be best. |
Thank you! |
@HolyFalafel This PR apparently breaks for people not having |
@fxmarty working on it |
* fix pack() thread regression via code * stream pytest output * backport h100 fixed marlin kernel from vllm * Revert "backport h100 fixed marlin kernel from vllm" This reverts commit 8ac1b87. * revert * fix h100 * revert debug code * now that h100 is validated, remove hopper check * Supporting uint4 inference of pre-quantized models in HPU (AutoGPTQ#689) * Supporting llama uint4 quantization using AutoGPTQ (#1) * Supporting llama int4 quantization using AutoGPTQ * Running only PT code (similar to cuda_old) on HPU * Testing convert_from_int4 * Started cleanup * code cleanup * Added weight reshape in preprocessing Added llama7b generation hpu test * Changed reshape to match matmul (still not accurate) and fixed q4 test * Fixing zero points * Update pack function * Fixed accuracy * Uncommented exllama * Marlin test fix + added hpu bias test * Review comments * Removed hpu pack until we'll implement it in HPU --------- Co-authored-by: yan tomsinsky <ytomsinsky@habana.ai> * Added assert when g_idx is not trivial (#2) --------- Co-authored-by: yan tomsinsky <ytomsinsky@habana.ai> * Update qlinear_hpu.py * Update test_q4.py --------- Co-authored-by: Qubitium <417764+Qubitium@users.noreply.github.com> Co-authored-by: yan tomsinsky <ytomsinsky@habana.ai> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
Added native support in HPU, using a conversion kernel.
Currently we only support inference on a preloaded HF model.
This feature will be usable in Synapse v1.17