Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Sym=False, new checkpoint_format = gptq_v2 #9

Merged
merged 69 commits into from
Jun 15, 2024
Merged

Conversation

Qubitium
Copy link
Owner

@Qubitium Qubitium commented Jun 15, 2024

@qwopqwop200 This is the rebase of your PR at AutoGPTQ#559 with some modifications. Should be ready soon after we verify quantize, inference, and add some tests.

Reason For PR:

sym=False was practically unusable due to post-quantization avg_loss per layer/PPL vs sym=True . @qwopqwop200 fixed the bad/suboptimal math. Now sym=False will most likely match or decrease avg_loss/layer vs sym=True and improve post-quant PPL for many models.

Core Changes:

  1. Rebase PR remove (zeors -= 1) AutoGPTQ/AutoGPTQ#559 with main: allow usable sym=False quantization and use checkpoint_format=gptq_v2 to store new checkpoint format. Compat runtime dynamic convert of all checkpoint_format=gptq to gptq_v2 on load.

Misc Changes not directly related to sym=False code:

  1. Complete Todo: Use accelerate 0.29.2 to load checkpoint.
  2. Consistency: Move cohere/starcoder2 to 4.39.0 release check
  3. Usability: Catch and alert user on how to fix quant/torch error caused by low damp/nsamples [BUG] torch._C._LinAlgError: linalg.cholesky always raised AutoGPTQ/AutoGPTQ#572 as this happens much more frequently than I had expected. Ran into 2-3 instances of this error on multiple models during testing for this PR when using low nsamples + low damp=0.005 to speed up quants.
  4. Simplify: optimized packing regression alert message/code to user
  5. Feature: Quant Stat Log 1/2: store per layer quant stats (layer #, module name, avg loss, duration) in dict/slice and return to user va quant_log = model.quantize()
  6. Feature: Quant Stat Log 2/2: pass saved quant_log to quantize(quant_log=saved_quant_log) to generate auto-avg_loss diff in progress. Sample diff output in later messages of this discussion.
  7. Usability: Use tqdm for layer loop so users can have an estimate of quant remaining time.

TODO:

  • Add sym=False tests
  • Validate and fix failing tests
  • Failed: Check if third party vllm/sglang needs kernel modification for the new gptq_v2 format
  • Check if third party vllm/sglang needs kernel modification for the new gptq (v1) format using sym=False in this PR

PASSING TESTS:

  • Compat: vllm/sglang compatible with gptq (v1) generated with this PR
  • Compat load of checkpoint_format=gptq(v1)
  • sym=False consistently generate lower avg_loss than sym=True
  • Regression Test: sym=True in PR generates same math/avg_loss for layers as sym=True in main
  • test_serialization.py
  • test_quantization.py
  • test_shared_loading.py
  • test_awq_compatibility_generation.py (note: awq cache generated by main is not compatible with pr. fixed with version file name adding v2)
  • test_q4.py

FAILING TESTS:

  • compat: vllm/sglang not compatible gptq_v2
  • test_triton.py (never got this to work on main)
  • test_repacking.py (never got this to work on main)

Original PR AutoGPTQ#559 notes duplicated here for ref:

  • check work marlin
  • check work exllama
  • check work exllama2
  • check work qigen
  • check work triton
  • check work cuda
  • check work cuda old(There is a bug in autogptq's main unrelated to this pr, so it cannot be confirmed.)
  • check work cuda pytorch
  • check work cuda old pytorch
  • check support old version save
  • check support old version load
  • check support new version save
  • check support new version load

I am removing this line because it is not only computationally unnecessary, but also makes sym=False impossible. However, it breaks backwards compatibility, so I making it the default to use the old save format.

related pr

AutoGPTQ#354
AutoGPTQ#325
I removed the following line: https://github.com/AutoGPTQ/AutoGPTQ/blob/main/auto_gptq/nn_modules/qlinear/qlinear_cuda.py#L169 This is an unnecessary line. And this line makes it impossible to sym='False' If sym='False' I get a reduction in ppl.

opt-125m(act-order) Bits group-size Wikitext2
sym=True 4 128 29.875
sym=False 4 128 29.221
llama2(act-order) Bits group-size Wikitext2
sym=True 4 128 5.254
sym=False 4 128 5.214

Qubitium and others added 27 commits April 16, 2024 08:21
@Qubitium Qubitium changed the title Sym false lm head Fix Sym=False, new checkpoint_format = gptq_v2 Jun 15, 2024
@Qubitium Qubitium merged commit c80855e into main Jun 15, 2024
@Qubitium Qubitium deleted the sym-false-lm-head branch June 17, 2024 02:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants