Fix Sym=False, new checkpoint_format = gptq_v2 #9

Qubitium · 2024-06-15T18:04:57Z

@qwopqwop200 This is the rebase of your PR at AutoGPTQ#559 with some modifications. Should be ready soon after we verify quantize, inference, and add some tests.

Reason For PR:

sym=False was practically unusable due to post-quantization avg_loss per layer/PPL vs sym=True . @qwopqwop200 fixed the bad/suboptimal math. Now sym=False will most likely match or decrease avg_loss/layer vs sym=True and improve post-quant PPL for many models.

Core Changes:

Rebase PR remove (zeors -= 1) AutoGPTQ/AutoGPTQ#559 with main: allow usable sym=False quantization and use checkpoint_format=gptq_v2 to store new checkpoint format. Compat runtime dynamic convert of all checkpoint_format=gptq to gptq_v2 on load.

Misc Changes not directly related to sym=False code:

Complete Todo: Use accelerate 0.29.2 to load checkpoint.
Consistency: Move cohere/starcoder2 to 4.39.0 release check
Usability: Catch and alert user on how to fix quant/torch error caused by low damp/nsamples [BUG] torch._C._LinAlgError: linalg.cholesky always raised AutoGPTQ/AutoGPTQ#572 as this happens much more frequently than I had expected. Ran into 2-3 instances of this error on multiple models during testing for this PR when using low nsamples + low damp=0.005 to speed up quants.
Simplify: optimized packing regression alert message/code to user
Feature: Quant Stat Log 1/2: store per layer quant stats (layer #, module name, avg loss, duration) in dict/slice and return to user va quant_log = model.quantize()
Feature: Quant Stat Log 2/2: pass saved quant_log to quantize(quant_log=saved_quant_log) to generate auto-avg_loss diff in progress. Sample diff output in later messages of this discussion.
Usability: Use tqdm for layer loop so users can have an estimate of quant remaining time.

TODO:

Add sym=False tests
Validate and fix failing tests
Failed: Check if third party vllm/sglang needs kernel modification for the new gptq_v2 format
Check if third party vllm/sglang needs kernel modification for the new gptq (v1) format using sym=False in this PR

PASSING TESTS:

Compat: vllm/sglang compatible with gptq (v1) generated with this PR
Compat load of checkpoint_format=gptq(v1)
sym=False consistently generate lower avg_loss than sym=True
Regression Test: sym=True in PR generates same math/avg_loss for layers as sym=True in main
test_serialization.py
test_quantization.py
test_shared_loading.py
test_awq_compatibility_generation.py (note: awq cache generated by main is not compatible with pr. fixed with version file name adding v2)
test_q4.py

FAILING TESTS:

compat: vllm/sglang not compatible gptq_v2
test_triton.py (never got this to work on main)
test_repacking.py (never got this to work on main)

Original PR AutoGPTQ#559 notes duplicated here for ref:

check work marlin

check work exllama

check work exllama2

check work qigen

check work triton

check work cuda

~~check work cuda old~~(There is a bug in autogptq's main unrelated to this pr, so it cannot be confirmed.)

check work cuda pytorch

check work cuda old pytorch

check support old version save

check support old version load

check support new version save

check support new version load

I am removing this line because it is not only computationally unnecessary, but also makes sym=False impossible. However, it breaks backwards compatibility, so I making it the default to use the old save format.

related pr

AutoGPTQ#354
AutoGPTQ#325
I removed the following line: https://github.com/AutoGPTQ/AutoGPTQ/blob/main/auto_gptq/nn_modules/qlinear/qlinear_cuda.py#L169 This is an unnecessary line. And this line makes it impossible to sym='False' If sym='False' I get a reduction in ppl.

opt-125m(act-order) Bits group-size Wikitext2
sym=True 4 128 29.875
sym=False 4 128 29.221
llama2(act-order) Bits group-size Wikitext2
sym=True 4 128 5.254
sym=False 4 128 5.214

… name

fix underflow cond reversed

…ading of v1 sym=False by default.

…eprecated by python. add packaging. depend

…nsical

…oading

… save_quantized

…und as example)

qwopqwop200 and others added 30 commits February 21, 2024 21:53

remove (zeors -= 1)

b33e1d0

add warning

6419a2d

support backwards compatibility

5ea98bb

support and fix bug

0b07292

remove not necessary parm

b015ae9

fix test_q4 bug

b7d9ade

fix test_q4 bug

83e510e

fix bug double converting

a89cb77

Update _utils.py

639e66a

Merge branch 'main' into main

15ecb0f

merge main #1

0759aea

FIX type error

101647d

module is nn.Module

28a2541

sync name

3899d5b

need return module

9f25c21

modify default format to gptq_v2

4159474

fix need return model

c853143

remove fixme and default to gptq_v2 for quantize_config

047dd97

save _qlinear_kernel and allow save to older format

869a162

fix name

365d961

pass quantize

e363d85

Merge remote-tracking branch 'origin/main' into sym-false

6af41db

update

bd12bde

store quant log/stats in dict slice and return to user in quantize()

cd916fe

accept saved quant_log in quantize() and calculate diff

d35e57d

tqdm the layer loop

b0d2ad5

log awq vs autogptq outputs in awq compat test

45cbc6b

fix cached models is not compatible with new pr. add v2 to cache file…

23b4b65

… name

add deprecation warning for loading .bin/.pt weights

dc5d3dd

add missing termcolor req

fec3f1c

Qubitium and others added 27 commits April 16, 2024 08:21

use thread limit 1: as good as 4 and 1 beats 16 threads in testing

ed0fe60

fix saving of gptq (v1)

37ed02e

deep copy

5d66558

remove todo: verified

9d88639

TEST/DEBUG underflow protection and output underflow stats

0b84da7

fix underflow cond reversed

force underflow math (testing shows this is better than skipping math)

9e3fea0

1) disable serialization of sym=False to v1 by default. 2) disable lo…

005a942

…ading of v1 sym=False by default.

revert adding underflow check/stats

365618d

pass test_quant and test both v1 and v2 save/load

31d4027

performance fix for convert_v1/v2().

ac63033

need to ++ version so can delimit models make pre/post pr

e8eabe2

add meta and meta.quantizer to quantized_config.json

e4936f5

fix json save and add meta check to test_quantization. distutils is d…

c822d63

…eprecated by python. add packaging. depend

fix failed test

35cc701

fix awq unpack/repacking thread regression

cddbe23

remove highly flaky mistral tiny test with input/output that is nonse…

811ca39

…nsical

now we can detect quant producer, we don't need use_unsafe_math for l…

2104cae

…oading

updat tests

bc2bf5b

default to gptq v1 for max compat and remove use_unsafe_math check in…

528a8fc

… save_quantized

misc

464fc7e

separate the concept of meta.quantizer and meta.packer (intel/auto-ro…

59be4b3

…und as example)

clean

79050ea

test allow loading quantized lm_head

4450702

rename

d85f3f2

fix quantized lm_head loading

55336b5

sync with main

d086437

ADD GLM model support

fe191f5

Qubitium changed the title ~~Sym false lm head~~ Fix Sym=False, new checkpoint_format = gptq_v2 Jun 15, 2024

Qubitium merged commit c80855e into main Jun 15, 2024

Qubitium deleted the sym-false-lm-head branch June 17, 2024 02:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Sym=False, new checkpoint_format = gptq_v2 #9

Fix Sym=False, new checkpoint_format = gptq_v2 #9

Qubitium commented Jun 15, 2024 •

edited

Fix Sym=False, new checkpoint_format = gptq_v2 #9

Fix Sym=False, new checkpoint_format = gptq_v2 #9

Conversation

Qubitium commented Jun 15, 2024 • edited

Reason For PR:

Core Changes:

Misc Changes not directly related to sym=False code:

related pr

Qubitium commented Jun 15, 2024 •

edited