-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG/FEATURE] Fix Sym=False, new checkpoint_format = gptq_v2 #640
base: main
Are you sure you want to change the base?
Conversation
8774a80
to
dd72a76
Compare
We no longer need |
Latest quant test result: model: command-r-v01 bfloat16vllm ppl: 1.8202 sym=False checkpoint_format=gptq (v1)vllm ppl: 1.8313 sym=False checkpoint_format=gptq_v2autogptq ppl: 1.8282 vllm and hf/autgptq have slightly different log_prob so not directly 1:1 but even with with vllm vs vllm, unsafe |
Removed all @fxmarty Feel free to edit at will and revert the last commit if you want to go the conservative route. 528a8fc |
@fxmarty So after all the back-and-forth with underflow/overflow, the current changes since your last review boils down to 3 changes:
|
Intel/auto-round by default uses {
"bits": 4,
"group_size": 128,
"damp_percent": 0.01,
"desc_act": false,
"static_groups": false,
"sym": false,
"true_sequential": false,
"model_name_or_path": null,
"model_file_base_name": "model",
"quant_method": "gptq",
"checkpoint_format": "gptq",
"meta": {
"quantizer": "intel/auto-round:0.1",
"iters": 10,
"lr": 0.1,
"minmax_lr": 0.1,
"enable_minmax_tuning": true,
"use_quant_input": true,
"scale_dtype": "torch.float16"
}
} |
Add autoround model config with sym=False {
"bits": 4,
"group_size": 128,
"damp_percent": 0.01,
"desc_act": false,
"static_groups": false,
"sym": false,
"true_sequential": false,
"model_name_or_path": null,
"model_file_base_name": "model",
"quant_method": "gptq",
"checkpoint_format": "gptq",
"meta": {
"quantizer": "intel/auto-round:0.1",
"packer": "autogptq:0.8.0.dev1",
"iters": 20,
"lr": 0.05,
"minmax_lr": 0.05,
"enable_minmax_tuning": true,
"use_quant_input": true,
"scale_dtype": "torch.float16"
}
} |
Commit/Refractor 59be4b3 Meta tooling fingerprints are now split into # sample auto-round meta field
"meta": {
"quantizer": "intel/auto-round:0.1",
"packer": "autogptq:0.8.0.dev1",
}
# sample autogptq meta field
"meta": {
"quantizer": "autogptq:0.8.0.dev1",
} |
We found a significant discrepancy between the real quantized model and the QDQ model at W2G32 asym. However, for the W2G32 sym and W4G128 asym, there doesn't seem to be severe issue, but sym is much worse than asym at W2G32 on qdq model. As this PR also supports the v2 format and we are not familiar with the details, we're unsure if we can test it directly or if we need to merge the PR in Autoround. |
@wenhuach21 Can you elaborate with more details? Using which model/weights and code to reproduce the quantize discrepancies and such so that @qwopqwop200 can better look at the potential sym true vs false math discrepancies. |
This is the results, we used autoround to generate the qdq model and real autogptq model by "--deployment_device fake,gpu", then we could evaluate the both model. lm-eval 0.4.2 is used sym at w2g32 reference command python3 main.py --model_name /data5/llama3_8b_instruct/ --bits 2 --group_size 32 --n_samples 512 --iters 200 --deployment_device fake,gpu --disable_eval --seqlen 512 --minmax_lr 0.01 --scale_dtype fp32 --train_bs 4 --output_dir "./tmp_signround" |
we also find large discrepancy for phi-2 at W4G128 asym. Shall I create a new issue for this? I finally know the reason why autogptq set sym as default~~ |
@qwopqwop200 For various reasons my team has forked and refractored AutoGPTQ into new GPTQModel project. This bug fix/feature PR has been merged there. I have done plenty of testing and validation for this PR and it works, not perfectly but it works. Default save to v1 for max compat. |
@wenhuach21 Please raise an issue at GPTQModel as this PR has been merged there. I also want to merge intel/autoround integration into GPTQModel so that the pkg can directly use autoround pkg depend and do the intel-round quantization + lm_head support and save to gptq v1 format. Would really love for you to help out. Thanks. |
@qwopqwop200 This is the rebase of your PR at #559 with some modifications. Should be ready soon after we verify quantize, inference, and add some tests.
Reason For PR:
sym=False
was practically unusable due to post-quantization avg_loss per layer/PPL vssym=True
. @qwopqwop200 fixed the bad/suboptimal math. Nowsym=False
will most likely match or decrease avg_loss/layer vssym=True
and improve post-quant PPL for many models.Core Changes:
main
: allow usablesym=False
quantization and usecheckpoint_format=gptq_v2
to store new checkpoint format. Compat runtime dynamic convert of allcheckpoint_format=gptq
togptq_v2
on load.Misc Changes not directly related to sym=False code:
nsamples
+ lowdamp=0.005
to speed up quants.layer #
,module name
,avg loss
,duration
) in dict/slice and return to user vaquant_log = model.quantize()
quantize(quant_log=saved_quant_log)
to generate auto-avg_loss diff in progress. Sample diff output in later messages of this discussion.TODO:
sym=False
testsgptq_v2
formatgptq
(v1) format usingsym=False
in this PRPASSING TESTS:
checkpoint_format=gptq
(v1)sym=False
consistently generate loweravg_loss
thansym=True
sym=True
in PR generates same math/avg_loss
for layers assym=True
inmain
test_serialization.py
test_quantization.py
test_shared_loading.py
test_awq_compatibility_generation.py
(note: awq cache generated bymain
is not compatible with pr. fixed with version file name addingv2
)test_q4.py
FAILING TESTS:
test_triton.py
(never got this to work on main)test_repacking.py
(never got this to work on main)Original PR #559 notes duplicated here for ref: