remove (zeors -= 1) #559

qwopqwop200 · 2024-02-21T13:21:41Z

I am removing this line because it is not only computationally unnecessary, but also makes sym=False impossible.
However, it breaks backwards compatibility, so I making it the default to use the old save format.

related pr
#354
#325

I removed the following line: https://github.com/AutoGPTQ/AutoGPTQ/blob/main/auto_gptq/nn_modules/qlinear/qlinear_cuda.py#L169
This is an unnecessary line. And this line makes it impossible to sym='False'
If sym='False' I get a reduction in ppl.

opt-125m(act-order)	Bits	group-size	Wikitext2
sym=True	4	128	29.875
sym=False	4	128	29.221

llama2(act-order)	Bits	group-size	Wikitext2
sym=True	4	128	5.254
sym=False	4	128	5.214

fxmarty · 2024-02-21T13:22:50Z

thanks a lot @qwopqwop200 for looking into this, it looks to be longstanding issue. Feel free to ping when you'd like a review

qwopqwop200 · 2024-02-25T11:07:48Z

I have tested on opt-125m and check that all kernels work fine except for the cuda old kernel.
I don't know why, but cuda old kernel is not working independently of this PR.
All other kernels are working fine, so I think it's a good to get it reviewed now.
@fxmarty

fxmarty · 2024-02-26T09:24:42Z

@qwopqwop200 Can you share code or add a test for me to test this?

qwopqwop200 · 2024-02-26T12:20:28Z

@qwopqwop200 Can you share code or add a test for me to test this?

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging

logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)

pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [
    tokenizer(
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    sym=True,
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
    new_checkpoint_format=True
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(examples)

# save quantized model
model.save_quantized(quantized_model_dir)

# load quantized model to the first GPU
model = AutoGPTQForCausalLM.from_quantized('./opt-125m-4bit/', device="cuda:0", disable_exllama=True, disable_exllamav2=True)

# download quantized model from Hugging Face Hub and load to the first GPU
# model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False)

# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))

fxmarty · 2024-02-26T13:10:25Z

I think we need to add tests for:

Previously quantized models still run fine with the modified kernels (test_q4.py is fine for that)
Newly quantized models with sym=True achieve comparable perplexity than without these changes under the same quantization settings / seed / etc.
Models quantized with sym=False achieve good perplexity (maybe higher than sym=True).

qwopqwop200 · 2024-03-02T06:49:58Z

Previously quantized models still run fine with the modified kernels (test_q4.py is fine for that), i use test_q4.py
Models quantized with sym=False achieve good perplexity (maybe higher than sym=True).
Newly quantized models with sym=True achieve comparable perplexity than without these changes under the same quantization settings / seed / etc.

opt-125m	Bits	group-size	Wikitext2
AutoGPTQ:sym=True	4	128	29.8402
AutoGPTQ:sym=False	4	128	1079.1795
This PR:sym=True	4	128	29.8402
This PR:sym=False	4	128	29.5254
AutoGPTQ:desc_act=True, sym=True	4	128	29.8747
AutoGPTQ:desc_act=True, sym=False	4	128	405.2904
This PR:desc_act=True, sym=True	4	128	29.8760
This PR:desc_act=True, sym=False	4	128	29.2197

qwopqwop200 · 2024-03-02T09:07:34Z

@fxmarty I've done all the testing.

Qubitium · 2024-03-07T03:08:55Z

@fxmarty This PR really solves a huge problem where sym=False is unusable but undocumented anywhere. Also why is the CI failing?

fxmarty

Great work! Left a few comments.

CUDA_VISIBLE_DEVICES=0 pytest test_awq_compatibility_generation.py -s -vvvvvv does not pass on this PR (but does on main), but test_q4.py indeed do pass (I suspect this is due to the overflow).

fxmarty · 2024-03-19T07:46:39Z

auto_gptq/modeling/_base.py

+        if self.now_format == "old":
+            self.model = convert_new_checkpoint_format(
+                self.model,
+                True,
+                self.quantize_config,
+                self.kerenl_backend_type
+            )
+            self.now_format = "new"


This should be in the from_quantized method, not in forward or generate. Basically, there should be three cases:

is_legacy_format missing from the config or argument: assume is_legacy_format=True and update accordingly.

is_legacy_format=True (either in config or argument): update accordingly.

is_legacy_format=False: the new default, do nothing.

As this is already in from_quantized, I am not sure why it is as well here?

fxmarty · 2024-03-19T07:53:39Z

auto_gptq/modeling/_utils.py

+            else:
+                if use_qigen:
+                    submodule.zeros.data -= 1
+                elif use_marlin:
+                    pass
+                else:
+                    if quantize_config.bits == 2:
+                        submodule.qzeros.data -= 0b01010101010101010101010101010101
+                    elif quantize_config.bits == 3:
+                        submodule.qzeros.data[:,range(0,submodule.qzeros.data.shape[1],3)] -= 0b00100100100100100100100100100100
+                        submodule.qzeros.data[:,range(1,submodule.qzeros.data.shape[1],3)] -= 0b10010010010010010010010010010010
+                        submodule.qzeros.data[:,range(2,submodule.qzeros.data.shape[1],3)] -= 0b01001001001001001001001001001001
+                    elif quantize_config.bits == 4:
+                        submodule.qzeros.data -= 0b00010001000100010001000100010001
+                    elif quantize_config.bits == 8:
+                        submodule.qzeros.data -= 0b00000001000000010000000100000001
+                    else:
+                        raise NotImplementedError("Only 2,3,4,8 bits are supported.")


Why do we need this case? Inference should always be done with is_legacy_format=False.

fxmarty · 2024-03-19T07:55:52Z

auto_gptq/modeling/_base.py

@@ -90,6 +91,7 @@ class BaseQuantizeConfig(PushToHubMixin):
    model_name_or_path: Optional[str] = field(default=None)
    model_file_base_name: Optional[str] = field(default=None)
    awq_gemm_checkpoint: Optional[bool] = field(default=False)
+    new_checkpoint_format: Optional[bool] = field(default=False)


Can we name this is_legacy_format with the default being True when loading models that don't have the config attribute, and default to False when quantizing new models?

fxmarty · 2024-03-19T07:56:25Z

auto_gptq/modeling/_base.py

@@ -194,6 +199,7 @@ def to_dict(self):
            "model_file_base_name": self.model_file_base_name,
            "is_marlin_format": self.is_marlin_format,
            "quant_method": "gptq",
+            "new_checkpoint_format": self.new_checkpoint_format,


let's call this is_legacy_format

fxmarty · 2024-03-19T07:56:46Z

auto_gptq/modeling/_base.py

@@ -216,6 +222,8 @@ def __init__(
        injected_fused_attention: bool = False,
        injected_fused_mlp: bool = False,
        trainable: bool = False,
+        kerenl_backend_type: Optional[str] = None,


Maybe call this qlinear_kernel_name

fxmarty · 2024-03-19T07:57:43Z

auto_gptq/modeling/_base.py

        if device_map:
            self.model = remove_hook_from_module(self.model, recurse=True)
            self.model = simple_dispatch_model(self.model, device_map)
        self.model.config.use_cache = forward_pass_use_cache

        self._quantized = True
+        self.now_format = "new"


should be handled in the config no?

fxmarty · 2024-03-19T07:59:29Z

auto_gptq/modeling/_base.py

+        if self.quantize_config.new_checkpoint_format:
+            logger.warning("New checkpoint format is enabled, the saved model is not supported by older versions of AutoGPTQ(<= 0.7.0).")
+
+        if not self.quantize_config.new_checkpoint_format and self.now_format == "new":
+            self.model = convert_new_checkpoint_format(
+                self.model,
+                False,
+                self.quantize_config,
+                self.kerenl_backend_type
+            )
+            self.now_format = "old"


At this point we should always have self.quantize_config.is_legacy_format to be False (if we always convert to the new format).

fxmarty · 2024-03-19T08:01:15Z

auto_gptq/modeling/_utils.py

@@ -297,6 +346,7 @@ def pack_model(
            "using autotune_warmup will move model to GPU, make sure you have enough VRAM to load the whole model."
        )
        QuantLinear.warmup(model.to(CUDA_0), seqlen=model.seqlen)
+    return QuantLinear


maybe return the name directly here

fxmarty · 2024-03-19T08:04:52Z

auto_gptq/modeling/_base.py

+        if not quantize_config.new_checkpoint_format:
+            model = convert_new_checkpoint_format(
+                model,
+                True,
+                quantize_config,
+                kerenl_backend_type
+            )


Can you also add the argument is_legacy_format: Optional[bool] = None to from_quantized that allows to specify it in case not in the config, defaulting first to the quantization config value if any, and else to the user specified if any, and else to True?

For example, if somebody is quantizing models with an external library, the quantization config may not contain a is_legacy_format.

fxmarty · 2024-03-19T08:08:02Z

auto_gptq/modeling/_utils.py

+                    if quantize_config.bits == 2:
+                        submodule.qzeros.data += 0b01010101010101010101010101010101
+                    elif quantize_config.bits == 3:
+                        submodule.qzeros.data[:,range(0,submodule.qzeros.data.shape[1],3)] += 0b00100100100100100100100100100100
+                        submodule.qzeros.data[:,range(1,submodule.qzeros.data.shape[1],3)] += 0b10010010010010010010010010010010
+                        submodule.qzeros.data[:,range(2,submodule.qzeros.data.shape[1],3)] += 0b01001001001001001001001001001001
+                    elif quantize_config.bits == 4:
+                        submodule.qzeros.data += 0b00010001000100010001000100010001
+                    elif quantize_config.bits == 8:
+                        submodule.qzeros.data += 0b00000001000000010000000100000001


This does not check for overflows, this is an issue. We used to check for overflows in the kernels.

11111 will become 0000.

qwopqwop200 · 2024-03-19T15:37:44Z

I'm going to be busy for a while and won't be able to take care of this PR, so feel free to edit it yourself if you need to.

And for now, I've designed it to only use the modified kernel, which means it uses the current weight format during inference and converts to the old weight format when saving.

Qubitium · 2024-03-20T05:01:09Z

I think it would be good to forward plan the vars that gets introduced to quantize_config.

With marlin and pending new version of gptq v2 format, we need a generic format property and not a new one for every next one that may come in the future.

right now with this pr we have:

new_checkpoint_format: bool
is_marlin_format: bool

Change to forward compatible:

format: str = ["gptq", "gptqv2", "marlin"]

@fxmarty What do you think? I can cook up a PR to use the new format property and make sure it's backward compatible.

fxmarty · 2024-03-20T05:05:01Z

@qwopqwop200 No worries, thank you for the great work!

@Qubitium Good suggestion yes, this makes more sense, maybe checkpoint_format. I think the best way forward would probably for you to branch out of this branch, and resubmit a PR with the changes. Thanks a lot!

qwopqwop200 · 2024-04-12T14:07:43Z

see this pr #640

qwopqwop200 added 2 commits February 21, 2024 21:53

remove (zeors -= 1)

b33e1d0

add warning

6419a2d

qwopqwop200 marked this pull request as draft February 21, 2024 13:21

qwopqwop200 added 2 commits February 22, 2024 21:57

support backwards compatibility

5ea98bb

support and fix bug

0b07292

qwopqwop200 marked this pull request as ready for review February 25, 2024 11:07

remove not necessary parm

b015ae9

qwopqwop200 added 2 commits March 2, 2024 14:58

fix test_q4 bug

b7d9ade

fix test_q4 bug

83e510e

qwopqwop200 added 3 commits March 2, 2024 17:25

fix bug double converting

a89cb77

Update _utils.py

639e66a

Merge branch 'main' into main

15ecb0f

fxmarty reviewed Mar 19, 2024

View reviewed changes

Qubitium mentioned this pull request Mar 23, 2024

Add checkpoint_format to unify all current/future saved weight formats #603

Merged

Qubitium mentioned this pull request Apr 12, 2024

[BUG/FEATURE] Fix Sym=False, new checkpoint_format = gptq_v2 #640

Open

29 tasks

qwopqwop200 closed this Apr 12, 2024

Qubitium mentioned this pull request Jun 15, 2024

Fix Sym=False, new checkpoint_format = gptq_v2 Qubitium/AutoGPTQ#9

Merged

29 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove (zeors -= 1) #559

remove (zeors -= 1) #559

qwopqwop200 commented Feb 21, 2024 •

edited

Loading

fxmarty commented Feb 21, 2024 •

edited

Loading

qwopqwop200 commented Feb 25, 2024 •

edited

Loading

fxmarty commented Feb 26, 2024

qwopqwop200 commented Feb 26, 2024

fxmarty commented Feb 26, 2024

qwopqwop200 commented Mar 2, 2024 •

edited

Loading

qwopqwop200 commented Mar 2, 2024

Qubitium commented Mar 7, 2024

fxmarty left a comment

fxmarty Mar 19, 2024

fxmarty Mar 19, 2024

fxmarty Mar 19, 2024

fxmarty Mar 19, 2024

fxmarty Mar 19, 2024

fxmarty Mar 19, 2024

fxmarty Mar 19, 2024

fxmarty Mar 19, 2024

fxmarty Mar 19, 2024

fxmarty Mar 19, 2024

fxmarty Mar 19, 2024

qwopqwop200 commented Mar 19, 2024

Qubitium commented Mar 20, 2024

fxmarty commented Mar 20, 2024

qwopqwop200 commented Apr 12, 2024

remove (zeors -= 1) #559

remove (zeors -= 1) #559

Conversation

qwopqwop200 commented Feb 21, 2024 • edited Loading

related pr #354 #325

fxmarty commented Feb 21, 2024 • edited Loading

qwopqwop200 commented Feb 25, 2024 • edited Loading

fxmarty commented Feb 26, 2024

qwopqwop200 commented Feb 26, 2024

fxmarty commented Feb 26, 2024

qwopqwop200 commented Mar 2, 2024 • edited Loading

qwopqwop200 commented Mar 2, 2024

Qubitium commented Mar 7, 2024

fxmarty left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qwopqwop200 commented Mar 19, 2024

Qubitium commented Mar 20, 2024

fxmarty commented Mar 20, 2024

qwopqwop200 commented Apr 12, 2024

qwopqwop200 commented Feb 21, 2024 •

edited

Loading

related pr
#354
#325

fxmarty commented Feb 21, 2024 •

edited

Loading

qwopqwop200 commented Feb 25, 2024 •

edited

Loading

qwopqwop200 commented Mar 2, 2024 •

edited

Loading