Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove (zeors -= 1) #559

Closed
wants to merge 10 commits into from
Closed

remove (zeors -= 1) #559

wants to merge 10 commits into from

Conversation

qwopqwop200
Copy link
Collaborator

@qwopqwop200 qwopqwop200 commented Feb 21, 2024

  • check work marlin
  • check work exllama
  • check work exllama2
  • check work qigen
  • check work triton
  • check work cuda
  • check work cuda old(There is a bug in autogptq's main unrelated to this pr, so it cannot be confirmed.)
  • check work cuda pytorch
  • check work cuda old pytorch
  • check support old version save
  • check support old version load
  • check support new version save
  • check support new version load

I am removing this line because it is not only computationally unnecessary, but also makes sym=False impossible.
However, it breaks backwards compatibility, so I making it the default to use the old save format.

related pr
#354
#325

I removed the following line: https://github.com/AutoGPTQ/AutoGPTQ/blob/main/auto_gptq/nn_modules/qlinear/qlinear_cuda.py#L169
This is an unnecessary line. And this line makes it impossible to sym='False'
If sym='False' I get a reduction in ppl.

opt-125m(act-order) Bits group-size Wikitext2
sym=True 4 128 29.875
sym=False 4 128 29.221
llama2(act-order) Bits group-size Wikitext2
sym=True 4 128 5.254
sym=False 4 128 5.214

@qwopqwop200 qwopqwop200 marked this pull request as draft February 21, 2024 13:21
@fxmarty
Copy link
Collaborator

fxmarty commented Feb 21, 2024

thanks a lot @qwopqwop200 for looking into this, it looks to be longstanding issue. Feel free to ping when you'd like a review

@qwopqwop200 qwopqwop200 marked this pull request as ready for review February 25, 2024 11:07
@qwopqwop200
Copy link
Collaborator Author

qwopqwop200 commented Feb 25, 2024

I have tested on opt-125m and check that all kernels work fine except for the cuda old kernel.
I don't know why, but cuda old kernel is not working independently of this PR.
All other kernels are working fine, so I think it's a good to get it reviewed now.
@fxmarty

@fxmarty
Copy link
Collaborator

fxmarty commented Feb 26, 2024

@qwopqwop200 Can you share code or add a test for me to test this?

@qwopqwop200
Copy link
Collaborator Author

@qwopqwop200 Can you share code or add a test for me to test this?

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging

logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)

pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [
    tokenizer(
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    sym=True,
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
    new_checkpoint_format=True
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(examples)

# save quantized model
model.save_quantized(quantized_model_dir)

# load quantized model to the first GPU
model = AutoGPTQForCausalLM.from_quantized('./opt-125m-4bit/', device="cuda:0", disable_exllama=True, disable_exllamav2=True)

# download quantized model from Hugging Face Hub and load to the first GPU
# model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False)

# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))

@fxmarty
Copy link
Collaborator

fxmarty commented Feb 26, 2024

I think we need to add tests for:

  • Previously quantized models still run fine with the modified kernels (test_q4.py is fine for that)
  • Newly quantized models with sym=True achieve comparable perplexity than without these changes under the same quantization settings / seed / etc.
  • Models quantized with sym=False achieve good perplexity (maybe higher than sym=True).

@qwopqwop200
Copy link
Collaborator Author

qwopqwop200 commented Mar 2, 2024

  • Previously quantized models still run fine with the modified kernels (test_q4.py is fine for that), i use test_q4.py
  • Models quantized with sym=False achieve good perplexity (maybe higher than sym=True).
  • Newly quantized models with sym=True achieve comparable perplexity than without these changes under the same quantization settings / seed / etc.
opt-125m Bits group-size Wikitext2
AutoGPTQ:sym=True 4 128 29.8402
AutoGPTQ:sym=False 4 128 1079.1795
This PR:sym=True 4 128 29.8402
This PR:sym=False 4 128 29.5254
AutoGPTQ:desc_act=True, sym=True 4 128 29.8747
AutoGPTQ:desc_act=True, sym=False 4 128 405.2904
This PR:desc_act=True, sym=True 4 128 29.8760
This PR:desc_act=True, sym=False 4 128 29.2197

@qwopqwop200
Copy link
Collaborator Author

@fxmarty I've done all the testing.

@Qubitium
Copy link
Contributor

Qubitium commented Mar 7, 2024

@fxmarty This PR really solves a huge problem where sym=False is unusable but undocumented anywhere. Also why is the CI failing?

Copy link
Collaborator

@fxmarty fxmarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Left a few comments.

CUDA_VISIBLE_DEVICES=0 pytest test_awq_compatibility_generation.py -s -vvvvvv does not pass on this PR (but does on main), but test_q4.py indeed do pass (I suspect this is due to the overflow).

Comment on lines +538 to +545
if self.now_format == "old":
self.model = convert_new_checkpoint_format(
self.model,
True,
self.quantize_config,
self.kerenl_backend_type
)
self.now_format = "new"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in the from_quantized method, not in forward or generate. Basically, there should be three cases:

  • is_legacy_format missing from the config or argument: assume is_legacy_format=True and update accordingly.
  • is_legacy_format=True (either in config or argument): update accordingly.
  • is_legacy_format=False: the new default, do nothing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is already in from_quantized, I am not sure why it is as well here?

Comment on lines +163 to +180
else:
if use_qigen:
submodule.zeros.data -= 1
elif use_marlin:
pass
else:
if quantize_config.bits == 2:
submodule.qzeros.data -= 0b01010101010101010101010101010101
elif quantize_config.bits == 3:
submodule.qzeros.data[:,range(0,submodule.qzeros.data.shape[1],3)] -= 0b00100100100100100100100100100100
submodule.qzeros.data[:,range(1,submodule.qzeros.data.shape[1],3)] -= 0b10010010010010010010010010010010
submodule.qzeros.data[:,range(2,submodule.qzeros.data.shape[1],3)] -= 0b01001001001001001001001001001001
elif quantize_config.bits == 4:
submodule.qzeros.data -= 0b00010001000100010001000100010001
elif quantize_config.bits == 8:
submodule.qzeros.data -= 0b00000001000000010000000100000001
else:
raise NotImplementedError("Only 2,3,4,8 bits are supported.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this case? Inference should always be done with is_legacy_format=False.

@@ -90,6 +91,7 @@ class BaseQuantizeConfig(PushToHubMixin):
model_name_or_path: Optional[str] = field(default=None)
model_file_base_name: Optional[str] = field(default=None)
awq_gemm_checkpoint: Optional[bool] = field(default=False)
new_checkpoint_format: Optional[bool] = field(default=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we name this is_legacy_format with the default being True when loading models that don't have the config attribute, and default to False when quantizing new models?

@@ -194,6 +199,7 @@ def to_dict(self):
"model_file_base_name": self.model_file_base_name,
"is_marlin_format": self.is_marlin_format,
"quant_method": "gptq",
"new_checkpoint_format": self.new_checkpoint_format,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's call this is_legacy_format

@@ -216,6 +222,8 @@ def __init__(
injected_fused_attention: bool = False,
injected_fused_mlp: bool = False,
trainable: bool = False,
kerenl_backend_type: Optional[str] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call this qlinear_kernel_name

if device_map:
self.model = remove_hook_from_module(self.model, recurse=True)
self.model = simple_dispatch_model(self.model, device_map)
self.model.config.use_cache = forward_pass_use_cache

self._quantized = True
self.now_format = "new"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be handled in the config no?

Comment on lines +662 to +672
if self.quantize_config.new_checkpoint_format:
logger.warning("New checkpoint format is enabled, the saved model is not supported by older versions of AutoGPTQ(<= 0.7.0).")

if not self.quantize_config.new_checkpoint_format and self.now_format == "new":
self.model = convert_new_checkpoint_format(
self.model,
False,
self.quantize_config,
self.kerenl_backend_type
)
self.now_format = "old"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point we should always have self.quantize_config.is_legacy_format to be False (if we always convert to the new format).

@@ -297,6 +346,7 @@ def pack_model(
"using autotune_warmup will move model to GPU, make sure you have enough VRAM to load the whole model."
)
QuantLinear.warmup(model.to(CUDA_0), seqlen=model.seqlen)
return QuantLinear
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe return the name directly here

Comment on lines +1359 to +1365
if not quantize_config.new_checkpoint_format:
model = convert_new_checkpoint_format(
model,
True,
quantize_config,
kerenl_backend_type
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add the argument is_legacy_format: Optional[bool] = None to from_quantized that allows to specify it in case not in the config, defaulting first to the quantization config value if any, and else to the user specified if any, and else to True?

For example, if somebody is quantizing models with an external library, the quantization config may not contain a is_legacy_format.

Comment on lines +151 to +160
if quantize_config.bits == 2:
submodule.qzeros.data += 0b01010101010101010101010101010101
elif quantize_config.bits == 3:
submodule.qzeros.data[:,range(0,submodule.qzeros.data.shape[1],3)] += 0b00100100100100100100100100100100
submodule.qzeros.data[:,range(1,submodule.qzeros.data.shape[1],3)] += 0b10010010010010010010010010010010
submodule.qzeros.data[:,range(2,submodule.qzeros.data.shape[1],3)] += 0b01001001001001001001001001001001
elif quantize_config.bits == 4:
submodule.qzeros.data += 0b00010001000100010001000100010001
elif quantize_config.bits == 8:
submodule.qzeros.data += 0b00000001000000010000000100000001
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not check for overflows, this is an issue. We used to check for overflows in the kernels.

11111 will become 0000.

@qwopqwop200
Copy link
Collaborator Author

I'm going to be busy for a while and won't be able to take care of this PR, so feel free to edit it yourself if you need to.

And for now, I've designed it to only use the modified kernel, which means it uses the current weight format during inference and converts to the old weight format when saving.

@Qubitium
Copy link
Contributor

I think it would be good to forward plan the vars that gets introduced to quantize_config.

With marlin and pending new version of gptq v2 format, we need a generic format property and not a new one for every next one that may come in the future.

right now with this pr we have:

new_checkpoint_format: bool
is_marlin_format: bool

Change to forward compatible:

format: str = ["gptq", "gptqv2", "marlin"]

@fxmarty What do you think? I can cook up a PR to use the new format property and make sure it's backward compatible.

@fxmarty
Copy link
Collaborator

fxmarty commented Mar 20, 2024

@qwopqwop200 No worries, thank you for the great work!

@Qubitium Good suggestion yes, this makes more sense, maybe checkpoint_format. I think the best way forward would probably for you to branch out of this branch, and resubmit a PR with the changes. Thanks a lot!

@qwopqwop200
Copy link
Collaborator Author

see this pr #640

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants