-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove (zeors -= 1) #559
remove (zeors -= 1) #559
Conversation
thanks a lot @qwopqwop200 for looking into this, it looks to be longstanding issue. Feel free to ping when you'd like a review |
I have tested on opt-125m and check that all kernels work fine except for the cuda old kernel. |
@qwopqwop200 Can you share code or add a test for me to test this? |
|
I think we need to add tests for:
|
|
@fxmarty I've done all the testing. |
@fxmarty This PR really solves a huge problem where sym=False is unusable but undocumented anywhere. Also why is the CI failing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Left a few comments.
CUDA_VISIBLE_DEVICES=0 pytest test_awq_compatibility_generation.py -s -vvvvvv
does not pass on this PR (but does on main), but test_q4.py
indeed do pass (I suspect this is due to the overflow).
if self.now_format == "old": | ||
self.model = convert_new_checkpoint_format( | ||
self.model, | ||
True, | ||
self.quantize_config, | ||
self.kerenl_backend_type | ||
) | ||
self.now_format = "new" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in the from_quantized
method, not in forward
or generate
. Basically, there should be three cases:
is_legacy_format
missing from the config or argument: assumeis_legacy_format=True
and update accordingly.is_legacy_format=True
(either in config or argument): update accordingly.is_legacy_format=False
: the new default, do nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is already in from_quantized
, I am not sure why it is as well here?
else: | ||
if use_qigen: | ||
submodule.zeros.data -= 1 | ||
elif use_marlin: | ||
pass | ||
else: | ||
if quantize_config.bits == 2: | ||
submodule.qzeros.data -= 0b01010101010101010101010101010101 | ||
elif quantize_config.bits == 3: | ||
submodule.qzeros.data[:,range(0,submodule.qzeros.data.shape[1],3)] -= 0b00100100100100100100100100100100 | ||
submodule.qzeros.data[:,range(1,submodule.qzeros.data.shape[1],3)] -= 0b10010010010010010010010010010010 | ||
submodule.qzeros.data[:,range(2,submodule.qzeros.data.shape[1],3)] -= 0b01001001001001001001001001001001 | ||
elif quantize_config.bits == 4: | ||
submodule.qzeros.data -= 0b00010001000100010001000100010001 | ||
elif quantize_config.bits == 8: | ||
submodule.qzeros.data -= 0b00000001000000010000000100000001 | ||
else: | ||
raise NotImplementedError("Only 2,3,4,8 bits are supported.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this case? Inference should always be done with is_legacy_format=False
.
@@ -90,6 +91,7 @@ class BaseQuantizeConfig(PushToHubMixin): | |||
model_name_or_path: Optional[str] = field(default=None) | |||
model_file_base_name: Optional[str] = field(default=None) | |||
awq_gemm_checkpoint: Optional[bool] = field(default=False) | |||
new_checkpoint_format: Optional[bool] = field(default=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we name this is_legacy_format
with the default being True
when loading models that don't have the config attribute, and default to False
when quantizing new models?
@@ -194,6 +199,7 @@ def to_dict(self): | |||
"model_file_base_name": self.model_file_base_name, | |||
"is_marlin_format": self.is_marlin_format, | |||
"quant_method": "gptq", | |||
"new_checkpoint_format": self.new_checkpoint_format, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's call this is_legacy_format
@@ -216,6 +222,8 @@ def __init__( | |||
injected_fused_attention: bool = False, | |||
injected_fused_mlp: bool = False, | |||
trainable: bool = False, | |||
kerenl_backend_type: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call this qlinear_kernel_name
if device_map: | ||
self.model = remove_hook_from_module(self.model, recurse=True) | ||
self.model = simple_dispatch_model(self.model, device_map) | ||
self.model.config.use_cache = forward_pass_use_cache | ||
|
||
self._quantized = True | ||
self.now_format = "new" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be handled in the config no?
if self.quantize_config.new_checkpoint_format: | ||
logger.warning("New checkpoint format is enabled, the saved model is not supported by older versions of AutoGPTQ(<= 0.7.0).") | ||
|
||
if not self.quantize_config.new_checkpoint_format and self.now_format == "new": | ||
self.model = convert_new_checkpoint_format( | ||
self.model, | ||
False, | ||
self.quantize_config, | ||
self.kerenl_backend_type | ||
) | ||
self.now_format = "old" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point we should always have self.quantize_config.is_legacy_format
to be False
(if we always convert to the new format).
@@ -297,6 +346,7 @@ def pack_model( | |||
"using autotune_warmup will move model to GPU, make sure you have enough VRAM to load the whole model." | |||
) | |||
QuantLinear.warmup(model.to(CUDA_0), seqlen=model.seqlen) | |||
return QuantLinear |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe return the name directly here
if not quantize_config.new_checkpoint_format: | ||
model = convert_new_checkpoint_format( | ||
model, | ||
True, | ||
quantize_config, | ||
kerenl_backend_type | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add the argument is_legacy_format: Optional[bool] = None
to from_quantized
that allows to specify it in case not in the config
, defaulting first to the quantization config value if any, and else to the user specified if any, and else to True
?
For example, if somebody is quantizing models with an external library, the quantization config may not contain a is_legacy_format
.
if quantize_config.bits == 2: | ||
submodule.qzeros.data += 0b01010101010101010101010101010101 | ||
elif quantize_config.bits == 3: | ||
submodule.qzeros.data[:,range(0,submodule.qzeros.data.shape[1],3)] += 0b00100100100100100100100100100100 | ||
submodule.qzeros.data[:,range(1,submodule.qzeros.data.shape[1],3)] += 0b10010010010010010010010010010010 | ||
submodule.qzeros.data[:,range(2,submodule.qzeros.data.shape[1],3)] += 0b01001001001001001001001001001001 | ||
elif quantize_config.bits == 4: | ||
submodule.qzeros.data += 0b00010001000100010001000100010001 | ||
elif quantize_config.bits == 8: | ||
submodule.qzeros.data += 0b00000001000000010000000100000001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not check for overflows, this is an issue. We used to check for overflows in the kernels.
11111 will become 0000.
I'm going to be busy for a while and won't be able to take care of this PR, so feel free to edit it yourself if you need to. And for now, I've designed it to only use the modified kernel, which means it uses the current weight format during inference and converts to the old weight format when saving. |
I think it would be good to forward plan the vars that gets introduced to quantize_config. With marlin and pending new version of gptq v2 format, we need a generic right now with this pr we have:
Change to forward compatible:
@fxmarty What do you think? I can cook up a PR to use the new format property and make sure it's backward compatible. |
@qwopqwop200 No worries, thank you for the great work! @Qubitium Good suggestion yes, this makes more sense, maybe |
see this pr #640 |
check work cuda old(There is a bug in autogptq's main unrelated to this pr, so it cannot be confirmed.)I am removing this line because it is not only computationally unnecessary, but also makes sym=False impossible.
However, it breaks backwards compatibility, so I making it the default to use the old save format.
related pr
#354
#325
I removed the following line: https://github.com/AutoGPTQ/AutoGPTQ/blob/main/auto_gptq/nn_modules/qlinear/qlinear_cuda.py#L169
This is an unnecessary line. And this line makes it impossible to sym='False'
If sym='False' I get a reduction in ppl.