-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support loading sharded quantized checkpoints. #425
Conversation
Thank you @LaaZa for the support. Could you add a test to load sharded quantized checkpoints? |
I guess, but it would be good to have a tiny sharded model for a test. Also there are differences between a local model and hub model.(the actual loading is the same) So what would you suggest? |
@LaaZa Yes it would be good but I couldn't find such a model, I only found https://huggingface.co/ranchlai/Llama-2-70b-chat-gptq-4bit-128g/tree/main & https://huggingface.co/TheBloke/Falcon-180B-Chat-GPTQ/tree/main Maybe just having a test loading one of these is fine. |
Hi guys I just made this, if it's of use: https://huggingface.co/TheBlokeAI/llama-68m-GPTQ-sharded It's Llama 68M, sharded with max_shard_size="10MB". Though only one shard is actually 10MB, and the other two are bigger - not sure why that is. Total weights size is 104MiB. I used Transformers GPTQ to create it as I don't have the PR set up yet. But I think that's a better test anyway, as it confirms AutoGPTQ can load Transformers-produced GPTQs (I'm sure it can.) |
I just made a sharded tinyllama, but it also fails loading with Edit: Okay, the one @TheBloke made works with |
Add sharded loading test case.
@fxmarty added a test and fixed the hub loading. Is the test okay? It's not very comprehensive, I can add more permutations if you want. |
Will this go into 0.6 @fxmarty ? Would love to have sharding support in AutoGPTQ |
# Conflicts: # auto_gptq/modeling/_base.py
I merged with main.* Edit: |
#364 unfortunately had some issue. I think the produced models wouldn't work with fused attention for some reason. |
Thanks @LaaZa for your quick response. I've seen @TheBloke uploading GPTQ models with sharded safetensor files instead of 1 file. Is this different than the way he does it? (maybe he does it a post-resharding) |
I do it with Transformers, which supports making GPTQs with sharding. Or occasionally if I accidentally make a >50GB GPTQ with AutoGPTQ, then yes I post-shard before upload. But if I know a file will be >50GB, eg for 120B models, I will use Transformers so I get the sharding. When I make GPTQs with Transformers, I set the shard size to 49GB. This means it only shards if it has to for the 50GB limit, otherwise it uploads one file so it remains compatible with AutoGPTQ. It'd be great to have the sharding saving in AutoGPTQ so I could use that always. I wonder if we could think about merging it in its current state, with fused attention disabled when loading sharded models? Because for some models, they can only be uploaded to HF if sharded, so any support for AutoGPTQ would be better than no support. (And TBH most people use Transformers, TGI or ExLlama for GPTQ loading anyway, not AutoGPTQ any more - all of which support loading with shards) |
They are done with Transformers instead of AutoGPTQ directly. |
Many thanks @TheBloke for the detailed explanation. I do agree that any support in AutoGPTQ would be much appreciated at this time. |
@LaaZa Let me review & merge this this week. |
I didn't test it after merging with main but didn't have to do any manual merging. It would be nice to have this. Too bad we don't have sharding for the quantization yet. As I found out with testing this with that PR, it has some weird issue that breaks fused_attention. |
# Conflicts: # auto_gptq/modeling/_base.py
I tried to format my code by hand to follow the new style. |
@LaaZa sorry for the headaches. I added linting so as to avoid to have style changes in PR diffs. The easiest way for you is:
and then merge main and resolve potential conflicts. |
@fxmarty do you want me to do that or is it okay now? I'll be using that in the future. |
@LaaZa Can you do it now on this branch? Maybe it's already good but just to be sure. |
@fxmarty will do in an hour. |
Didn't do much but I simplified one line. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
@@ -1220,6 +1194,8 @@ def skip(*args, **kwargs): | |||
safe_save(new_state_dict, model_save_name) | |||
|
|||
if use_marlin: | |||
if is_sharded: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xunfeng1980 it is just not tested/implemented, but should be.
Adds support for loading sharded quantized models with accelerate. qigen not supported.
The loading counterpart for #364 which is not required.
Checks for
[model].index.json
to detect shards.fused_attention
may not work(errors out) for tiny shards, but they are typically not small enough to cause this.fixes #319