Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support loading sharded quantized checkpoints. #425

Merged
merged 12 commits into from
Feb 19, 2024
Merged

Conversation

LaaZa
Copy link
Contributor

@LaaZa LaaZa commented Nov 11, 2023

Adds support for loading sharded quantized models with accelerate. qigen not supported.

The loading counterpart for #364 which is not required.

Checks for [model].index.json to detect shards.

fused_attention may not work(errors out) for tiny shards, but they are typically not small enough to cause this.

fixes #319

@fxmarty
Copy link
Collaborator

fxmarty commented Nov 14, 2023

Thank you @LaaZa for the support. Could you add a test to load sharded quantized checkpoints?

@LaaZa
Copy link
Contributor Author

LaaZa commented Nov 14, 2023

I guess, but it would be good to have a tiny sharded model for a test. Also there are differences between a local model and hub model.(the actual loading is the same) So what would you suggest?

@fxmarty
Copy link
Collaborator

fxmarty commented Nov 14, 2023

@LaaZa Yes it would be good but I couldn't find such a model, I only found https://huggingface.co/ranchlai/Llama-2-70b-chat-gptq-4bit-128g/tree/main & https://huggingface.co/TheBloke/Falcon-180B-Chat-GPTQ/tree/main

Maybe just having a test loading one of these is fine.

@LaaZa
Copy link
Contributor Author

LaaZa commented Nov 14, 2023

@fxmarty I can use #364 to make one small one and upload to hub.

@TheBloke
Copy link
Contributor

Hi guys

I just made this, if it's of use: https://huggingface.co/TheBlokeAI/llama-68m-GPTQ-sharded

It's Llama 68M, sharded with max_shard_size="10MB". Though only one shard is actually 10MB, and the other two are bigger - not sure why that is. Total weights size is 104MiB.

I used Transformers GPTQ to create it as I don't have the PR set up yet. But I think that's a better test anyway, as it confirms AutoGPTQ can load Transformers-produced GPTQs (I'm sure it can.)

@LaaZa
Copy link
Contributor Author

LaaZa commented Nov 14, 2023

I just made a sharded tinyllama, but it also fails loading with fused_attention it might be that #364 somehow makes models that do not work with that. I'll test the 68M one.

Edit: Okay, the one @TheBloke made works with fused_attention but I noticed normal filename resolve fails with a hub model if its not in the gptq_* format when the basename is not provided.

Add sharded loading test case.
@LaaZa
Copy link
Contributor Author

LaaZa commented Nov 14, 2023

@fxmarty added a test and fixed the hub loading. Is the test okay? It's not very comprehensive, I can add more permutations if you want.

@TheBloke
Copy link
Contributor

Will this go into 0.6 @fxmarty ? Would love to have sharding support in AutoGPTQ

# Conflicts:
#	auto_gptq/modeling/_base.py
@LaaZa
Copy link
Contributor Author

LaaZa commented Dec 15, 2023

I merged with main.*

Edit:
*again, I forgot to update, no issues

@maziyarpanahi
Copy link

Thanks @LaaZa for the implementation. Would it be possible to merge this PR and #364 to the main branch? I would really love to have save_quantized method with max_shard_size feature:

model.save_quantized(output_dir, max_shard_size=shard_size)

@LaaZa
Copy link
Contributor Author

LaaZa commented Jan 14, 2024

#364 unfortunately had some issue. I think the produced models wouldn't work with fused attention for some reason.

@maziyarpanahi
Copy link

#364 unfortunately had some issue. I think the produced models wouldn't work with fused attention for some reason.

Thanks @LaaZa for your quick response. I've seen @TheBloke uploading GPTQ models with sharded safetensor files instead of 1 file. Is this different than the way he does it? (maybe he does it a post-resharding)

@TheBloke
Copy link
Contributor

I do it with Transformers, which supports making GPTQs with sharding.

Or occasionally if I accidentally make a >50GB GPTQ with AutoGPTQ, then yes I post-shard before upload. But if I know a file will be >50GB, eg for 120B models, I will use Transformers so I get the sharding.

When I make GPTQs with Transformers, I set the shard size to 49GB. This means it only shards if it has to for the 50GB limit, otherwise it uploads one file so it remains compatible with AutoGPTQ.

It'd be great to have the sharding saving in AutoGPTQ so I could use that always. I wonder if we could think about merging it in its current state, with fused attention disabled when loading sharded models? Because for some models, they can only be uploaded to HF if sharded, so any support for AutoGPTQ would be better than no support.

(And TBH most people use Transformers, TGI or ExLlama for GPTQ loading anyway, not AutoGPTQ any more - all of which support loading with shards)

@LaaZa
Copy link
Contributor Author

LaaZa commented Jan 14, 2024

I've seen @TheBloke uploading GPTQ models with sharded safetensor files instead of 1 file. Is this different than the way he does it? (maybe he does it a post-resharding)

They are done with Transformers instead of AutoGPTQ directly.

@maziyarpanahi
Copy link

Many thanks @TheBloke for the detailed explanation. I do agree that any support in AutoGPTQ would be much appreciated at this time.

@fxmarty
Copy link
Collaborator

fxmarty commented Jan 30, 2024

@LaaZa Let me review & merge this this week.

@LaaZa
Copy link
Contributor Author

LaaZa commented Jan 30, 2024

I didn't test it after merging with main but didn't have to do any manual merging. It would be nice to have this. Too bad we don't have sharding for the quantization yet. As I found out with testing this with that PR, it has some weird issue that breaks fused_attention.

# Conflicts:
#	auto_gptq/modeling/_base.py
@LaaZa
Copy link
Contributor Author

LaaZa commented Feb 13, 2024

I tried to format my code by hand to follow the new style.

@fxmarty
Copy link
Collaborator

fxmarty commented Feb 14, 2024

@LaaZa sorry for the headaches. I added linting so as to avoid to have style changes in PR diffs. The easiest way for you is:

pip install ruff==0.1.5
ruff auto_gptq examples tests setup.py --fix (on your branch)

and then merge main and resolve potential conflicts.

@LaaZa
Copy link
Contributor Author

LaaZa commented Feb 14, 2024

@fxmarty do you want me to do that or is it okay now? I'll be using that in the future.

@fxmarty
Copy link
Collaborator

fxmarty commented Feb 14, 2024

@LaaZa Can you do it now on this branch? Maybe it's already good but just to be sure.

@LaaZa
Copy link
Contributor Author

LaaZa commented Feb 14, 2024

@fxmarty will do in an hour.

@LaaZa
Copy link
Contributor Author

LaaZa commented Feb 14, 2024

Didn't do much but I simplified one line.

Copy link
Collaborator

@fxmarty fxmarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

@fxmarty fxmarty merged commit 6906ce8 into AutoGPTQ:main Feb 19, 2024
1 check failed
@@ -1220,6 +1194,8 @@ def skip(*args, **kwargs):
safe_save(new_state_dict, model_save_name)

if use_marlin:
if is_sharded:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xunfeng1980 it is just not tested/implemented, but should be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support sharded quantized model files in from_quantized
5 participants