Support loading sharded quantized checkpoints. #425

LaaZa · 2023-11-11T12:05:06Z

Adds support for loading sharded quantized models with accelerate. qigen not supported.

The loading counterpart for #364 which is not required.

Checks for [model].index.json to detect shards.

~~fused_attention may not work(errors out) for tiny shards, but they are typically not small enough to cause this.~~

fixes #319

fxmarty · 2023-11-14T08:45:50Z

Thank you @LaaZa for the support. Could you add a test to load sharded quantized checkpoints?

LaaZa · 2023-11-14T11:19:36Z

I guess, but it would be good to have a tiny sharded model for a test. Also there are differences between a local model and hub model.(the actual loading is the same) So what would you suggest?

fxmarty · 2023-11-14T12:35:20Z

@LaaZa Yes it would be good but I couldn't find such a model, I only found https://huggingface.co/ranchlai/Llama-2-70b-chat-gptq-4bit-128g/tree/main & https://huggingface.co/TheBloke/Falcon-180B-Chat-GPTQ/tree/main

Maybe just having a test loading one of these is fine.

LaaZa · 2023-11-14T12:40:01Z

@fxmarty I can use #364 to make one small one and upload to hub.

TheBloke · 2023-11-14T13:21:59Z

Hi guys

I just made this, if it's of use: https://huggingface.co/TheBlokeAI/llama-68m-GPTQ-sharded

It's Llama 68M, sharded with max_shard_size="10MB". Though only one shard is actually 10MB, and the other two are bigger - not sure why that is. Total weights size is 104MiB.

I used Transformers GPTQ to create it as I don't have the PR set up yet. But I think that's a better test anyway, as it confirms AutoGPTQ can load Transformers-produced GPTQs (I'm sure it can.)

LaaZa · 2023-11-14T13:24:49Z

I just made a sharded tinyllama, but it also fails loading with fused_attention it might be that #364 somehow makes models that do not work with that. I'll test the 68M one.

Edit: Okay, the one @TheBloke made works with fused_attention but I noticed normal filename resolve fails with a hub model if its not in the gptq_* format when the basename is not provided.

Add sharded loading test case.

LaaZa · 2023-11-14T14:06:37Z

@fxmarty added a test and fixed the hub loading. Is the test okay? It's not very comprehensive, I can add more permutations if you want.

TheBloke · 2023-12-14T23:16:31Z

Will this go into 0.6 @fxmarty ? Would love to have sharding support in AutoGPTQ

# Conflicts: # auto_gptq/modeling/_base.py

LaaZa · 2023-12-15T02:19:04Z

I merged with main.*

Edit:
*again, I forgot to update, no issues

maziyarpanahi · 2024-01-14T13:31:24Z

Thanks @LaaZa for the implementation. Would it be possible to merge this PR and #364 to the main branch? I would really love to have save_quantized method with max_shard_size feature:

model.save_quantized(output_dir, max_shard_size=shard_size)

LaaZa · 2024-01-14T13:43:04Z

#364 unfortunately had some issue. I think the produced models wouldn't work with fused attention for some reason.

maziyarpanahi · 2024-01-14T13:47:22Z

#364 unfortunately had some issue. I think the produced models wouldn't work with fused attention for some reason.

Thanks @LaaZa for your quick response. I've seen @TheBloke uploading GPTQ models with sharded safetensor files instead of 1 file. Is this different than the way he does it? (maybe he does it a post-resharding)

TheBloke · 2024-01-14T13:51:11Z

I do it with Transformers, which supports making GPTQs with sharding.

Or occasionally if I accidentally make a >50GB GPTQ with AutoGPTQ, then yes I post-shard before upload. But if I know a file will be >50GB, eg for 120B models, I will use Transformers so I get the sharding.

When I make GPTQs with Transformers, I set the shard size to 49GB. This means it only shards if it has to for the 50GB limit, otherwise it uploads one file so it remains compatible with AutoGPTQ.

It'd be great to have the sharding saving in AutoGPTQ so I could use that always. I wonder if we could think about merging it in its current state, with fused attention disabled when loading sharded models? Because for some models, they can only be uploaded to HF if sharded, so any support for AutoGPTQ would be better than no support.

(And TBH most people use Transformers, TGI or ExLlama for GPTQ loading anyway, not AutoGPTQ any more - all of which support loading with shards)

LaaZa · 2024-01-14T13:52:05Z

I've seen @TheBloke uploading GPTQ models with sharded safetensor files instead of 1 file. Is this different than the way he does it? (maybe he does it a post-resharding)

They are done with Transformers instead of AutoGPTQ directly.

maziyarpanahi · 2024-01-14T15:14:41Z

Many thanks @TheBloke for the detailed explanation. I do agree that any support in AutoGPTQ would be much appreciated at this time.

fxmarty · 2024-01-30T13:55:53Z

@LaaZa Let me review & merge this this week.

LaaZa · 2024-01-30T14:10:36Z

I didn't test it after merging with main but didn't have to do any manual merging. It would be nice to have this. Too bad we don't have sharding for the quantization yet. As I found out with testing this with that PR, it has some weird issue that breaks fused_attention.

# Conflicts: # auto_gptq/modeling/_base.py

LaaZa · 2024-02-13T21:47:43Z

I tried to format my code by hand to follow the new style.

fxmarty · 2024-02-14T08:33:43Z

@LaaZa sorry for the headaches. I added linting so as to avoid to have style changes in PR diffs. The easiest way for you is:

pip install ruff==0.1.5
ruff auto_gptq examples tests setup.py --fix (on your branch)

and then merge main and resolve potential conflicts.

LaaZa · 2024-02-14T13:58:43Z

@fxmarty do you want me to do that or is it okay now? I'll be using that in the future.

fxmarty · 2024-02-14T14:14:31Z

@LaaZa Can you do it now on this branch? Maybe it's already good but just to be sure.

LaaZa · 2024-02-14T14:16:09Z

@fxmarty will do in an hour.

LaaZa · 2024-02-14T15:43:04Z

Didn't do much but I simplified one line.

fxmarty

Thanks a lot!

xunfeng1980 · 2024-04-12T10:09:28Z

auto_gptq/modeling/_base.py

@@ -1220,6 +1194,8 @@ def skip(*args, **kwargs):
                        safe_save(new_state_dict, model_save_name)

            if use_marlin:
+                if is_sharded:


@xunfeng1980 it is just not tested/implemented, but should be.

LaaZa added 3 commits November 11, 2023 11:45

Support loading sharded quantized models. (fused_attention not working)

292a7f5

Pass index.json instead of directory.

d958d86

Fix unintended formatting changes.

6c4dde5

Fix hub index file lookup.

d63d160

Add sharded loading test case.

Merge branch 'main' into sharding

c550a83

# Conflicts: # auto_gptq/modeling/_base.py

Merge branch 'main' into sharding

323aed9

Merge branch 'main' into sharding

9086ae5

LaaZa mentioned this pull request Jan 30, 2024

[BUG] ould not find a model in /data/share/rwq/Qwen-7B-Chat-Int4 with a name in model.safetensors. Please specify the argument model_basename to use a custom file name. #526

Open

Merge branch 'main' into sharding

f3ca988

# Conflicts: # auto_gptq/modeling/_base.py

Not a penny in sight just some lint. Coding is ruff.

f261139

fxmarty added 3 commits February 19, 2024 14:20

refactor

33cb06a

Merge branch 'main' into sharding

0b547fc

comment

728994a

fxmarty approved these changes Feb 19, 2024

View reviewed changes

fxmarty merged commit 6906ce8 into AutoGPTQ:main Feb 19, 2024
1 check failed

xunfeng1980 reviewed Apr 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support loading sharded quantized checkpoints. #425

Support loading sharded quantized checkpoints. #425

LaaZa commented Nov 11, 2023 •

edited

Loading

fxmarty commented Nov 14, 2023

LaaZa commented Nov 14, 2023 •

edited

Loading

fxmarty commented Nov 14, 2023 •

edited

Loading

LaaZa commented Nov 14, 2023

TheBloke commented Nov 14, 2023

LaaZa commented Nov 14, 2023 •

edited

Loading

LaaZa commented Nov 14, 2023

TheBloke commented Dec 14, 2023

LaaZa commented Dec 15, 2023 •

edited

Loading

maziyarpanahi commented Jan 14, 2024

LaaZa commented Jan 14, 2024

maziyarpanahi commented Jan 14, 2024

TheBloke commented Jan 14, 2024

LaaZa commented Jan 14, 2024

maziyarpanahi commented Jan 14, 2024

fxmarty commented Jan 30, 2024

LaaZa commented Jan 30, 2024

LaaZa commented Feb 13, 2024

fxmarty commented Feb 14, 2024

LaaZa commented Feb 14, 2024

fxmarty commented Feb 14, 2024

LaaZa commented Feb 14, 2024

LaaZa commented Feb 14, 2024

fxmarty left a comment

xunfeng1980 Apr 12, 2024

fxmarty Apr 16, 2024

Support loading sharded quantized checkpoints. #425

Support loading sharded quantized checkpoints. #425

Conversation

LaaZa commented Nov 11, 2023 • edited Loading

fxmarty commented Nov 14, 2023

LaaZa commented Nov 14, 2023 • edited Loading

fxmarty commented Nov 14, 2023 • edited Loading

LaaZa commented Nov 14, 2023

TheBloke commented Nov 14, 2023

LaaZa commented Nov 14, 2023 • edited Loading

LaaZa commented Nov 14, 2023

TheBloke commented Dec 14, 2023

LaaZa commented Dec 15, 2023 • edited Loading

maziyarpanahi commented Jan 14, 2024

LaaZa commented Jan 14, 2024

maziyarpanahi commented Jan 14, 2024

TheBloke commented Jan 14, 2024

LaaZa commented Jan 14, 2024

maziyarpanahi commented Jan 14, 2024

fxmarty commented Jan 30, 2024

LaaZa commented Jan 30, 2024

LaaZa commented Feb 13, 2024

fxmarty commented Feb 14, 2024

LaaZa commented Feb 14, 2024

fxmarty commented Feb 14, 2024

LaaZa commented Feb 14, 2024

LaaZa commented Feb 14, 2024

fxmarty left a comment

Choose a reason for hiding this comment

xunfeng1980 Apr 12, 2024

Choose a reason for hiding this comment

fxmarty Apr 16, 2024

Choose a reason for hiding this comment

LaaZa commented Nov 11, 2023 •

edited

Loading

LaaZa commented Nov 14, 2023 •

edited

Loading

fxmarty commented Nov 14, 2023 •

edited

Loading

LaaZa commented Nov 14, 2023 •

edited

Loading

LaaZa commented Dec 15, 2023 •

edited

Loading