Support sharded quantized model files in `from_quantized` #319

shakealeg · 2023-09-03T02:01:26Z

I've been using 0cc4m GPTQ for a while and it's been smooth sailing, no errors. But when I try AutoGPTQ, it just, doesn't work. I'm trying to load Pygmalion 7b 4bit 32g, from TehVenom. And I get the following error:

Traceback (most recent call last):
  File "/home/XXX/Documents/Projects/Sapphire/main.py", line 63, in <module>
    main()
  File "/home/XXX/Documents/Projects/Sapphire/main.py", line 30, in main
    model = AutoGPTQForCausalLM.from_quantized(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/XXX/Documents/Projects/Sapphire/venv/lib/python3.11/site-packages/auto_gptq/modeling/auto.py", line 108, in from_quantized
    return quant_func(
           ^^^^^^^^^^^
  File "/home/XXX/Documents/Projects/Sapphire/venv/lib/python3.11/site-packages/auto_gptq/modeling/_base.py", line 791, in from_quantized
    raise FileNotFoundError(f"Could not find model in {model_name_or_path}")
FileNotFoundError: Could not find model in models/pygmalion-7b-4bit-32g

My file structure looks like this:

Project:
    models/
        pygmalion-7b-4bit-32g/:
            4bit-32g.safetensors
            config.json
            generation_config.json
            special_tokens_map.json
            tokenizer_config.json
            tokenizer.json
            tokenizer.model
    venv/
    main.py

My code is the following:

from transformers import (
    AutoTokenizer,
    pipeline,
    logging
)
from auto_gptq import (
    AutoGPTQForCausalLM,
    BaseQuantizeConfig
)

USERNAME = "Frantic"
VERSION = "1.0.0"
MODEL_DIR = "models/pygmalion-7b-4bit-32g"
MODEL_BASENAME = "Pygmalion-7b-4bit-GPTQ-Safetensors"
USE_TRITON = True

def main():
    print(f"Sapphire | Version {VERSION}")

    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR, use_fast=True)
    quantize_config = BaseQuantizeConfig(
        bits=4,
        group_size=32,
        desc_act=False
    )

    model = AutoGPTQForCausalLM.from_quantized(
        MODEL_DIR,
        use_safetensors=True,
        model_basename=MODEL_BASENAME,
        device="cuda:0",
        use_triton=USE_TRITON,
        quantize_config=quantize_config
    )

    logging.set_verbosity(logging.CRITICAL)

    prompt = "Hello, how are you today?"
    prompt_template = f"""### {USERNAME}: {prompt}
### Assistant:"""

    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=512,
        temperature=0.9,
        top_p=0.9,
        repetition_penalty=1.15
    )

    print(pipe(prompt_template)[0]["generated_text"])

    input_ids = tokenizer(prompt_template, return_tensors="pt").input_ids.cuda()
    output = model.generate(inputs=input_ids, temperature=0.9, max_new_tokens=512)

    print(tokenizer.decode(output[0]))

if __name__ == "__main__":
    main()

Overall, I'm confused about what to do to fix it and understand AutoGPTQ. If anyone could help, that would be appreciated.

The text was updated successfully, but these errors were encountered:

thunderamental · 2023-09-10T09:59:05Z

try changing your MODEL_BASENAME to 4bit-32g (the weights name without the .safetensors file extension)? sometimes these paths are fickle.

CrazyBrick · 2023-10-25T16:31:27Z

4bit-32g

I met the same problem when inference (Qwen-VL-Chat-Int4) with “AutoGPTQForCausalLM.from_quantized”:
File "/root/miniconda3/lib/python3.8/site-packages/auto_gptq/modeling/_base.py", line 802, in from_quantized raise FileNotFoundError(f"Could not find model in {model_name_or_path}") FileNotFoundError: Could not find model in ../Qwen-VL-Chat-Int4

I check the quantize_config.json file:"model_name_or_path": "model"
"model_file_base_name": "model"

and the name of quantized model : model-00001-of-00005.safetensors,model-00002-of-00005.safetensors....

How should I solve the problem?

fxmarty · 2023-10-27T07:59:19Z

Hi @shakealeg, the model loading was improved in #383. Please pass the argument model_basename for custom model names, otherwise we greedily search quantize_config.model_file_base_name and models with basename f"gptq_model-{quantize_config.bits}bit-{quantize_config.group_size}g" or "model".

use_safetensors used to be False by default, which was turned to True by default in the above PR.

@CrazyBrick There is a WIP PR to support sharded models in AutoGPTQ: #364

CrazyBrick · 2023-10-27T09:35:09Z

Hi @shakealeg, the model loading was improved in #383. Please pass the argument model_basename for custom model names, otherwise we greedily search quantize_config.model_file_base_name and models with basename f"gptq_model-{quantize_config.bits}bit-{quantize_config.group_size}g" or "model".

use_safetensors used to be False by default, which was turned to True by default in the above PR.

@CrazyBrick There is a WIP PR to support sharded models in AutoGPTQ: #364

Hi,@fxmarty,Thank you for you reply.I changed the code according to the PR#364(mainly in ``),but it dosen't work.
I print some variables while debugging:

model_name_or_path:Qwen/Qwen-VL-Chat-Int4
isdir(model_name_or_path):True
model_save_name:Qwen/Qwen-VL-Chat-Int4/model

But I have Qwen/Qwen-VL-Chat-Int4/model-00001-of-00005.safetensors Qwen/Qwen-VL-Chat-Int4/model-00001-of-00005.safetensors(00001~00005)

It won't find the whole model through one splitname(when I change the "model_file_base_name": "model" to `"model_file_base_name": "model-00001-of-00005",

then I will get NotImplementedError: Cannot copy out of meta tensor; no data!

The following is the code I am using:
from [Qwen-VL-Chat-Int4](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4/tree/main)

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
quantized_model_dir = "Qwen/Qwen-VL-Chat-Int4"
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, trust_remote_code=True)
# use cuda device
model = AutoGPTQForCausalLM.from_quantized(
    quantized_model_dir, 
    device_map="auto", 
    use_safetensors=True,
    low_cpu_mem_usage=True,
    trust_remote_code=True).eval()  # cuda
print(model.hf_device_map)

what should I do?

fxmarty · 2023-10-27T11:39:31Z

Hi @CrazyBrick, the model you are trying to use uses sharded checkpoints. This is unfortunately currently not supported in AutoGPTQ, there is a WIP PR open for it #364

xiayq1 · 2024-04-25T14:06:49Z

Hi @shakealeg, the model loading was improved in #383. Please pass the argument model_basename for custom model names, otherwise we greedily search quantize_config.model_file_base_name and models with basename f"gptq_model-{quantize_config.bits}bit-{quantize_config.group_size}g" or "model".
use_safetensors used to be False by default, which was turned to True by default in the above PR.
@CrazyBrick There is a WIP PR to support sharded models in AutoGPTQ: #364

Hi,@fxmarty,Thank you for you reply.I changed the code according to the PR#364(mainly in ``),but it dosen't work. I print some variables while debugging:
model_name_or_path:Qwen/Qwen-VL-Chat-Int4
isdir(model_name_or_path):True
model_save_name:Qwen/Qwen-VL-Chat-Int4/model
But I have Qwen/Qwen-VL-Chat-Int4/model-00001-of-00005.safetensors Qwen/Qwen-VL-Chat-Int4/model-00001-of-00005.safetensors(00001~00005)

It won't find the whole model through one splitname(when I change the "model_file_base_name": "model" to `"model_file_base_name": "model-00001-of-00005",

then I will get NotImplementedError: Cannot copy out of meta tensor; no data!

The following is the code I am using: from [Qwen-VL-Chat-Int4](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4/tree/main)
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
quantized_model_dir = "Qwen/Qwen-VL-Chat-Int4"
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, trust_remote_code=True)
# use cuda device
model = AutoGPTQForCausalLM.from_quantized(
    quantized_model_dir, 
    device_map="auto", 
    use_safetensors=True,
    low_cpu_mem_usage=True,
    trust_remote_code=True).eval()  # cuda
print(model.hf_device_map)
what should I do?

do you solve the bug? Met the same situation.....

fxmarty changed the title ~~Confused on why this isn't working, confused about quantization.~~ Support sharded quantized model files in from_quantized Oct 27, 2023

fxmarty added enhancement New feature or request help wanted Extra attention is needed labels Oct 27, 2023

LaaZa mentioned this issue Nov 11, 2023

Support loading sharded quantized checkpoints. #425

Merged

fxmarty closed this as completed in #425 Feb 19, 2024

xiayq1 mentioned this issue Apr 25, 2024

[BUG] <title> AutoGPTQForCausalLM.from_quantized( "Qwen/Qwen-VL-Chat-Int4", 。。。）报错 QwenLM/Qwen-VL#371

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support sharded quantized model files in `from_quantized` #319

Support sharded quantized model files in `from_quantized` #319

shakealeg commented Sep 3, 2023 •

edited

thunderamental commented Sep 10, 2023

CrazyBrick commented Oct 25, 2023

fxmarty commented Oct 27, 2023

CrazyBrick commented Oct 27, 2023 •

edited

fxmarty commented Oct 27, 2023

xiayq1 commented Apr 25, 2024

Support sharded quantized model files in from_quantized #319

Support sharded quantized model files in from_quantized #319

Comments

shakealeg commented Sep 3, 2023 • edited

thunderamental commented Sep 10, 2023

CrazyBrick commented Oct 25, 2023

fxmarty commented Oct 27, 2023

CrazyBrick commented Oct 27, 2023 • edited

fxmarty commented Oct 27, 2023

xiayq1 commented Apr 25, 2024

Support sharded quantized model files in `from_quantized` #319

Support sharded quantized model files in `from_quantized` #319

shakealeg commented Sep 3, 2023 •

edited

CrazyBrick commented Oct 27, 2023 •

edited