Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Qwen-7B-Chat-Int4微调报错 Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU #385

Closed
2 tasks done
studyhardstudyhard opened this issue Sep 28, 2023 · 18 comments

Comments

@studyhardstudyhard
Copy link

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

Qwen-7B-Chat-Int4微调报错,这是什么原因呢
File "/home/llm/qwen/fine-tune/finetune.py", line 353, in
train()
File "/home/llm/qwen/fine-tune/finetune.py", line 294, in train
model = transformers.AutoModelForCausalLM.from_pretrained(
File "/home/.conda/envs/qwen_env/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 511, in from_pretrained
return model_class.from_pretrained(
File "/home/.conda/envs/qwen_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3161, in from_pretrained
model = quantizer.post_init_model(model)
File "/home/jing.yu/.conda/envs/qwen_env/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 482, in post_init_model
raise ValueError(
ValueError: Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by setting disable_exllama=True in the quantization config object

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

@Moemu
Copy link

Moemu commented Sep 29, 2023

Same question.

@omega-accelr
Copy link

I also faced this error when I tried to quantize the model

@Tejaswgupta
Copy link

You just need to add disable_exllama=True in the config.json

@zhuqiangqiangqiang
Copy link

您只需添加disable_exllama=Trueconfig.json
May I ask where config.json is located?

@csuer411
Copy link

csuer411 commented Oct 4, 2023

Have you solved it?I faced the same question

@dragove
Copy link

dragove commented Oct 5, 2023

You just need to add disable_exllama=True in the config.json

Hello, I've tried following code to change this config, but this error still remains.

config = AutoConfig.from_pretrained("Qwen/Qwen-7B-Chat-Int4",
                                          trust_remote_code=True)
config.disable_exllama = True
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat-Int4",
                                             config=config,
                                             device_map="cpu",
                                             trust_remote_code=True).eval()

Could you tell me if I'm doing wrong?

Edit, using following code to skip this error, but as JustinLin610 said, int4 is not working on CPU.

config = AutoConfig.from_pretrained("Qwen/Qwen-7B-Chat-Int4",
                                          trust_remote_code=True)
config.quantization_config["disable_exllama"] = True
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat-Int4",
                                             config=config,
                                             device_map="cpu",
                                             trust_remote_code=True).eval()

@ilovesouthpark
Copy link

change the config.json in your model file and add disable_exllama: true to quantization_config section.

@JustinLin610
Copy link
Member

I think you guys are using int4 models on CPU. It is not supported! If you would like to use it on CPU, I advise you to check our new project qwen.cpp

@jklj077 jklj077 changed the title [BUG] <title> [BUG] Qwen-7B-Chat-Int4微调报错 Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU Oct 8, 2023
@x1ngzai
Copy link

x1ngzai commented Oct 8, 2023

Same problem. Just add disable_exllama=True at the quantization_config filed of config.json file

@sjhm131
Copy link

sjhm131 commented Oct 26, 2023

I also face this question. I add disable_exllama:true into the 7B-int4/quantize_config.json. but problem is still in there. So how can i fix this?

@sjhm131
Copy link

sjhm131 commented Oct 26, 2023

I also face this question. I add disable_exllama:true into the 7B-int4/quantize_config.json. but problem is still in there. So how can i fix this?

Sorry i have a mistake in the file name, i guess add the disable_exllama:true into the config.json/quantization_config is the right way to fix this. am i right

@tigerinus
Copy link

Same problem. Just add disable_exllama=True at the quantization_config filed of config.json file

这个配置会让整个推理变得很慢。

正确的做法是:AutoGPTQ/AutoGPTQ#406

@jklj077
Copy link
Contributor

jklj077 commented Nov 8, 2023

各位,这里好几个问题混了:

  1. 量化模型不支持CPU推理:旧版AutoGPTQ (<5.0.0)是不支持的CPU推理的,新版AutoGPTQ有实验性的支持。
  2. 量化模型GPU推理,但exllama报错:
    • exllama提供了一种高效的kernel实现,仅支持GPTQ方式量化得到的int4模型和Modern GPU,需要所有模型参数在GPU上。AutoGPTQ旧版支持使用exllama的kernel,新版(5.0.0)支持exllama v2的kernel,可开可关,速度影响、显存占用影响,可以参考AutoGPTQ的benchmark
    • 在我们的样例代码中,部分使用了device_map="auto",目的是缓解大模型加载的内存/显存压力,有可能一部分参数到内存里去了(可以查看模型的hf_device_map确认分配是否合理)。需要改成device_map="cuda:0",这样会全加载到第一张GPU上(或者类似的方法,就是需要让模型都在GPU上)。如果硬件支持、软件版本匹配的话,"Found modules on cpu/disk"的报错可以解决。
      • device_map参数是利用Hugging Face Accelerate来支持的,它的语义跟device是不等价的,后面也不容易再修改模型参数的位置。
      • 如果你的内存/显存确实是不够用,那建议还是用device_map="auto",然后关掉exllama。
    • 如果是exllama不支持,比如int8或者比较旧的显卡,就需要关掉它,GPU上也是能推理运行的。在config.json的quantization_config字段或者代码中,旧版transformers是disable_exllama=True,新版transformers是use_exllama=False,按transformers的版本相应修改。

@jklj077 jklj077 closed this as completed Nov 8, 2023
@CrazyBrick
Copy link

各位,这里好几个问题混了:

1. 量化模型不支持CPU推理:旧版AutoGPTQ (<5.0.0)是不支持的CPU推理的,新版AutoGPTQ有实验性的支持。

2. 量化模型GPU推理,但exllama报错:
   
   * exllama提供了一种高效的kernel实现,仅支持GPTQ方式量化得到的int4模型和Modern GPU,需要所有模型参数在GPU上。AutoGPTQ旧版支持使用exllama的kernel,新版(5.0.0)支持exllama v2的kernel,可开可关,速度影响、显存占用影响,可以参考AutoGPTQ的[benchmark](https://github.com/huggingface/optimum/tree/main/tests/benchmark#batch-size--1)。
   * 在我们的样例代码中,部分使用了`device_map="auto"`,目的是缓解大模型加载的内存/显存压力,有可能一部分参数到内存里去了(可以查看模型的`hf_device_map`确认分配是否合理)。需要改成`device_map="cuda:0"`,这样会全加载到第一张GPU上(或者类似的方法,就是需要让模型都在GPU上)。如果硬件支持、软件版本匹配的话,"Found modules on cpu/disk"的报错可以解决。
     
     * `device_map`参数是利用Hugging Face Accelerate来支持的,它的语义跟`device`是不等价的,后面也不容易再修改模型参数的位置。
     * 如果你的内存/显存确实是不够用,那建议还是用`device_map="auto"`,然后关掉exllama。
   * 如果是exllama不支持,比如int8或者比较旧的显卡,就需要关掉它,GPU上也是能推理运行的。在quantization_config.json或者config.json的quantization_config字段或者代码中,旧版transformers是disable_exllama=True,新版transformers是use_exllama=False,按transformers的版本相应修改。

你好,在A100上可以正常运行chat-int4,但是在Tesla T4上,不管单卡和多卡都会报错no kernel image is available for execution on the device,这种是因为什么原因呢(用的都是torch2.0.1,驱动都是可以支持cuda117的,且torch都可用),就是在T4上会报错

@ColdCodeCool
Copy link

请问如何导出int4的权重,模型load进来打印发现权重都是float16的

@PlanetesDDH
Copy link

您只需添加disable_exllama=Trueconfig.json
May I ask where config.json is located?

应该在模型文件夹里

@danjuan-77
Copy link

各位,这里好几个问题混了:

  1. 量化模型不支持CPU推理:旧版AutoGPTQ (<5.0.0)是不支持的CPU推理的,新版AutoGPTQ有实验性的支持。

  2. 量化模型GPU推理,但exllama报错:

    • exllama提供了一种高效的kernel实现,仅支持GPTQ方式量化得到的int4模型和Modern GPU,需要所有模型参数在GPU上。AutoGPTQ旧版支持使用exllama的kernel,新版(5.0.0)支持exllama v2的kernel,可开可关,速度影响、显存占用影响,可以参考AutoGPTQ的benchmark

    • 在我们的样例代码中,部分使用了device_map="auto",目的是缓解大模型加载的内存/显存压力,有可能一部分参数到内存里去了(可以查看模型的hf_device_map确认分配是否合理)。需要改成device_map="cuda:0",这样会全加载到第一张GPU上(或者类似的方法,就是需要让模型都在GPU上)。如果硬件支持、软件版本匹配的话,"Found modules on cpu/disk"的报错可以解决。

      • device_map参数是利用Hugging Face Accelerate来支持的,它的语义跟device是不等价的,后面也不容易再修改模型参数的位置。
      • 如果你的内存/显存确实是不够用,那建议还是用device_map="auto",然后关掉exllama。
    • 如果是exllama不支持,比如int8或者比较旧的显卡,就需要关掉它,GPU上也是能推理运行的。在config.json的quantization_config字段或者代码中,旧版transformers是disable_exllama=True,新版transformers是use_exllama=False,按transformers的版本相应修改。

那如果我想用在GPU上微调int4版本的模型,应该修改哪个部分?finetune.py中找不到修改的地方?

@feb-cloud
Copy link

我尝试修改config.json中的quantization_config: {"use_exllama": false},解决了该问题。
config.json配置文件中,use_exllama默认值为true,现在将其更改为false,如下例所示

translation:
I tried to modify the quantization_config in config.json: {"use_exllama": false}, which solved the problem.
config.json,The default value of use_exllama is true, now change it to false as shown in the following example.

{ ...... "quantization_config": { "use_exllama": false }, ...... }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests