Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model quantize #19

Closed
bensonbs opened this issue Aug 21, 2023 · 10 comments
Closed

Model quantize #19

bensonbs opened this issue Aug 21, 2023 · 10 comments

Comments

@bensonbs
Copy link

bensonbs commented Aug 21, 2023

我在單張RTX3090 24GB得到錯誤

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB (GPU 0; 24.00 GiB total capacity; 23.85 GiB already allocated; 0 bytes free; 24.00 GiB allowed; 23.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

是否有方法能夠量化模型

@gary1003
Copy link

模型的檔案多大就差不多那樣吧,24G看來是塞不下,如果想要縮小可以考慮 https://github.com/ggerganov/llama.cpp 的 quantize

@hiwudery
Copy link

請參考下面的資料
https://huggingface.co/audreyt/Taiwan-LLaMa-v1.0-GGML
https://huggingface.co/weiren119/Taiwan-LLaMa-v1.0-4bits-GPTQ

@bensonbs
Copy link
Author

bensonbs commented Aug 22, 2023

#19 (comment)

我使用以下命令運行該模型

docker run --gpus all -p 8080:80 `
-v C:\Users\BBS\code\Taiwan-LLaMa\data:/data  `
--name Taiwan-LLaMa-v1.0-4bits-GPTQ `
ghcr.io/huggingface/text-generation-inference:latest `
--model-id weiren119/Taiwan-LLaMa-v1.0-4bits-GPTQ `
--quantize gptq

但效果不太理想

from text_generation import Client
client = Client("http://127.0.0.1:8080")
text = ""
for response in client.generate_stream("什麼是深度學習?", max_new_tokens=100):
    if not response.token.special:
        text += response.token.text
print(text)

- 網路資訊討論區 - 網路開店購物車論壇
網路開店購物車論壇»論壇 › 網路開店資訊 › 網路資訊討論區 › 什麼是深度

@PenutChen
Copy link

PenutChen commented Aug 22, 2023

你應該可以嘗試用 bnb 就好

docker run --gpus all -p 8080:80 `
-v C:\Users\BBS\code\Taiwan-LLaMa\data:/data  `
--name Taiwan-LLaMa-v1.0-4bits-GPTQ `
ghcr.io/huggingface/text-generation-inference:latest `
--model-id yentinglin/Taiwan-LLaMa-v1.0`
--quantize bitsandbytes

@PenutChen
Copy link

PenutChen commented Aug 22, 2023

#19 (comment)

我使用以下命令運行該模型

docker run --gpus all -p 8080:80 `
-v C:\Users\BBS\code\Taiwan-LLaMa\data:/data  `
--name Taiwan-LLaMa-v1.0-4bits-GPTQ `
ghcr.io/huggingface/text-generation-inference:latest `
--model-id weiren119/Taiwan-LLaMa-v1.0-4bits-GPTQ `
--quantize gptq

但效果不太理想

from text_generation import Client
client = Client("http://127.0.0.1:8080")
text = ""
for response in client.generate_stream("什麼是深度學習?", max_new_tokens=100):
    if not response.token.special:
        text += response.token.text
print(text)

- 網路資訊討論區 - 網路開店購物車論壇
網路開店購物車論壇»論壇 › 網路開店資訊 › 網路資訊討論區 › 什麼是深度

你的生成效果不好應該是因為你沒有使用作者提供的 Prompt Template
完整的 Prompt 應該是這樣:

prompt_template = """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"""
prompt = prompt_template.format("什麼是深度學習?")

完整 Prompt Template 的描述在這

@bensonbs
Copy link
Author

#19 (comment)

非常感謝在使用 --quantize gptq 與 Prompt Template 得到很好的結果。

但在--quantize bitsandbytes 會得到錯誤訊息

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 53, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")

RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist
 rank=0
Error: ShardCannotStart
2023-08-22T01:37:21.850382Z ERROR text_generation_launcher: Shard 0 failed to start
2023-08-22T01:37:21.850414Z  INFO text_generation_launcher: Shutting down shards

@PenutChen
Copy link

可能是 TGI 版本的關係,可以試試看改用 ghcr.io/huggingface/text-generation-inference:1.0.11.0.0

@bensonbs
Copy link
Author

bensonbs commented Aug 22, 2023

#19 (comment)

抱歉該錯誤與 TGI 無關

我犯了一個簡單的錯誤model-id應該使用yentinglin/Taiwan-LLaMa-v1.0我使用到weiren119/Taiwan-LLaMa-v1.0-4bits-GPTQ

docker run --gpus all -p 8080:80 `
-v C:\Users\BBS\code\Taiwan-LLaMa\data:/data  `
ghcr.io/huggingface/text-generation-inference:latest `
--model-id yentinglin/Taiwan-LLaMa-v1.0` 
--quantize bitsandbytes

能夠正常執行

另外請問 bitsandbytes-foundation/bitsandbytes#539 提到GPTQ會比bnb 精確,在速度上它們會有不同嗎?

@bensonbs bensonbs changed the title 模型需要使用多少VRAM? Model quantize Aug 22, 2023
@PenutChen
Copy link

PenutChen commented Aug 22, 2023

在 3090 這個層級的 GPU 上 GPTQ 與 bnb 的速度差距應該不會很大,我自己實測 GPTQ (4-Bit) 與 bnb (4-Bit) 通常會比 bnb (8-Bit) 略慢一點點,但三者的差距非常微小幾乎可以忽略,可以自行測量看看

@adamlin120
Copy link
Collaborator

BTW,目前在 demo 網站是兩張3090 用 TGI (沒有 quantization),有興趣可以比較一下差距。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants