Model quantize #19

bensonbs · 2023-08-21T03:30:53Z

我在單張RTX3090 24GB得到錯誤

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB (GPU 0; 24.00 GiB total capacity; 23.85 GiB already allocated; 0 bytes free; 24.00 GiB allowed; 23.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

是否有方法能夠量化模型

The text was updated successfully, but these errors were encountered:

gary1003 · 2023-08-21T07:20:37Z

模型的檔案多大就差不多那樣吧，24G看來是塞不下，如果想要縮小可以考慮 https://github.com/ggerganov/llama.cpp 的 quantize

hiwudery · 2023-08-21T23:21:22Z

請參考下面的資料
https://huggingface.co/audreyt/Taiwan-LLaMa-v1.0-GGML
https://huggingface.co/weiren119/Taiwan-LLaMa-v1.0-4bits-GPTQ

bensonbs · 2023-08-22T01:10:00Z

#19 (comment)

我使用以下命令運行該模型

docker run --gpus all -p 8080:80 `
-v C:\Users\BBS\code\Taiwan-LLaMa\data:/data  `
--name Taiwan-LLaMa-v1.0-4bits-GPTQ `
ghcr.io/huggingface/text-generation-inference:latest `
--model-id weiren119/Taiwan-LLaMa-v1.0-4bits-GPTQ `
--quantize gptq

但效果不太理想

from text_generation import Client
client = Client("http://127.0.0.1:8080")
text = ""
for response in client.generate_stream("什麼是深度學習?", max_new_tokens=100):
    if not response.token.special:
        text += response.token.text
print(text)

- 網路資訊討論區 - 網路開店購物車論壇
網路開店購物車論壇»論壇 › 網路開店資訊 › 網路資訊討論區 › 什麼是深度

PenutChen · 2023-08-22T01:30:03Z

你應該可以嘗試用 bnb 就好

docker run --gpus all -p 8080:80 `
-v C:\Users\BBS\code\Taiwan-LLaMa\data:/data  `
--name Taiwan-LLaMa-v1.0-4bits-GPTQ `
ghcr.io/huggingface/text-generation-inference:latest `
--model-id yentinglin/Taiwan-LLaMa-v1.0`
--quantize bitsandbytes

PenutChen · 2023-08-22T01:34:49Z

#19 (comment)

我使用以下命令運行該模型

docker run --gpus all -p 8080:80 `
-v C:\Users\BBS\code\Taiwan-LLaMa\data:/data  `
--name Taiwan-LLaMa-v1.0-4bits-GPTQ `
ghcr.io/huggingface/text-generation-inference:latest `
--model-id weiren119/Taiwan-LLaMa-v1.0-4bits-GPTQ `
--quantize gptq

但效果不太理想

from text_generation import Client
client = Client("http://127.0.0.1:8080")
text = ""
for response in client.generate_stream("什麼是深度學習?", max_new_tokens=100):
    if not response.token.special:
        text += response.token.text
print(text)

- 網路資訊討論區 - 網路開店購物車論壇
網路開店購物車論壇»論壇 › 網路開店資訊 › 網路資訊討論區 › 什麼是深度

你的生成效果不好應該是因為你沒有使用作者提供的 Prompt Template
完整的 Prompt 應該是這樣：

prompt_template = """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"""
prompt = prompt_template.format("什麼是深度學習?")

完整 Prompt Template 的描述在這

bensonbs · 2023-08-22T01:47:22Z

#19 (comment)

非常感謝在使用 --quantize gptq 與 Prompt Template 得到很好的結果。

但在--quantize bitsandbytes 會得到錯誤訊息

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 53, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")

RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist
 rank=0
Error: ShardCannotStart
2023-08-22T01:37:21.850382Z ERROR text_generation_launcher: Shard 0 failed to start
2023-08-22T01:37:21.850414Z  INFO text_generation_launcher: Shutting down shards

PenutChen · 2023-08-22T01:51:52Z

可能是 TGI 版本的關係，可以試試看改用 ghcr.io/huggingface/text-generation-inference:1.0.1 或 1.0.0

bensonbs · 2023-08-22T04:29:02Z

#19 (comment)

抱歉該錯誤與 TGI 無關

我犯了一個簡單的錯誤model-id應該使用yentinglin/Taiwan-LLaMa-v1.0我使用到weiren119/Taiwan-LLaMa-v1.0-4bits-GPTQ

docker run --gpus all -p 8080:80 `
-v C:\Users\BBS\code\Taiwan-LLaMa\data:/data  `
ghcr.io/huggingface/text-generation-inference:latest `
--model-id yentinglin/Taiwan-LLaMa-v1.0` 
--quantize bitsandbytes

能夠正常執行

另外請問 bitsandbytes-foundation/bitsandbytes#539 提到GPTQ會比bnb 精確，在速度上它們會有不同嗎?

PenutChen · 2023-08-22T05:44:51Z

在 3090 這個層級的 GPU 上 GPTQ 與 bnb 的速度差距應該不會很大，我自己實測 GPTQ (4-Bit) 與 bnb (4-Bit) 通常會比 bnb (8-Bit) 略慢一點點，但三者的差距非常微小幾乎可以忽略，可以自行測量看看

adamlin120 · 2023-08-22T06:00:33Z

BTW，目前在 demo 網站是兩張3090 用 TGI (沒有 quantization)，有興趣可以比較一下差距。

bensonbs changed the title ~~模型需要使用多少VRAM?~~ Model quantize Aug 22, 2023

bensonbs closed this as completed Aug 27, 2023

nigue3025 mentioned this issue Sep 12, 2023

網頁Demo與程式碼執行結果有落差 #28

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model quantize #19

Model quantize #19

bensonbs commented Aug 21, 2023 •

edited

Loading

gary1003 commented Aug 21, 2023

hiwudery commented Aug 21, 2023

bensonbs commented Aug 22, 2023 •

edited

Loading

PenutChen commented Aug 22, 2023 •

edited

Loading

PenutChen commented Aug 22, 2023 •

edited

Loading

bensonbs commented Aug 22, 2023

PenutChen commented Aug 22, 2023

bensonbs commented Aug 22, 2023 •

edited

Loading

PenutChen commented Aug 22, 2023 •

edited

Loading

adamlin120 commented Aug 22, 2023

Model quantize #19

Model quantize #19

Comments

bensonbs commented Aug 21, 2023 • edited Loading

gary1003 commented Aug 21, 2023

hiwudery commented Aug 21, 2023

bensonbs commented Aug 22, 2023 • edited Loading

PenutChen commented Aug 22, 2023 • edited Loading

PenutChen commented Aug 22, 2023 • edited Loading

bensonbs commented Aug 22, 2023

PenutChen commented Aug 22, 2023

bensonbs commented Aug 22, 2023 • edited Loading

PenutChen commented Aug 22, 2023 • edited Loading

adamlin120 commented Aug 22, 2023

bensonbs commented Aug 21, 2023 •

edited

Loading

bensonbs commented Aug 22, 2023 •

edited

Loading

PenutChen commented Aug 22, 2023 •

edited

Loading

PenutChen commented Aug 22, 2023 •

edited

Loading

bensonbs commented Aug 22, 2023 •

edited

Loading

PenutChen commented Aug 22, 2023 •

edited

Loading