is it possible to speed up the polyglot-12.8b-koalpaca-v1.1b ? #34

SoroorMa · 2023-04-30T17:55:56Z

Hi there,
I tried to use your new provided model (polyglot-12.8b-koalpaca-v1.1b) on my local system (with one GPU)
but it's kinda slow, is there any way that I can speed up the functionality?

thank you!

Beomi · 2023-04-30T18:09:48Z

To be clairfy,

Give me some information to guess.

how 'slow' the generation is? 3-4 tokens/s is expected for most GPUs.
which tensor type are u using? fp16? or int8 quantized? 8bit quantization enables to load 12B model into RTX series(like 3090/4090) however it has overhead with 30-40%, which could be the reason to be slow.
Could you check https://chat.koalpaca.com and compare your generation speed? it works on single RTX 3090 right now, and if your GPU is better one and if it is slower than Demo web ui, there could be somting issue on your configuration. (maybe it is not works on gpu but on cpus...)

SoroorMa · 2023-05-01T01:59:20Z

I used huggingface to load the model as well as the Gradio SDK
but compared to the basic version of this model, it takes about 30 seconds to start generating a text, while the original version was less than 10 seconds.
I use a GPU (NVIDIA A100-SXM-80GB) and set max_new_tokens to 512 (no additional configuration)
umm..., so it might be because of my GPU
yeah I already try it, I think it's enough fast in generating text
however, sometimes it generates additional information beyond the answer to the question

I would appreciate it if you have any advice to fix these issues

Beomi · 2023-05-09T06:54:59Z

Have you tried with low_cpu_mem_usage=True and torch_dtype=torch.float16?

if torch_dtype is torch.float32, then the generation would be much slower.

MODEL = 'beomi/KoAlpaca-Polyglot-12.8B'

model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
).to(device=f"cuda", non_blocking=True)
model.eval()
print()

SoroorMa · 2023-05-29T23:24:48Z

Oh, thank you!
How about efficiency? Would it have an effect on it?

Beomi · 2023-06-13T09:48:05Z

Since the precision of the fp16 is exactly half of fp32, but as generation model it does not harm the generation quality.
So I suggest you to use float16 in terms of speed and quality :)

SimonUncle mentioned this issue May 15, 2023

webui gradio로 인퍼런스시 속도 문제 #51

Closed

Beomi closed this as completed Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is it possible to speed up the polyglot-12.8b-koalpaca-v1.1b ? #34

is it possible to speed up the polyglot-12.8b-koalpaca-v1.1b ? #34

SoroorMa commented Apr 30, 2023

Beomi commented Apr 30, 2023

SoroorMa commented May 1, 2023 •

edited

Beomi commented May 9, 2023

SoroorMa commented May 29, 2023

Beomi commented Jun 13, 2023

is it possible to speed up the polyglot-12.8b-koalpaca-v1.1b ? #34

is it possible to speed up the polyglot-12.8b-koalpaca-v1.1b ? #34

Comments

SoroorMa commented Apr 30, 2023

Beomi commented Apr 30, 2023

SoroorMa commented May 1, 2023 • edited

Beomi commented May 9, 2023

SoroorMa commented May 29, 2023

Beomi commented Jun 13, 2023

SoroorMa commented May 1, 2023 •

edited