Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is it possible to speed up the polyglot-12.8b-koalpaca-v1.1b ? #34

Closed
SoroorMa opened this issue Apr 30, 2023 · 5 comments
Closed

is it possible to speed up the polyglot-12.8b-koalpaca-v1.1b ? #34

SoroorMa opened this issue Apr 30, 2023 · 5 comments

Comments

@SoroorMa
Copy link

Hi there,
I tried to use your new provided model (polyglot-12.8b-koalpaca-v1.1b) on my local system (with one GPU)
but it's kinda slow, is there any way that I can speed up the functionality?

thank you!

@Beomi
Copy link
Owner

Beomi commented Apr 30, 2023

To be clairfy,

Give me some information to guess.

  1. how 'slow' the generation is? 3-4 tokens/s is expected for most GPUs.
  2. which tensor type are u using? fp16? or int8 quantized? 8bit quantization enables to load 12B model into RTX series(like 3090/4090) however it has overhead with 30-40%, which could be the reason to be slow.
  3. Could you check https://chat.koalpaca.com and compare your generation speed? it works on single RTX 3090 right now, and if your GPU is better one and if it is slower than Demo web ui, there could be somting issue on your configuration. (maybe it is not works on gpu but on cpus...)

@SoroorMa
Copy link
Author

SoroorMa commented May 1, 2023

  1. I used huggingface to load the model as well as the Gradio SDK
    but compared to the basic version of this model, it takes about 30 seconds to start generating a text, while the original version was less than 10 seconds.
    I use a GPU (NVIDIA A100-SXM-80GB) and set max_new_tokens to 512 (no additional configuration)

  2. umm..., so it might be because of my GPU

  3. yeah I already try it, I think it's enough fast in generating text
    however, sometimes it generates additional information beyond the answer to the question

I would appreciate it if you have any advice to fix these issues

@Beomi
Copy link
Owner

Beomi commented May 9, 2023

Have you tried with low_cpu_mem_usage=True and torch_dtype=torch.float16?

if torch_dtype is torch.float32, then the generation would be much slower.

MODEL = 'beomi/KoAlpaca-Polyglot-12.8B'

model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
).to(device=f"cuda", non_blocking=True)
model.eval()
print()

@SoroorMa
Copy link
Author

Oh, thank you!
How about efficiency? Would it have an effect on it?

@Beomi
Copy link
Owner

Beomi commented Jun 13, 2023

Since the precision of the fp16 is exactly half of fp32, but as generation model it does not harm the generation quality.
So I suggest you to use float16 in terms of speed and quality :)

@Beomi Beomi closed this as completed Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants