Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmark stuck #20

Closed
leiwen83 opened this issue Aug 7, 2023 · 18 comments
Closed

benchmark stuck #20

leiwen83 opened this issue Aug 7, 2023 · 18 comments

Comments

@leiwen83
Copy link

leiwen83 commented Aug 7, 2023

Hi,

I try benchmark_serving.py to check the througput of lightllm, But seems benchmark process stuck after server print the "freed all gpu mem", then http post print would no longer print except last one.

Any idea?

current batch size: 1 token used ratio: 0.31983333333333336
freed all gpu mem size 6000
INFO:     127.0.0.1:34050 - "POST /generate HTTP/1.1" 200 OK
@hiworldwzj
Copy link
Collaborator

@leiwen83 ”freed all gpu mem size 6000" stands for "all requests has been finished". Did you has others requests left to run ?

@hiworldwzj
Copy link
Collaborator

@leiwen83 I guess that your arg "batch_max_tokens" is defult " 1 / 6 * 6000 = 1000", this will make requests with req_input_len > 1000 will cannot be handled by router. you can set batch_max_tokens larger. and 6000 for max_total_token_num is too small for test througput,please read https://github.com/ModelTC/lightllm/blob/main/docs/ApiServerArgs.md to set better value for max_total_token_num.

@leiwen83
Copy link
Author

leiwen83 commented Aug 7, 2023

@hiworldwzj you're right, after change batch_max_token, the benchmark itself works.

python -m lightllm.server.api_server --model_dir  /llama/7B/ --host 0.0.0.0 --port 8000   --max_total_token_num 48000 --tokenizer_mode auto

But on A100@40G, I get the result as 866.94s, which far from claimed 188.85 s reported. So is 866.94s a reasonable value for A100@40G? The same dataset run with vllm, I get benchmark result is as 449.97s.

Test command is as:

python benchmark_serving.py --tokenizer //llama/7B/ --dataset /ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 2000 --request-rate 200

@hiworldwzj
Copy link
Collaborator

@leiwen83 I try your setting in a A800 80G, but only use 40G gpu mem. the result is

total tokens: 986423
Total time: 227.69 s
Throughput: 8.78 requests/s
Average latency: 97.93 s
Average latency per token: 0.36 s
Average latency per output token: 2.41 s

I guess there could be two reasons. One possibility is that a slow tokenizer was loaded, and another possibility is that there may be issues related to the Triton version. I haven't tested the performance of the kernel I wrote on Triton version 2.1.0, so I'm not sure if the operator's performance would degrade on different GPUs. Sometimes, I feel very disappointed with the instability of Triton.

@hiworldwzj
Copy link
Collaborator

@leiwen83 I will try to borrow a A100@40G to repeat this bug and repair it, Thanks to report this. Or could you help identify the cause and fix this performance issue? Thanks.

@leiwen83
Copy link
Author

leiwen83 commented Aug 7, 2023

I think maybe after you tune the result, one docker image may could be published together with testing hugginface url. So that people could better align with your result.

For the time now, it is very hard to say which component to cause such big performance downgrade.

@hiworldwzj
Copy link
Collaborator

@leiwen83 Yes, you are right. I will try it.

@shihaobai
Copy link
Collaborator

@leiwen83 You can try to use triton==2.0.0.dev20221202. Here is my result on 40G A100:
total tokens: 986423
Total time: 249.52 s
Throughput: 8.02 requests/s
Average latency: 106.31 s
Average latency per token: 0.40 s
Average latency per output token: 2.61 s

@leiwen83
Copy link
Author

leiwen83 commented Aug 8, 2023

Hi @shihaobai ,

After use triton==2.0.0.dev20221202, I get 801.74 result for the benchmark.
And I found one phenomenon, after "freed all gpu mem size" being printed, there are 1204 ""POST /generate HTTP/1.1" 200 OK" get printed in slow speed. While before "freed all gpu mem size", there are 796 "HTTP ok" message.

So it seems to me that after 796, the prompt get processed in serial way, not in batching?

Do you mind pack your environ into one docker image, together with setup lightllm and the test case? So that I could reproduce your 249 latency result in my machine?

Thx

@hiworldwzj
Copy link
Collaborator

Hi @shihaobai ,

After use triton==2.0.0.dev20221202, I get 801.74 result for the benchmark. And I found one phenomenon, after "freed all gpu mem size" being printed, there are 1204 ""POST /generate HTTP/1.1" 200 OK" get printed in slow speed. While before "freed all gpu mem size", there are 796 "HTTP ok" message.

So it seems to me that after 796, the prompt get processed in serial way, not in batching?

Do you mind pack your environ into one docker image, together with setup lightllm and the test case? So that I could reproduce your 249 latency result in my machine?

Thx

It sounds like you have loaded a slow tokenizer, very very much. "freed all gpu mem size" print stands for "all requests has been finished, but detokenize has not finished"

@leiwen83
Copy link
Author

leiwen83 commented Aug 8, 2023

Hi @shihaobai ,
After use triton==2.0.0.dev20221202, I get 801.74 result for the benchmark. And I found one phenomenon, after "freed all gpu mem size" being printed, there are 1204 ""POST /generate HTTP/1.1" 200 OK" get printed in slow speed. While before "freed all gpu mem size", there are 796 "HTTP ok" message.
So it seems to me that after 796, the prompt get processed in serial way, not in batching?
Do you mind pack your environ into one docker image, together with setup lightllm and the test case? So that I could reproduce your 249 latency result in my machine?
Thx

It sounds like you have loaded a slow tokenizer, very very much. "freed all gpu mem size" print stands for "all requests has been finished, but detokenize has not finished"

How to change to fast tokenizer?

@XHPlus
Copy link
Contributor

XHPlus commented Aug 9, 2023

Do you have tokenizer.json in your model folder? If there is this file, it will load the fast tokenizer.

@leiwen83
Copy link
Author

leiwen83 commented Aug 9, 2023

Do you have tokenizer.json in your model folder? If there is this file, it will load the fast tokenizer.

There is no such file...
Which llama-7b you have tested? Do you have any suggest repo from hugginface to test with?

Thx

@shihaobai
Copy link
Collaborator

You can use the pre-trained model in https://huggingface.co/huggyllama/llama-7b/tree/main which includes a fast tokenizer.

@leiwen83
Copy link
Author

leiwen83 commented Aug 9, 2023

After switch to this fast tokenizer one, seem result get very close to your side:

Total time: 268.24 s
Throughput: 7.46 requests/s
Average latency: 115.63 s

Do you know how to convert "tokenizer.json" for those llama model without it?
And another what suprise me is that whether or not llama model has the "tokenizer.json", vllm get nearly the same benchmark value. Does it mean lightllm more sensitive toward the fast tokenizer?

Thx

@hiworldwzj
Copy link
Collaborator

After switch to this fast tokenizer one, seem result get very close to your side:

Total time: 268.24 s Throughput: 7.46 requests/s Average latency: 115.63 s

Do you know how to convert "tokenizer.json" for those llama model without it? And another what suprise me is that whether or not llama model has the "tokenizer.json", vllm get nearly the same benchmark value. Does it mean lightllm more sensitive toward the fast tokenizer?

Thx

@shihaobai Can you help this?

@XHPlus
Copy link
Contributor

XHPlus commented Aug 10, 2023

After switch to this fast tokenizer one, seem result get very close to your side:

Total time: 268.24 s Throughput: 7.46 requests/s Average latency: 115.63 s

Do you know how to convert "tokenizer.json" for those llama model without it? And another what suprise me is that whether or not llama model has the "tokenizer.json", vllm get nearly the same benchmark value. Does it mean lightllm more sensitive toward the fast tokenizer?

Thx

You can export a tokenizer.json referring the https://huggingface.co/docs/transformers/fast_tokenizers.

@leiwen83
Copy link
Author

Got it. Thx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants