benchmark stuck #20

leiwen83 · 2023-08-07T06:07:06Z

Hi,

I try benchmark_serving.py to check the througput of lightllm, But seems benchmark process stuck after server print the "freed all gpu mem", then http post print would no longer print except last one.

Any idea?

current batch size: 1 token used ratio: 0.31983333333333336
freed all gpu mem size 6000
INFO:     127.0.0.1:34050 - "POST /generate HTTP/1.1" 200 OK

The text was updated successfully, but these errors were encountered:

hiworldwzj · 2023-08-07T06:17:10Z

@leiwen83 ”freed all gpu mem size 6000" stands for "all requests has been finished". Did you has others requests left to run ?

hiworldwzj · 2023-08-07T06:26:19Z

@leiwen83 I guess that your arg "batch_max_tokens" is defult " 1 / 6 * 6000 = 1000", this will make requests with req_input_len > 1000 will cannot be handled by router. you can set batch_max_tokens larger. and 6000 for max_total_token_num is too small for test througput，please read https://github.com/ModelTC/lightllm/blob/main/docs/ApiServerArgs.md to set better value for max_total_token_num.

leiwen83 · 2023-08-07T07:20:06Z

@hiworldwzj you're right, after change batch_max_token, the benchmark itself works.

python -m lightllm.server.api_server --model_dir  /llama/7B/ --host 0.0.0.0 --port 8000   --max_total_token_num 48000 --tokenizer_mode auto

But on A100@40G, I get the result as 866.94s, which far from claimed 188.85 s reported. So is 866.94s a reasonable value for A100@40G? The same dataset run with vllm, I get benchmark result is as 449.97s.

Test command is as:

python benchmark_serving.py --tokenizer //llama/7B/ --dataset /ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 2000 --request-rate 200

hiworldwzj · 2023-08-07T09:23:17Z

@leiwen83 I try your setting in a A800 80G, but only use 40G gpu mem. the result is

total tokens: 986423
Total time: 227.69 s
Throughput: 8.78 requests/s
Average latency: 97.93 s
Average latency per token: 0.36 s
Average latency per output token: 2.41 s

I guess there could be two reasons. One possibility is that a slow tokenizer was loaded, and another possibility is that there may be issues related to the Triton version. I haven't tested the performance of the kernel I wrote on Triton version 2.1.0, so I'm not sure if the operator's performance would degrade on different GPUs. Sometimes, I feel very disappointed with the instability of Triton.

hiworldwzj · 2023-08-07T09:32:45Z

@leiwen83 I will try to borrow a A100@40G to repeat this bug and repair it, Thanks to report this. Or could you help identify the cause and fix this performance issue? Thanks.

leiwen83 · 2023-08-07T10:25:52Z

I think maybe after you tune the result, one docker image may could be published together with testing hugginface url. So that people could better align with your result.

For the time now, it is very hard to say which component to cause such big performance downgrade.

hiworldwzj · 2023-08-07T10:52:55Z

@leiwen83 Yes, you are right. I will try it.

shihaobai · 2023-08-08T06:25:04Z

@leiwen83 You can try to use triton==2.0.0.dev20221202. Here is my result on 40G A100:
total tokens: 986423
Total time: 249.52 s
Throughput: 8.02 requests/s
Average latency: 106.31 s
Average latency per token: 0.40 s
Average latency per output token: 2.61 s

leiwen83 · 2023-08-08T09:05:38Z

Hi @shihaobai ,

After use triton==2.0.0.dev20221202, I get 801.74 result for the benchmark.
And I found one phenomenon, after "freed all gpu mem size" being printed, there are 1204 ""POST /generate HTTP/1.1" 200 OK" get printed in slow speed. While before "freed all gpu mem size", there are 796 "HTTP ok" message.

So it seems to me that after 796, the prompt get processed in serial way, not in batching?

Do you mind pack your environ into one docker image, together with setup lightllm and the test case? So that I could reproduce your 249 latency result in my machine?

Thx

hiworldwzj · 2023-08-08T11:26:55Z

Hi @shihaobai ,

After use triton==2.0.0.dev20221202, I get 801.74 result for the benchmark. And I found one phenomenon, after "freed all gpu mem size" being printed, there are 1204 ""POST /generate HTTP/1.1" 200 OK" get printed in slow speed. While before "freed all gpu mem size", there are 796 "HTTP ok" message.

So it seems to me that after 796, the prompt get processed in serial way, not in batching?

Do you mind pack your environ into one docker image, together with setup lightllm and the test case? So that I could reproduce your 249 latency result in my machine?

Thx

It sounds like you have loaded a slow tokenizer, very very much. "freed all gpu mem size" print stands for "all requests has been finished, but detokenize has not finished"

leiwen83 · 2023-08-08T12:17:44Z

Hi @shihaobai ,
After use triton==2.0.0.dev20221202, I get 801.74 result for the benchmark. And I found one phenomenon, after "freed all gpu mem size" being printed, there are 1204 ""POST /generate HTTP/1.1" 200 OK" get printed in slow speed. While before "freed all gpu mem size", there are 796 "HTTP ok" message.
So it seems to me that after 796, the prompt get processed in serial way, not in batching?
Do you mind pack your environ into one docker image, together with setup lightllm and the test case? So that I could reproduce your 249 latency result in my machine?
Thx

It sounds like you have loaded a slow tokenizer, very very much. "freed all gpu mem size" print stands for "all requests has been finished, but detokenize has not finished"

How to change to fast tokenizer?

XHPlus · 2023-08-09T01:32:05Z

Do you have tokenizer.json in your model folder? If there is this file, it will load the fast tokenizer.

leiwen83 · 2023-08-09T01:41:22Z

Do you have tokenizer.json in your model folder? If there is this file, it will load the fast tokenizer.

There is no such file...
Which llama-7b you have tested? Do you have any suggest repo from hugginface to test with?

Thx

shihaobai · 2023-08-09T10:00:50Z

You can use the pre-trained model in https://huggingface.co/huggyllama/llama-7b/tree/main which includes a fast tokenizer.

leiwen83 · 2023-08-09T12:15:10Z

After switch to this fast tokenizer one, seem result get very close to your side:

Total time: 268.24 s
Throughput: 7.46 requests/s
Average latency: 115.63 s

Do you know how to convert "tokenizer.json" for those llama model without it?
And another what suprise me is that whether or not llama model has the "tokenizer.json", vllm get nearly the same benchmark value. Does it mean lightllm more sensitive toward the fast tokenizer?

Thx

hiworldwzj · 2023-08-10T08:21:58Z

After switch to this fast tokenizer one, seem result get very close to your side:

Total time: 268.24 s Throughput: 7.46 requests/s Average latency: 115.63 s

Do you know how to convert "tokenizer.json" for those llama model without it? And another what suprise me is that whether or not llama model has the "tokenizer.json", vllm get nearly the same benchmark value. Does it mean lightllm more sensitive toward the fast tokenizer?

Thx

@shihaobai Can you help this?

XHPlus · 2023-08-10T16:39:52Z

After switch to this fast tokenizer one, seem result get very close to your side:

Total time: 268.24 s Throughput: 7.46 requests/s Average latency: 115.63 s

Do you know how to convert "tokenizer.json" for those llama model without it? And another what suprise me is that whether or not llama model has the "tokenizer.json", vllm get nearly the same benchmark value. Does it mean lightllm more sensitive toward the fast tokenizer?

Thx

You can export a tokenizer.json referring the https://huggingface.co/docs/transformers/fast_tokenizers.

leiwen83 · 2023-08-11T02:26:17Z

Got it. Thx.

leiwen83 closed this as completed Aug 11, 2023

hiworldwzj mentioned this issue Aug 11, 2023

benchmark无法完整运行 #54

Closed

leiwen83 mentioned this issue Aug 12, 2023

Auto convert without tokenizer.json to prevent performance downgrade? #62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark stuck #20

benchmark stuck #20

leiwen83 commented Aug 7, 2023

hiworldwzj commented Aug 7, 2023

hiworldwzj commented Aug 7, 2023

leiwen83 commented Aug 7, 2023 •

edited

Loading

hiworldwzj commented Aug 7, 2023

hiworldwzj commented Aug 7, 2023

leiwen83 commented Aug 7, 2023

hiworldwzj commented Aug 7, 2023

shihaobai commented Aug 8, 2023

leiwen83 commented Aug 8, 2023

hiworldwzj commented Aug 8, 2023

leiwen83 commented Aug 8, 2023

XHPlus commented Aug 9, 2023

leiwen83 commented Aug 9, 2023

shihaobai commented Aug 9, 2023

leiwen83 commented Aug 9, 2023 •

edited

Loading

hiworldwzj commented Aug 10, 2023

XHPlus commented Aug 10, 2023

leiwen83 commented Aug 11, 2023

benchmark stuck #20

benchmark stuck #20

Comments

leiwen83 commented Aug 7, 2023

hiworldwzj commented Aug 7, 2023

hiworldwzj commented Aug 7, 2023

leiwen83 commented Aug 7, 2023 • edited Loading

hiworldwzj commented Aug 7, 2023

hiworldwzj commented Aug 7, 2023

leiwen83 commented Aug 7, 2023

hiworldwzj commented Aug 7, 2023

shihaobai commented Aug 8, 2023

leiwen83 commented Aug 8, 2023

hiworldwzj commented Aug 8, 2023

leiwen83 commented Aug 8, 2023

XHPlus commented Aug 9, 2023

leiwen83 commented Aug 9, 2023

shihaobai commented Aug 9, 2023

leiwen83 commented Aug 9, 2023 • edited Loading

hiworldwzj commented Aug 10, 2023

XHPlus commented Aug 10, 2023

leiwen83 commented Aug 11, 2023

leiwen83 commented Aug 7, 2023 •

edited

Loading

leiwen83 commented Aug 9, 2023 •

edited

Loading