Fix performance issue of chatbot #1295

ispobock · 2024-03-16T07:20:48Z

Motivation

As mentioned in #1280, the throughput of Triton inference server is only 3.4 RPS, which is unexpected.

By further investigation, we found there two factors caused the incorrect throughput:

The grpc result format conversion in chatbot is time consuming. By fixing it, the throughput increased to 5.0 RPS.
In the default setting, the prompt will be decorated by the chat template, which caused the actual prompt tokens is more than the given input. For profile setting, we'd better use fixed prompt length and output length. After disabling the prompt decoration, the throughput increased to 5.9 RPS, which is much closer to the real throughput of Triton inference server.

Modification

Avoid grpc result format conversion.
Use capability='completion' to disable the prompt decoration.

zhyncs · 2024-03-16T07:43:59Z

Does this mean that preprocess and postprocess are actually not time-consuming, but that the triton backend itself is not implemented well and the throughput is worse than the api server.

zhyncs · 2024-03-16T07:44:51Z

And without this pr fix, the performance will be even worse.

ispobock · 2024-03-16T08:11:16Z

Does this mean that preprocess and postprocess are actually not time-consuming, but that the triton backend itself is not implemented well and the throughput is worse than the api server.

Yes, in my test setting (server and client are in the same machine), the effect of preprocess and postprocess is very small. Whether adding or removing preprocessing and postprocessing, the throughput remains at 5.9 RPS. That's the actual performance of triton inference server with Turbomind backend.

For the same 1000 sample prompts, the throughput of api_server is 6.6 RPS. So there are still almost 12% performance gap.

zhyncs · 2024-03-16T08:14:32Z

Does this mean that preprocess and postprocess are actually not time-consuming, but that the triton backend itself is not implemented well and the throughput is worse than the api server.

Yes, in my test setting (server and client are in the same machine), the effect of preprocess and postprocess is very small. Whether adding or removing preprocessing and postprocessing, the throughput remains at 5.9 RPS. That's the actual performance of triton inference server with Turbomind backend.

For the same 1000 sample prompts, the throughput of api_server is 6.6 RPS. So there are still almost 12% performance gap.

It seems that whether to use the Triton ensemble mode or not is irrelevant. Even if the ensemble is used, the throughput limit is still the same as the throughput limit of a single model in TurboMind.

ispobock · 2024-03-19T01:54:51Z

@AllentDan could you help review?

AllentDan

LGTM

zhyncs · 2024-03-19T08:55:18Z

After completing this fix, the RPS of the TurboMind model with Triton is still lower than that of the API Server, which means that without considering the pre and post-processing of token id, the performance is still worse. @ispobock has done some troubleshooting and currently found that forward is slower. However, it's very strange because Python and C++ call the same forward function. More time is needed for investigation. Besides that, we will also complete the POC for #1309.

zhyncs · 2024-03-19T08:55:24Z

LGTM

fix performance issue of chatbot

a52b765

lvhan028 requested review from AllentDan and irexyc March 19, 2024 07:34

lvhan028 added the Bug:P1 label Mar 19, 2024

irexyc approved these changes Mar 19, 2024

View reviewed changes

AllentDan approved these changes Mar 19, 2024

View reviewed changes

lvhan028 merged commit 82e14db into InternLM:main Mar 20, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix performance issue of chatbot #1295

Fix performance issue of chatbot #1295

ispobock commented Mar 16, 2024

zhyncs commented Mar 16, 2024

zhyncs commented Mar 16, 2024

ispobock commented Mar 16, 2024

zhyncs commented Mar 16, 2024

ispobock commented Mar 19, 2024

AllentDan left a comment

zhyncs commented Mar 19, 2024

zhyncs commented Mar 19, 2024

Fix performance issue of chatbot #1295

Fix performance issue of chatbot #1295

Conversation

ispobock commented Mar 16, 2024

Motivation

Modification

zhyncs commented Mar 16, 2024

zhyncs commented Mar 16, 2024

ispobock commented Mar 16, 2024

zhyncs commented Mar 16, 2024

ispobock commented Mar 19, 2024

AllentDan left a comment

Choose a reason for hiding this comment

zhyncs commented Mar 19, 2024

zhyncs commented Mar 19, 2024