Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix performance issue of chatbot #1295

Merged
merged 1 commit into from
Mar 20, 2024
Merged

Conversation

ispobock
Copy link
Contributor

Motivation

As mentioned in #1280, the throughput of Triton inference server is only 3.4 RPS, which is unexpected.

By further investigation, we found there two factors caused the incorrect throughput:

  1. The grpc result format conversion in chatbot is time consuming. By fixing it, the throughput increased to 5.0 RPS.
  2. In the default setting, the prompt will be decorated by the chat template, which caused the actual prompt tokens is more than the given input. For profile setting, we'd better use fixed prompt length and output length. After disabling the prompt decoration, the throughput increased to 5.9 RPS, which is much closer to the real throughput of Triton inference server.

Modification

  1. Avoid grpc result format conversion.
  2. Use capability='completion' to disable the prompt decoration.

@zhyncs
Copy link
Contributor

zhyncs commented Mar 16, 2024

Does this mean that preprocess and postprocess are actually not time-consuming, but that the triton backend itself is not implemented well and the throughput is worse than the api server.

@zhyncs
Copy link
Contributor

zhyncs commented Mar 16, 2024

And without this pr fix, the performance will be even worse.

@ispobock
Copy link
Contributor Author

Does this mean that preprocess and postprocess are actually not time-consuming, but that the triton backend itself is not implemented well and the throughput is worse than the api server.

Yes, in my test setting (server and client are in the same machine), the effect of preprocess and postprocess is very small. Whether adding or removing preprocessing and postprocessing, the throughput remains at 5.9 RPS. That's the actual performance of triton inference server with Turbomind backend.

For the same 1000 sample prompts, the throughput of api_server is 6.6 RPS. So there are still almost 12% performance gap.

@zhyncs
Copy link
Contributor

zhyncs commented Mar 16, 2024

Does this mean that preprocess and postprocess are actually not time-consuming, but that the triton backend itself is not implemented well and the throughput is worse than the api server.

Yes, in my test setting (server and client are in the same machine), the effect of preprocess and postprocess is very small. Whether adding or removing preprocessing and postprocessing, the throughput remains at 5.9 RPS. That's the actual performance of triton inference server with Turbomind backend.

For the same 1000 sample prompts, the throughput of api_server is 6.6 RPS. So there are still almost 12% performance gap.

It seems that whether to use the Triton ensemble mode or not is irrelevant. Even if the ensemble is used, the throughput limit is still the same as the throughput limit of a single model in TurboMind.

@ispobock
Copy link
Contributor Author

@AllentDan could you help review?

Copy link
Collaborator

@AllentDan AllentDan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhyncs
Copy link
Contributor

zhyncs commented Mar 19, 2024

After completing this fix, the RPS of the TurboMind model with Triton is still lower than that of the API Server, which means that without considering the pre and post-processing of token id, the performance is still worse. @ispobock has done some troubleshooting and currently found that forward is slower. However, it's very strange because Python and C++ call the same forward function. More time is needed for investigation. Besides that, we will also complete the POC for #1309.

@zhyncs
Copy link
Contributor

zhyncs commented Mar 19, 2024

LGTM

@lvhan028 lvhan028 merged commit 82e14db into InternLM:main Mar 20, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants