-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix performance issue of chatbot #1295
Conversation
Does this mean that preprocess and postprocess are actually not time-consuming, but that the triton backend itself is not implemented well and the throughput is worse than the api server. |
And without this pr fix, the performance will be even worse. |
Yes, in my test setting (server and client are in the same machine), the effect of preprocess and postprocess is very small. Whether adding or removing preprocessing and postprocessing, the throughput remains at 5.9 RPS. That's the actual performance of triton inference server with Turbomind backend. For the same 1000 sample prompts, the throughput of |
It seems that whether to use the Triton ensemble mode or not is irrelevant. Even if the ensemble is used, the throughput limit is still the same as the throughput limit of a single model in TurboMind. |
@AllentDan could you help review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
After completing this fix, the RPS of the TurboMind model with Triton is still lower than that of the API Server, which means that without considering the pre and post-processing of token id, the performance is still worse. @ispobock has done some troubleshooting and currently found that forward is slower. However, it's very strange because Python and C++ call the same forward function. More time is needed for investigation. Besides that, we will also complete the POC for #1309. |
LGTM |
Motivation
As mentioned in #1280, the throughput of Triton inference server is only
3.4
RPS, which is unexpected.By further investigation, we found there two factors caused the incorrect throughput:
5.0
RPS.5.9
RPS, which is much closer to the real throughput of Triton inference server.Modification
capability='completion'
to disable the prompt decoration.