-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove owned_session #1097
remove owned_session #1097
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested OK for both pytorch and turbomind concurrency. But there is a VRAM increase during benchmarking restful API for pytorch backend.
Besides, the performance gap between |
I found for api_server, if I make requests on 64 concurrency, the actual concurrency for the pytorch backend is only 10+. Do you have any clue about it? |
queue get will block cpu on coroutine. I have update an version with async recv. in 6b6dcae But since all tokenizer are processed on the same cpu with the engine, it still can not reach the performance of profile throughput. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Conflicts: lmdeploy/serve/async_engine.py lmdeploy/tokenizer.py
session is no longer bind to EngineInstance.
Tested on
chat
andprofile_pytorch_benchmark
with llama-7b