-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Fix concurrency issues for inference #1371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
check TrtContextWrapper, now we move trt_context into TrtContextWrapper, so it will not block the inference, and you can set higher trt_concurrent |
@aluminumbox great! When are you pushing the fix? |
it is already in our main branch, check https://github.com/FunAudioLLM/CosyVoice/blob/main/cosyvoice/utils/common.py#L171 |
@aluminumbox I saw it now that you mentioned it, but I think it also needs some work. As I said, neither the HTTP server nor the GRPC server works without the suggested fix, which does some maintenance on how you get a lock and then release it from TRT |
check https://github.com/FunAudioLLM/CosyVoice/blob/main/cosyvoice/flow/flow_matching.py#L129 it should work during concurrency, you need to set higher trt_concurrent when using vllm |
It doesn't work because you don't even propagate the I was not using vllm for this particular case, and I used |
check https://github.com/FunAudioLLM/CosyVoice/blob/main/cosyvoice/cli/model.py#L94, we pass trt_concurrent when init trt_wrapper |
@aluminumbox can you please run the server and try to get 2 consecutive GET requests on the I was not able to get the response from the second request, the server just never responded after waiting for 5+ minutes, and I was running this on a 4090, for which the first request was about 1 second I ran it with:
Then I sent a GET Request via Postman |
we are sure that grpc can do concurrent inference. fastapi is contributed by community, i do not know whether fastapi supports streaming concurrency, you can try grpc example |
Hi, I was using the FastAPI server, and I could only make one request before I needed to restart the server.
And given that it worked before the latest changes with vLLM, I knew there were some newly introduced issues.
Basically, the existing code is wrapped in try catch finally clause, and after yielding the final result, there's some cleaning up :)
Keep up the great work 👋