Skip to content

Fix concurrency issues for inference #1371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

pr3mar
Copy link

@pr3mar pr3mar commented Jun 12, 2025

Hi, I was using the FastAPI server, and I could only make one request before I needed to restart the server.
And given that it worked before the latest changes with vLLM, I knew there were some newly introduced issues.

Basically, the existing code is wrapped in try catch finally clause, and after yielding the final result, there's some cleaning up :)

Keep up the great work 👋

@aluminumbox
Copy link
Collaborator

check TrtContextWrapper, now we move trt_context into TrtContextWrapper, so it will not block the inference, and you can set higher trt_concurrent

@pr3mar
Copy link
Author

pr3mar commented Jun 17, 2025

@aluminumbox great! When are you pushing the fix?
As it stands right now, it doesn't work without the proposed fix

@aluminumbox
Copy link
Collaborator

@aluminumbox great! When are you pushing the fix? As it stands right now, it doesn't work without the proposed fix

it is already in our main branch, check https://github.com/FunAudioLLM/CosyVoice/blob/main/cosyvoice/utils/common.py#L171
now we only use this trt_context in flow_matching.py during tensorrt inference

@pr3mar
Copy link
Author

pr3mar commented Jun 17, 2025

@aluminumbox I saw it now that you mentioned it, but I think it also needs some work.

As I said, neither the HTTP server nor the GRPC server works without the suggested fix, which does some maintenance on how you get a lock and then release it from TRT

@aluminumbox
Copy link
Collaborator

@aluminumbox I saw it now that you mentioned it, but I think it also needs some work.

As I said, neither the HTTP server nor the GRPC server works without the suggested fix, which does some maintenance on how you get a lock and then release it from TRT

check https://github.com/FunAudioLLM/CosyVoice/blob/main/cosyvoice/flow/flow_matching.py#L129 it should work during concurrency, you need to set higher trt_concurrent when using vllm

@pr3mar
Copy link
Author

pr3mar commented Jun 17, 2025

It doesn't work because you don't even propagate the trt_concurrent properly, see:
https://github.com/FunAudioLLM/CosyVoice/pull/1371/files#diff-6091622429a7f519ed255b9a7061725feaf46abb6b442979e486e376debb48b3R284
and
https://github.com/FunAudioLLM/CosyVoice/pull/1371/files#diff-6091622429a7f519ed255b9a7061725feaf46abb6b442979e486e376debb48b3R294

I was not using vllm for this particular case, and I used CosyVoice2, not CosyVoice

@aluminumbox
Copy link
Collaborator

trt_concurrent

check https://github.com/FunAudioLLM/CosyVoice/blob/main/cosyvoice/cli/model.py#L94, we pass trt_concurrent when init trt_wrapper

@pr3mar
Copy link
Author

pr3mar commented Jun 17, 2025

@aluminumbox can you please run the server and try to get 2 consecutive GET requests on the inference_zero_shot endpoint?

I was not able to get the response from the second request, the server just never responded after waiting for 5+ minutes, and I was running this on a 4090, for which the first request was about 1 second

I ran it with:

conda activate cosyvoice
cd runtime/python/fastapi
python server.py

Then I sent a GET Request via Postman

@aluminumbox
Copy link
Collaborator

@aluminumbox can you please run the server and try to get 2 consecutive GET requests on the inference_zero_shot endpoint?

I was not able to get the response from the second request, the server just never responded after waiting for 5+ minutes, and I was running this on a 4090, for which the first request was about 1 second

I ran it with:

conda activate cosyvoice
cd runtime/python/fastapi
python server.py

Then I sent a GET Request via Postman

we are sure that grpc can do concurrent inference. fastapi is contributed by community, i do not know whether fastapi supports streaming concurrency, you can try grpc example

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants