Fix concurrency issues for inference #1371

pr3mar · 2025-06-12T18:20:36Z

Hi, I was using the FastAPI server, and I could only make one request before I needed to restart the server.
And given that it worked before the latest changes with vLLM, I knew there were some newly introduced issues.

Basically, the existing code is wrapped in try catch finally clause, and after yielding the final result, there's some cleaning up :)

Keep up the great work 👋

aluminumbox · 2025-06-17T05:38:33Z

check TrtContextWrapper, now we move trt_context into TrtContextWrapper, so it will not block the inference, and you can set higher trt_concurrent

pr3mar · 2025-06-17T06:06:37Z

@aluminumbox great! When are you pushing the fix?
As it stands right now, it doesn't work without the proposed fix

aluminumbox · 2025-06-17T06:08:48Z

@aluminumbox great! When are you pushing the fix? As it stands right now, it doesn't work without the proposed fix

it is already in our main branch, check https://github.com/FunAudioLLM/CosyVoice/blob/main/cosyvoice/utils/common.py#L171
now we only use this trt_context in flow_matching.py during tensorrt inference

pr3mar · 2025-06-17T06:15:17Z

@aluminumbox I saw it now that you mentioned it, but I think it also needs some work.

As I said, neither the HTTP server nor the GRPC server works without the suggested fix, which does some maintenance on how you get a lock and then release it from TRT

aluminumbox · 2025-06-17T06:22:04Z

@aluminumbox I saw it now that you mentioned it, but I think it also needs some work.

As I said, neither the HTTP server nor the GRPC server works without the suggested fix, which does some maintenance on how you get a lock and then release it from TRT

check https://github.com/FunAudioLLM/CosyVoice/blob/main/cosyvoice/flow/flow_matching.py#L129 it should work during concurrency, you need to set higher trt_concurrent when using vllm

pr3mar · 2025-06-17T06:25:51Z

It doesn't work because you don't even propagate the trt_concurrent properly, see:
https://github.com/FunAudioLLM/CosyVoice/pull/1371/files#diff-6091622429a7f519ed255b9a7061725feaf46abb6b442979e486e376debb48b3R284
and
https://github.com/FunAudioLLM/CosyVoice/pull/1371/files#diff-6091622429a7f519ed255b9a7061725feaf46abb6b442979e486e376debb48b3R294

I was not using vllm for this particular case, and I used CosyVoice2, not CosyVoice

aluminumbox · 2025-06-17T06:27:55Z

trt_concurrent

check https://github.com/FunAudioLLM/CosyVoice/blob/main/cosyvoice/cli/model.py#L94, we pass trt_concurrent when init trt_wrapper

pr3mar · 2025-06-17T06:39:00Z

@aluminumbox can you please run the server and try to get 2 consecutive GET requests on the inference_zero_shot endpoint?

I was not able to get the response from the second request, the server just never responded after waiting for 5+ minutes, and I was running this on a 4090, for which the first request was about 1 second

I ran it with:

conda activate cosyvoice
cd runtime/python/fastapi
python server.py

Then I sent a GET Request via Postman

aluminumbox · 2025-06-17T06:56:26Z

@aluminumbox can you please run the server and try to get 2 consecutive GET requests on the inference_zero_shot endpoint?

I was not able to get the response from the second request, the server just never responded after waiting for 5+ minutes, and I was running this on a 4090, for which the first request was about 1 second

I ran it with:
conda activate cosyvoice
cd runtime/python/fastapi
python server.py
Then I sent a GET Request via Postman

we are sure that grpc can do concurrent inference. fastapi is contributed by community, i do not know whether fastapi supports streaming concurrency, you can try grpc example

Fix concurrency issues for inference

01f02a3

aluminumbox closed this Jun 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix concurrency issues for inference #1371

Fix concurrency issues for inference #1371

pr3mar commented Jun 12, 2025

Uh oh!

aluminumbox commented Jun 17, 2025

Uh oh!

pr3mar commented Jun 17, 2025

Uh oh!

aluminumbox commented Jun 17, 2025

Uh oh!

pr3mar commented Jun 17, 2025

Uh oh!

aluminumbox commented Jun 17, 2025

Uh oh!

pr3mar commented Jun 17, 2025 •

edited

Loading

Uh oh!

aluminumbox commented Jun 17, 2025

Uh oh!

pr3mar commented Jun 17, 2025

Uh oh!

aluminumbox commented Jun 17, 2025

Uh oh!

Uh oh!

Fix concurrency issues for inference #1371

Fix concurrency issues for inference #1371

Conversation

pr3mar commented Jun 12, 2025

Uh oh!

aluminumbox commented Jun 17, 2025

Uh oh!

pr3mar commented Jun 17, 2025

Uh oh!

aluminumbox commented Jun 17, 2025

Uh oh!

pr3mar commented Jun 17, 2025

Uh oh!

aluminumbox commented Jun 17, 2025

Uh oh!

pr3mar commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aluminumbox commented Jun 17, 2025

Uh oh!

pr3mar commented Jun 17, 2025

Uh oh!

aluminumbox commented Jun 17, 2025

Uh oh!

Uh oh!

pr3mar commented Jun 17, 2025 •

edited

Loading