Is your feature request related to a problem? Please describe.
On a machine with an NVIDIA A10 (24G) GPU, when 8 processes are started to handle concurrent requests, the GPU utilization reaches full capacity with 2 concurrent requests causing request latency to spike to around 10 seconds.
Describe the solution you'd like
By using nvidia-cuda-mps-control to offload the tasks to MPS (Multi-Process Service) on the GPU, the performance bottleneck caused by increased concurrency can be reduced to a controllable range. With 8 concurrent requests, the latency increases by approximately 100ms for each additional concurrent request.
Is your feature request related to a problem? Please describe.
On a machine with an NVIDIA A10 (24G) GPU, when 8 processes are started to handle concurrent requests, the GPU utilization reaches full capacity with 2 concurrent requests causing request latency to spike to around 10 seconds.
Describe the solution you'd like
By using nvidia-cuda-mps-control to offload the tasks to MPS (Multi-Process Service) on the GPU, the performance bottleneck caused by increased concurrency can be reduced to a controllable range. With 8 concurrent requests, the latency increases by approximately 100ms for each additional concurrent request.