The GPU utilization reaches full capacity with 2 concurrent requests

**Is your feature request related to a problem? Please describe.**
On a machine with an NVIDIA A10 (24G) GPU, when 8 processes are started to handle concurrent requests, the GPU utilization reaches full capacity with 2 concurrent requests causing request latency to spike to around 10 seconds. 

**Describe the solution you'd like**
By using nvidia-cuda-mps-control to offload the tasks to MPS (Multi-Process Service) on the GPU, the performance bottleneck caused by increased concurrency can be reduced to a controllable range. With 8 concurrent requests, the latency increases by approximately 100ms for each additional concurrent request.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The GPU utilization reaches full capacity with 2 concurrent requests #623

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

The GPU utilization reaches full capacity with 2 concurrent requests #623

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions