Question on bad perf when concurrent copy on single GPU

we observe bad latency if concurrent copy_to_mapping on the single GPU, and want to understand the cause (known limit?) before we dive into.
- env: X86 8163 2 socket (AVX2 supported), Nvidia T4 *2, PCIe3.0 * 16, CUDA driver 450.82, latest gdrcpy(2022.10)
- Tests: 2 processes (bind to different core) concurrently running test/copylat on GPU0,  each process alloc different host and dev memory addr of course. 
- result: i.e. 32KB, each process gdr_copy_to_mapping gets avg 6.2usec, vs. 3.2usec if run with single process. similar problem with other block size (such as 2KB ~ 256KB where I only focuse on small blocks)
  btw, if 2 processes run torwards different GPU, the perf behaves ok.

Question1:  what's major cause for such big contention or perf degrade when concurrent gdr_copy_to_mapping? considering 32KB is not large enough I don't think PCIe bandwith is saturated. 

Question2: any plan or possible to optimize concurrent gdr_copy_to_mapping?

Thanks for any feedback.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on bad perf when concurrent copy on single GPU #237

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question on bad perf when concurrent copy on single GPU #237

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions