-
Notifications
You must be signed in to change notification settings - Fork 182
Open
Labels
Description
we observe bad latency if concurrent copy_to_mapping on the single GPU, and want to understand the cause (known limit?) before we dive into.
- env: X86 8163 2 socket (AVX2 supported), Nvidia T4 *2, PCIe3.0 * 16, CUDA driver 450.82, latest gdrcpy(2022.10)
- Tests: 2 processes (bind to different core) concurrently running test/copylat on GPU0, each process alloc different host and dev memory addr of course.
- result: i.e. 32KB, each process gdr_copy_to_mapping gets avg 6.2usec, vs. 3.2usec if run with single process. similar problem with other block size (such as 2KB ~ 256KB where I only focuse on small blocks)
btw, if 2 processes run torwards different GPU, the perf behaves ok.
Question1: what's major cause for such big contention or perf degrade when concurrent gdr_copy_to_mapping? considering 32KB is not large enough I don't think PCIe bandwith is saturated.
Question2: any plan or possible to optimize concurrent gdr_copy_to_mapping?
Thanks for any feedback.