Skip to content

Question on bad perf when concurrent copy on single GPU #237

@Zhaojp-Frank

Description

@Zhaojp-Frank

we observe bad latency if concurrent copy_to_mapping on the single GPU, and want to understand the cause (known limit?) before we dive into.

  • env: X86 8163 2 socket (AVX2 supported), Nvidia T4 *2, PCIe3.0 * 16, CUDA driver 450.82, latest gdrcpy(2022.10)
  • Tests: 2 processes (bind to different core) concurrently running test/copylat on GPU0, each process alloc different host and dev memory addr of course.
  • result: i.e. 32KB, each process gdr_copy_to_mapping gets avg 6.2usec, vs. 3.2usec if run with single process. similar problem with other block size (such as 2KB ~ 256KB where I only focuse on small blocks)
    btw, if 2 processes run torwards different GPU, the perf behaves ok.

Question1: what's major cause for such big contention or perf degrade when concurrent gdr_copy_to_mapping? considering 32KB is not large enough I don't think PCIe bandwith is saturated.

Question2: any plan or possible to optimize concurrent gdr_copy_to_mapping?

Thanks for any feedback.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions