GPU Bandwidth #18

TheFl0w · 2017-05-10T07:36:07Z

When we ran the memory bandwidth test on your nvidia TITAN Black at PSI , we got some unexpected results. If I remember correctly, we measured about 6000 MiB/s for data transfer between host and GPU. PCIe 3.0 should actually give us twice the bandwidth. I ran the same tests on the GPU I use at home (GTX 780) so I could find out if consumer grade GPUs are more limited when it comes to data transfer rates. It turned out that data transfer for my card is as fast as data transfer of the Tesla K80x cards we use in our HPC cluster. Can you post the results for your TITAN Black, please?

Here is the output of the bandwidth test:

Pinned memory (physically contiguous)

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 780
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     12172.0

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     12454.7

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     213145.1

Pageable memory (virtually contiguous)

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 780
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     6657.6

 Device to Host Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     6474.5

 Device to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     212451.2

The text was updated successfully, but these errors were encountered:

mbrueckner-psi · 2017-05-10T08:23:31Z

What's the CPU clock frequency? The server's CPUs (where the titan is mounted) are Intel(R) Xeon(R) CPU E5-2680 0 with only 2.70GHz. This is our output: [l_brueckner_m@pc-jungfrau-test bandwidthTest]$ ./bandwidthTest [CUDA Bandwidth Test] - Starting... Running on... Device 0: GeForce GTX TITAN Black Quick Mode Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 5836.3 Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 6533.9 Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 230384.2 [l_brueckner_m@pc-jungfrau-test bandwidthTest]$ ./bandwidthTest --memory=pageable [CUDA Bandwidth Test] - Starting... Running on... Device 0: GeForce GTX TITAN Black Quick Mode Host to Device Bandwidth, 1 Device(s) PAGEABLE Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 4026.1 Device to Host Bandwidth, 1 Device(s) PAGEABLE Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 3891.8 Device to Device Bandwidth, 1 Device(s) PAGEABLE Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 231035.7 Am 10.05.2017 um 09:36 schrieb TheFl0w:

…

When we ran the memory bandwidth test on your nvidia TITAN Black at PSI , we got some unexpected results. If I remember correctly, we measured about 6000 MiB/s for data transfer between host and GPU. PCIe 3.0 should actually give us twice the bandwidth. I ran the same tests on the GPU I use at home (GTX 780) so I could find out if consumer grade GPUs are more limited when it comes to data transfer rates. It turned out that data transfer for my card is as fast as data transfer of the Tesla K80x cards we use in our HPC cluster. Can you post the results for your TITAN Black, please? Here is the output of the bandwidth test: *Pinned memory (physically contiguous)* |[CUDA Bandwidth Test] - Starting... Running on... Device 0: GeForce GTX 780 Quick Mode Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 12172.0 Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 12454.7 Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 213145.1 | *Pageable memory (virtually contiguous)* |[CUDA Bandwidth Test] - Starting... Running on... Device 0: GeForce GTX 780 Quick Mode Host to Device Bandwidth, 1 Device(s) PAGEABLE Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 6657.6 Device to Host Bandwidth, 1 Device(s) PAGEABLE Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 6474.5 Device to Device Bandwidth, 1 Device(s) PAGEABLE Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 212451.2 | — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#18>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXSCCVkRj0ZDsuan-GJTlPeX7tJBeKXjks5r4WjngaJpZM4NWSFm>.

TheFl0w · 2017-05-10T08:48:15Z

$ /proc/cpuinfo says the GPUs on our HPC cluster are running on 32 cores of type Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz

The test I did with my GPU at home was done on a Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz

I get the same results in both cases.

TheFl0w · 2017-05-10T08:57:32Z

As far as I know CPU clock frequency does only matter for pageable memory anyway. If we transfer data from pinned memory, this is usually done with DMA, so the CPU would not be involved in copying data.

lopez-c · 2017-05-10T09:11:29Z

Hi,
We need to find out the reason why the transfers are so slow. We will keep you up to date.

mbrueckner-psi · 2017-05-10T09:25:09Z

Hi lspci -nvvs 27:00.0 [...] LnkCap: Port #0, Speed 5GT/s, Width x16, [...] LnkSta: Speed 2.5GT/s, Width x16, [...] dmidecode [...] Handle 0x0908, DMI type 9, 17 bytes System Slot Information Designation: PCI-E Slot 8 Type: x16 PCI Express 3 Current Usage: In Use Length: Long ID: 8 Characteristics: 3.3 V is provided PME signal is supported Bus Address: 0000:27:00.0 lspci shows that the card can handle 5GT/s but it gets only 2.5GT/s. This is strange since 5GT/s is PCIe 2.0 (wikipedia) and NVidia claims that the Titan can do PCIe 3.0. dmidecode shows that the slot can handle PCIe 3.0 with 16 lanes. Martin Am 10.05.2017 um 11:11 schrieb lopez-c:

…

Hi, We need to find out the reason why the transfers are so slow. We will keep you up to date. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#18 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXSCCY3sOpi3e2sHhx_SimqniLXd76Ksks5r4X9BgaJpZM4NWSFm>.

TheFl0w · 2017-05-10T10:17:03Z

@mbrueckner-psi
To be honest, I have no idea how to fix this. Off the top of my head, I would say:

could be a bug in the driver, make sure you are using the latest version
firmware problem on mainboard
a power connector of the PSU is not working anymore
PCIe slot faulty, maybe try another slot
faulty GPU

If you have physical access to the system, maybe try to check the PSU connectors and just put the GPU out of the PCIe slot and back in.

TheFl0w · 2017-05-10T12:05:04Z

I would like to gather some additional information about your GPU. Can you run the program I attached and post the results?

benchmark.zip

lopez-c · 2017-05-10T12:30:40Z

Hi,
This is the result of the benchmark:

CUDA Driver version: 8000
CUDA Runtime version: 8000

Devices:
GeForce GTX TITAN Black
Compute capability: 3.5
Global memory: 6082.31 MiB
DMA engines: 1
Multi processors: 15
Warp size: 32
Max concurrent kernels: 1
Max grid size: 2147483647, 65535, 65535
Max block size: 1024, 1024, 64
Max threads per block: 1024

For some reason we are trying to understand, it looks like the link between the CPU and the GPU is PCIe v2.0 instead of PCIe v3.0.

In theory the GPU is compatible with PCIe 3.0 and the slot where it is connected to as well. So yes, a bit strange.

mbrueckner-psi · 2017-05-10T15:35:59Z

I've tried this : Edit /etc/modprobe.d/local.conf or create a new file like /etc/modprobe.d/nvidia.conf and add this options nvidia NVreg_EnablePCIeGen3=1 but did not work Cheers Aldo

…

On 05/10/2017 11:24 AM, Martin Brückner wrote: Hi lspci -nvvs 27:00.0 [...] LnkCap: Port #0, Speed 5GT/s, Width x16, [...] LnkSta: Speed 2.5GT/s, Width x16, [...] dmidecode [...] Handle 0x0908, DMI type 9, 17 bytes System Slot Information Designation: PCI-E Slot 8 Type: x16 PCI Express 3 Current Usage: In Use Length: Long ID: 8 Characteristics: 3.3 V is provided PME signal is supported Bus Address: 0000:27:00.0 lspci shows that the card can handle 5GT/s but it gets only 2.5GT/s. This is strange since 5GT/s is PCIe 2.0 (wikipedia) and NVidia claims that the Titan can do PCIe 3.0. dmidecode shows that the slot can handle PCIe 3.0 with 16 lanes. Martin Am 10.05.2017 um 11:11 schrieb lopez-c: > Hi, > We need to find out the reason why the transfers are so slow. We will > keep you up to date. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#18 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AXSCCY3sOpi3e2sHhx_SimqniLXd76Ksks5r4X9BgaJpZM4NWSFm>. >

TheFl0w · 2017-05-11T13:29:32Z

Okay, so I have tried looking for more possible explanations for our GPU bandwidth problem. A processor supports a certain number of PCIe lanes. In your case that max number of lanes should be 40. However, different chipsets on motherboards support a different number of PCIe lanes. Sometimes the number of lanes available per PCIe socket depends on how many devices are connected. For example: If there are devices plugged into socket 1 and 3, only 8 lanes respectively are available. I can look into this, but I need to know what motherboard is used, which PCIe sockets are used and how many lanes are (theoretically) occupied by those devices.

TL;DR: Max # of PCIe lanes could be the issue. For now I would like to know the model of the motherboard.

mbrueckner-psi · 2017-05-11T13:37:37Z

Hi, it's the server HP ML350P Gen8. The GPU sits in a suitable slot and see the output of lspci and dmidecode. The there are 16 lanes connected to the GPU. As said before: Card and Mainboard support PCIe 3 (8GT/s) but the link is only PCIe 2 (5GT/s). This limits the bandwith to max 8GB/s. Am 11.05.2017 um 15:29 schrieb TheFl0w:

…

Okay, so I have tried looking for more possible explanations for our GPU bandwidth problem. A processor supports a certain number of PCIe lanes. In your case that max number of lanes should be 40. However, different chipsets on motherboards support a different number of PCIe lanes. Sometimes the number of lanes available per PCIe socket depends on how many devices are connected. For example: /If there are devices plugged into socket 1 and 3, only 8 lanes respectively are available/. I can look into this, but I need to know what motherboard is used, which PCIe sockets are used and how many lanes are (theoretically) occupied by those devices. *TL;DR:* Max # of PCIe lanes could be the issue. For now I would like to know the model of the motherboard. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXSCCXEc85StzV_PweDWVEvUHxdkOlPVks5r4w09gaJpZM4NWSFm>.

TheFl0w · 2017-05-11T14:09:12Z

This is what the manual says about the expansion slots.

Expansion Slot #	Technology	Bus Width	Connector Width	Bus Number	Form Factor	Notes
9	PCIe 3.0	x4	x8	32	Full Length / Height	For processor 2
8	PCIe 3.0	x16	x16	32	Full Length / Height	For processor 2
7	PCIe 3.0	x4	x8	32	Full Length / Height	For processor 2
6	PCIe 3.0	x16	x16	32	Full Length / Height	For processor 2
5	PCIe 2.0	x4	x8	0	Full Length / Height	For processor 2
4	PCIe 3.0	x4	x8	0	Full Length / Height	For processor 1
3	PCIe 3.0	x16	x16	0	Full Length / Height	For processor 1
2	PCIe 3.0	x4	x8	0	Full Length / Height	For processor 1
1	PCIe 3.0	x8	x16	0	Full Length / Height	For processor 1

dmidecode reported:

Designation: PCI-E Slot 8 
Type: x16 PCI Express 3

Expansion slot 1 has a connector width of x16 while it only supports x8. Please make sure the card is not plugged into slot 1. If so, consider placing it in slot 3 instead.

TheFl0w · 2017-05-11T14:23:50Z

If Linux labels the PCIe slots correctly, I am out of ideas for now. I will ask around at work tomorrow, maybe this is a common problem.

TheFl0w assigned lopez-c May 10, 2017

kloppstock closed this as completed Nov 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Bandwidth #18

GPU Bandwidth #18

TheFl0w commented May 10, 2017

mbrueckner-psi commented May 10, 2017 via email

TheFl0w commented May 10, 2017

TheFl0w commented May 10, 2017

lopez-c commented May 10, 2017

mbrueckner-psi commented May 10, 2017 via email

TheFl0w commented May 10, 2017

TheFl0w commented May 10, 2017

lopez-c commented May 10, 2017

CUDA Driver version: 8000
CUDA Runtime version: 8000

mbrueckner-psi commented May 10, 2017 via email

TheFl0w commented May 11, 2017

mbrueckner-psi commented May 11, 2017 via email

TheFl0w commented May 11, 2017 •

edited

Loading

TheFl0w commented May 11, 2017

GPU Bandwidth #18

GPU Bandwidth #18

Comments

TheFl0w commented May 10, 2017

mbrueckner-psi commented May 10, 2017 via email

TheFl0w commented May 10, 2017

TheFl0w commented May 10, 2017

lopez-c commented May 10, 2017

mbrueckner-psi commented May 10, 2017 via email

TheFl0w commented May 10, 2017

TheFl0w commented May 10, 2017

lopez-c commented May 10, 2017

CUDA Driver version: 8000 CUDA Runtime version: 8000

mbrueckner-psi commented May 10, 2017 via email

TheFl0w commented May 11, 2017

mbrueckner-psi commented May 11, 2017 via email

TheFl0w commented May 11, 2017 • edited Loading

TheFl0w commented May 11, 2017

CUDA Driver version: 8000
CUDA Runtime version: 8000

TheFl0w commented May 11, 2017 •

edited

Loading