Error: failed to retrieve CPU affinity #65

joelacrisp · 2016-04-05T00:30:00Z

Host : AWS g2.8xlarge
This works on a g2.2xlarge
Fails on a g2.8xlarge using version from master/HEAD

ubuntu@ip-172-31-67-39:~/nvidia-docker$ sudo ./tools/bin/nvidia-docker-plugin
./tools/bin/nvidia-docker-plugin | 2016/04/04 23:54:05 Loading NVIDIA management library
./tools/bin/nvidia-docker-plugin | 2016/04/04 23:54:05 Loading NVIDIA unified memory
./tools/bin/nvidia-docker-plugin | 2016/04/04 23:54:05 Discovering GPU devices

0 [65535 0 0 0] [4294967295 0 0 0] <-- Extra debug from me, dump of the CPU and GPU masks
1 [4294901760 0 0 0] [4294967295 0 0 0]
./tools/bin/nvidia-docker-plugin | 2016/04/04 23:54:05 Error: failed to retrieve CPU affinity

root@ip-172-31-67-39:~/nvidia-docker# for x in /proc/driver/nvidia/gpus/*/information; do cat $x; done
Model: GRID K520
IRQ: 261
GPU UUID: GPU-2ce780af-c15a-a966-f0b8-98c9320b34a2
Video BIOS: 80.04.d4.00.03
Bus Type: PCIe
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:00:03.0
Model: GRID K520
IRQ: 262
GPU UUID: GPU-0bc9a139-4747-41a7-c24a-7c871dc4809e
Video BIOS: 80.04.d4.00.04
Bus Type: PCIe
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:00:04.0
Model: GRID K520
IRQ: 263
GPU UUID: GPU-6ca92b1b-6acf-cdbe-e61f-25387f27518d
Video BIOS: 80.04.d4.00.03
Bus Type: PCIe
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:00:05.0
Model: GRID K520
IRQ: 264
GPU UUID: GPU-3ca9085f-66a6-30c2-d544-907e00dde280
Video BIOS: 80.04.d4.00.04
Bus Type: PCIe
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:00:06.0

root@ip-172-31-67-39:~/nvidia-docker# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 CPU Affinity
GPU0 X PHB PHB PHB 0-31
GPU1 PHB X PHB PHB 0-31
GPU2 PHB PHB X PHB 0-31
GPU3 PHB PHB PHB X 0-31

Legend:

X = Self
SOC = Path traverses a socket-level link (e.g. QPI)
PHB = Path traverses a PCIe host bridge
PXB = Path traverses multiple PCIe internal switches
PIX = Path traverses a PCIe internal switch

joelacrisp · 2016-04-05T00:37:18Z

Editing numa.go to remove this error return and always return zero seems to be a workaround, although it clearly isn't a correct fix.

3XX0 · 2016-04-05T01:28:50Z

Cross socket affinity for all 4 GPUs looks suspicious.
I suspect PCI passthrough to not play nice here. I'll look into it.

joelacrisp · 2016-04-05T01:30:07Z

Thanks. Let me know if you need any more info

3XX0 · 2016-04-09T02:04:24Z

Finally had time to look at it closely.
Xen is definitively at fault. hwlock and sysfs can't figure out the CPU affinity either.
I'll probably change the code to rely solely on sysfs and default the numa node to 0 if we can't figure it out.

jsmith50500 · 2016-04-12T14:19:16Z

Hi there
I got this CPU Affinity error message too, thanks for any help you can provide
Host:Centos 7.1
Nvidia Driver 361.42
Docker 1.10
Small Development box with GeForce GTX 750 Ti card which is working with latest Linux drivers.
Should the nvidia-docker plugin work with non-quadro cards too ?. My information is below

I ran,
[root@bccentos7 bin]# sudo -b nohup nvidia-docker-plugin > /tmp/nvidia-docker.log

Information here
cat nvidia-docker.log
nvidia-docker-plugin | 2016/04/12 14:51:50 Loading NVIDIA management library
nvidia-docker-plugin | 2016/04/12 14:51:50 Loading NVIDIA unified memory
nvidia-docker-plugin | 2016/04/12 14:51:50 Discovering GPU devices
nvidia-docker-plugin | 2016/04/12 14:51:51 Error: failed to retrieve CPU affinity

[root@bccentos7 tmp]# for x in /proc/driver/nvidia/gpus/*/information; do cat $x; done
Model: GeForce GTX 750 Ti
IRQ: 102
GPU UUID: GPU-4 95xxxx-9534- e8b-a6-0xxxxx428v
Video BIOS: 82.07.55.00.35
Bus Type: PCIe
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:07:00.0
Device Minor: 0

[root@bccentos7 tmp]# nvidia-smi topo -m
GPU0 CPU Affinity
GPU0 X 0-23

Legend:

X = Self
SOC = PCI path traverses a socket-level link (e.g. QPI)
PHB = PCI path traverses a host bridge
PXB = PCI path traverses multiple internal switches
PIX = PCI path traverses an internal switch
NV# = Path traverses # NVLinks

[root@bccentos7 tmp]# nvidia-smi

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1485 G /usr/bin/Xorg 48MiB |
| 0 2803 G gnome-shell 38MiB |
+-----------------------------------------------------------------------------+

jsmith50500 · 2016-04-13T10:06:23Z

Hi there
I think the CPU affinity issue is still there . Can you help thanks
I downloaded nvidia-docker master branch after update - The file nvml.go fiile was time stamped - April 12th 16.51, so I believe it had your fix for CPU affinity issue (#65) and ran a build, then started nvidia-docker-plugin

[root@bccentos7 nvidia-docker-master]# make
make -C /tmp/nvidia-docker-master/nvidia-docker-master/tools
make[1]: Entering directory `/tmp/nvidia-docker-master/nvidia-docker-master/tools'
Sending build context to Docker daemon 109.6 kB
Step 1 : FROM golang
.
/tmp/nvidia-docker-master/nvidia-docker-master/tools/bin

Info here

[root@bccentos7 bin]# sudo -b nohup nvidia-docker-plugin > /tmp/nvidia-docker.log
[root@bccentos7 bin]# nohup: ignoring input and redirecting stderr to stdout

[root@bccentos7 tmp]# cat nvidia-docker.log
nvidia-docker-plugin | 2016/04/13 10:33:05 Loading NVIDIA management library
nvidia-docker-plugin | 2016/04/13 10:33:05 Loading NVIDIA unified memory
nvidia-docker-plugin | 2016/04/13 10:33:05 Discovering GPU devices
nvidia-docker-plugin | 2016/04/13 10:33:05 Error: failed to retrieve CPU affinity

more info here

[root@bccentos7 nvml]# ls -l
total 20
-rw-rw-r--. 1 1181 Apr 12 16:51 nvml_dl.c
-rw-rw-r--. 1 397 Apr 12 16:51 nvml_dl.h
-rw-rw-r--. 1 9006 Apr 12 16:51 nvml.go
[root@bccentos7 nvml]# for x in /proc/driver/nvidia/gpus/*/information; do cat $x; done
Model: GeForce GTX 750 Ti
IRQ: 102
GPU UUID: GPU-4 95xxxx-9534- e8b-a6-0xxxxx428v
Video BIOS: 82.07.55.00.35
Bus Type: PCIe
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:07:00.0
Device Minor: 0

[root@bccentos7 nvml]# nvidia-smi topo -m
GPU0 CPU Affinity
GPU0 X 0-23

Legend:

X = Self
SOC = PCI path traverses a socket-level link (e.g. QPI)
PHB = PCI path traverses a host bridge
PXB = PCI path traverses multiple internal switches
PIX = PCI path traverses an internal switch
NV# = Path traverses # NVLinks

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1820 G /usr/bin/Xorg 48MiB |
| 0 2804 G gnome-shell 38MiB |
+-----------------------------------------------------------------------------+

3XX0 · 2016-04-13T16:44:45Z

What's the output of:

cat /sys/bus/pci/devices/0000:07:00.0/numa_node

jsmith50500 · 2016-04-14T08:33:19Z

Hi @3XX0 thanks for replying

I ran the command the output was -1,

[root@bccentos7 deviceQuery]# cat /sys/bus/pci/devices/0000:07:00.0/numa_node
-1

3XX0 · 2016-04-15T08:13:48Z

That's weird, that's exactly what the code does.
What about the permissions on this file?

jsmith50500 · 2016-04-15T09:35:02Z

@3XX0
Thanks for getting back again , the permissions are
/sys/bus/pci/devices/0000:07:00.0

-r--r--r--. 1 root root 4096 Apr 15 10:10 numa_node

3XX0 · 2016-04-15T23:03:09Z

I just added better error reporting, the error should be more explicit now.

cnd · 2016-09-12T07:48:32Z

@3XX0 I've got not numa_node file, what is it even?

3XX0 added the bug label Apr 5, 2016

3XX0 closed this as completed in 641a8bc Apr 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: failed to retrieve CPU affinity #65

Error: failed to retrieve CPU affinity #65

joelacrisp commented Apr 5, 2016

joelacrisp commented Apr 5, 2016

3XX0 commented Apr 5, 2016

joelacrisp commented Apr 5, 2016

3XX0 commented Apr 9, 2016

jsmith50500 commented Apr 12, 2016

jsmith50500 commented Apr 13, 2016

3XX0 commented Apr 13, 2016

jsmith50500 commented Apr 14, 2016

3XX0 commented Apr 15, 2016

jsmith50500 commented Apr 15, 2016

3XX0 commented Apr 15, 2016

cnd commented Sep 12, 2016

Error: failed to retrieve CPU affinity #65

Error: failed to retrieve CPU affinity #65

Comments

joelacrisp commented Apr 5, 2016

joelacrisp commented Apr 5, 2016

3XX0 commented Apr 5, 2016

joelacrisp commented Apr 5, 2016

3XX0 commented Apr 9, 2016

jsmith50500 commented Apr 12, 2016

jsmith50500 commented Apr 13, 2016

3XX0 commented Apr 13, 2016

jsmith50500 commented Apr 14, 2016

3XX0 commented Apr 15, 2016

jsmith50500 commented Apr 15, 2016

3XX0 commented Apr 15, 2016

cnd commented Sep 12, 2016