Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

Error: failed to retrieve CPU affinity #65

Closed
joelacrisp opened this issue Apr 5, 2016 · 12 comments
Closed

Error: failed to retrieve CPU affinity #65

joelacrisp opened this issue Apr 5, 2016 · 12 comments
Labels

Comments

@joelacrisp
Copy link

Host : AWS g2.8xlarge
This works on a g2.2xlarge
Fails on a g2.8xlarge using version from master/HEAD

ubuntu@ip-172-31-67-39:~/nvidia-docker$ sudo ./tools/bin/nvidia-docker-plugin
./tools/bin/nvidia-docker-plugin | 2016/04/04 23:54:05 Loading NVIDIA management library
./tools/bin/nvidia-docker-plugin | 2016/04/04 23:54:05 Loading NVIDIA unified memory
./tools/bin/nvidia-docker-plugin | 2016/04/04 23:54:05 Discovering GPU devices

0 [65535 0 0 0] [4294967295 0 0 0] <-- Extra debug from me, dump of the CPU and GPU masks
1 [4294901760 0 0 0] [4294967295 0 0 0]
./tools/bin/nvidia-docker-plugin | 2016/04/04 23:54:05 Error: failed to retrieve CPU affinity

root@ip-172-31-67-39:~/nvidia-docker# for x in /proc/driver/nvidia/gpus/*/information; do cat $x; done
Model: GRID K520
IRQ: 261
GPU UUID: GPU-2ce780af-c15a-a966-f0b8-98c9320b34a2
Video BIOS: 80.04.d4.00.03
Bus Type: PCIe
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:00:03.0
Model: GRID K520
IRQ: 262
GPU UUID: GPU-0bc9a139-4747-41a7-c24a-7c871dc4809e
Video BIOS: 80.04.d4.00.04
Bus Type: PCIe
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:00:04.0
Model: GRID K520
IRQ: 263
GPU UUID: GPU-6ca92b1b-6acf-cdbe-e61f-25387f27518d
Video BIOS: 80.04.d4.00.03
Bus Type: PCIe
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:00:05.0
Model: GRID K520
IRQ: 264
GPU UUID: GPU-3ca9085f-66a6-30c2-d544-907e00dde280
Video BIOS: 80.04.d4.00.04
Bus Type: PCIe
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:00:06.0

root@ip-172-31-67-39:~/nvidia-docker# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 CPU Affinity
GPU0 X PHB PHB PHB 0-31
GPU1 PHB X PHB PHB 0-31
GPU2 PHB PHB X PHB 0-31
GPU3 PHB PHB PHB X 0-31

Legend:

X = Self
SOC = Path traverses a socket-level link (e.g. QPI)
PHB = Path traverses a PCIe host bridge
PXB = Path traverses multiple PCIe internal switches
PIX = Path traverses a PCIe internal switch

@joelacrisp
Copy link
Author

Editing numa.go to remove this error return and always return zero seems to be a workaround, although it clearly isn't a correct fix.

@3XX0
Copy link
Member

3XX0 commented Apr 5, 2016

Cross socket affinity for all 4 GPUs looks suspicious.
I suspect PCI passthrough to not play nice here. I'll look into it.

@joelacrisp
Copy link
Author

Thanks. Let me know if you need any more info

@3XX0 3XX0 added the bug label Apr 5, 2016
@3XX0
Copy link
Member

3XX0 commented Apr 9, 2016

Finally had time to look at it closely.
Xen is definitively at fault. hwlock and sysfs can't figure out the CPU affinity either.
I'll probably change the code to rely solely on sysfs and default the numa node to 0 if we can't figure it out.

@jsmith50500
Copy link

Hi there
I got this CPU Affinity error message too, thanks for any help you can provide
Host:Centos 7.1
Nvidia Driver 361.42
Docker 1.10
Small Development box with GeForce GTX 750 Ti card which is working with latest Linux drivers.
Should the nvidia-docker plugin work with non-quadro cards too ?. My information is below

I ran,
[root@bccentos7 bin]# sudo -b nohup nvidia-docker-plugin > /tmp/nvidia-docker.log

Information here
cat nvidia-docker.log
nvidia-docker-plugin | 2016/04/12 14:51:50 Loading NVIDIA management library
nvidia-docker-plugin | 2016/04/12 14:51:50 Loading NVIDIA unified memory
nvidia-docker-plugin | 2016/04/12 14:51:50 Discovering GPU devices
nvidia-docker-plugin | 2016/04/12 14:51:51 Error: failed to retrieve CPU affinity

[root@bccentos7 tmp]# for x in /proc/driver/nvidia/gpus/*/information; do cat $x; done
Model: GeForce GTX 750 Ti
IRQ: 102
GPU UUID: GPU-4 95xxxx-9534- e8b-a6-0xxxxx428v
Video BIOS: 82.07.55.00.35
Bus Type: PCIe
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:07:00.0
Device Minor: 0

[root@bccentos7 tmp]# nvidia-smi topo -m
GPU0 CPU Affinity
GPU0 X 0-23

Legend:

X = Self
SOC = PCI path traverses a socket-level link (e.g. QPI)
PHB = PCI path traverses a host bridge
PXB = PCI path traverses multiple internal switches
PIX = PCI path traverses an internal switch
NV# = Path traverses # NVLinks


[root@bccentos7 tmp]# nvidia-smi

+------------------------------------------------------+
| NVIDIA-SMI 361.42 Driver Version: 361.42 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 750 Ti Off | 0000:07:00.0 On | N/A |
| 40% 36C P8 1W / 38W | 97MiB / 2047MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1485 G /usr/bin/Xorg 48MiB |
| 0 2803 G gnome-shell 38MiB |
+-----------------------------------------------------------------------------+

@3XX0 3XX0 closed this as completed in 641a8bc Apr 13, 2016
@jsmith50500
Copy link

Hi there
I think the CPU affinity issue is still there . Can you help thanks
I downloaded nvidia-docker master branch after update - The file nvml.go fiile was time stamped - April 12th 16.51, so I believe it had your fix for CPU affinity issue (#65) and ran a build, then started nvidia-docker-plugin

[root@bccentos7 nvidia-docker-master]# make
make -C /tmp/nvidia-docker-master/nvidia-docker-master/tools
make[1]: Entering directory `/tmp/nvidia-docker-master/nvidia-docker-master/tools'
Sending build context to Docker daemon 109.6 kB
Step 1 : FROM golang
.
/tmp/nvidia-docker-master/nvidia-docker-master/tools/bin

Info here


[root@bccentos7 bin]# sudo -b nohup nvidia-docker-plugin > /tmp/nvidia-docker.log
[root@bccentos7 bin]# nohup: ignoring input and redirecting stderr to stdout

[root@bccentos7 tmp]# cat nvidia-docker.log
nvidia-docker-plugin | 2016/04/13 10:33:05 Loading NVIDIA management library
nvidia-docker-plugin | 2016/04/13 10:33:05 Loading NVIDIA unified memory
nvidia-docker-plugin | 2016/04/13 10:33:05 Discovering GPU devices
nvidia-docker-plugin | 2016/04/13 10:33:05 Error: failed to retrieve CPU affinity

more info here


[root@bccentos7 nvml]# ls -l
total 20
-rw-rw-r--. 1 1181 Apr 12 16:51 nvml_dl.c
-rw-rw-r--. 1 397 Apr 12 16:51 nvml_dl.h
-rw-rw-r--. 1 9006 Apr 12 16:51 nvml.go
[root@bccentos7 nvml]# for x in /proc/driver/nvidia/gpus/*/information; do cat $x; done
Model: GeForce GTX 750 Ti
IRQ: 102
GPU UUID: GPU-4 95xxxx-9534- e8b-a6-0xxxxx428v
Video BIOS: 82.07.55.00.35
Bus Type: PCIe
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:07:00.0
Device Minor: 0

[root@bccentos7 nvml]# nvidia-smi topo -m
GPU0 CPU Affinity
GPU0 X 0-23

Legend:

X = Self
SOC = PCI path traverses a socket-level link (e.g. QPI)
PHB = PCI path traverses a host bridge
PXB = PCI path traverses multiple internal switches
PIX = PCI path traverses an internal switch
NV# = Path traverses # NVLinks

[root@bccentos7 nvml]# nvidia-smi
Wed Apr 13 11:02:45 2016
+------------------------------------------------------+
| NVIDIA-SMI 361.42 Driver Version: 361.42 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 750 Ti Off | 0000:07:00.0 On | N/A |
| 40% 33C P8 1W / 38W | 96MiB / 2047MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1820 G /usr/bin/Xorg 48MiB |
| 0 2804 G gnome-shell 38MiB |
+-----------------------------------------------------------------------------+

@3XX0
Copy link
Member

3XX0 commented Apr 13, 2016

What's the output of:

cat /sys/bus/pci/devices/0000:07:00.0/numa_node

@jsmith50500
Copy link

Hi @3XX0 thanks for replying

I ran the command the output was -1,

[root@bccentos7 deviceQuery]# cat /sys/bus/pci/devices/0000:07:00.0/numa_node
-1

@3XX0
Copy link
Member

3XX0 commented Apr 15, 2016

That's weird, that's exactly what the code does.
What about the permissions on this file?

@jsmith50500
Copy link

@3XX0
Thanks for getting back again , the permissions are
/sys/bus/pci/devices/0000:07:00.0

-r--r--r--. 1 root root 4096 Apr 15 10:10 numa_node

@3XX0
Copy link
Member

3XX0 commented Apr 15, 2016

I just added better error reporting, the error should be more explicit now.

@cnd
Copy link

cnd commented Sep 12, 2016

@3XX0 I've got not numa_node file, what is it even?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants