-
Notifications
You must be signed in to change notification settings - Fork 2k
Error: failed to retrieve CPU affinity #65
Comments
Editing numa.go to remove this error return and always return zero seems to be a workaround, although it clearly isn't a correct fix. |
Cross socket affinity for all 4 GPUs looks suspicious. |
Thanks. Let me know if you need any more info |
Finally had time to look at it closely. |
Hi there I ran, Information here [root@bccentos7 tmp]# for x in /proc/driver/nvidia/gpus/*/information; do cat $x; done [root@bccentos7 tmp]# nvidia-smi topo -m Legend: X = Self [root@bccentos7 tmp]# nvidia-smi +------------------------------------------------------+ +-----------------------------------------------------------------------------+ |
Hi there [root@bccentos7 nvidia-docker-master]# make Info here [root@bccentos7 bin]# sudo -b nohup nvidia-docker-plugin > /tmp/nvidia-docker.log [root@bccentos7 tmp]# cat nvidia-docker.log more info here [root@bccentos7 nvml]# ls -l [root@bccentos7 nvml]# nvidia-smi topo -m Legend: X = Self [root@bccentos7 nvml]# nvidia-smi +-----------------------------------------------------------------------------+ |
What's the output of:
|
Hi @3XX0 thanks for replying I ran the command the output was -1, [root@bccentos7 deviceQuery]# cat /sys/bus/pci/devices/0000:07:00.0/numa_node |
That's weird, that's exactly what the code does. |
@3XX0 -r--r--r--. 1 root root 4096 Apr 15 10:10 numa_node |
I just added better error reporting, the error should be more explicit now. |
@3XX0 I've got not numa_node file, what is it even? |
Host : AWS g2.8xlarge
This works on a g2.2xlarge
Fails on a g2.8xlarge using version from master/HEAD
ubuntu@ip-172-31-67-39:~/nvidia-docker$ sudo ./tools/bin/nvidia-docker-plugin
./tools/bin/nvidia-docker-plugin | 2016/04/04 23:54:05 Loading NVIDIA management library
./tools/bin/nvidia-docker-plugin | 2016/04/04 23:54:05 Loading NVIDIA unified memory
./tools/bin/nvidia-docker-plugin | 2016/04/04 23:54:05 Discovering GPU devices
0 [65535 0 0 0] [4294967295 0 0 0] <-- Extra debug from me, dump of the CPU and GPU masks
1 [4294901760 0 0 0] [4294967295 0 0 0]
./tools/bin/nvidia-docker-plugin | 2016/04/04 23:54:05 Error: failed to retrieve CPU affinity
root@ip-172-31-67-39:~/nvidia-docker# for x in /proc/driver/nvidia/gpus/*/information; do cat $x; done
Model: GRID K520
IRQ: 261
GPU UUID: GPU-2ce780af-c15a-a966-f0b8-98c9320b34a2
Video BIOS: 80.04.d4.00.03
Bus Type: PCIe
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:00:03.0
Model: GRID K520
IRQ: 262
GPU UUID: GPU-0bc9a139-4747-41a7-c24a-7c871dc4809e
Video BIOS: 80.04.d4.00.04
Bus Type: PCIe
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:00:04.0
Model: GRID K520
IRQ: 263
GPU UUID: GPU-6ca92b1b-6acf-cdbe-e61f-25387f27518d
Video BIOS: 80.04.d4.00.03
Bus Type: PCIe
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:00:05.0
Model: GRID K520
IRQ: 264
GPU UUID: GPU-3ca9085f-66a6-30c2-d544-907e00dde280
Video BIOS: 80.04.d4.00.04
Bus Type: PCIe
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:00:06.0
root@ip-172-31-67-39:~/nvidia-docker# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 CPU Affinity
GPU0 X PHB PHB PHB 0-31
GPU1 PHB X PHB PHB 0-31
GPU2 PHB PHB X PHB 0-31
GPU3 PHB PHB PHB X 0-31
Legend:
X = Self
SOC = Path traverses a socket-level link (e.g. QPI)
PHB = Path traverses a PCIe host bridge
PXB = Path traverses multiple PCIe internal switches
PIX = Path traverses a PCIe internal switch
The text was updated successfully, but these errors were encountered: