Keep popping GPU fault on training with opencl build #5285

nucleargod · 2017-02-15T09:24:24Z

Issue summary

I built Caffe with opencl branch but it keep popping GPU fault whenever training any net.
It can still output models and snapshots.
The kernel message:
...
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.807191] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0d88880c
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.807197] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010506C
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.807200] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0818800C
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.807203] VM fault (0x0c, vmid 4) at page 1069164, read from 'TC4' (0x54433400) (392)
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.814219] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0d08840c
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.814224] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00106368
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.814226] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0808400C
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.814229] VM fault (0x0c, vmid 4) at page 1074024, read from 'TC5' (0x54433500) (132)
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.934987] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0d82880c
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.934993] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010506C
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.934995] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0218800C
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.934999] VM fault (0x0c, vmid 1) at page 1069164, read from 'TC4' (0x54433400) (392)
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.942522] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0d02040c
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.942527] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00106368
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.942530] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0200400C
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.942533] VM fault (0x0c, vmid 1) at page 1074024, read from 'TC3' (0x54433300) (4)
...

Steps to reproduce

You can just train the lenet with mnist dataset from examples/, then notice the kernel message.

Your system configuration

Operating system: Ubuntu 16.04
Compiler: gcc
OpenCL Platform: AMDAPP 3.0
BLAS: atlas
GPU: R9 390
GPU driver: AMDGPU-PRO 16.40-348864

naibaf7 · 2017-02-15T11:16:19Z

Huh, are you using your GPU in a VM with pass-through?
Does it still work despite the kernel messages, or does training/testing abort at any point?

nucleargod · 2017-02-15T14:51:48Z

No, there has no vm on the system.
It still work despite the kernel messages at most of time, but OS may crash when training with more large task. For example, training the U-net may crash about every 10000 iterations.

And everything is fine with testing, no kernel message here.

naibaf7 · 2017-02-15T15:55:42Z

Ok. Did you check GPU memory usage? How much VRAM does the GPU have?
I have a similar setup without the error messages, however I'm using a W9100 16GB on it (but the GPU is a Hawaii identical to your R9 390X).

nucleargod · 2017-02-15T16:35:40Z

My GPU has 8GB VRAM.
Usage for lenet:
I0215 17:04:24.371548 29539 net.cpp:281] Memory required for data: 8086808
for U-net:
I0215 16:45:42.087606 29315 net.cpp:281] Memory required for data: 2879010116
These two case give me the same message, so I think I haven't run out of memory.

nucleargod · 2017-02-20T09:00:46Z

Ok, I fond the difference.
I originally build with GreenTea & ViennaCL and it has error messages.
When I build with clBLAS or libdnn it doesn't give me such kernel message any more.

naibaf7 · 2017-02-20T11:47:20Z

Ok, good to know, thanks. I'll have a look. I did not test it with the ViennaCL BLAS kernels, I also use LibDNN and clBLAS.

naibaf7 self-assigned this Feb 15, 2017

naibaf7 added the OpenCL label Feb 15, 2017

naibaf7 closed this as completed Feb 20, 2017

itsnarsi mentioned this issue May 10, 2017

Cannot use GPU in CPU-only Caffe: check mode with LibDNN and CLBlast on RX 480 naibaf7/caffe#66

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep popping GPU fault on training with opencl build #5285

Keep popping GPU fault on training with opencl build #5285

nucleargod commented Feb 15, 2017

naibaf7 commented Feb 15, 2017

nucleargod commented Feb 15, 2017

naibaf7 commented Feb 15, 2017 •

edited

Loading

nucleargod commented Feb 15, 2017

nucleargod commented Feb 20, 2017

naibaf7 commented Feb 20, 2017

Keep popping GPU fault on training with opencl build #5285

Keep popping GPU fault on training with opencl build #5285

Comments

nucleargod commented Feb 15, 2017

Issue summary

Steps to reproduce

Your system configuration

naibaf7 commented Feb 15, 2017

nucleargod commented Feb 15, 2017

naibaf7 commented Feb 15, 2017 • edited Loading

nucleargod commented Feb 15, 2017

nucleargod commented Feb 20, 2017

naibaf7 commented Feb 20, 2017

naibaf7 commented Feb 15, 2017 •

edited

Loading