-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep popping GPU fault on training with opencl build #5285
Comments
Huh, are you using your GPU in a VM with pass-through? |
No, there has no vm on the system. And everything is fine with testing, no kernel message here. |
Ok. Did you check GPU memory usage? How much VRAM does the GPU have? |
My GPU has 8GB VRAM. |
Ok, I fond the difference. |
Ok, good to know, thanks. I'll have a look. I did not test it with the ViennaCL BLAS kernels, I also use LibDNN and clBLAS. |
Issue summary
I built Caffe with opencl branch but it keep popping GPU fault whenever training any net.
It can still output models and snapshots.
The kernel message:
...
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.807191] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0d88880c
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.807197] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010506C
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.807200] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0818800C
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.807203] VM fault (0x0c, vmid 4) at page 1069164, read from 'TC4' (0x54433400) (392)
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.814219] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0d08840c
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.814224] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00106368
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.814226] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0808400C
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.814229] VM fault (0x0c, vmid 4) at page 1074024, read from 'TC5' (0x54433500) (132)
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.934987] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0d82880c
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.934993] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010506C
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.934995] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0218800C
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.934999] VM fault (0x0c, vmid 1) at page 1069164, read from 'TC4' (0x54433400) (392)
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.942522] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0d02040c
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.942527] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00106368
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.942530] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0200400C
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.942533] VM fault (0x0c, vmid 1) at page 1074024, read from 'TC3' (0x54433300) (4)
...
Steps to reproduce
You can just train the lenet with mnist dataset from examples/, then notice the kernel message.
Your system configuration
Operating system: Ubuntu 16.04
Compiler: gcc
OpenCL Platform: AMDAPP 3.0
BLAS: atlas
GPU: R9 390
GPU driver: AMDGPU-PRO 16.40-348864
The text was updated successfully, but these errors were encountered: