Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep popping GPU fault on training with opencl build #5285

Closed
nucleargod opened this issue Feb 15, 2017 · 6 comments
Closed

Keep popping GPU fault on training with opencl build #5285

nucleargod opened this issue Feb 15, 2017 · 6 comments
Assignees
Labels

Comments

@nucleargod
Copy link

Issue summary

I built Caffe with opencl branch but it keep popping GPU fault whenever training any net.
It can still output models and snapshots.
The kernel message:
...
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.807191] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0d88880c
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.807197] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010506C
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.807200] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0818800C
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.807203] VM fault (0x0c, vmid 4) at page 1069164, read from 'TC4' (0x54433400) (392)
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.814219] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0d08840c
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.814224] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00106368
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.814226] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0808400C
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.814229] VM fault (0x0c, vmid 4) at page 1074024, read from 'TC5' (0x54433500) (132)
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.934987] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0d82880c
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.934993] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010506C
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.934995] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0218800C
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.934999] VM fault (0x0c, vmid 1) at page 1069164, read from 'TC4' (0x54433400) (392)
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.942522] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0d02040c
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.942527] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00106368
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.942530] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0200400C
Feb 15 16:49:21 nucleargod-KVM kernel: [772365.942533] VM fault (0x0c, vmid 1) at page 1074024, read from 'TC3' (0x54433300) (4)
...

Steps to reproduce

You can just train the lenet with mnist dataset from examples/, then notice the kernel message.

Your system configuration

Operating system: Ubuntu 16.04
Compiler: gcc
OpenCL Platform: AMDAPP 3.0
BLAS: atlas
GPU: R9 390
GPU driver: AMDGPU-PRO 16.40-348864

@naibaf7
Copy link
Member

naibaf7 commented Feb 15, 2017

Huh, are you using your GPU in a VM with pass-through?
Does it still work despite the kernel messages, or does training/testing abort at any point?

@naibaf7 naibaf7 self-assigned this Feb 15, 2017
@naibaf7 naibaf7 added the OpenCL label Feb 15, 2017
@nucleargod
Copy link
Author

No, there has no vm on the system.
It still work despite the kernel messages at most of time, but OS may crash when training with more large task. For example, training the U-net may crash about every 10000 iterations.

And everything is fine with testing, no kernel message here.

@naibaf7
Copy link
Member

naibaf7 commented Feb 15, 2017

Ok. Did you check GPU memory usage? How much VRAM does the GPU have?
I have a similar setup without the error messages, however I'm using a W9100 16GB on it (but the GPU is a Hawaii identical to your R9 390X).

@nucleargod
Copy link
Author

My GPU has 8GB VRAM.
Usage for lenet:
I0215 17:04:24.371548 29539 net.cpp:281] Memory required for data: 8086808
for U-net:
I0215 16:45:42.087606 29315 net.cpp:281] Memory required for data: 2879010116
These two case give me the same message, so I think I haven't run out of memory.

@nucleargod
Copy link
Author

Ok, I fond the difference.
I originally build with GreenTea & ViennaCL and it has error messages.
When I build with clBLAS or libdnn it doesn't give me such kernel message any more.

@naibaf7
Copy link
Member

naibaf7 commented Feb 20, 2017

Ok, good to know, thanks. I'll have a look. I did not test it with the ViennaCL BLAS kernels, I also use LibDNN and clBLAS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants