Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory access fault with rocm 2.9 #125

Closed
csuji opened this issue Oct 17, 2019 · 5 comments
Closed

Memory access fault with rocm 2.9 #125

csuji opened this issue Oct 17, 2019 · 5 comments

Comments

@csuji
Copy link

csuji commented Oct 17, 2019

With pytorch docker image rocm2.9_ubuntu16.04_py3.6_pytorch I get following error
after starting training:

Memory access fault by GPU node-1 (Agent handle: 0x75e0030) on address 0x7f4cf87ff000. Reason: Page not present or supervisor privilege.

With MIOPEN_ENABLE_LOGGING_CMD=1 MIOPEN_LOG_LEVEL=5 I was able to track it down to the following kernel:
/opt/rocm/miopen/bin/MIOpenDriver conv -n 32 -c 64 -H 1 -W 256 -k 32 -y 1 -x 384 -p 0 -q 128 -u 1 -v 64 -l 1 -j 1 -m conv -g 1 -F 1 -t 1

MIOpenDriver: conv -n 32 -c 64 -H 1 -W 256 -k 32 -y 1 -x 384 -p 0 -q 128 -u 1 -v 64 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
Memory access fault by GPU node-1 (Agent handle: 0x20f0010) on address 0x7f228a9ff000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

Config:
GPU gfx900
Card: Vega 10 XTX [Radeon Vega Frontier Edition]

hipconfig:
HIP version : 2.8.19361-cbe6b65

== hipconfig
HIP_PATH : /opt/rocm
HIP_PLATFORM : hcc
CPP_CONFIG : -D__HIP_PLATFORM_HCC__= -I/opt/rocm/include -I/opt/rocm/hcc/include -I/opt/rocm/hsa/include

== hcc
HSA_PATH : /opt/rocm/hsa
HCC_HOME : /opt/rocm/hcc
HCC clang version 10.0.0 (/data/jenkins_workspace/compute-rocm-rel-2.9/external/hcc-tot/clang fa40706d8ba0b8b958d42f579120eb9b89babc00) (/data/jenkins_workspace/compute-rocm-rel-2.9/external/hcc-tot/compiler b7f876231af7fdaf52e419088b8ba9e0c3a61845) (based on HCC 2.9.19392-75835c3-fa40706-b7f8762 )
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/hcc/bin
LLVM (http://llvm.org/):
LLVM version 10.0.0svn
Optimized build.
Default target: x86_64-unknown-linux-gnu
Host CPU: skylake

Registered Targets:
amdgcn - AMD GCN GPUs
r600 - AMD GPUs HD2XXX-HD6XXX
x86 - 32-bit X86: Pentium-Pro and above
x86-64 - 64-bit X86: EM64T and AMD64
HCC-cxxflags : -hc -std=c++amp -I/opt/rocm/hcc/include -I/opt/rocm/includeHCC-ldflags : -hc -std=c++amp -L/opt/rocm/hcc/lib -Wl,--rpath=/opt/rocm/hcc/lib -ldl -lm -lpthread -lhc_am -Wl,--whole-archive -lmcwamp -Wl,--no-whole-archive

=== Environment Variables
PATH=/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HIP_PLATFORM=hcc

== Linux Kernel
Hostname : 08c9b8b666c9
Linux 08c9b8b666c9 4.15.0-65-generic ROCm/pytorch#74-Ubuntu SMP Tue Sep 17 17:06:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.6 LTS
Release: 16.04
Codename: xenia

@iotamudelta
Copy link

Can you report that failure in https://github.com/ROCmSoftwarePlatform/MIOpen ? Thanks!

@csuji
Copy link
Author

csuji commented Oct 17, 2019

Sure but can not someone from AMD move it?
https://help.github.com/en/articles/transferring-an-issue-to-another-repository

@iotamudelta iotamudelta transferred this issue from ROCm/pytorch Oct 17, 2019
@iotamudelta
Copy link

Transferred over to MIOpen for now.

@csuji
Copy link
Author

csuji commented Oct 17, 2019

Checked the other kernels that were written in the log file until core dump..
These also produce a mem fault with MIOpenDriver:

conv -n 32 -c 64 -H 1 -W 256 -k 32 -y 1 -x 258 -p 0 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 MEM FAULT
conv -n 32 -c 64 -H 1 -W 256 -k 32 -y 1 -x 260 -p 0 -q 4 -u 1 -v 2 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 MEM FAULT
conv -n 32 -c 64 -H 1 -W 256 -k 32 -y 1 -x 264 -p 0 -q 8 -u 1 -v 4 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 MEM FAULT
conv -n 32 -c 64 -H 1 -W 256 -k 32 -y 1 -x 272 -p 0 -q 16 -u 1 -v 8 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 MEM FAULT
conv -n 32 -c 64 -H 1 -W 256 -k 32 -y 1 -x 288 -p 0 -q 32 -u 1 -v 16 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 MEM FAULT
conv -n 32 -c 64 -H 1 -W 256 -k 32 -y 1 -x 320 -p 0 -q 64 -u 1 -v 32 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 MEM FAULT
conv -n 32 -c 64 -H 1 -W 256 -k 32 -y 1 -x 384 -p 0 -q 128 -u 1 -v 64 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 MEM FAULT

@csuji
Copy link
Author

csuji commented Jan 1, 2020

This seams to be fixed with ROCm 3.0-6:
opt/rocm/miopen/bin/MIOpenDriver conv -n 32 -c 64 -H 1 -W 256 -k 32 -y 1 -x 384 -p 0 -q 128 -u 1 -v 64 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver: conv -n 32 -c 64 -H 1 -W 256 -k 32 -y 1 -x 384 -p 0 -q 128 -u 1 -v 64 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 0, Solution: 33/gemm
GPU Kernel Time Forward Conv. Elapsed: 8.201511 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x384u1, 32, 64, 1, 3, 1, 384, 32, 150994944, 5242880, 12288, 18, 1, 8.201511
Forward Convolution Verifies on CPU and GPU (1.73756e-07)
Happy New Year!

@csuji csuji closed this as completed Jan 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants