Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent memory access error #2097

Closed
PhilipDeegan opened this issue Jun 4, 2020 · 15 comments
Closed

Intermittent memory access error #2097

PhilipDeegan opened this issue Jun 4, 2020 · 15 comments

Comments

@PhilipDeegan
Copy link

Every few executions I get the following error:

Memory access fault by GPU node-1 (Agent handle: 0x37b9900) on address 0x7f18ada2e000. Reason: Unknown

What I find surprising is that my code works at all yet sometimes fails.
My memory structure is perhaps some what complex, but I don't see why this should be causing problems.

I have no async memory calls.

@PhilipDeegan
Copy link
Author

I cannot reproduce this under gdb so I would guess there's a race condition somewhere.
I use no async calls and have no threads so I don't know how it could be my code.

Please advise.

@gargrahul
Copy link
Contributor

Are you using rocm-3.5 branch?

@PhilipDeegan
Copy link
Author

Yes indeed, this was occuring both on 3.3 and now on 3.5

@PhilipDeegan
Copy link
Author

I absolutely understand that his may be due to my own buggy code.
So I appreciate any assistance especially if this turns out to be the case!

@gargrahul
Copy link
Contributor

You could try by SERIALIZING kernels and memcpys. With ROCm3.5, you can use AMD_SERIALIZE_KERNEL and AMD_SERIALIZE_COPY env vars as described @ https://github.com/ROCm-Developer-Tools/ROCclr/blob/roc-3.5.x/utils/flags.hpp#L220-#L225

@PhilipDeegan
Copy link
Author

will give it a shot thanks

@PhilipDeegan
Copy link
Author

I'm not really sure what these env vars are doing, but it doesn't make much of a difference

LOG_LEVEL=14 HIP_DB=api+copy+mem AMD_SERIALIZE_KERNEL=3 AMD_SERIALIZE_COPY=3 ./bin

results in both of these
A) working - https://gist.github.com/Dekken/f527c1499ea362075a03807f31c18031
B) not working - https://gist.github.com/Dekken/9d5f831f3ce6a7ae860cf9035adbed17

I think there's binary characters being logged somewhere in there too which isn't ideal!

@PhilipDeegan
Copy link
Author

Got it to happen with dbg

Memory access fault by GPU node-1 (Agent handle: 0x1276a70) on address 0x7fffa1ec4000. Reason: Unknown.

Thread 33 "binary" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffa1dbf700 (LWP 58336)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff6763535 in __GI_abort () at abort.c:79
#2  0x00007ffff5a3cf43 in core::Runtime::VMFaultHandler(long, void*) () from /opt/rocm/lib/libhsa-runtime64.so.1
#3  0x00007ffff5a3b505 in core::Runtime::AsyncEventsLoop(void*) () from /opt/rocm/lib/libhsa-runtime64.so.1
#4  0x00007ffff59fa1c7 in os::ThreadTrampoline(void*) () from /opt/rocm/lib/libhsa-runtime64.so.1
#5  0x00007ffff7f7afa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#6  0x00007ffff683a4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@PhilipDeegan
Copy link
Author

After building https://github.com/RadeonOpenCompute/ROCR-Runtime with debug symbols.

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff6763535 in __GI_abort () at abort.c:79
#2  0x00007ffff6bf2ffd in roc::callbackQueue(hsa_status_t, hsa_queue_s*, void*) () from /opt/rocm/lib/libamdhip64.so.3
#3  0x00007ffff5b8b0b4 in AMD::callback_t<void (*)(hsa_status_t, hsa_queue_s*, void*)>::operator() (this=0x7ffe90001930, args#0=41, args#1=0x7fffa1710000, args#2=0x12b4f80) at /path/to/rocm/hsa-rocr/src/core/inc/exceptions.h:87
#4  0x00007ffff5b88b53 in amd::AqlQueue::DynamicScratchHandler (error_code=536870912, arg=0x7ffe90001860) at /path/to/rocm/hsa-rocr/src/core/runtime/amd_aql_queue.cpp:866
#5  0x00007ffff5bcf8e8 in core::Runtime::AsyncEventsLoop () at /path/to/rocm/hsa-rocr/src/core/runtime/runtime.cpp:1035
#6  0x00007ffff5b5c362 in os::ThreadTrampoline (arg=0x12b3c20) at /path/to/rocm/hsa-rocr/src/core/util/lnx/os_linux.cpp:75
#7  0x00007ffff7f7afa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#8  0x00007ffff683a4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@PhilipDeegan
Copy link
Author

and


Memory access fault by GPU node-1 (Agent handle: 0x1276c00) on address 0x7fffa0e51000. Reason: Unknown.
Nearby memory map:
0x7fffa0db4000, 0x1000, VRAM
0x7fffa0db5000, 0x3000, VRAM
0x7fffa0e80000, 0x60000, System

PtrInfo:
  Address: 0x7fffa0db4000-0x7fffa0db5000/0x7fffa0db4000-0x7fffa0db5000
  Size: 0x1000
  Type: 1
  Owner: 0x1276c00
  CanAccess: 1
    0x1276c00
  In block: 0x7fffa0c00000, 0x200000
PtrInfo:
  Address: 0x7fffa0db5000-0x7fffa0db8000/0x7fffa0db5000-0x7fffa0db8000
  Size: 0x3000
  Type: 1
  Owner: 0x1276c00
  CanAccess: 1
    0x1276c00
  In block: 0x7fffa0c00000, 0x200000
PtrInfo:
  Address: 0x7fffa0e80000-0x7fffa0ee0000/0x7fffa0e80000-0x7fffa0ee0000
  Size: 0x60000
  Type: 1
  Owner: 0x1275880
  CanAccess: 1
    0x1276c00
  In block: 0x7fffa0e80000, 0x60000
phare: /path/to/rocm/hsa-rocr/src/core/runtime/runtime.cpp:1264: static bool core::Runtime::VMFaultHandler(hsa_signal_value_t, void*): Assertion `false && "GPU memory access fault."' failed.

Thread 33 "phare" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffa1f1f700 (LWP 62364)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff6763535 in __GI_abort () at abort.c:79
#2  0x00007ffff676340f in __assert_fail_base (fmt=0x7ffff68c5ee0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7ffff5c3c710 "false && \"GPU memory access fault.\"", 
    file=0x7ffff5c3c168 "/path/to/rocm/hsa-rocr/src/core/runtime/runtime.cpp", line=1264, function=<optimized out>) at assert.c:92
#3  0x00007ffff6771102 in __GI___assert_fail (assertion=0x7ffff5c3c710 "false && \"GPU memory access fault.\"", file=0x7ffff5c3c168 "/path/to/rocm/hsa-rocr/src/core/runtime/runtime.cpp", line=1264, 
    function=0x7ffff5c3c080 <core::Runtime::VMFaultHandler(long, void*)::__PRETTY_FUNCTION__> "static bool core::Runtime::VMFaultHandler(hsa_signal_value_t, void*)") at assert.c:101
#4  0x00007ffff5bd0ab8 in core::Runtime::VMFaultHandler (val=1, arg=0x12b38c0) at /path/to/rocm/hsa-rocr/src/core/runtime/runtime.cpp:1264
#5  0x00007ffff5bcf8e8 in core::Runtime::AsyncEventsLoop () at /path/to/rocm/hsa-rocr/src/core/runtime/runtime.cpp:1035
#6  0x00007ffff5b5c362 in os::ThreadTrampoline (arg=0x12b3c20) at /path/to/rocm/hsa-rocr/src/core/util/lnx/os_linux.cpp:75
#7  0x00007ffff7f7afa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#8  0x00007ffff683a4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@PhilipDeegan
Copy link
Author

I have my code running on Nvidia hardware with no issues.

@PhilipDeegan
Copy link
Author

Any chance I can get a comment on this?

If I can get access to a machine with an "official setup", and if I can't reproduce this, I'll close the ticket (and reconsider my life choices)

@PhilipDeegan
Copy link
Author

I don't know if it's relevant, but the error appears to happen less often when executed just after compiling/linking

@PhilipDeegan
Copy link
Author

fault.Failure.ErrorType is 0

@PhilipDeegan
Copy link
Author

ok, I'm an idiot.
My code shouldn't ever work, yet it does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants