Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gasnetex: crash and assertion failure #1118

Closed
rohany opened this issue Jul 5, 2021 · 14 comments
Closed

gasnetex: crash and assertion failure #1118

rohany opened this issue Jul 5, 2021 · 14 comments
Assignees

Comments

@rohany
Copy link
Contributor

rohany commented Jul 5, 2021

I recently switched to using GASNet-Ex from GASNet, and I see this error frequently when running on 8 nodes of lassen:

cannonMM-cuda: /g/g15/yadav2/taco/legion/legion/runtime/realm/event_impl.cc:1601: void Realm::GenEventImpl::process_update(Realm::EventImpl::gen_t, const gen_t*, int, Realm::TimeLimit): Assertion `num_poisoned_generations.load() == 0' failed.
Signal 6 received by node 5, process 109319 (thread 20002c70f8b0) - obtaining backtrace
Signal 6 received by process 109319 (thread 20002c70f8b0) at: stack trace: 14 frames
  [0] = [0x2000000504d8]
  [1] = /lib64/libc.so.6(abort+0x2b4) [0x200006092134]
  [2] = /lib64/libc.so.6(+0x357d4) [0x2000060857d4]
  [3] = /lib64/libc.so.6(__assert_fail+0x64) [0x2000060858c4]
  [4] = bin/cannonMM-cuda() [0x12710c78]
  [5] = bin/cannonMM-cuda() [0x12711504]
  [6] = bin/cannonMM-cuda() [0x1271bf54]
  [7] = bin/cannonMM-cuda(Realm::IncomingMessageManager::do_work(Realm::TimeLimit)+0x1b4) [0x12945a38]
  [8] = bin/cannonMM-cuda() [0x12703b74]
  [9] = bin/cannonMM-cuda() [0x12700e1c]
  [10] = bin/cannonMM-cuda() [0x12705fe0]
  [11] = bin/cannonMM-cuda() [0x1290b4bc]
  [12] = /lib64/libpthread.so.0(+0x8cd4) [0x2000000f8cd4]
  [13] = /lib64/libc.so.6(clone+0xe4) [0x200006177e14]
mlx5: lassen128: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000005 00000000 00000000 00000000
00000000 00008813 10037173 36a112d2
@ 1> snd status=10 opcode=2 dst_node=5 dst_qp=2
@ 1> - rcv CQ contains impossibly large WCE count with status 10
*** FATAL ERROR (proc 1): in gasnetc_snd_reap() at ld/embed-gasnet/source/ibv-conduit/gasnet_core_sndrcv.c:815: aborting on reap of failed send
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace.
NOTICE: We recommend linking the debug version of GASNet to assist you in resolving this application issue.
Signal 6 received by node 1, process 159318 (thread 20002c5df8b0) - obtaining backtrace
@ 1> rcv comp->status=5
mlx5: lassen128: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000005 00000000 00000000 00000000
00000000 00008813 10034dd4 3d7c14d3
*** FATAL ERROR (proc 1): in gasnetc_rcv_reap() at ld/embed-gasnet/source/ibv-conduit/gasnet_core_sndrcv.c:1076: aborting on reap of failed recv
mlx5: lassen129: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008914 10000377 53df76d3
@ 2> snd status=11 opcode=2 dst_node=1 dst_qp=0
@ 2> - rcv CQ contains impossibly large WCE count with status 11
*** FATAL ERROR (proc 2): in gasnetc_snd_reap() at ld/embed-gasnet/source/ibv-conduit/gasnet_core_sndrcv.c:815: aborting on reap of failed send
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace.
NOTICE: We recommend linking the debug version of GASNet to assist you in resolving this application issue.
Signal 6 received by node 2, process 26632 (thread 20002c5df8b0) - obtaining backtrace
@ 2> rcv comp->status=5
mlx5: lassen129: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008813 1003a2f3 559524d3
*** FATAL ERROR (proc 2): in gasnetc_rcv_reap() at ld/embed-gasnet/source/ibv-conduit/gasnet_core_sndrcv.c:1076: aborting on reap of failed recv

I'm on commit e7d513f51a39df73e9a7ca31f9a631a82777c022. I can't reproduce the error on Sapling, so let me know what I can do to give you information to debug it.

@lightsighter
Copy link
Contributor

Start by getting a backtrace

@rohany
Copy link
Contributor Author

rohany commented Jul 5, 2021

The backtrace is included at the top of the paste -- it was collected with LEGION_BACKTRACE=1 on a Debug build

@lightsighter
Copy link
Contributor

We'll probably need to wait for @streichler to get back from vacation. It would still be good to get a backtrace with symbols. Try using GASNET_BACKTRACE=1 instead.

@rohany
Copy link
Contributor Author

rohany commented Jul 5, 2021

I got a different error with GASNET_BACKTRACE set. Do you think this is the same or different error? If so, I can open a separate issue. In general, when should I be using LEGION_BACKTRACE vs GASNET_BACKTRACE (or REALM_BACKTRACE)?

[6] Thread 12 (Thread 0x20002c11f8b0 (LWP 125194)):
[6] #0  0x0000200005c1e4e8 in waitpid () at ../sysdeps/unix/syscall-template.S:81
[6] #1  0x0000200005b8fd3c in do_system (line=0x13ea56a0 <cmd.7513> "/usr/bin/gdb -nx -batch -x /var/tmp/gasnet_e9dmpA '/g/g15/yadav2/taco/build/bin/cannonMM-cuda' 124997") at ../sysdeps/posix/system.c:148
[6] #2  0x0000200005b90424 in __libc_system (line=0x13ea56a0 <cmd.7513> "/usr/bin/gdb -nx -batch -x /var/tmp/gasnet_e9dmpA '/g/g15/yadav2/taco/build/bin/cannonMM-cuda' 124997") at ../sysdeps/posix/system.c:189
[6] #3  0x0000200000105348 in system (line=<optimized out>) at pt-system.c:28
[6] #4  0x00000000112ae99c in gasneti_system_redirected ()
[6] #5  0x00000000131858c8 in gasneti_bt_gdb ()
[6] #6  0x000000001318ad98 in gasneti_print_backtrace ()
[6] #7  0x00000000112afb98 in gasneti_defaultSignalHandler ()
[6] #8  <signal handler called>
[6] #9  0x0000200005b7fcb0 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
[6] #10 0x0000200005b8200c in __GI_abort () at abort.c:90
[6] #11 0x000000001312ef08 in Realm::verify_packet_crc (arg0=1050599796, header_base=0x20001c270108, header_size=112, payload_base=0x0, payload_size=0) at /g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc:118
[6] #12 0x000000001313fad0 in Realm::GASNetEXInternal::handle_short (this=0x13f69940, srcrank=7, arg0=1050599796, hdr=0x20001c270108, hdr_bytes=112) at /g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc:4253
[6] #13 0x0000000013140940 in Realm::GASNetEXInternal::handle_batch (this=0x13f69940, srcrank=7, arg0=16, data=0x20001c270010, data_bytes=1280, comps=0x20002c11ab48) at /g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc:4465
[6] #14 0x000000001315398c in Realm::handle_request_batch (token=0x14b6d380, buf=0x20001c270010, nbytes=1280, arg0=16) at /g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_handlers.cc:1640
[6] #15 0x0000000013173420 in gasnetc_processPacket ()
[6] #16 0x00000000131758c4 in gasnetc_poll_rcv_hca ()
[6] #17 0x0000000013176a20 in gasnetc_do_poll ()
[6] #18 0x000000001316565c in gasnetc_AM_PrepareRequestMedium ()
[6] #19 0x000000001315382c in Realm::GASNetEXHandlers::prepare_request_batch (src_ep=0x14b251a0, tgt_rank=5, tgt_ep_index=0, data=0x20009c081900, min_data_bytes=80, max_data_bytes=1280, lc_opt=0x20002c11d4a8, flags=0) at /g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_handlers.cc:1607
[6] #20 0x00000000131355fc in Realm::XmitSrcDestPair::push_packets (this=0x197de470, immediate_mode=false, work_until=...) at /g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc:1735
[6] #21 0x00000000131385cc in Realm::GASNetEXPoller::do_work (this=0x13f699c0, work_until=...) at /g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2417
[6] #22 0x0000000012703b74 in Realm::BackgroundWorkManager::Worker::do_work (this=0x20002c11e930, max_time_in_ns=-1, interrupt_flag=0x0) at /g/g15/yadav2/taco/legion/legion/runtime/realm/bgwork.cc:611
[6] #23 0x0000000012700e1c in Realm::BackgroundWorkThread::main_loop (this=0x19080e80) at /g/g15/yadav2/taco/legion/legion/runtime/realm/bgwork.cc:158
[6] #24 0x0000000012705fe0 in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x19080e80) at /g/g15/yadav2/taco/legion/legion/runtime/realm/threads.inl:97
[6] #25 0x000000001290b4bc in Realm::KernelThread::pthread_entry (data=0x19084e90) at /g/g15/yadav2/taco/legion/legion/runtime/realm/threads.cc:774
[6] #26 0x00002000000f8cd4 in start_thread (arg=0x20002c11f8b0) at pthread_create.c:309
[6] #27 0x0000200005c67e14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104

Another backtrace for this error:

[1] Thread 14 (Thread 0x200027dcf8b0 (LWP 72636)):
[1] #0  0x0000200005c1e4e8 in waitpid () at ../sysdeps/unix/syscall-template.S:81
[1] #1  0x0000200005b8fd3c in do_system (line=0x13ea56a0 <cmd.7513> "/usr/bin/gdb -nx -batch -x /var/tmp/gasnet_MnRejt '/g/g15/yadav2/taco/build/bin/cannonMM-cuda' 72423") at ../sysdeps/posix/system.c:148
[1] #2  0x0000200005b90424 in __libc_system (line=0x13ea56a0 <cmd.7513> "/usr/bin/gdb -nx -batch -x /var/tmp/gasnet_MnRejt '/g/g15/yadav2/taco/build/bin/cannonMM-cuda' 72423") at ../sysdeps/posix/system.c:189
[1] #3  0x0000200000105348 in system (line=<optimized out>) at pt-system.c:28
[1] #4  0x00000000112ae99c in gasneti_system_redirected ()
[1] #5  0x00000000131858c8 in gasneti_bt_gdb ()
[1] #6  0x000000001318ad98 in gasneti_print_backtrace ()
[1] #7  0x00000000112afb98 in gasneti_defaultSignalHandler ()
[1] #8  <signal handler called>
[1] #9  0x0000200005b7fcb0 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
[1] #10 0x0000200005b8200c in __GI_abort () at abort.c:90
[1] #11 0x000000001312ef08 in Realm::verify_packet_crc (arg0=4, header_base=0x20001b310240, header_size=48, payload_base=0x0, payload_size=0) at /g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc:118
[1] #12 0x000000001313fad0 in Realm::GASNetEXInternal::handle_short (this=0x13f69940, srcrank=0, arg0=4, hdr=0x20001b310240, hdr_bytes=48) at /g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc:4253
[1] #13 0x0000000013140940 in Realm::GASNetEXInternal::handle_batch (this=0x13f69940, srcrank=0, arg0=16, data=0x20001b310008, data_bytes=1248, comps=0x200027dcaae8) at /g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc:4465
[1] #14 0x000000001315398c in Realm::handle_request_batch (token=0x14b3c600, buf=0x20001b310008, nbytes=1248, arg0=16) at /g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_handlers.cc:1640
[1] #15 0x0000000013173420 in gasnetc_processPacket ()
[1] #16 0x0000000013176204 in gasnetc_poll_rcv_all ()
[1] #17 0x0000000013155368 in gasnetc_am_sema_poll ()
[1] #18 0x00000000131554cc in gasnetc_am_get_credit ()
[1] #19 0x00000000131654dc in gasnetc_AM_PrepareRequestMedium ()
[1] #20 0x000000001315382c in Realm::GASNetEXHandlers::prepare_request_batch (src_ep=0x14af7b20, tgt_rank=4, tgt_ep_index=0, data=0x20009c742d00, min_data_bytes=80, max_data_bytes=1280, lc_opt=0x200027dcd4a8, flags=0) at /g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_handlers.cc:1607
[1] #21 0x00000000131355fc in Realm::XmitSrcDestPair::push_packets (this=0x198be7d0, immediate_mode=false, work_until=...) at /g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc:1735
[1] #22 0x00000000131385cc in Realm::GASNetEXPoller::do_work (this=0x13f699c0, work_until=...) at /g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2417
[1] #23 0x0000000012703b74 in Realm::BackgroundWorkManager::Worker::do_work (this=0x200027dce930, max_time_in_ns=-1, interrupt_flag=0x0) at /g/g15/yadav2/taco/legion/legion/runtime/realm/bgwork.cc:611
[1] #24 0x0000000012700e1c in Realm::BackgroundWorkThread::main_loop (this=0x1a3510e0) at /g/g15/yadav2/taco/legion/legion/runtime/realm/bgwork.cc:158
[1] #25 0x0000000012705fe0 in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x1a3510e0) at /g/g15/yadav2/taco/legion/legion/runtime/realm/threads.inl:97
[1] #26 0x000000001290b4bc in Realm::KernelThread::pthread_entry (data=0x1a355390) at /g/g15/yadav2/taco/legion/legion/runtime/realm/threads.cc:774
[1] #27 0x00002000000f8cd4 in start_thread (arg=0x200027dcf8b0) at pthread_create.c:309
[1] #28 0x0000200005c67e14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104

@rohany
Copy link
Contributor Author

rohany commented Jul 5, 2021

Here's a backtrace for the intial assertion failure:

[2] Thread 11 (Thread 0x20002c24f8b0 (LWP 172385)):
[2] #0  0x0000200005c1e4e8 in waitpid () at ../sysdeps/unix/syscall-template.S:81
[2] #1  0x0000200005b8fd3c in do_system (line=0x13ea56a0 <cmd.7513> "/usr/bin/gdb -nx -batch -x /var/tmp/gasnet_7r4cXf '/g/g15/yadav2/taco/build/bin/cannonMM-cuda' 172167") at ../sysdeps/posix/system.c:148
[2] #2  0x0000200005b90424 in __libc_system (line=0x13ea56a0 <cmd.7513> "/usr/bin/gdb -nx -batch -x /var/tmp/gasnet_7r4cXf '/g/g15/yadav2/taco/build/bin/cannonMM-cuda' 172167") at ../sysdeps/posix/system.c:189
[2] #3  0x0000200000105348 in system (line=<optimized out>) at pt-system.c:28
[2] #4  0x00000000112ae99c in gasneti_system_redirected ()
[2] #5  0x00000000131858c8 in gasneti_bt_gdb ()
[2] #6  0x000000001318ad98 in gasneti_print_backtrace ()
[2] #7  0x00000000112afb98 in gasneti_defaultSignalHandler ()
[2] #8  <signal handler called>
[2] #9  0x0000200005b7fcb0 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
[2] #10 0x0000200005b8200c in __GI_abort () at abort.c:90
[2] #11 0x0000200005b757d4 in __assert_fail_base (fmt=0x200005cdb6d0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x13420eb0 "num_poisoned_generations.load() == 0", file=0x134207f8 "/g/g15/yadav2/taco/legion/legion/runtime/realm/event_impl.cc", line=<optimized out>, function=<optimized out>) at assert.c:92
[2] #12 0x0000200005b758c4 in __GI___assert_fail (assertion=0x13420eb0 "num_poisoned_generations.load() == 0", file=0x134207f8 "/g/g15/yadav2/taco/legion/legion/runtime/realm/event_impl.cc", line=<optimized out>, function=0x13423260 <Realm::GenEventImpl::process_update(unsigned int, unsigned int const*, int, Realm::TimeLimit)::__PRETTY_FUNCTION__> "void Realm::GenEventImpl::process_update(Realm::EventImpl::gen_t, const gen_t*, int, Realm::TimeLimit)") at assert.c:101
[2] #13 0x0000000012710c78 in Realm::GenEventImpl::process_update (this=0x2034ac26ad60, current_gen=6, new_poisoned_generations=0x0, new_poisoned_count=0, work_until=...) at /g/g15/yadav2/taco/legion/legion/runtime/realm/event_impl.cc:1601
[2] #14 0x0000000012711504 in Realm::EventUpdateMessage::handle_message (sender=4, args=..., data=0x0, datalen=0, work_until=...) at /g/g15/yadav2/taco/legion/legion/runtime/realm/event_impl.cc:1689
[2] #15 0x000000001271bf54 in Realm::HandlerWrappers::wrap_handler<Realm::EventUpdateMessage, Realm::EventUpdateMessage::handle_message> (sender=4, header=0x15cbebe0, payload=0x0, payload_size=0, work_until=...) at /g/g15/yadav2/taco/legion/legion/runtime/realm/activemsg.inl:588
[2] #16 0x0000000012945a38 in Realm::IncomingMessageManager::do_work (this=0x15b24a00, work_until=...) at /g/g15/yadav2/taco/legion/legion/runtime/realm/activemsg.cc:744
[2] #17 0x0000000012703b74 in Realm::BackgroundWorkManager::Worker::do_work (this=0x20002c24e930, max_time_in_ns=-1, interrupt_flag=0x0) at /g/g15/yadav2/taco/legion/legion/runtime/realm/bgwork.cc:611
[2] #18 0x0000000012700e1c in Realm::BackgroundWorkThread::main_loop (this=0x190ca6b0) at /g/g15/yadav2/taco/legion/legion/runtime/realm/bgwork.cc:158
[2] #19 0x0000000012705fe0 in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x190ca6b0) at /g/g15/yadav2/taco/legion/legion/runtime/realm/threads.inl:97
[2] #20 0x000000001290b4bc in Realm::KernelThread::pthread_entry (data=0x19091640) at /g/g15/yadav2/taco/legion/legion/runtime/realm/threads.cc:774
[2] #21 0x00002000000f8cd4 in start_thread (arg=0x20002c24f8b0) at pthread_create.c:309
[2] #22 0x0000200005c67e14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104

@lightsighter
Copy link
Contributor

I suspect that they are variations of the same error. The CRC checksum differing is indicative of packet data corruption, which could manifest in different ways. Since the original error also manifested in an active message handler, it is likely that packet was corrupted as well.

@rohany
Copy link
Contributor Author

rohany commented Jul 6, 2021

Another stack that doesn't appear when building with debug (or at least i haven't been able to reproduce it)

[4] Thread 20 (Thread 0x200027b6f8b0 (LWP 163876)):
[4] #0  0x0000200005c1e4e8 in waitpid () at ../sysdeps/unix/syscall-template.S:81
[4] #1  0x0000200005b8fd3c in do_system (line=0x11bb5040 <cmd.7513> "/usr/bin/gdb -nx -batch -x /var/tmp/gasnet_JcbO9L '/g/g15/yadav2/taco/build/bin/cannonMM-cuda' 163644") at ../sysdeps/posix/system.c:148
[4] #2  0x0000200005b90424 in __libc_system (line=0x11bb5040 <cmd.7513> "/usr/bin/gdb -nx -batch -x /var/tmp/gasnet_JcbO9L '/g/g15/yadav2/taco/build/bin/cannonMM-cuda' 163644") at ../sysdeps/posix/system.c:189
[4] #3  0x0000200000105348 in system (line=<optimized out>) at pt-system.c:28
[4] #4  0x000000001032c8f8 in gasneti_system_redirected ()
[4] #5  0x0000000011475f68 in gasneti_bt_gdb ()
[4] #6  0x000000001147b438 in gasneti_print_backtrace ()
[4] #7  0x000000001032daf4 in gasneti_defaultSignalHandler ()
[4] #8  <signal handler called>
[4] #9  0x0000200005b7fcb0 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
[4] #10 0x0000200005b8200c in __GI_abort () at abort.c:90
[4] #11 0x0000200005b757d4 in __assert_fail_base (fmt=0x200005cdb6d0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x11771b38 "prev_count > 0", file=0x11771858 "/g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc", line=<optimized out>, function=<optimized out>) at assert.c:92
[4] #12 0x0000200005b758c4 in __GI___assert_fail (assertion=0x11771b38 "prev_count > 0", file=0x11771858 "/g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc", line=<optimized out>, function=0x11770cf0 <Realm::PendingCompletion::invoke_remote_completions()::__PRETTY_FUNCTION__> "bool Realm::PendingCompletion::invoke_remote_completions()") at assert.c:101
[4] #13 0x000000001142deac in Realm::PendingCompletion::invoke_remote_completions() ()
[4] #14 0x000000001142e6cc in Realm::PendingCompletionManager::invoke_completions(Realm::PendingCompletion*, bool, bool) ()
[4] #15 0x00000000114350dc in Realm::GASNetEXInternal::handle_completion_reply(unsigned int, int const*, unsigned long) ()
[4] #16 0x00000000114402ac in Realm::handle_completion_reply(gasneti_token_s*, int const*, unsigned long) ()
[4] #17 0x0000000011440364 in Realm::handle_completion_reply_16(gasneti_token_s*, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int) ()
[4] #18 0x0000000011464014 in gasnetc_processPacket ()
[4] #19 0x00000000114668a4 in gasnetc_poll_rcv_all ()
[4] #20 0x0000000011445a08 in gasnetc_am_sema_poll ()
[4] #21 0x0000000011445b6c in gasnetc_am_get_credit ()
[4] #22 0x0000000011452e24 in gasnetc_AMRequestShortM ()
[4] #23 0x00000000114423b4 in Realm::GASNetEXHandlers::send_completion_reply(gasneti_endpoint_s*, unsigned int, unsigned short, int const*, unsigned long, unsigned int) ()
[4] #24 0x0000000011435fc4 in Realm::XmitSrcDestPair::push_packets(bool, Realm::TimeLimit) ()
[4] #25 0x00000000114385b0 in Realm::GASNetEXPoller::do_work(Realm::TimeLimit) ()
[4] #26 0x0000000010d78850 in Realm::BackgroundWorkManager::Worker::do_work(long long, Realm::atomic<bool>*) ()
[4] #27 0x0000000010d78ef0 in Realm::BackgroundWorkThread::main_loop() ()
[4] #28 0x0000000010e5fc84 in Realm::KernelThread::pthread_entry(void*) ()
[4] #29 0x00002000000f8cd4 in start_thread (arg=0x200027b6f8b0) at pthread_create.c:309
[4] #30 0x0000200005c67e14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104

@streichler streichler self-assigned this Jul 12, 2021
@streichler
Copy link
Contributor

@rohany can you try running with -gex:batch 0 on the command line and see if the behavior is any different?

@rohany
Copy link
Contributor Author

rohany commented Jul 12, 2021

Running with -gex:batch 0 seems to have fixed packet corruption related errors, but I still see errors like this at high (64) node counts:

[46] Thread 24 (Thread 0x20002cc6f8b0 (LWP 178623)):
[46] #0  0x000020000612e4e8 in waitpid () at ../sysdeps/unix/syscall-template.S:81
[46] #1  0x000020000609fd3c in do_system (line=0x11ba5040 <cmd.7513> "/usr/bin/gdb -nx -batch -x /var/tmp/gasnet_DUUBIR '/g/g15/yadav2/taco/build/bin/cannonMM-cuda' 178390") at ../sysdeps/posix/system.c:148
[46] #2  0x00002000060a0424 in __libc_system (line=0x11ba5040 <cmd.7513> "/usr/bin/gdb -nx -batch -x /var/tmp/gasnet_DUUBIR '/g/g15/yadav2/taco/build/bin/cannonMM-cuda' 178390") at ../sysdeps/posix/system.c:189
[46] #3  0x0000200000105348 in system (line=<optimized out>) at pt-system.c:28
[46] #4  0x000000001032c978 in gasneti_system_redirected ()
[46] #5  0x0000000011476048 in gasneti_bt_gdb ()
[46] #6  0x000000001147b518 in gasneti_print_backtrace ()
[46] #7  0x000000001032db74 in gasneti_defaultSignalHandler ()
[46] #8  <signal handler called>
[46] #9  0x000020000608fcb0 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
[46] #10 0x000020000609200c in __GI_abort () at abort.c:90
[46] #11 0x00002000060857d4 in __assert_fail_base (fmt=0x2000061eb6d0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x1166ae00 "num_poisoned_generations.load() == 0", file=0x1166a4f0 "/g/g15/yadav2/taco/legion/legion/runtime/realm/event_impl.cc", line=<optimized out>, function=<optimized out>) at assert.c:92
[46] #12 0x00002000060858c4 in __GI___assert_fail (assertion=0x1166ae00 "num_poisoned_generations.load() == 0", file=0x1166a4f0 "/g/g15/yadav2/taco/legion/legion/runtime/realm/event_impl.cc", line=<optimized out>, function=0x11669f08 <Realm::GenEventImpl::process_update(unsigned int, unsigned int const*, int, Realm::TimeLimit)::__PRETTY_FUNCTION__> "void Realm::GenEventImpl::process_update(Realm::EventImpl::gen_t, const gen_t*, int, Realm::TimeLimit)") at assert.c:101
[46] #13 0x0000000010d889d8 in Realm::GenEventImpl::process_update(unsigned int, unsigned int const*, int, Realm::TimeLimit) ()
[46] #14 0x0000000010d88c00 in Realm::EventUpdateMessage::handle_message(int, Realm::EventUpdateMessage const&, void const*, unsigned long, Realm::TimeLimit) ()
[46] #15 0x0000000010e7d7e4 in Realm::IncomingMessageManager::do_work(Realm::TimeLimit) ()
[46] #16 0x0000000010d78930 in Realm::BackgroundWorkManager::Worker::do_work(long long, Realm::atomic<bool>*) ()
[46] #17 0x0000000010d78fd0 in Realm::BackgroundWorkThread::main_loop() ()
[46] #18 0x0000000010e5fd64 in Realm::KernelThread::pthread_entry(void*) ()
[46] #19 0x00002000000f8cd4 in start_thread (arg=0x20002cc6f8b0) at pthread_create.c:309
[46] #20 0x0000200006177e14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104

@rohany
Copy link
Contributor Author

rohany commented Jul 28, 2021

I also seen this error:

cosma-cuda: /g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc:3276: Realm::PreparedMessage* Realm::GASNetEXInternal::prepare_message(gex_Rank_t, gex_EP_Index_t, short unsigned int, void*&, size_t, void*&, size_t, uintptr_t): Assertion `0' failed.
cosma-cuda: /g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc:3276: Realm::PreparedMessage* Realm::GASNetEXInternal::prepare_message(gex_Rank_t, gex_EP_Index_t, short unsigned int, void*&, size_t, void*&, size_t, uintptr_t): Assertion `0' failed.
*** Caught a fatal signal (proc 2): SIGABRT(6)

I'm trying to get a backtrace, will update if I can get one.

@rohany
Copy link
Contributor Author

rohany commented Aug 11, 2021

I ran into this error again on a different application, here's the backtrace:

#7  0x00002000060858c4 in __GI___assert_fail (assertion=0x13560808 "0", file=0x13560798 "/g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc", line=<optimized out>,
    function=0x13562928 <Realm::GASNetEXInternal::prepare_message(unsigned int, unsigned short, unsigned short, void*&, unsigned long, void*&, unsigned long, unsigned long)::__PRETTY_FUNCTION__> "Realm::PreparedMessage* Realm::GASNetEXInternal::prepare_message(gex_Rank_t, gex_EP_Index_t, short unsigned int, void*&, size_t, void*&, size_t, uintptr_t)") at assert.c:101
#8  0x000000001313b4b4 in Realm::GASNetEXInternal::prepare_message (this=0x13f65640, target=0, target_ep_index=4, msgid=372, header_base=@0x20002ad2d5b8: 0x20002ad2d630, header_size=28, payload_base=@0x20002ad2d5c0: 0x0, payload_size=16384, dest_payload_addr=35387500962064)
    at /g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_internal.cc:3276
#9  0x0000000013129578 in Realm::GASNetEXMessageImpl::GASNetEXMessageImpl (this=0x20002ad2d5b0, _internal=0x13f65640, _target=0, _msgid=372, _header_size=28, _max_payload_size=16384, _src_payload_addr=0x0, _src_payload_lines=0, _src_payload_line_stride=0,
    _dest_payload_addr=35387500962064, _dest_ep_index=4) at /g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_module.cc:236
#10 0x000000001312ba60 in Realm::GASNetEXModule::create_active_message_impl (this=0x13f655b0, target=0, msgid=372, header_size=28, max_payload_size=16384, src_payload_addr=0x0, src_payload_lines=0, src_payload_line_stride=0, dest_payload_addr=..., storage_base=0x20002ad2d5b0,
    storage_size=256) at /g/g15/yadav2/taco/legion/legion/runtime/realm/gasnetex/gasnetex_module.cc:686
#11 0x00000000130b09f4 in Realm::Network::create_active_message_impl (target=0, msgid=372, header_size=24, max_payload_size=16384, src_payload_addr=0x0, src_payload_lines=0, src_payload_line_stride=0, dest_payload_addr=..., storage_base=0x20002ad2d5b0, storage_size=256)
    at /g/g15/yadav2/taco/legion/legion/runtime/realm/network.inl:135
#12 0x00000000130b9798 in Realm::ActiveMessage<Realm::RemoteWriteXferDes::Write1DMessage, 256ul>::init (this=0x20002ad2d590, _target=0, _max_payload_size=16384, _dest_payload_addr=...) at /g/g15/yadav2/taco/legion/legion/runtime/realm/activemsg.inl:87
#13 0x00000000130b45c8 in Realm::ActiveMessage<Realm::RemoteWriteXferDes::Write1DMessage, 256ul>::ActiveMessage (this=0x20002ad2d590, _target=0, _max_payload_size=16384, _dest_payload_addr=...) at /g/g15/yadav2/taco/legion/legion/runtime/realm/activemsg.inl:70
#14 0x00000000130a58b4 in Realm::RemoteWriteXferDes::progress_xd (this=0x2034a48880a0, channel=0x16b92a60, work_until=...) at /g/g15/yadav2/taco/legion/legion/runtime/realm/transfer/channel.cc:3910
#15 0x00000000130c7c9c in Realm::XDQueue<Realm::RemoteWriteChannel, Realm::RemoteWriteXferDes>::do_work (this=0x16b92a98, work_until=...) at /g/g15/yadav2/taco/legion/legion/runtime/realm/transfer/channel.inl:161
#16 0x0000000012700808 in Realm::BackgroundWorkManager::Worker::do_work (this=0x20002ad2e930, max_time_in_ns=-1, interrupt_flag=0x0) at /g/g15/yadav2/taco/legion/legion/runtime/realm/bgwork.cc:611
#17 0x00000000126fdab0 in Realm::BackgroundWorkThread::main_loop (this=0x16ac65e0) at /g/g15/yadav2/taco/legion/legion/runtime/realm/bgwork.cc:158
#18 0x0000000012702c74 in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x16ac65e0) at /g/g15/yadav2/taco/legion/legion/runtime/realm/threads.inl:97
#19 0x0000000012908150 in Realm::KernelThread::pthread_entry (data=0x16ac2260) at /g/g15/yadav2/taco/legion/legion/runtime/realm/threads.cc:774
#20 0x00002000000f8cd4 in start_thread (arg=0x20002ad2f8b0) at pthread_create.c:309
#21 0x0000200006177e14 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104

@streichler
Copy link
Contributor

@rohany can you try cherry-picking this commit and see if the behavior with batching enabled changes (for better or for worse)?
https://gitlab.com/StanfordLegion/legion/-/commit/1791b66e0908bdc5c0e77fe3cc022bd17114fa5b

@rohany
Copy link
Contributor Author

rohany commented Sep 20, 2021

These bugs look like they have been fixed! Closing this for now.

@rohany rohany closed this as completed Sep 20, 2021
@rohany
Copy link
Contributor Author

rohany commented Sep 20, 2021

Well it needs 58cc02f97e2e6d4e4d6f67c03454712304600b15 to land first, but basically fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants