cuda-gdb internal-error caused on google colab #5

sakaia · 2019-06-06T13:37:12Z

On Google Colab, with GPU enabled.

!git clone http://github.com/pytorch/examples
!cuda-gdb python3

and type

run examples/mnist/main.py

generates following error.

cuda-gdb/7.12/gdb/block.c:456: internal-error: set_block_compunit_symtab: Assertion `gb->compunit_symtab == NULL' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) n

This is a bug, please report it.  For instructions, see:
<http://www.gnu.org/software/gdb/bugs/>.

cuda-gdb/7.12/gdb/block.c:456: internal-error: set_block_compunit_symtab: Assertion `gb->compunit_symtab == NULL' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) n

When I set gdb instead of cuda-gdb, nothing happend.

The text was updated successfully, but these errors were encountered:

maphysart · 2019-08-07T01:52:13Z

I met the same problem when I try to debug a cuda cpp extension for pytorch 1.1 . Have you solved it ? Thanks.

andyljones · 2019-08-28T09:08:18Z

This is fixed in CUDA toolkit 10.1.243.

A following patch adds an exists_non_stop_target call in the target_terminal routines, and that surprisingly caused a weird regression / GDB crash: $ make check RUNTESTFLAGS="--target_board=native-extended-gdbserver" TESTS="gdb.base/signest.exp" ... configuring for gdbserver local testing (extended-remote) Using src/gdb/testsuite/config/extended-gdbserver.exp as tool-and-target-specific interface file. Running src/gdb/testsuite/gdb.base/signest.exp ... ERROR: GDB process no longer exists Debugging the core, we see infinite recursion: (top-gdb) bt 20 #0 0x0000561d6a1bfeff in frame_unwind_arch (next_frame=0x561d6b19f9c0) at src/gdb/frame.c:2950 #1 0x0000561d6a1bfeb8 in get_frame_arch (this_frame=0x561d6b19f9c0) at src/gdb/frame.c:2939 #2 0x0000561d6a1b989f in frame_unwind_find_by_frame (this_frame=0x561d6b19f9c0, this_cache=0x561d6b19f9d8) at src/gdb/frame-unwind.c:174 #3 0x0000561d6a1bff04 in frame_unwind_arch (next_frame=0x561d6b19f9c0) at src/gdb/frame.c:2950 #4 0x0000561d6a1bfeb8 in get_frame_arch (this_frame=0x561d6b19f9c0) at src/gdb/frame.c:2939 #5 0x0000561d6a1b989f in frame_unwind_find_by_frame (this_frame=0x561d6b19f9c0, this_cache=0x561d6b19f9d8) at src/gdb/frame-unwind.c:174 #6 0x0000561d6a1bff04 in frame_unwind_arch (next_frame=0x561d6b19f9c0) at src/gdb/frame.c:2950 #7 0x0000561d6a1bfeb8 in get_frame_arch (this_frame=0x561d6b19f9c0) at src/gdb/frame.c:2939 #8 0x0000561d6a1b989f in frame_unwind_find_by_frame (this_frame=0x561d6b19f9c0, this_cache=0x561d6b19f9d8) at src/gdb/frame-unwind.c:174 #9 0x0000561d6a1bff04 in frame_unwind_arch (next_frame=0x561d6b19f9c0) at src/gdb/frame.c:2950 #10 0x0000561d6a1bfeb8 in get_frame_arch (this_frame=0x561d6b19f9c0) at src/gdb/frame.c:2939 #11 0x0000561d6a1b989f in frame_unwind_find_by_frame (this_frame=0x561d6b19f9c0, this_cache=0x561d6b19f9d8) at src/gdb/frame-unwind.c:174 #12 0x0000561d6a1bff04 in frame_unwind_arch (next_frame=0x561d6b19f9c0) at src/gdb/frame.c:2950 #13 0x0000561d6a1bfeb8 in get_frame_arch (this_frame=0x561d6b19f9c0) at src/gdb/frame.c:2939 #14 0x0000561d6a1b989f in frame_unwind_find_by_frame (this_frame=0x561d6b19f9c0, this_cache=0x561d6b19f9d8) at src/gdb/frame-unwind.c:174 #15 0x0000561d6a1bff04 in frame_unwind_arch (next_frame=0x561d6b19f9c0) at src/gdb/frame.c:2950 #16 0x0000561d6a1bfeb8 in get_frame_arch (this_frame=0x561d6b19f9c0) at src/gdb/frame.c:2939 #17 0x0000561d6a1b989f in frame_unwind_find_by_frame (this_frame=0x561d6b19f9c0, this_cache=0x561d6b19f9d8) at src/gdb/frame-unwind.c:174 #18 0x0000561d6a1bff04 in frame_unwind_arch (next_frame=0x561d6b19f9c0) at src/gdb/frame.c:2950 #19 0x0000561d6a1bfeb8 in get_frame_arch (this_frame=0x561d6b19f9c0) at src/gdb/frame.c:2939 (More stack frames follow...) (top-gdb) bt -30 #157054 0x0000561d6a1bfeb8 in get_frame_arch (this_frame=0x561d6b19f9c0) at src/gdb/frame.c:2939 #157055 0x0000561d6a1b989f in frame_unwind_find_by_frame (this_frame=0x561d6b19f9c0, this_cache=0x561d6b19f9d8) at src/gdb/frame-unwind.c:174 #157056 0x0000561d6a1bff04 in frame_unwind_arch (next_frame=0x561d6b19f9c0) at src/gdb/frame.c:2950 #157057 0x0000561d6a1bfeb8 in get_frame_arch (this_frame=0x561d6b19f9c0) at src/gdb/frame.c:2939 #157058 0x0000561d6a1b989f in frame_unwind_find_by_frame (this_frame=0x561d6b19f9c0, this_cache=0x561d6b19f9d8) at src/gdb/frame-unwind.c:174 #157059 0x0000561d6a1bff04 in frame_unwind_arch (next_frame=0x561d6b19f9c0) at src/gdb/frame.c:2950 #157060 0x0000561d6a1bbc65 in frame_unwind_pc (this_frame=0x561d6b19f9c0) at src/gdb/frame.c:970 #157061 0x0000561d6a1bf54c in get_frame_pc (frame=0x561d6b19fa90) at src/gdb/frame.c:2625 #157062 0x0000561d6a1bf63e in get_frame_address_in_block (this_frame=0x561d6b19fa90) at src/gdb/frame.c:2655 #157063 0x0000561d6a0cae7f in dwarf2_frame_cache (this_frame=0x561d6b19fa90, this_cache=0x561d6b19faa8) at src/gdb/dwarf2/frame.c:1010 #157064 0x0000561d6a0cb928 in dwarf2_frame_this_id (this_frame=0x561d6b19fa90, this_cache=0x561d6b19faa8, this_id=0x561d6b19faf0) at src/gdb/dwarf2/frame.c:1227 #157065 0x0000561d6a1baf72 in compute_frame_id (fi=0x561d6b19fa90) at src/gdb/frame.c:588 #157066 0x0000561d6a1bb16e in get_frame_id (fi=0x561d6b19fa90) at src/gdb/frame.c:636 #157067 0x0000561d6a1bb224 in get_stack_frame_id (next_frame=0x561d6b19fa90) at src/gdb/frame.c:650 #157068 0x0000561d6a26ecd0 in insert_hp_step_resume_breakpoint_at_frame (return_frame=0x561d6b19fa90) at src/gdb/infrun.c:7809 #157069 0x0000561d6a26b88a in handle_signal_stop (ecs=0x7ffc67022830) at src/gdb/infrun.c:6428 #157070 0x0000561d6a269d81 in handle_inferior_event (ecs=0x7ffc67022830) at src/gdb/infrun.c:5741 #157071 0x0000561d6a265bd0 in fetch_inferior_event () at src/gdb/infrun.c:4120 #157072 0x0000561d6a244c24 in inferior_event_handler (event_type=INF_REG_EVENT) at src/gdb/inf-loop.c:41 #157073 0x0000561d6a435cc4 in remote_async_serial_handler (scb=0x561d6b4a8990, context=0x561d6b4a4c48) at src/gdb/remote.c:14403 #157074 0x0000561d6a460bc5 in run_async_handler_and_reschedule (scb=0x561d6b4a8990) at src/gdb/ser-base.c:138 #157075 0x0000561d6a460cae in fd_event (error=0, context=0x561d6b4a8990) at src/gdb/ser-base.c:189 #157076 0x0000561d6a76a191 in handle_file_event (file_ptr=0x561d6b233ae0, ready_mask=1) at src/gdbsupport/event-loop.cc:575 #157077 0x0000561d6a76a743 in gdb_wait_for_event (block=1) at src/gdbsupport/event-loop.cc:701 #157078 0x0000561d6a7694ee in gdb_do_one_event () at src/gdbsupport/event-loop.cc:237 #157079 0x0000561d6a2df16b in start_event_loop () at src/gdb/main.c:421 #157080 0x0000561d6a2df2b6 in captured_command_loop () at src/gdb/main.c:481 #157081 0x0000561d6a2e0d16 in captured_main (data=0x7ffc67022bd0) at src/gdb/main.c:1353 #157082 0x0000561d6a2e0da8 in gdb_main (args=0x7ffc67022bd0) at src/gdb/main.c:1370 #157083 0x0000561d69eb3d82 in main (argc=13, argv=0x7ffc67022cf8, envp=0x7ffc67022d68) at src/gdb/gdb.c:33 This was caused by exists_non_stop_target flushing the frame cache via scoped_restore_current_thread/switch_to_thread, while we're in the middle of unwinding. Fix this by making exists_non_stop_target only switch the inferior, like done in target_pass_ctrlc. The annota1.exp change is necessary because we'd get a regression otherwise: @@ -238,8 +238,6 @@ Continuing. \032\032breakpoints-invalid -\032\032frames-invalid - \032\032breakpoint 3 Breakpoint 3, @@ -276,7 +274,7 @@ printf.c \032\032pre-prompt (gdb) \032\032prompt -PASS: gdb.base/annota1.exp: continue to printf +FAIL: gdb.base/annota1.exp: continue to printf ... because this patch avoids flushing the frame cache that lead to that missing frames-invalid. We still need to match frames-invalid because against gdbserver + "maint set target non-stop on", some other code path flushes the frame cache resulting in the annotation being emitted anyway. gdb/ChangeLog: yyyy-mm-dd Pedro Alves <pedro@palves.net> * target.c (exists_non_stop_target): Use scoped_restore_current_inferior and set_current_inferior instead of scoped_restore_current_thread / switch_to_inferior_no_thread. Change-Id: I8402483ee755e64e54d8b7c4a67c177557f569bd

After loading a core file, you're supposed to be able to use "detach" to unload the core file. That unfortunately regressed starting with GDB 11, with these commits: 1192f12 - gdb: generalize commit_resume, avoid commit-resuming when threads have pending statuses 408f668 - detach in all-stop with threads running resulting in a GDB crash: ... Thread 1 "gdb" received signal SIGSEGV, Segmentation fault. 0x0000555555e842bf in maybe_set_commit_resumed_all_targets () at ../../src/gdb/infrun.c:2899 2899 if (proc_target->commit_resumed_state) (top-gdb) bt #0 0x0000555555e842bf in maybe_set_commit_resumed_all_targets () at ../../src/gdb/infrun.c:2899 #1 0x0000555555e848bf in scoped_disable_commit_resumed::reset (this=0x7fffffffd440) at ../../src/gdb/infrun.c:3023 #2 0x0000555555e84a0c in scoped_disable_commit_resumed::reset_and_commit (this=0x7fffffffd440) at ../../src/gdb/infrun.c:3049 #3 0x0000555555e739cd in detach_command (args=0x0, from_tty=1) at ../../src/gdb/infcmd.c:2791 #4 0x0000555555c0ba46 in do_simple_func (args=0x0, from_tty=1, c=0x55555662a600) at ../../src/gdb/cli/cli-decode.c:95 #5 0x0000555555c112b0 in cmd_func (cmd=0x55555662a600, args=0x0, from_tty=1) at ../../src/gdb/cli/cli-decode.c:2514 #6 0x0000555556173b1f in execute_command (p=0x5555565c5916 "", from_tty=1) at ../../src/gdb/top.c:699 The code that crashes looks like: static void maybe_set_commit_resumed_all_targets () { scoped_restore_current_thread restore_thread; for (inferior *inf : all_non_exited_inferiors ()) { process_stratum_target *proc_target = inf->process_target (); if (proc_target->commit_resumed_state) ^^^^^^^^^^^ With 'proc_target' above being null. all_non_exited_inferiors filters out inferiors that have pid==0. We get here at the end of detach_command, after core_target::detach has already run, at which point the inferior _should_ have pid==0 and no process target. It is clear it no longer has a process target, but, it still has a pid!=0 somehow. The reason the inferior still has pid!=0, is that core_target::detach just unpushes, and relies on core_target::close to actually do the getting rid of the core and exiting the inferior. The problem with that is that detach_command grabs an extra strong reference to the process stratum target, so the unpush_target inside core_target::detach doesn't actually result in a call to core_target::close. Fix this my moving the cleaning up the core inferior to a shared routine called by both core_target::close and core_target::detach. We still need to cleanup the inferior from within core_file::close because there are paths to it that want to get rid of the core without going through detach. E.g., "core-file" -> "run". This commit includes a new test added to gdb.base/corefile.exp to cover the "core-file core" -> "detach" scenario. Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=29275 Change-Id: Ic42bdd03182166b19f598428b0dbc2ce6f67c893

New in this version: add a dedicated test. When I do this: $ ./gdb -nx --data-directory=data-directory -q \ /bin/sleep \ -ex "maint set target-non-stop on" \ -ex "tar ext :1234" \ -ex "set remote exec-file /bin/sleep" \ -ex "run 1231 &" \ -ex add-inferior \ -ex "inferior 2" Reading symbols from /bin/sleep... (No debugging symbols found in /bin/sleep) Remote debugging using :1234 Starting program: /bin/sleep 1231 Reading /lib64/ld-linux-x86-64.so.2 from remote target... warning: File transfers from remote targets can be slow. Use "set sysroot" to access files locally instead. Reading /lib64/ld-linux-x86-64.so.2 from remote target... Reading /usr/lib/debug/.build-id/a6/7a1408f18db3576757eea210d07ba3fc560dff.debug from remote target... [New inferior 2] Added inferior 2 on connection 1 (extended-remote :1234) [Switching to inferior 2 [<null>] (<noexec>)] (gdb) Reading /lib/x86_64-linux-gnu/libc.so.6 from remote target... attach 3659848 Attaching to process 3659848 /home/smarchi/src/binutils-gdb/gdb/thread.c:85: internal-error: inferior_thread: Assertion `current_thread_ != nullptr' failed. Note the "attach" command just above. When doing it on the command-line with a -ex switch, the bug doesn't trigger. The internal error of GDB is actually caused by GDBserver crashing, and the error recovery of GDB is not on point. This patch aims to fix just the GDBserver crash, not the GDB problem. GDBserver crashes with a segfault here: (gdb) bt #0 0x00005555557fb3f4 in find_one_thread (ptid=...) at /home/smarchi/src/binutils-gdb/gdbserver/thread-db.cc:177 #1 0x00005555557fd5cf in thread_db_thread_handle (ptid=<error reading variable: Cannot access memory at address 0xffffffffffffffa0>, handle=0x7fffffffc400, handle_len=0x7fffffffc3f0) at /home/smarchi/src/binutils-gdb/gdbserver/thread-db.cc:461 #2 0x000055555578a0b6 in linux_process_target::thread_handle (this=0x5555558a64c0 <the_x86_target>, ptid=<error reading variable: Cannot access memory at address 0xffffffffffffffa0>, handle=0x7fffffffc400, handle_len=0x7fffffffc3f0) at /home/smarchi/src/binutils-gdb/gdbserver/linux-low.cc:6905 #3 0x00005555556dfcc6 in handle_qxfer_threads_worker (thread=0x60b000000510, buffer=0x7fffffffc8a0) at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:1645 #4 0x00005555556e00e6 in operator() (__closure=0x7fffffffc5e0, thread=0x60b000000510) at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:1696 #5 0x00005555556f54be in for_each_thread<handle_qxfer_threads_proper(buffer*)::<lambda(thread_info*)> >(struct {...}) (func=...) at /home/smarchi/src/binutils-gdb/gdbserver/gdbthread.h:159 #6 0x00005555556e0242 in handle_qxfer_threads_proper (buffer=0x7fffffffc8a0) at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:1694 #7 0x00005555556e04ba in handle_qxfer_threads (annex=0x629000000213 "", readbuf=0x621000019100 '\276' <repeats 200 times>..., writebuf=0x0, offset=0, len=4097) at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:1732 #8 0x00005555556e1989 in handle_qxfer (own_buf=0x629000000200 "qXfer:threads", packet_len=26, new_packet_len_p=0x7fffffffd630) at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:2045 #9 0x00005555556e720a in handle_query (own_buf=0x629000000200 "qXfer:threads", packet_len=26, new_packet_len_p=0x7fffffffd630) at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:2685 #10 0x00005555556f1a01 in process_serial_event () at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:4176 #11 0x00005555556f4457 in handle_serial_event (err=0, client_data=0x0) at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:4514 #12 0x0000555555820f56 in handle_file_event (file_ptr=0x607000000250, ready_mask=1) at /home/smarchi/src/binutils-gdb/gdbsupport/event-loop.cc:573 #13 0x0000555555821895 in gdb_wait_for_event (block=1) at /home/smarchi/src/binutils-gdb/gdbsupport/event-loop.cc:694 #14 0x000055555581f533 in gdb_do_one_event (mstimeout=-1) at /home/smarchi/src/binutils-gdb/gdbsupport/event-loop.cc:264 #15 0x00005555556ec9fb in start_event_loop () at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:3512 #16 0x00005555556f0769 in captured_main (argc=4, argv=0x7fffffffe0d8) at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:3992 #17 0x00005555556f0e3f in main (argc=4, argv=0x7fffffffe0d8) at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:4078 The reason is a wrong current process when find_one_thread is called. The current process is the 2nd one, which was just attached. It does not yet have thread_db data (proc->priv->thread_db is nullptr). As we iterate on all threads of all process to fulfull the qxfer:threads:read request, we get to a thread of process 1 for which we haven't read thread_db information yet (lwp_info::thread_known is false), so we get into find_one_thread. find_one_thread uses `current_processÂ ()->priv->thread_db`, assuming the current process matches the ptid passed as a parameter, which is wrong. A segfault happens when trying to dereference that thread_db pointer. Fix this by making find_one_thread not assume what the current process / current thread is. If it needs to call into libthread_db, which we know will try to read memory from the current process, then temporarily set the current process. In the case where the thread is already know and we return early, we don't need to switch process. Add a test to reproduce this specific situation. Change-Id: I09b00883e8b73b7e5f89d0f47cb4e9c0f3d6caaa Approved-By: Andrew Burgess <aburgess@redhat.com>

New in this version: - Better comment in target_kill - Uncomment line in test to avoid hanging when exiting, when testing on native-extended-gdbserver PR 28275 shows that doing a sequence of: - Run inferior in background (run &) - kill that inferior - Run again We get into this assertion: /home/smarchi/src/binutils-gdb/gdb/target.c:2590: internal-error: target_wait: Assertion `!proc_target->commit_resumed_state' failed. #0 internal_error_loc (file=0x5606b344e740 "/home/smarchi/src/binutils-gdb/gdb/target.c", line=2590, fmt=0x5606b344d6a0 "%s: Assertion `%s' failed.") at /home/smarchi/src/binutils-gdb/gdbsupport/errors.cc:54 #1 0x00005606b6296475 in target_wait (ptid=..., status=0x7fffb9390630, options=...) at /home/smarchi/src/binutils-gdb/gdb/target.c:2590 #2 0x00005606b5767a98 in startup_inferior (proc_target=0x5606bfccb2a0 <the_amd64_linux_nat_target>, pid=3884857, ntraps=1, last_waitstatus=0x0, last_ptid=0x0) at /home/smarchi/src/binutils-gdb/gdb/nat/fork-inferior.c:482 #3 0x00005606b4e6c9c5 in gdb_startup_inferior (pid=3884857, num_traps=1) at /home/smarchi/src/binutils-gdb/gdb/fork-child.c:132 #4 0x00005606b50f14a5 in inf_ptrace_target::create_inferior (this=0x5606bfccb2a0 <the_amd64_linux_nat_target>, exec_file=0x604000039f50 "/home/smarchi/build/binutils-gdb/gdb/test", allargs="", env=0x61500000a580, from_tty=1) at /home/smarchi/src/binutils-gdb/gdb/inf-ptrace.c:105 #5 0x00005606b53b6d23 in linux_nat_target::create_inferior (this=0x5606bfccb2a0 <the_amd64_linux_nat_target>, exec_file=0x604000039f50 "/home/smarchi/build/binutils-gdb/gdb/test", allargs="", env=0x61500000a580, from_tty=1) at /home/smarchi/src/binutils-gdb/gdb/linux-nat.c:978 #6 0x00005606b512b79b in run_command_1 (args=0x0, from_tty=1, run_how=RUN_NORMAL) at /home/smarchi/src/binutils-gdb/gdb/infcmd.c:468 #7 0x00005606b512c236 in run_command (args=0x0, from_tty=1) at /home/smarchi/src/binutils-gdb/gdb/infcmd.c:526 When running the kill command, commit_resumed_state for the process_stratum_target (linux-nat, here) is true. After the kill, when there are no more threads, commit_resumed_state is still true, as nothing touches this flag during the kill operation. During the subsequent run command, run_command_1 does: scoped_disable_commit_resumed disable_commit_resumed ("running"); We would think that this would clear the commit_resumed_state flag of our native target, but that's not the case, because scoped_disable_commit_resumed iterates on non-exited inferiors in order to find active process targets. And after the kill, the inferior is exited, and the native target was unpushed from it anyway. So scoped_disable_commit_resumed doesn't touch the commit_resumed_state flag of the native target, it stays true. When reaching target_wait, in startup_inferior (to consume the initial expect stop events while the inferior is starting up and working its way through the shell), commit_resumed_state is true, breaking the contract saying that commit_resumed_state is always false when calling the targets' wait method. (note: to be correct, I think that startup_inferior should toggle commit_resumed between the target_wait and target_resume calls, but I'll ignore that for now) I can see multiple ways to fix this. In the end, we need commit_resumed_state to be cleared by the time we get to that target_wait. It could be done at the end of the kill command, or at the beginning of the run command. To keep things in a coherent state, I'd like to make it so that after the kill command, when the target is left with no threads, its commit_resumed_state flag is left to false. This way, we can keep working with the assumption that a target with no threads (and therefore no running threads) has commit_resumed_state == false. Do this by adding a scoped_disable_commit_resumed in target_kill. It clears the target's commit_resumed_state on entry, and leaves it false if the target does not have any resumed thread on exit. That means, even if the target has another inferior with stopped threads, commit_resumed_state will be left to false, which makes sense. Add a test that tries to cover various combinations of actions done while an inferior is running (and therefore while commit_resumed_state is true on the process target). Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=28275 Change-Id: I8e6fe6dc1f475055921520e58cab68024039a1e9 Approved-By: Andrew Burgess <aburgess@redhat.com>

I noticed that after a following patch ("Step over clone syscall w/ breakpoint, TARGET_WAITKIND_THREAD_CLONED"), the gdb.threads/step-over-exec.exp was passing cleanly, but still, we'd end up with four new unexpected GDB core dumps: === gdb Summary === # of unexpected core files 4 # of expected passes 48 That said patch is making the pre-existing gdb.threads/step-over-exec.exp testcase (almost silently) expose a latent problem in gdb/linux-nat.c, resulting in a GDB crash when: #1 - a non-leader thread execs #2 - the post-exec program stops somewhere #3 - you kill the inferior Instead of #3 directly, the testcase just returns, which ends up in gdb_exit, tearing down GDB, which kills the inferior, and is thus equivalent to #3 above. Vis: $ gdb --args ./gdb /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true ... (top-gdb) r ... (gdb) b main ... (gdb) r ... Breakpoint 1, main (argc=1, argv=0x7fffffffdb88) at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec.c:69 69 argv0 = argv[0]; (gdb) c Continuing. [New Thread 0x7ffff7d89700 (LWP 2506975)] Other going in exec. Exec-ing /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd process 2506769 is executing new program: /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd Thread 1 "step-over-exec-" hit Breakpoint 1, main () at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec-execd.c:28 28 foo (); (gdb) k ... Thread 1 "gdb" received signal SIGSEGV, Segmentation fault. 0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393 393 return m_suspend.waitstatus_pending_p; (top-gdb) bt #0 0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393 #1 0x0000555555a884d1 in get_pending_child_status (lp=0x5555579b8230, ws=0x7fffffffd130) at ../../src/gdb/linux-nat.c:1345 #2 0x0000555555a8e5e6 in kill_unfollowed_child_callback (lp=0x5555579b8230) at ../../src/gdb/linux-nat.c:3564 #3 0x0000555555a92a26 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (this=0x0, ecall=..., args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:284 #4 0x0000555555a92a51 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:278 #5 0x0000555555a91f84 in gdb::function_view<int (lwp_info*)>::operator()(lwp_info*) const (this=0x7fffffffd210, args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:247 #6 0x0000555555a87072 in iterate_over_lwps(ptid_t, gdb::function_view<int (lwp_info*)>) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:864 #7 0x0000555555a8e732 in linux_nat_target::kill (this=0x55555653af40 <the_amd64_linux_nat_target>) at ../../src/gdb/linux-nat.c:3590 #8 0x0000555555cfdc11 in target_kill () at ../../src/gdb/target.c:911 ... The root of the problem is that when a non-leader LWP execs, it just changes its tid to the tgid, replacing the pre-exec leader thread, becoming the new leader. There's no thread exit event for the execing thread. It's as if the old pre-exec LWP vanishes without trace. The ptrace man page says: "PTRACE_O_TRACEEXEC (since Linux 2.5.46) Stop the tracee at the next execve(2). A waitpid(2) by the tracer will return a status value such that status>>8 == (SIGTRAP | (PTRACE_EVENT_EXEC<<8)) If the execing thread is not a thread group leader, the thread ID is reset to thread group leader's ID before this stop. Since Linux 3.0, the former thread ID can be retrieved with PTRACE_GETEVENTMSG." When the core of GDB processes an exec events, it deletes all the threads of the inferior. But, that is too late -- deleting the thread does not delete the corresponding LWP, so we end leaving the pre-exec non-leader LWP stale in the LWP list. That's what leads to the crash above -- linux_nat_target::kill iterates over all LWPs, and after the patch in question, that code will look for the corresponding thread_info for each LWP. For the pre-exec non-leader LWP still listed, won't find one. This patch fixes it, by deleting the pre-exec non-leader LWP (and thread) from the LWP/thread lists as seen as we get an exec event out of ptrace. GDBserver does not need an equivalent fix, because it is already doing this, as side effect of mourning the pre-exec process, in gdbserver/linux-low.cc: else if (event == PTRACE_EVENT_EXEC && cs.report_exec_events) { ... /* Delete the execing process and all its threads. */ mourn (proc); switch_to_thread (nullptr); Change-Id: I21ec18072c7750f3a972160ae6b9e46590376643

Running the gdb.threads/step-over-thread-exit-while-stop-all-threads.exp testcase added later in the series against gdbserver, after the TARGET_WAITKIND_NO_RESUMED fix from the following patch, would run into an infinite loop in stop_all_threads, leading to a timeout: FAIL: gdb.threads/step-over-thread-exit-while-stop-all-threads.exp: displaced-stepping=off: target-non-stop=on: iter 0: continue (timeout) The is really a latent bug, and it is about the fact that stop_all_threads stops listening to events from a target as soon as it sees a TARGET_WAITKIND_NO_RESUMED, ignoring that TARGET_WAITKIND_NO_RESUMED may be delayed. handle_no_resumed knows how to handle delayed no-resumed events, but stop_all_threads was never taught to. In more detail, here's what happens with that testcase: #1 - Multiple threads report breakpoint hits to gdb. #2 - gdb picks one events, and it's for thread 1. All other stops are left pending. thread 1 needs to move past a breakpoint, so gdb stops all threads to start an inline step over for thread 1. While stopping threads, some of the threads that were still running report events that are also left pending. #2 - gdb steps thread 1 #3 - Thread 1 exits while stepping (it steps over an exit syscall), gdbserver reports thread exit for thread 1 #4 - Thread 1 was the last resumed thread, so gdbserver also reports no-resumed: [remote] Notification received: Stop:w0;p3445d0.3445d3 [remote] Sending packet: $vStopped#55 [remote] Packet received: N [remote] Sending packet: $vStopped#55 [remote] Packet received: OK #5 - gdb processes the thread exit for thread 1, finishes the step over and restarts threads. #6 - gdb picks the next event to process out of one of the resumed threads with pending events: [infrun] random_resumed_with_pending_wait_status: Found 32 events, selecting #11 #7 - This is again a breakpoint hit and the breakpoint needs to be stepped over too, so gdb starts a step-over dance again. #8 - We reach stop_all_threads, which finds that some threads need to be stopped. #9 - wait_one finally consumes the no-resumed event queue by #4. Seeing this, wait_one disable target async, to stop listening for events out of the remote target. #10 - We still haven't seen all the stops expected, so stop_all_threads tries another iteration. #11 - Because the remote target is no longer async, and there are no other targets, wait_one return no-resumed immediately without polling the remote target. #12 - We still haven't seen all the stops expected, so stop_all_threads tries another iteration. goto #11, looping forever. Fix this by explicitly enabling/re-enabling target async on targets that can async, before waiting for stops. Change-Id: Ie3ffb0df89635585a6631aa842689cecc989e33f

… gdb/23377) This fixes a gdb.base/multi-forks.exp regression with GDBserver. Git commit f2ffa92 ("gdb: Eliminate the 'stop_pc' global") caused the regression by exposing a latent bug in gdbserver. The bug is that GDBserver's implementation of the D;PID packet incorrectly assumes that the selected thread points to the process being detached. This happens via the any_persistent_commands call, which calls current_process: (gdb) bt #0 0x000000000040a57e in internal_error(char const*, int, char const*, ...) (file=0x4a53c0 "src/gdb/gdbserver/inferiors.c", line=212, fmt=0x4a539e "%s: Assertion `%s' failed.") at src/gdb/gdbserver/../common/errors.c:54 #1 0x0000000000420acf in current_process() () at src/gdb/gdbserver/inferiors.c:212 #2 0x00000000004226a0 in any_persistent_commands() () at gdb/gdbserver/mem-break.c:308 #3 0x000000000042cb43 in handle_detach(char*) (own_buf=0x6f0280 "D;62ea") at src/gdb/gdbserver/server.c:1210 #4 0x0000000000433af3 in process_serial_event() () at src/gdb/gdbserver/server.c:4055 #5 0x0000000000434878 in handle_serial_event(int, void*) (err=0, client_data=0x0) The "eliminate stop_pc" commit exposes the problem because before that commit, GDB's switch_to_thread always read the newly-selected thread's PC, and that would end up forcing GDBserver's selected thread to change accordingly as side effect. After that commit, GDB no longer reads the thread's PC, and GDBserver does not switch the thread. Fix this by removing the assumption from GDBserver. gdb/gdbserver/ChangeLog: 2018-07-11 Pedro Alves <palves@redhat.com> PR gdb/23377 * mem-break.c (any_persistent_commands): Add process_info parameter and use it instead of relying on the current process. Change return type to bool. * mem-break.h (any_persistent_commands): Add process_info parameter and change return type to bool. * server.c (handle_detach): Remove require_running_or_return call. Look up the process_info for the process we're about to detach. If not found, return back error to GDB. Adjust any_persistent_commands call to pass down a process pointer.

…b/23379) This commit fixes a 8.1->8.2 regression exposed by gdb.python/py-evthreads.exp when testing with --target_board=native-gdbserver. gdb.log shows: src/gdb/thread.c:93: internal-error: thread_info* inferior_thread(): Assertion `tp' failed. A problem internal to GDB has been detected, further debugging may prove unreliable. Quit this debugging session? (y or n) FAIL: gdb.python/py-evthreads.exp: run to breakpoint 1 (GDB internal error) A backtrace shows (frames #2 and #10 highlighted) that the assertion fails when GDB is setting up the connection to the remote target, in non-stop mode: #0 0x0000000000622ff0 in internal_error(char const*, int, char const*, ...) (file=0xc1ad98 "src/gdb/thread.c", line=93, fmt=0xc1ad20 "%s: Assertion `%s' failed.") at src/gdb/common/errors.c:54 #1 0x000000000089567e in inferior_thread() () at src/gdb/thread.c:93 = #2 0x00000000004da91d in get_event_thread() () at src/gdb/python/py-threadevent.c:38 #3 0x00000000004da9b7 in create_thread_event_object(_typeobject*, _object*) (py_type=0x11574c0 <continue_event_object_type>, thread=0x0) at src/gdb/python/py-threadevent.c:60 #4 0x00000000004bf6fe in create_continue_event_object() () at src/gdb/python/py-continueevent.c:27 #5 0x00000000004bf738 in emit_continue_event(ptid_t) (ptid=...) at src/gdb/python/py-continueevent.c:40 #6 0x00000000004c7d47 in python_on_resume(ptid_t) (ptid=...) at src/gdb/python/py-inferior.c:108 #7 0x0000000000485bfb in std::_Function_handler<void (ptid_t), void (*)(ptid_t)>::_M_invoke(std::_Any_data const&, ptid_t&&) (__functor=..., __args#0=...) at /usr/include/c++/7/bits/std_function.h:316 #8 0x000000000089b416 in std::function<void (ptid_t)>::operator()(ptid_t) const (this=0x12aa600, __args#0=...) at /usr/include/c++/7/bits/std_function.h:706 #9 0x000000000089aa0e in gdb::observers::observable<ptid_t>::notify(ptid_t) const (this=0x118a7a0 <gdb::observers::target_resumed>, args#0=...) at src/gdb/common/observable.h:106 = #10 0x0000000000896fbe in set_running(ptid_t, int) (ptid=..., running=1) at src/gdb/thread.c:880 #11 0x00000000007f750f in remote_target::remote_add_thread(ptid_t, bool, bool) (this=0x12c5440, ptid=..., running=true, executing=true) at src/gdb/remote.c:2434 #12 0x00000000007f779d in remote_target::remote_notice_new_inferior(ptid_t, int) (this=0x12c5440, currthread=..., executing=1) at src/gdb/remote.c:2515 #13 0x00000000007f9c44 in remote_target::update_thread_list() (this=0x12c5440) at src/gdb/remote.c:3831 #14 0x00000000007fb922 in remote_target::start_remote(int, int) (this=0x12c5440, from_tty=0, extended_p=0) at src/gdb/remote.c:4655 #15 0x00000000007fd102 in remote_target::open_1(char const*, int, int) (name=0x1a4f45e "localhost:2346", from_tty=0, extended_p=0) at src/gdb/remote.c:5638 #16 0x00000000007fbec1 in remote_target::open(char const*, int) (name=0x1a4f45e "localhost:2346", from_tty=0) at src/gdb/remote.c:4862 So on frame #10, we're marking a newly-discovered thread as running, and that causes the Python API to emit a gdb.ContinueEvent. gdb.ContinueEvent is a gdb.ThreadEvent, and as such includes the event thread as the "inferior_thread" attribute. The problem is that when we get to frame #3/#4, we lost all references to the thread that is being marked as running. create_continue_event_object assumes that it is the current thread, which is not true in this case. Fix this by passing down the right thread in create_continue_event_object. Also remove create_thread_event_object's default argument and have the only other caller left pass down the right thread explicitly too. gdb/ChangeLog: 2018-08-24 Pedro Alves <palves@redhat.com> Simon Marchi <simon.marchi@ericsson.com> PR gdb/23379 * python/py-continueevent.c: Include "gdbthread.h". (create_continue_event_object): Add intro comment. Add 'ptid' parameter. Use it to find thread to pass to create_thread_event_object. (emit_continue_event): Pass PTID down to create_continue_event_object. * python/py-event.h (py_get_event_thread): Declare. (create_thread_event_object): Remove default from 'thread' parameter. * python/py-stopevent.c (create_stop_event_object): Use py_get_event_thread. * python/py-threadevent.c (get_event_thread): Rename to ... (py_get_event_thread): ... this, make extern, add 'ptid' parameter and use it to find the thread. (create_thread_event_object): Assert that THREAD isn't null. Don't find the event thread here.

This commit fixes an issue that was discovered while writing the tests for the previous commit. I noticed that, when GDB restarts an inferior, the executable_changed event would trigger twice. The first notification would originate from: #0 exec_file_attach (filename=0x4046680 "/tmp/hello.x", from_tty=0) at ../../src/gdb/exec.c:513 #1 0x00000000006f3adb in reopen_exec_file () at ../../src/gdb/corefile.c:122 #2 0x0000000000e6a3f2 in generic_mourn_inferior () at ../../src/gdb/target.c:3682 #3 0x0000000000995121 in inf_child_target::mourn_inferior (this=0x2fe95c0 <the_amd64_linux_nat_target>) at ../../src/gdb/inf-child.c:192 #4 0x0000000000995cff in inf_ptrace_target::mourn_inferior (this=0x2fe95c0 <the_amd64_linux_nat_target>) at ../../src/gdb/inf-ptrace.c:125 #5 0x0000000000a32472 in linux_nat_target::mourn_inferior (this=0x2fe95c0 <the_amd64_linux_nat_target>) at ../../src/gdb/linux-nat.c:3609 #6 0x0000000000e68a40 in target_mourn_inferior (ptid=...) at ../../src/gdb/target.c:2761 #7 0x0000000000a323ec in linux_nat_target::kill (this=0x2fe95c0 <the_amd64_linux_nat_target>) at ../../src/gdb/linux-nat.c:3593 #8 0x0000000000e64d1c in target_kill () at ../../src/gdb/target.c:924 #9 0x00000000009a19bc in kill_if_already_running (from_tty=1) at ../../src/gdb/infcmd.c:328 #10 0x00000000009a1a6f in run_command_1 (args=0x0, from_tty=1, run_how=RUN_STOP_AT_MAIN) at ../../src/gdb/infcmd.c:381 #11 0x00000000009a20a5 in start_command (args=0x0, from_tty=1) at ../../src/gdb/infcmd.c:527 #12 0x000000000068dc5d in do_simple_func (args=0x0, from_tty=1, c=0x35c7200) at ../../src/gdb/cli/cli-decode.c:95 While the second originates from: #0 exec_file_attach (filename=0x3d7a1d0 "/tmp/hello.x", from_tty=0) at ../../src/gdb/exec.c:513 #1 0x0000000000dfe525 in reread_symbols (from_tty=1) at ../../src/gdb/symfile.c:2517 #2 0x00000000009a1a98 in run_command_1 (args=0x0, from_tty=1, run_how=RUN_STOP_AT_MAIN) at ../../src/gdb/infcmd.c:398 #3 0x00000000009a20a5 in start_command (args=0x0, from_tty=1) at ../../src/gdb/infcmd.c:527 #4 0x000000000068dc5d in do_simple_func (args=0x0, from_tty=1, c=0x35c7200) at ../../src/gdb/cli/cli-decode.c:95 In the first case the call to exec_file_attach first passes through reopen_exec_file. The reopen_exec_file performs a modification time check on the executable file, and only calls exec_file_attach if the executable has changed on disk since it was last loaded. However, in the second case things work a little differently. In this case GDB is really trying to reread the debug symbol. As such, we iterate over the objfiles list, and for each of those we check the modification time, if the file on disk has changed then we reload the debug symbols from that file. However, there is an additional check, if the objfile has the same name as the executable then we will call exec_file_attach, but we do so without checking the cached modification time that indicates when the executable was last reloaded, as a result, we reload the executable twice. In this commit I propose that reread_symbols be changed to unconditionally call reopen_exec_file before performing the objfile iteration. This will ensure that, if the executable has changed, then the executable will be reloaded, however, if the executable has already been recently reloaded, we will not reload it for a second time. After handling the executable, GDB can then iterate over the objfiles list and reload them in the normal way. With this done I now see the executable reloaded only once when GDB restarts an inferior, which means I can remove the kfail that I added to the gdb.python/py-exec-file.exp test in the previous commit. Approved-By: Tom Tromey <tom@tromey.com>

It was pointed out on the mailing list that a recently added test (gdb.python/py-progspace-events.exp) was failing when run with the native-extended-gdbserver board. This test was added with this commit: commit 59912fb Date: Tue Sep 19 11:45:36 2023 +0100 gdb: add Python events for program space addition and removal It turns out though that the test is failing due to a existing bug in GDB, the new test just exposes the problem. Additionally, the failure really doesn't even rely on the new functionality added in the above commit. I reduced the test to a simple set of steps that reproduced the failure and tested against GDB 13, and the test passes; so the bug was introduced since then. In fact, the bug was introduced with this commit: commit a282736 Date: Fri Sep 8 15:48:16 2023 +0100 gdb: remove final user of the executable_changed observer This commit changed how the per-inferior auxv data cache is managed, specifically, when the cache is cleared, and it is this that leads to the failure. This bug is interesting because it exposes a number of issues with GDB, I'll explain all of the problems I see, though ultimately, I only propose fixing one problem in this commit, which is enough to resolve the crash we are currently seeing. The crash that we are seeing manifests like this: ... [Inferior 2 (process 3970384) exited normally] +inferior 1 [Switching to inferior 1 [process 3970383] (/tmp/build/gdb/testsuite/outputs/gdb.python/py-progspace-events/py-progspace-events)] [Switching to thread 1.1 (Thread 3970383.3970383)] #0 breakpt () at /tmp/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.python/py-progspace-events.c:28 28 { /* Nothing. */ } (gdb) step +step terminate called after throwing an instance of 'gdb_exception_error' Fatal signal: Aborted ... etc ... What's happening is that GDB attempts to refill the auxv cache as a result of the gdbarch_has_shared_address_space call in program_space::~program_space, the backtrace looks like this: #0 0x00007fb4f419a9a5 in raise () from /lib64/libpthread.so.0 #1 0x00000000008b635d in handle_fatal_signal (sig=6) at ../../src/gdb/event-top.c:912 #2 <signal handler called> #3 0x00007fb4f38e3625 in raise () from /lib64/libc.so.6 #4 0x00007fb4f38cc8d9 in abort () from /lib64/libc.so.6 #5 0x00007fb4f3c70756 in __gnu_cxx::__verbose_terminate_handler() [clone .cold] () from /lib64/libstdc++.so.6 #6 0x00007fb4f3c7c6dc in __cxxabiv1::__terminate(void (*)()) () from /lib64/libstdc++.so.6 #7 0x00007fb4f3c7b6e9 in __cxa_call_terminate () from /lib64/libstdc++.so.6 #8 0x00007fb4f3c7c094 in __gxx_personality_v0 () from /lib64/libstdc++.so.6 #9 0x00007fb4f3a80c63 in _Unwind_RaiseException_Phase2 () from /lib64/libgcc_s.so.1 #10 0x00007fb4f3a8154e in _Unwind_Resume () from /lib64/libgcc_s.so.1 #11 0x0000000000e8832d in target_read_alloc_1<unsigned char> (ops=0x408a3a0, object=TARGET_OBJECT_AUXV, annex=0x0) at ../../src/gdb/target.c:2266 #12 0x0000000000e73dea in target_read_alloc (ops=0x408a3a0, object=TARGET_OBJECT_AUXV, annex=0x0) at ../../src/gdb/target.c:2315 #13 0x000000000058248c in target_read_auxv_raw (ops=0x408a3a0) at ../../src/gdb/auxv.c:379 #14 0x000000000058243d in target_read_auxv () at ../../src/gdb/auxv.c:368 #15 0x000000000058255c in target_auxv_search (match=0x0, valp=0x7ffdee17c598) at ../../src/gdb/auxv.c:415 #16 0x0000000000a464bb in linux_is_uclinux () at ../../src/gdb/linux-tdep.c:433 #17 0x0000000000a464f6 in linux_has_shared_address_space (gdbarch=0x409a2d0) at ../../src/gdb/linux-tdep.c:440 #18 0x0000000000510eae in gdbarch_has_shared_address_space (gdbarch=0x409a2d0) at ../../src/gdb/gdbarch.c:4889 #19 0x0000000000bc7558 in program_space::~program_space (this=0x4544aa0, __in_chrg=<optimized out>) at ../../src/gdb/progspace.c:124 #20 0x00000000009b245d in delete_inferior (inf=0x47b3de0) at ../../src/gdb/inferior.c:290 #21 0x00000000009b2c10 in prune_inferiors () at ../../src/gdb/inferior.c:480 #22 0x00000000009c5e3e in fetch_inferior_event () at ../../src/gdb/infrun.c:4558 #23 0x000000000099b4dc in inferior_event_handler (event_type=INF_REG_EVENT) at ../../src/gdb/inf-loop.c:42 #24 0x0000000000cbc64f in remote_async_serial_handler (scb=0x4090a30, context=0x408a6b0) at ../../src/gdb/remote.c:14859 #25 0x0000000000d83d3a in run_async_handler_and_reschedule (scb=0x4090a30) at ../../src/gdb/ser-base.c:138 #26 0x0000000000d83e1f in fd_event (error=0, context=0x4090a30) at ../../src/gdb/ser-base.c:189 So this is problem #1, if we throw an exception while deleting a program_space then this is not caught, and is going to crash GDB. Problem #2 becomes evident when we ask why GDB is throwing an error in this case; the error is thrown because the remote target, operating in non-async mode, can't read the auxv data while an inferior is running and GDB is waiting for a stop reply. The problem here then, is why does GDB get into a position where it tries to interact with the remote target in this way, at this time? The problem is caused by the prune_inferiors call which can be seen in the above backtrace. In prune_inferiors we check if the inferior is deletable, and if it is, we delete it. The problem is, I think, we should also check if the target is currently in a state that would allow us to delete the inferior. We don't currently have such a check available, we'd need to add one, but for the remote target, this would return false if the remote is in async mode and the remote is currently waiting for a stop reply. With this change in place GDB would defer deleting the inferior until the remote target has stopped, at which point GDB would be able to refill the auxv cache successfully. And then, problem #3 becomes evident when we ask why GDB is needing to refill the auxv cache now when it didn't need to for GDB 13. This is where the second commit mentioned above (a282736) comes in. Prior to this commit, the auxv cache was cleared by the executable_changed observer, while after that commit the auxv cache was cleared by the new_objfile observer -- but only when the new_objfile observer is used in the special mode that actually means that all objfiles have been unloaded (I know, the overloading of the new_objfile observer is horrible, and unnecessary, but it's not really important for this bug). The difference arises because the new_objfile observer is triggered from clear_symtab_users, which in turn is called from program_space::~program_space. The new_objfile observer for auxv does this: static void auxv_new_objfile_observer (struct objfile *objfile) { if (objfile == nullptr) invalidate_auxv_cache_inf (current_inferior ()); } That is, when all the objfiles are unloaded, we clear the auxv cache for the current inferior. The problem is, then when we look at the prune_inferiors -> delete_inferior -> ~program_space path, we see that the current inferior is not going to be an inferior that exists within the program_space being deleted; delete_inferior removes the deleted inferior from the global inferior list, and then only deletes the program_space if program_space::empty() returns true, which is only the case if the current inferior isn't within the program_space to delete, and no other inferior exists within that program_space either. What this means is that when the new_objfile observer is called we can't rely on the current inferior having any relationship with the program space in which the objfiles were removed. This was an error in the commit a282736, the only thing we can rely on is the current program space. As a result of this mistake, after commit a282736, GDB was sometimes clearing the auxv cache for a random inferior. In the native target case this was harmless as we can always refill the cache when needed, but in the remote target case, if we need to refill the cache when the remote target is executing, then we get the crash we observed. And additionally, if we think about this a little more, we see that commit a282736 made another mistake. When all the objfiles are removed, they are removed from a program_space, a program_space might contain multiple inferiors, so surely, we should clear the auxv cache for all of the matching inferiors? Given these two insights, that the current_inferior is not relevant, only the current_program_space, and that we should be clearing the cache for all inferiors in the current_program_space, we can update auxv_new_objfile_observer to: if (objfile == nullptr) { for (inferior *inf : all_inferiors ()) { if (inf->pspace == current_program_space) invalidate_auxv_cache_inf (inf); } } With this change we now correctly clear the auxv cache for the correct inferiors, and GDB no longer needs to refill the cache at an inconvenient time, this avoids the crash we were seeing. And finally, we reach problem #4. Inspired by the observation that using the current_inferior from within the ~program_space function was not correct, I added some debug to see if current_inferior() was called anywhere else (below ~program_space), and the answer is yes, it's called a often. Mostly the culprit is GDB doing: current_inferior ()->top_target ()-> .... But I think all of these calls are most likely doing the wrong thing, and only work because the top target in all these cases is shared between all inferiors, e.g. it's the native target, or the remote target for all inferiors. But if we had a truly multi-connection setup, then we might start to see odd behaviour. Problem #1 I'm just ignoring for now, I guess at some point we might run into this again, and then we'd need to solve this. But in this case I wasn't sure what a "good" solution would look like. We need the auxv data in order to implement the linux_is_uclinux() function. If we can't get the auxv data then what should we do, assume yes, or assume no? The right answer would probably be to propagate the error back up the stack, but then we reach ~program_space, and throwing exceptions from a destructor is problematic, so we'd need to catch and deal at this point. The linux_is_uclinux() call is made from within gdbarch_has_shared_address_space(), which is used like: if (!gdbarch_has_shared_address_space (target_gdbarch ())) delete this->aspace; So, we would have to choose; delete the address space or not. If we delete it on error, then we might delete an address space that is shared within another program space. If we don't delete the address space, then we might leak it. Neither choice is great. A better solution might be to have the address spaces be reference counted, then we could remove the gdbarch_has_shared_address_space call completely, and just rely on the reference count to auto-delete the address space when appropriate. The solution for problem #2 I already hinted at above, we should have a new target_can_delete_inferiors() call, which should be called from prune_inferiors, this would prevent GDB from trying to delete inferiors when a (remote) target is in a state where we know it can't delete the inferior. Deleting an inferior often (always?) requires sending packets to the remote, and if the remote is waiting for a stop reply then this will never work, so the pruning should be deferred in this case. The solution for problem #3 is included in this commit. And, for problem #4, I'm not sure what the right solution is. Maybe delete_inferior should ensure the inferior to be deleted is in place when ~program_space is called? But that seems a little weird, as the current inferior would, in theory, still be using the current program_space... Anyway, after this commit, the gdb.python/py-progspace-events.exp test now passes when run with the native-extended-remote board. Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=30935 Approved-By: Simon Marchi <simon.marchi@efficios.com> Change-Id: I41f0e6e2d7ecc1e5e55ec170f37acd4052f46eaf

I noticed that on an Ubuntu 20.04 system, after a following patch ("Step over clone syscall w/ breakpoint, TARGET_WAITKIND_THREAD_CLONED"), the gdb.threads/step-over-exec.exp was passing cleanly, but still, we'd end up with four new unexpected GDB core dumps: === gdb Summary === # of unexpected core files 4 # of expected passes 48 That said patch is making the pre-existing gdb.threads/step-over-exec.exp testcase (almost silently) expose a latent problem in gdb/linux-nat.c, resulting in a GDB crash when: #1 - a non-leader thread execs #2 - the post-exec program stops somewhere #3 - you kill the inferior Instead of #3 directly, the testcase just returns, which ends up in gdb_exit, tearing down GDB, which kills the inferior, and is thus equivalent to #3 above. Vis (after said patch is applied): $ gdb --args ./gdb /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true ... (top-gdb) r ... (gdb) b main ... (gdb) r ... Breakpoint 1, main (argc=1, argv=0x7fffffffdb88) at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec.c:69 69 argv0 = argv[0]; (gdb) c Continuing. [New Thread 0x7ffff7d89700 (LWP 2506975)] Other going in exec. Exec-ing /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd process 2506769 is executing new program: /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd Thread 1 "step-over-exec-" hit Breakpoint 1, main () at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec-execd.c:28 28 foo (); (gdb) k ... Thread 1 "gdb" received signal SIGSEGV, Segmentation fault. 0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393 393 return m_suspend.waitstatus_pending_p; (top-gdb) bt #0 0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393 #1 0x0000555555a884d1 in get_pending_child_status (lp=0x5555579b8230, ws=0x7fffffffd130) at ../../src/gdb/linux-nat.c:1345 #2 0x0000555555a8e5e6 in kill_unfollowed_child_callback (lp=0x5555579b8230) at ../../src/gdb/linux-nat.c:3564 #3 0x0000555555a92a26 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (this=0x0, ecall=..., args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:284 #4 0x0000555555a92a51 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:278 #5 0x0000555555a91f84 in gdb::function_view<int (lwp_info*)>::operator()(lwp_info*) const (this=0x7fffffffd210, args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:247 #6 0x0000555555a87072 in iterate_over_lwps(ptid_t, gdb::function_view<int (lwp_info*)>) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:864 #7 0x0000555555a8e732 in linux_nat_target::kill (this=0x55555653af40 <the_amd64_linux_nat_target>) at ../../src/gdb/linux-nat.c:3590 #8 0x0000555555cfdc11 in target_kill () at ../../src/gdb/target.c:911 ... The root of the problem is that when a non-leader LWP execs, it just changes its tid to the tgid, replacing the pre-exec leader thread, becoming the new leader. There's no thread exit event for the execing thread. It's as if the old pre-exec LWP vanishes without trace. The ptrace man page says: "PTRACE_O_TRACEEXEC (since Linux 2.5.46) Stop the tracee at the next execve(2). A waitpid(2) by the tracer will return a status value such that status>>8 == (SIGTRAP | (PTRACE_EVENT_EXEC<<8)) If the execing thread is not a thread group leader, the thread ID is reset to thread group leader's ID before this stop. Since Linux 3.0, the former thread ID can be retrieved with PTRACE_GETEVENTMSG." When the core of GDB processes an exec events, it deletes all the threads of the inferior. But, that is too late -- deleting the thread does not delete the corresponding LWP, so we end leaving the pre-exec non-leader LWP stale in the LWP list. That's what leads to the crash above -- linux_nat_target::kill iterates over all LWPs, and after the patch in question, that code will look for the corresponding thread_info for each LWP. For the pre-exec non-leader LWP still listed, won't find one. This patch fixes it, by deleting the pre-exec non-leader LWP (and thread) from the LWP/thread lists as soon as we get an exec event out of ptrace. GDBserver does not need an equivalent fix, because it is already doing this, as side effect of mourning the pre-exec process, in gdbserver/linux-low.cc: else if (event == PTRACE_EVENT_EXEC && cs.report_exec_events) { ... /* Delete the execing process and all its threads. */ mourn (proc); switch_to_thread (nullptr); The crash with gdb.threads/step-over-exec.exp is not observable on newer systems, which postdate the glibc change to move "libpthread.so" internals to "libc.so.6", because right after the exec, GDB traps a load event for "libc.so.6", which leads to GDB trying to open libthread_db for the post-exec inferior, and, on such systems that succeeds. When we load libthread_db, we call linux_stop_and_wait_all_lwps, which, as the name suggests, stops all lwps, and then waits to see their stops. While doing this, GDB detects that the pre-exec stale LWP is gone, and deletes it. If we use "catch exec" to stop right at the exec before the "libc.so.6" load event ever happens, and issue "kill" right there, then GDB crashes on newer systems as well. So instead of tweaking gdb.threads/step-over-exec.exp to cover the fix, add a new gdb.threads/threads-after-exec.exp testcase that uses "catch exec". The test also uses the new "maint info linux-lwps" command if testing on Linux native, which also exposes the stale LWP problem with an unfixed GDB. Also tweak a comment in infrun.c:follow_exec referring to how linux-nat.c used to behave, as it would become stale otherwise. Reviewed-By: Andrew Burgess <aburgess@redhat.com> Change-Id: I21ec18072c7750f3a972160ae6b9e46590376643

Running the gdb.threads/step-over-thread-exit-while-stop-all-threads.exp testcase added later in the series against gdbserver, after the TARGET_WAITKIND_NO_RESUMED fix from the following patch, would run into an infinite loop in stop_all_threads, leading to a timeout: FAIL: gdb.threads/step-over-thread-exit-while-stop-all-threads.exp: displaced-stepping=off: target-non-stop=on: iter 0: continue (timeout) The is really a latent bug, and it is about the fact that stop_all_threads stops listening to events from a target as soon as it sees a TARGET_WAITKIND_NO_RESUMED, ignoring that TARGET_WAITKIND_NO_RESUMED may be delayed. handle_no_resumed knows how to handle delayed no-resumed events, but stop_all_threads was never taught to. In more detail, here's what happens with that testcase: #1 - Multiple threads report breakpoint hits to gdb. #2 - gdb picks one events, and it's for thread 1. All other stops are left pending. thread 1 needs to move past a breakpoint, so gdb stops all threads to start an inline step over for thread 1. While stopping threads, some of the threads that were still running report events that are also left pending. #2 - gdb steps thread 1 #3 - Thread 1 exits while stepping (it steps over an exit syscall), gdbserver reports thread exit for thread 1 #4 - Thread 1 was the last resumed thread, so gdbserver also reports no-resumed: [remote] Notification received: Stop:w0;p3445d0.3445d3 [remote] Sending packet: $vStopped#55 [remote] Packet received: N [remote] Sending packet: $vStopped#55 [remote] Packet received: OK #5 - gdb processes the thread exit for thread 1, finishes the step over and restarts threads. #6 - gdb picks the next event to process out of one of the resumed threads with pending events: [infrun] random_resumed_with_pending_wait_status: Found 32 events, selecting #11 #7 - This is again a breakpoint hit and the breakpoint needs to be stepped over too, so gdb starts a step-over dance again. #8 - We reach stop_all_threads, which finds that some threads need to be stopped. #9 - wait_one finally consumes the no-resumed event queue by #4. Seeing this, wait_one disable target async, to stop listening for events out of the remote target. #10 - We still haven't seen all the stops expected, so stop_all_threads tries another iteration. #11 - Because the remote target is no longer async, and there are no other targets, wait_one return no-resumed immediately without polling the remote target. #12 - We still haven't seen all the stops expected, so stop_all_threads tries another iteration. goto #11, looping forever. Fix this by explicitly enabling/re-enabling target async on targets that can async, before waiting for stops. Reviewed-By: Andrew Burgess <aburgess@redhat.com> Change-Id: Ie3ffb0df89635585a6631aa842689cecc989e33f

On aarch64-linux, with gcc 13.2.1, I run into: ... (gdb) backtrace^M #0 break_here () at solib-search.c:30^M #1 0x0000fffff7f20194 in lib2_func4 () at solib-search-lib2.c:50^M #2 0x0000fffff7f70194 in lib1_func3 () at solib-search-lib1.c:50^M #3 0x0000fffff7f20174 in lib2_func2 () at solib-search-lib2.c:30^M #4 0x0000fffff7f70174 in lib1_func1 () at solib-search-lib1.c:30^M #5 0x00000000004101b4 in main () at solib-search.c:23^M (gdb) PASS: gdb.base/solib-search.exp: \ backtrace (with wrong libs) (data collection) FAIL: gdb.base/solib-search.exp: backtrace (with wrong libs) ... The FAIL is generated by this code in the test-case: ... if { $expect_fail } { # If the backtrace output is correct the test isn't sufficiently # testing what it should. if { $count == $total_expected } { set fail 1 } ... The test-case: - builds two versions of two shared libs, a "right" and "wrong" version, the difference being an additional dummy function (called spacer function), - uses the "right" version to generate a core file, - uses the "wrong" version to interpret the core file, and - generates a backtrace. The intent is that the backtrace is incorrect due to using the "wrong" version, but actually it's correct. This is because the spacer functions aren't large enough. Fix this by increasing the size of the spacer functions by adding a dummy loop, after which we have, as expected, an incorrect backtrace: ... (gdb) backtrace^M #0 break_here () at solib-search.c:30^M #1 0x0000fffff7f201c0 in ?? ()^M #2 0x0000fffff7f20174 in lib2_func2 () at solib-search-lib2.c:30^M #3 0x0000fffff7f20174 in lib2_func2 () at solib-search-lib2.c:30^M #4 0x0000fffff7f70174 in lib1_func1 () at solib-search-lib1.c:30^M #5 0x00000000004101b4 in main () at solib-search.c:23^M (gdb) PASS: gdb.base/solib-search.exp: \ backtrace (with wrong libs) (data collection) PASS: gdb.base/solib-search.exp: backtrace (with wrong libs) ... Tested on aarch64-linux.

Commit b5661ff ("gdb: fix possible use-after-free when executing commands") attempted to fix possible use-after-free in case command redefines itself. Commit 37e5833 ("gdb: fix command lookup in execute_command ()") updated the previous fix to handle subcommands as well by using the original command string to lookup the command again after its execution. This fixed the test in gdb.base/define.exp but it turned out that it does not work (at least) for "target remote" and "target extended-remote". The problem is that the command buffer P passed to execute_command () gets overwritten in dont_repeat () while executing "target remote" command itself: #0 dont_repeat () at top.c:822 #1 0x000055555730982a in target_preopen (from_tty=1) at target.c:2483 #2 0x000055555711e911 in remote_target::open_1 (name=0x55555881c7fe ":1234", from_tty=1, extended_p=0) at remote.c:5946 #3 0x000055555711d577 in remote_target::open (name=0x55555881c7fe ":1234", from_tty=1) at remote.c:5272 #4 0x00005555573062f2 in open_target (args=0x55555881c7fe ":1234", from_tty=1, command=0x5555589d0490) at target.c:853 #5 0x0000555556ad22fa in cmd_func (cmd=0x5555589d0490, args=0x55555881c7fe ":1234", from_tty=1) at cli/cli-decode.c:2737 #6 0x00005555573487fd in execute_command (p=0x55555881c802 "4", from_tty=1) at top.c:688 Therefore the second call to lookup_cmd () at line 697 fails to find command because the original command string is gone. This commit addresses this particular problem by creating a *copy* of original command string for the sole purpose of using it after command execution to lookup the command again. It may not be the most efficient way but it's safer given that command buffer is shared and overwritten in hard-to-foresee situations. Tested on x86_64-linux. PR 30249 Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=30249 Approved-By: Tom Tromey <tom@tromey.com> (cherry picked from commit b69378c)

This was referenced Aug 11, 2019

Cannot step inside the cuda kernel. Ask for help about how to debug. daniilidis-group/neural_renderer#65

Open

cuda-gdb/pytorch crash pytorch/pytorch#24159

Closed

agontarek closed this as completed Dec 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda-gdb internal-error caused on google colab #5

cuda-gdb internal-error caused on google colab #5

sakaia commented Jun 6, 2019

maphysart commented Aug 7, 2019

andyljones commented Aug 28, 2019

cuda-gdb internal-error caused on google colab #5

cuda-gdb internal-error caused on google colab #5

Comments

sakaia commented Jun 6, 2019

maphysart commented Aug 7, 2019

andyljones commented Aug 28, 2019