Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda-gdb internal-error caused on google colab #5

Closed
sakaia opened this issue Jun 6, 2019 · 2 comments
Closed

cuda-gdb internal-error caused on google colab #5

sakaia opened this issue Jun 6, 2019 · 2 comments

Comments

@sakaia
Copy link

sakaia commented Jun 6, 2019

On Google Colab, with GPU enabled.

!git clone http://github.com/pytorch/examples
!cuda-gdb python3

and type

run examples/mnist/main.py

generates following error.

cuda-gdb/7.12/gdb/block.c:456: internal-error: set_block_compunit_symtab: Assertion `gb->compunit_symtab == NULL' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) n

This is a bug, please report it.  For instructions, see:
<http://www.gnu.org/software/gdb/bugs/>.

cuda-gdb/7.12/gdb/block.c:456: internal-error: set_block_compunit_symtab: Assertion `gb->compunit_symtab == NULL' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) n

When I set gdb instead of cuda-gdb, nothing happend.

@maphysart
Copy link

I met the same problem when I try to debug a cuda cpp extension for pytorch 1.1 . Have you solved it ? Thanks.

@andyljones
Copy link

This is fixed in CUDA toolkit 10.1.243.

agontarek pushed a commit that referenced this issue Dec 3, 2022
A following patch adds an exists_non_stop_target call in the
target_terminal routines, and that surprisingly caused a weird
regression / GDB crash:

 $ make check RUNTESTFLAGS="--target_board=native-extended-gdbserver" TESTS="gdb.base/signest.exp"
 ...
 configuring for gdbserver local testing (extended-remote)
 Using src/gdb/testsuite/config/extended-gdbserver.exp as tool-and-target-specific interface file.
 Running src/gdb/testsuite/gdb.base/signest.exp ...
 ERROR: GDB process no longer exists

Debugging the core, we see infinite recursion:

 (top-gdb) bt 20
 #0  0x0000561d6a1bfeff in frame_unwind_arch (next_frame=0x561d6b19f9c0) at src/gdb/frame.c:2950
 #1  0x0000561d6a1bfeb8 in get_frame_arch (this_frame=0x561d6b19f9c0) at src/gdb/frame.c:2939
 #2  0x0000561d6a1b989f in frame_unwind_find_by_frame (this_frame=0x561d6b19f9c0, this_cache=0x561d6b19f9d8) at src/gdb/frame-unwind.c:174
 #3  0x0000561d6a1bff04 in frame_unwind_arch (next_frame=0x561d6b19f9c0) at src/gdb/frame.c:2950
 #4  0x0000561d6a1bfeb8 in get_frame_arch (this_frame=0x561d6b19f9c0) at src/gdb/frame.c:2939
 #5  0x0000561d6a1b989f in frame_unwind_find_by_frame (this_frame=0x561d6b19f9c0, this_cache=0x561d6b19f9d8) at src/gdb/frame-unwind.c:174
 #6  0x0000561d6a1bff04 in frame_unwind_arch (next_frame=0x561d6b19f9c0) at src/gdb/frame.c:2950
 #7  0x0000561d6a1bfeb8 in get_frame_arch (this_frame=0x561d6b19f9c0) at src/gdb/frame.c:2939
 #8  0x0000561d6a1b989f in frame_unwind_find_by_frame (this_frame=0x561d6b19f9c0, this_cache=0x561d6b19f9d8) at src/gdb/frame-unwind.c:174
 #9  0x0000561d6a1bff04 in frame_unwind_arch (next_frame=0x561d6b19f9c0) at src/gdb/frame.c:2950
 #10 0x0000561d6a1bfeb8 in get_frame_arch (this_frame=0x561d6b19f9c0) at src/gdb/frame.c:2939
 #11 0x0000561d6a1b989f in frame_unwind_find_by_frame (this_frame=0x561d6b19f9c0, this_cache=0x561d6b19f9d8) at src/gdb/frame-unwind.c:174
 #12 0x0000561d6a1bff04 in frame_unwind_arch (next_frame=0x561d6b19f9c0) at src/gdb/frame.c:2950
 #13 0x0000561d6a1bfeb8 in get_frame_arch (this_frame=0x561d6b19f9c0) at src/gdb/frame.c:2939
 #14 0x0000561d6a1b989f in frame_unwind_find_by_frame (this_frame=0x561d6b19f9c0, this_cache=0x561d6b19f9d8) at src/gdb/frame-unwind.c:174
 #15 0x0000561d6a1bff04 in frame_unwind_arch (next_frame=0x561d6b19f9c0) at src/gdb/frame.c:2950
 #16 0x0000561d6a1bfeb8 in get_frame_arch (this_frame=0x561d6b19f9c0) at src/gdb/frame.c:2939
 #17 0x0000561d6a1b989f in frame_unwind_find_by_frame (this_frame=0x561d6b19f9c0, this_cache=0x561d6b19f9d8) at src/gdb/frame-unwind.c:174
 #18 0x0000561d6a1bff04 in frame_unwind_arch (next_frame=0x561d6b19f9c0) at src/gdb/frame.c:2950
 #19 0x0000561d6a1bfeb8 in get_frame_arch (this_frame=0x561d6b19f9c0) at src/gdb/frame.c:2939
 (More stack frames follow...)

 (top-gdb) bt -30
 #157054 0x0000561d6a1bfeb8 in get_frame_arch (this_frame=0x561d6b19f9c0) at src/gdb/frame.c:2939
 #157055 0x0000561d6a1b989f in frame_unwind_find_by_frame (this_frame=0x561d6b19f9c0, this_cache=0x561d6b19f9d8) at src/gdb/frame-unwind.c:174
 #157056 0x0000561d6a1bff04 in frame_unwind_arch (next_frame=0x561d6b19f9c0) at src/gdb/frame.c:2950
 #157057 0x0000561d6a1bfeb8 in get_frame_arch (this_frame=0x561d6b19f9c0) at src/gdb/frame.c:2939
 #157058 0x0000561d6a1b989f in frame_unwind_find_by_frame (this_frame=0x561d6b19f9c0, this_cache=0x561d6b19f9d8) at src/gdb/frame-unwind.c:174
 #157059 0x0000561d6a1bff04 in frame_unwind_arch (next_frame=0x561d6b19f9c0) at src/gdb/frame.c:2950
 #157060 0x0000561d6a1bbc65 in frame_unwind_pc (this_frame=0x561d6b19f9c0) at src/gdb/frame.c:970
 #157061 0x0000561d6a1bf54c in get_frame_pc (frame=0x561d6b19fa90) at src/gdb/frame.c:2625
 #157062 0x0000561d6a1bf63e in get_frame_address_in_block (this_frame=0x561d6b19fa90) at src/gdb/frame.c:2655
 #157063 0x0000561d6a0cae7f in dwarf2_frame_cache (this_frame=0x561d6b19fa90, this_cache=0x561d6b19faa8) at src/gdb/dwarf2/frame.c:1010
 #157064 0x0000561d6a0cb928 in dwarf2_frame_this_id (this_frame=0x561d6b19fa90, this_cache=0x561d6b19faa8, this_id=0x561d6b19faf0) at src/gdb/dwarf2/frame.c:1227
 #157065 0x0000561d6a1baf72 in compute_frame_id (fi=0x561d6b19fa90) at src/gdb/frame.c:588
 #157066 0x0000561d6a1bb16e in get_frame_id (fi=0x561d6b19fa90) at src/gdb/frame.c:636
 #157067 0x0000561d6a1bb224 in get_stack_frame_id (next_frame=0x561d6b19fa90) at src/gdb/frame.c:650
 #157068 0x0000561d6a26ecd0 in insert_hp_step_resume_breakpoint_at_frame (return_frame=0x561d6b19fa90) at src/gdb/infrun.c:7809
 #157069 0x0000561d6a26b88a in handle_signal_stop (ecs=0x7ffc67022830) at src/gdb/infrun.c:6428
 #157070 0x0000561d6a269d81 in handle_inferior_event (ecs=0x7ffc67022830) at src/gdb/infrun.c:5741
 #157071 0x0000561d6a265bd0 in fetch_inferior_event () at src/gdb/infrun.c:4120
 #157072 0x0000561d6a244c24 in inferior_event_handler (event_type=INF_REG_EVENT) at src/gdb/inf-loop.c:41
 #157073 0x0000561d6a435cc4 in remote_async_serial_handler (scb=0x561d6b4a8990, context=0x561d6b4a4c48) at src/gdb/remote.c:14403
 #157074 0x0000561d6a460bc5 in run_async_handler_and_reschedule (scb=0x561d6b4a8990) at src/gdb/ser-base.c:138
 #157075 0x0000561d6a460cae in fd_event (error=0, context=0x561d6b4a8990) at src/gdb/ser-base.c:189
 #157076 0x0000561d6a76a191 in handle_file_event (file_ptr=0x561d6b233ae0, ready_mask=1) at src/gdbsupport/event-loop.cc:575
 #157077 0x0000561d6a76a743 in gdb_wait_for_event (block=1) at src/gdbsupport/event-loop.cc:701
 #157078 0x0000561d6a7694ee in gdb_do_one_event () at src/gdbsupport/event-loop.cc:237
 #157079 0x0000561d6a2df16b in start_event_loop () at src/gdb/main.c:421
 #157080 0x0000561d6a2df2b6 in captured_command_loop () at src/gdb/main.c:481
 #157081 0x0000561d6a2e0d16 in captured_main (data=0x7ffc67022bd0) at src/gdb/main.c:1353
 #157082 0x0000561d6a2e0da8 in gdb_main (args=0x7ffc67022bd0) at src/gdb/main.c:1370
 #157083 0x0000561d69eb3d82 in main (argc=13, argv=0x7ffc67022cf8, envp=0x7ffc67022d68) at src/gdb/gdb.c:33

This was caused by exists_non_stop_target flushing the frame cache via
scoped_restore_current_thread/switch_to_thread, while we're in the
middle of unwinding.

Fix this by making exists_non_stop_target only switch the inferior,
like done in target_pass_ctrlc.

The annota1.exp change is necessary because we'd get a regression
otherwise:

  @@ -238,8 +238,6 @@ Continuing.

   \032\032breakpoints-invalid

  -\032\032frames-invalid
  -
   \032\032breakpoint 3

   Breakpoint 3,
  @@ -276,7 +274,7 @@ printf.c
   \032\032pre-prompt
   (gdb)
   \032\032prompt
  -PASS: gdb.base/annota1.exp: continue to printf
  +FAIL: gdb.base/annota1.exp: continue to printf

... because this patch avoids flushing the frame cache that lead to
that missing frames-invalid.  We still need to match frames-invalid
because against gdbserver + "maint set target non-stop on", some other
code path flushes the frame cache resulting in the annotation being
emitted anyway.

gdb/ChangeLog:
yyyy-mm-dd  Pedro Alves  <pedro@palves.net>

	* target.c (exists_non_stop_target): Use
	scoped_restore_current_inferior and set_current_inferior instead
	of scoped_restore_current_thread / switch_to_inferior_no_thread.

Change-Id: I8402483ee755e64e54d8b7c4a67c177557f569bd
agontarek pushed a commit that referenced this issue Dec 3, 2022
After loading a core file, you're supposed to be able to use "detach"
to unload the core file.  That unfortunately regressed starting with
GDB 11, with these commits:

 1192f12 - gdb: generalize commit_resume, avoid commit-resuming when threads have pending statuses
 408f668 - detach in all-stop with threads running

resulting in a GDB crash:

 ...
 Thread 1 "gdb" received signal SIGSEGV, Segmentation fault.
 0x0000555555e842bf in maybe_set_commit_resumed_all_targets () at ../../src/gdb/infrun.c:2899
 2899          if (proc_target->commit_resumed_state)
 (top-gdb) bt
 #0  0x0000555555e842bf in maybe_set_commit_resumed_all_targets () at ../../src/gdb/infrun.c:2899
 #1  0x0000555555e848bf in scoped_disable_commit_resumed::reset (this=0x7fffffffd440) at ../../src/gdb/infrun.c:3023
 #2  0x0000555555e84a0c in scoped_disable_commit_resumed::reset_and_commit (this=0x7fffffffd440) at ../../src/gdb/infrun.c:3049
 #3  0x0000555555e739cd in detach_command (args=0x0, from_tty=1) at ../../src/gdb/infcmd.c:2791
 #4  0x0000555555c0ba46 in do_simple_func (args=0x0, from_tty=1, c=0x55555662a600) at ../../src/gdb/cli/cli-decode.c:95
 #5  0x0000555555c112b0 in cmd_func (cmd=0x55555662a600, args=0x0, from_tty=1) at ../../src/gdb/cli/cli-decode.c:2514
 #6  0x0000555556173b1f in execute_command (p=0x5555565c5916 "", from_tty=1) at ../../src/gdb/top.c:699

The code that crashes looks like:

 static void
 maybe_set_commit_resumed_all_targets ()
 {
   scoped_restore_current_thread restore_thread;

   for (inferior *inf : all_non_exited_inferiors ())
     {
       process_stratum_target *proc_target = inf->process_target ();

       if (proc_target->commit_resumed_state)
           ^^^^^^^^^^^

With 'proc_target' above being null.  all_non_exited_inferiors filters
out inferiors that have pid==0.  We get here at the end of
detach_command, after core_target::detach has already run, at which
point the inferior _should_ have pid==0 and no process target.  It is
clear it no longer has a process target, but, it still has a pid!=0
somehow.

The reason the inferior still has pid!=0, is that core_target::detach
just unpushes, and relies on core_target::close to actually do the
getting rid of the core and exiting the inferior.  The problem with
that is that detach_command grabs an extra strong reference to the
process stratum target, so the unpush_target inside
core_target::detach doesn't actually result in a call to
core_target::close.

Fix this my moving the cleaning up the core inferior to a shared
routine called by both core_target::close and core_target::detach.  We
still need to cleanup the inferior from within core_file::close
because there are paths to it that want to get rid of the core without
going through detach.  E.g., "core-file" -> "run".

This commit includes a new test added to gdb.base/corefile.exp to
cover the "core-file core" -> "detach" scenario.

Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=29275

Change-Id: Ic42bdd03182166b19f598428b0dbc2ce6f67c893
agontarek pushed a commit that referenced this issue Dec 3, 2022
New in this version: add a dedicated test.

When I do this:

    $ ./gdb -nx --data-directory=data-directory -q \
        /bin/sleep \
	-ex "maint set target-non-stop on" \
	-ex "tar ext :1234" \
	-ex "set remote exec-file /bin/sleep" \
	-ex "run 1231 &" \
	-ex add-inferior \
	-ex "inferior 2"
    Reading symbols from /bin/sleep...
    (No debugging symbols found in /bin/sleep)
    Remote debugging using :1234
    Starting program: /bin/sleep 1231
    Reading /lib64/ld-linux-x86-64.so.2 from remote target...
    warning: File transfers from remote targets can be slow. Use "set sysroot" to access files locally instead.
    Reading /lib64/ld-linux-x86-64.so.2 from remote target...
    Reading /usr/lib/debug/.build-id/a6/7a1408f18db3576757eea210d07ba3fc560dff.debug from remote target...
    [New inferior 2]
    Added inferior 2 on connection 1 (extended-remote :1234)
    [Switching to inferior 2 [<null>] (<noexec>)]
    (gdb) Reading /lib/x86_64-linux-gnu/libc.so.6 from remote target...
    attach 3659848
    Attaching to process 3659848
    /home/smarchi/src/binutils-gdb/gdb/thread.c:85: internal-error: inferior_thread: Assertion `current_thread_ != nullptr' failed.

Note the "attach" command just above.  When doing it on the command-line
with a -ex switch, the bug doesn't trigger.

The internal error of GDB is actually caused by GDBserver crashing, and
the error recovery of GDB is not on point.  This patch aims to fix just
the GDBserver crash, not the GDB problem.

GDBserver crashes with a segfault here:

    (gdb) bt
    #0  0x00005555557fb3f4 in find_one_thread (ptid=...) at /home/smarchi/src/binutils-gdb/gdbserver/thread-db.cc:177
    #1  0x00005555557fd5cf in thread_db_thread_handle (ptid=<error reading variable: Cannot access memory at address 0xffffffffffffffa0>, handle=0x7fffffffc400, handle_len=0x7fffffffc3f0)
        at /home/smarchi/src/binutils-gdb/gdbserver/thread-db.cc:461
    #2  0x000055555578a0b6 in linux_process_target::thread_handle (this=0x5555558a64c0 <the_x86_target>, ptid=<error reading variable: Cannot access memory at address 0xffffffffffffffa0>, handle=0x7fffffffc400,
        handle_len=0x7fffffffc3f0) at /home/smarchi/src/binutils-gdb/gdbserver/linux-low.cc:6905
    #3  0x00005555556dfcc6 in handle_qxfer_threads_worker (thread=0x60b000000510, buffer=0x7fffffffc8a0) at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:1645
    #4  0x00005555556e00e6 in operator() (__closure=0x7fffffffc5e0, thread=0x60b000000510) at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:1696
    #5  0x00005555556f54be in for_each_thread<handle_qxfer_threads_proper(buffer*)::<lambda(thread_info*)> >(struct {...}) (func=...) at /home/smarchi/src/binutils-gdb/gdbserver/gdbthread.h:159
    #6  0x00005555556e0242 in handle_qxfer_threads_proper (buffer=0x7fffffffc8a0) at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:1694
    #7  0x00005555556e04ba in handle_qxfer_threads (annex=0x629000000213 "", readbuf=0x621000019100 '\276' <repeats 200 times>..., writebuf=0x0, offset=0, len=4097)
        at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:1732
    #8  0x00005555556e1989 in handle_qxfer (own_buf=0x629000000200 "qXfer:threads", packet_len=26, new_packet_len_p=0x7fffffffd630) at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:2045
    #9  0x00005555556e720a in handle_query (own_buf=0x629000000200 "qXfer:threads", packet_len=26, new_packet_len_p=0x7fffffffd630) at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:2685
    #10 0x00005555556f1a01 in process_serial_event () at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:4176
    #11 0x00005555556f4457 in handle_serial_event (err=0, client_data=0x0) at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:4514
    #12 0x0000555555820f56 in handle_file_event (file_ptr=0x607000000250, ready_mask=1) at /home/smarchi/src/binutils-gdb/gdbsupport/event-loop.cc:573
    #13 0x0000555555821895 in gdb_wait_for_event (block=1) at /home/smarchi/src/binutils-gdb/gdbsupport/event-loop.cc:694
    #14 0x000055555581f533 in gdb_do_one_event (mstimeout=-1) at /home/smarchi/src/binutils-gdb/gdbsupport/event-loop.cc:264
    #15 0x00005555556ec9fb in start_event_loop () at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:3512
    #16 0x00005555556f0769 in captured_main (argc=4, argv=0x7fffffffe0d8) at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:3992
    #17 0x00005555556f0e3f in main (argc=4, argv=0x7fffffffe0d8) at /home/smarchi/src/binutils-gdb/gdbserver/server.cc:4078

The reason is a wrong current process when find_one_thread is called.
The current process is the 2nd one, which was just attached.  It does
not yet have thread_db data (proc->priv->thread_db is nullptr).  As we
iterate on all threads of all process to fulfull the qxfer:threads:read
request, we get to a thread of process 1 for which we haven't read
thread_db information yet (lwp_info::thread_known is false), so we get
into find_one_thread.  find_one_thread uses
`current_process ()->priv->thread_db`, assuming the current process
matches the ptid passed as a parameter, which is wrong.  A segfault
happens when trying to dereference that thread_db pointer.

Fix this by making find_one_thread not assume what the current process /
current thread is.  If it needs to call into libthread_db, which we know
will try to read memory from the current process, then temporarily set
the current process.

In the case where the thread is already know and we return early, we
don't need to switch process.

Add a test to reproduce this specific situation.

Change-Id: I09b00883e8b73b7e5f89d0f47cb4e9c0f3d6caaa
Approved-By: Andrew Burgess <aburgess@redhat.com>
agontarek pushed a commit that referenced this issue Dec 3, 2022
New in this version:

 - Better comment in target_kill
 - Uncomment line in test to avoid hanging when exiting, when testing on
   native-extended-gdbserver

PR 28275 shows that doing a sequence of:

 - Run inferior in background (run &)
 - kill that inferior
 - Run again

We get into this assertion:

    /home/smarchi/src/binutils-gdb/gdb/target.c:2590: internal-error: target_wait: Assertion `!proc_target->commit_resumed_state' failed.

    #0  internal_error_loc (file=0x5606b344e740 "/home/smarchi/src/binutils-gdb/gdb/target.c", line=2590, fmt=0x5606b344d6a0 "%s: Assertion `%s' failed.") at /home/smarchi/src/binutils-gdb/gdbsupport/errors.cc:54
    #1  0x00005606b6296475 in target_wait (ptid=..., status=0x7fffb9390630, options=...) at /home/smarchi/src/binutils-gdb/gdb/target.c:2590
    #2  0x00005606b5767a98 in startup_inferior (proc_target=0x5606bfccb2a0 <the_amd64_linux_nat_target>, pid=3884857, ntraps=1, last_waitstatus=0x0, last_ptid=0x0) at /home/smarchi/src/binutils-gdb/gdb/nat/fork-inferior.c:482
    #3  0x00005606b4e6c9c5 in gdb_startup_inferior (pid=3884857, num_traps=1) at /home/smarchi/src/binutils-gdb/gdb/fork-child.c:132
    #4  0x00005606b50f14a5 in inf_ptrace_target::create_inferior (this=0x5606bfccb2a0 <the_amd64_linux_nat_target>, exec_file=0x604000039f50 "/home/smarchi/build/binutils-gdb/gdb/test", allargs="", env=0x61500000a580, from_tty=1)
        at /home/smarchi/src/binutils-gdb/gdb/inf-ptrace.c:105
    #5  0x00005606b53b6d23 in linux_nat_target::create_inferior (this=0x5606bfccb2a0 <the_amd64_linux_nat_target>, exec_file=0x604000039f50 "/home/smarchi/build/binutils-gdb/gdb/test", allargs="", env=0x61500000a580, from_tty=1)
        at /home/smarchi/src/binutils-gdb/gdb/linux-nat.c:978
    #6  0x00005606b512b79b in run_command_1 (args=0x0, from_tty=1, run_how=RUN_NORMAL) at /home/smarchi/src/binutils-gdb/gdb/infcmd.c:468
    #7  0x00005606b512c236 in run_command (args=0x0, from_tty=1) at /home/smarchi/src/binutils-gdb/gdb/infcmd.c:526

When running the kill command, commit_resumed_state for the
process_stratum_target (linux-nat, here) is true.  After the kill, when
there are no more threads, commit_resumed_state is still true, as
nothing touches this flag during the kill operation.  During the
subsequent run command, run_command_1 does:

    scoped_disable_commit_resumed disable_commit_resumed ("running");

We would think that this would clear the commit_resumed_state flag of
our native target, but that's not the case, because
scoped_disable_commit_resumed iterates on non-exited inferiors in order
to find active process targets.  And after the kill, the inferior is
exited, and the native target was unpushed from it anyway.  So
scoped_disable_commit_resumed doesn't touch the commit_resumed_state
flag of the native target, it stays true.  When reaching target_wait, in
startup_inferior (to consume the initial expect stop events while the
inferior is starting up and working its way through the shell),
commit_resumed_state is true, breaking the contract saying that
commit_resumed_state is always false when calling the targets' wait
method.

(note: to be correct, I think that startup_inferior should toggle
commit_resumed between the target_wait and target_resume calls, but I'll
ignore that for now)

I can see multiple ways to fix this.  In the end, we need
commit_resumed_state to be cleared by the time we get to that
target_wait.  It could be done at the end of the kill command, or at the
beginning of the run command.

To keep things in a coherent state, I'd like to make it so that after
the kill command, when the target is left with no threads, its
commit_resumed_state flag is left to false.  This way, we can keep
working with the assumption that a target with no threads (and therefore
no running threads) has commit_resumed_state == false.

Do this by adding a scoped_disable_commit_resumed in target_kill.  It
clears the target's commit_resumed_state on entry, and leaves it false
if the target does not have any resumed thread on exit.  That means,
even if the target has another inferior with stopped threads,
commit_resumed_state will be left to false, which makes sense.

Add a test that tries to cover various combinations of actions done
while an inferior is running (and therefore while commit_resumed_state
is true on the process target).

Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=28275
Change-Id: I8e6fe6dc1f475055921520e58cab68024039a1e9
Approved-By: Andrew Burgess <aburgess@redhat.com>
agontarek pushed a commit that referenced this issue Dec 3, 2022
I noticed that after a following patch ("Step over clone syscall w/
breakpoint, TARGET_WAITKIND_THREAD_CLONED"), the
gdb.threads/step-over-exec.exp was passing cleanly, but still, we'd
end up with four new unexpected GDB core dumps:

		 === gdb Summary ===

 # of unexpected core files      4
 # of expected passes            48

That said patch is making the pre-existing
gdb.threads/step-over-exec.exp testcase (almost silently) expose a
latent problem in gdb/linux-nat.c, resulting in a GDB crash when:

 #1 - a non-leader thread execs
 #2 - the post-exec program stops somewhere
 #3 - you kill the inferior

Instead of #3 directly, the testcase just returns, which ends up in
gdb_exit, tearing down GDB, which kills the inferior, and is thus
equivalent to #3 above.

Vis:

 $ gdb --args ./gdb /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true
 ...
 (top-gdb) r
 ...
 (gdb) b main
 ...
 (gdb) r
 ...
 Breakpoint 1, main (argc=1, argv=0x7fffffffdb88) at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec.c:69
 69        argv0 = argv[0];
 (gdb) c
 Continuing.
 [New Thread 0x7ffff7d89700 (LWP 2506975)]
 Other going in exec.
 Exec-ing /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd
 process 2506769 is executing new program: /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd

 Thread 1 "step-over-exec-" hit Breakpoint 1, main () at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec-execd.c:28
 28        foo ();
 (gdb) k
 ...
 Thread 1 "gdb" received signal SIGSEGV, Segmentation fault.
 0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393
 393         return m_suspend.waitstatus_pending_p;
 (top-gdb) bt
 #0  0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393
 #1  0x0000555555a884d1 in get_pending_child_status (lp=0x5555579b8230, ws=0x7fffffffd130) at ../../src/gdb/linux-nat.c:1345
 #2  0x0000555555a8e5e6 in kill_unfollowed_child_callback (lp=0x5555579b8230) at ../../src/gdb/linux-nat.c:3564
 #3  0x0000555555a92a26 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (this=0x0, ecall=..., args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:284
 #4  0x0000555555a92a51 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:278
 #5  0x0000555555a91f84 in gdb::function_view<int (lwp_info*)>::operator()(lwp_info*) const (this=0x7fffffffd210, args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:247
 #6  0x0000555555a87072 in iterate_over_lwps(ptid_t, gdb::function_view<int (lwp_info*)>) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:864
 #7  0x0000555555a8e732 in linux_nat_target::kill (this=0x55555653af40 <the_amd64_linux_nat_target>) at ../../src/gdb/linux-nat.c:3590
 #8  0x0000555555cfdc11 in target_kill () at ../../src/gdb/target.c:911
 ...

The root of the problem is that when a non-leader LWP execs, it just
changes its tid to the tgid, replacing the pre-exec leader thread,
becoming the new leader.  There's no thread exit event for the execing
thread.  It's as if the old pre-exec LWP vanishes without trace.  The
ptrace man page says:

"PTRACE_O_TRACEEXEC (since Linux 2.5.46)
	Stop the tracee at the next execve(2).  A waitpid(2) by the
	tracer will return a status value such that

	  status>>8 == (SIGTRAP | (PTRACE_EVENT_EXEC<<8))

	If the execing thread is not a thread group leader, the thread
	ID is reset to thread group leader's ID before this stop.
	Since Linux 3.0, the former thread ID can be retrieved with
	PTRACE_GETEVENTMSG."

When the core of GDB processes an exec events, it deletes all the
threads of the inferior.  But, that is too late -- deleting the thread
does not delete the corresponding LWP, so we end leaving the pre-exec
non-leader LWP stale in the LWP list.  That's what leads to the crash
above -- linux_nat_target::kill iterates over all LWPs, and after the
patch in question, that code will look for the corresponding
thread_info for each LWP.  For the pre-exec non-leader LWP still
listed, won't find one.

This patch fixes it, by deleting the pre-exec non-leader LWP (and
thread) from the LWP/thread lists as seen as we get an exec event out
of ptrace.

GDBserver does not need an equivalent fix, because it is already doing
this, as side effect of mourning the pre-exec process, in
gdbserver/linux-low.cc:

  else if (event == PTRACE_EVENT_EXEC && cs.report_exec_events)
    {
...
      /* Delete the execing process and all its threads.  */
      mourn (proc);
      switch_to_thread (nullptr);

Change-Id: I21ec18072c7750f3a972160ae6b9e46590376643
agontarek pushed a commit that referenced this issue Dec 3, 2022
Running the
gdb.threads/step-over-thread-exit-while-stop-all-threads.exp testcase
added later in the series against gdbserver, after the
TARGET_WAITKIND_NO_RESUMED fix from the following patch, would run
into an infinite loop in stop_all_threads, leading to a timeout:

  FAIL: gdb.threads/step-over-thread-exit-while-stop-all-threads.exp: displaced-stepping=off: target-non-stop=on: iter 0: continue (timeout)

The is really a latent bug, and it is about the fact that
stop_all_threads stops listening to events from a target as soon as it
sees a TARGET_WAITKIND_NO_RESUMED, ignoring that
TARGET_WAITKIND_NO_RESUMED may be delayed.  handle_no_resumed knows
how to handle delayed no-resumed events, but stop_all_threads was
never taught to.

In more detail, here's what happens with that testcase:

#1 - Multiple threads report breakpoint hits to gdb.

#2 - gdb picks one events, and it's for thread 1.  All other stops are
     left pending.  thread 1 needs to move past a breakpoint, so gdb
     stops all threads to start an inline step over for thread 1.
     While stopping threads, some of the threads that were still
     running report events that are also left pending.

#2 - gdb steps thread 1

#3 - Thread 1 exits while stepping (it steps over an exit syscall),
     gdbserver reports thread exit for thread 1

#4 - Thread 1 was the last resumed thread, so gdbserver also reports
     no-resumed:

    [remote]   Notification received: Stop:w0;p3445d0.3445d3
    [remote] Sending packet: $vStopped#55
    [remote] Packet received: N
    [remote] Sending packet: $vStopped#55
    [remote] Packet received: OK

#5 - gdb processes the thread exit for thread 1, finishes the step
     over and restarts threads.

#6 - gdb picks the next event to process out of one of the resumed
     threads with pending events:

    [infrun] random_resumed_with_pending_wait_status: Found 32 events, selecting #11

#7 - This is again a breakpoint hit and the breakpoint needs to be
     stepped over too, so gdb starts a step-over dance again.

#8 - We reach stop_all_threads, which finds that some threads need to
     be stopped.

#9 - wait_one finally consumes the no-resumed event queue by #4.
     Seeing this, wait_one disable target async, to stop listening for
     events out of the remote target.

#10 - We still haven't seen all the stops expected, so
      stop_all_threads tries another iteration.

#11 - Because the remote target is no longer async, and there are no
      other targets, wait_one return no-resumed immediately without
      polling the remote target.

#12 - We still haven't seen all the stops expected, so
      stop_all_threads tries another iteration.  goto #11, looping
      forever.

Fix this by explicitly enabling/re-enabling target async on targets
that can async, before waiting for stops.

Change-Id: Ie3ffb0df89635585a6631aa842689cecc989e33f
agontarek pushed a commit that referenced this issue Dec 3, 2022
… gdb/23377)

This fixes a gdb.base/multi-forks.exp regression with GDBserver.

Git commit f2ffa92 ("gdb: Eliminate the 'stop_pc' global") caused
the regression by exposing a latent bug in gdbserver.

The bug is that GDBserver's implementation of the D;PID packet
incorrectly assumes that the selected thread points to the process
being detached.  This happens via the any_persistent_commands call,
which calls current_process:

  (gdb) bt
  #0  0x000000000040a57e in internal_error(char const*, int, char const*, ...)
  (file=0x4a53c0 "src/gdb/gdbserver/inferiors.c", line=212, fmt=0x4a539e "%s:
  Assertion `%s' failed.") at src/gdb/gdbserver/../common/errors.c:54
  #1  0x0000000000420acf in current_process() () at
  src/gdb/gdbserver/inferiors.c:212
  #2  0x00000000004226a0 in any_persistent_commands() () at
  gdb/gdbserver/mem-break.c:308
  #3  0x000000000042cb43 in handle_detach(char*) (own_buf=0x6f0280 "D;62ea") at
  src/gdb/gdbserver/server.c:1210
  #4  0x0000000000433af3 in process_serial_event() () at
  src/gdb/gdbserver/server.c:4055
  #5  0x0000000000434878 in handle_serial_event(int, void*) (err=0,
  client_data=0x0)

The "eliminate stop_pc" commit exposes the problem because before that
commit, GDB's switch_to_thread always read the newly-selected thread's
PC, and that would end up forcing GDBserver's selected thread to
change accordingly as side effect.  After that commit, GDB no longer
reads the thread's PC, and GDBserver does not switch the thread.

Fix this by removing the assumption from GDBserver.

gdb/gdbserver/ChangeLog:
2018-07-11  Pedro Alves  <palves@redhat.com>

	PR gdb/23377
	* mem-break.c (any_persistent_commands): Add process_info
	parameter and use it instead of relying on the current process.
	Change return type to bool.
	* mem-break.h (any_persistent_commands): Add process_info
	parameter and change return type to bool.
	* server.c (handle_detach): Remove require_running_or_return call.
	Look up the process_info for the process we're about to detach.
	If not found, return back error to GDB.  Adjust
	any_persistent_commands call to pass down a process pointer.
agontarek pushed a commit that referenced this issue Dec 3, 2022
…b/23379)

This commit fixes a 8.1->8.2 regression exposed by
gdb.python/py-evthreads.exp when testing with
--target_board=native-gdbserver.

gdb.log shows:

  src/gdb/thread.c:93: internal-error: thread_info* inferior_thread(): Assertion `tp' failed.
  A problem internal to GDB has been detected,
  further debugging may prove unreliable.
  Quit this debugging session? (y or n) FAIL: gdb.python/py-evthreads.exp: run to breakpoint 1 (GDB internal error)

A backtrace shows (frames #2 and #10 highlighted) that the assertion
fails when GDB is setting up the connection to the remote target, in
non-stop mode:

  #0  0x0000000000622ff0 in internal_error(char const*, int, char const*, ...) (file=0xc1ad98 "src/gdb/thread.c", line=93, fmt=0xc1ad20 "%s: Assertion `%s' failed.") at src/gdb/common/errors.c:54
  #1  0x000000000089567e in inferior_thread() () at src/gdb/thread.c:93
= #2  0x00000000004da91d in get_event_thread() () at src/gdb/python/py-threadevent.c:38
  #3  0x00000000004da9b7 in create_thread_event_object(_typeobject*, _object*) (py_type=0x11574c0 <continue_event_object_type>, thread=0x0)
      at src/gdb/python/py-threadevent.c:60
  #4  0x00000000004bf6fe in create_continue_event_object() () at src/gdb/python/py-continueevent.c:27
  #5  0x00000000004bf738 in emit_continue_event(ptid_t) (ptid=...) at src/gdb/python/py-continueevent.c:40
  #6  0x00000000004c7d47 in python_on_resume(ptid_t) (ptid=...) at src/gdb/python/py-inferior.c:108
  #7  0x0000000000485bfb in std::_Function_handler<void (ptid_t), void (*)(ptid_t)>::_M_invoke(std::_Any_data const&, ptid_t&&) (__functor=..., __args#0=...) at /usr/include/c++/7/bits/std_function.h:316
  #8  0x000000000089b416 in std::function<void (ptid_t)>::operator()(ptid_t) const (this=0x12aa600, __args#0=...)
      at /usr/include/c++/7/bits/std_function.h:706
  #9  0x000000000089aa0e in gdb::observers::observable<ptid_t>::notify(ptid_t) const (this=0x118a7a0 <gdb::observers::target_resumed>, args#0=...)
      at src/gdb/common/observable.h:106
= #10 0x0000000000896fbe in set_running(ptid_t, int) (ptid=..., running=1) at src/gdb/thread.c:880
  #11 0x00000000007f750f in remote_target::remote_add_thread(ptid_t, bool, bool) (this=0x12c5440, ptid=..., running=true, executing=true) at src/gdb/remote.c:2434
  #12 0x00000000007f779d in remote_target::remote_notice_new_inferior(ptid_t, int) (this=0x12c5440, currthread=..., executing=1)
      at src/gdb/remote.c:2515
  #13 0x00000000007f9c44 in remote_target::update_thread_list() (this=0x12c5440) at src/gdb/remote.c:3831
  #14 0x00000000007fb922 in remote_target::start_remote(int, int) (this=0x12c5440, from_tty=0, extended_p=0)
      at src/gdb/remote.c:4655
  #15 0x00000000007fd102 in remote_target::open_1(char const*, int, int) (name=0x1a4f45e "localhost:2346", from_tty=0, extended_p=0)
      at src/gdb/remote.c:5638
  #16 0x00000000007fbec1 in remote_target::open(char const*, int) (name=0x1a4f45e "localhost:2346", from_tty=0)
      at src/gdb/remote.c:4862

So on frame #10, we're marking a newly-discovered thread as running,
and that causes the Python API to emit a gdb.ContinueEvent.
gdb.ContinueEvent is a gdb.ThreadEvent, and as such includes the event
thread as the "inferior_thread" attribute.  The problem is that when
we get to frame #3/#4, we lost all references to the thread that is
being marked as running.  create_continue_event_object assumes that it
is the current thread, which is not true in this case.

Fix this by passing down the right thread in
create_continue_event_object.  Also remove
create_thread_event_object's default argument and have the only other
caller left pass down the right thread explicitly too.

gdb/ChangeLog:
2018-08-24  Pedro Alves  <palves@redhat.com>
	    Simon Marchi  <simon.marchi@ericsson.com>

	PR gdb/23379
	* python/py-continueevent.c: Include "gdbthread.h".
	(create_continue_event_object): Add intro comment.  Add 'ptid'
	parameter.  Use it to find thread to pass to
	create_thread_event_object.
	(emit_continue_event): Pass PTID down to
	create_continue_event_object.
	* python/py-event.h (py_get_event_thread): Declare.
	(create_thread_event_object): Remove default from 'thread'
	parameter.
	* python/py-stopevent.c (create_stop_event_object): Use
	py_get_event_thread.
	* python/py-threadevent.c (get_event_thread): Rename to ...
	(py_get_event_thread): ... this, make extern, add 'ptid' parameter
	and use it to find the thread.
	(create_thread_event_object): Assert that THREAD isn't null.
	Don't find the event thread here.
agontarek pushed a commit that referenced this issue Mar 13, 2024
This commit fixes an issue that was discovered while writing the tests
for the previous commit.

I noticed that, when GDB restarts an inferior, the executable_changed
event would trigger twice.  The first notification would originate
from:

  #0  exec_file_attach (filename=0x4046680 "/tmp/hello.x", from_tty=0) at ../../src/gdb/exec.c:513
  #1  0x00000000006f3adb in reopen_exec_file () at ../../src/gdb/corefile.c:122
  #2  0x0000000000e6a3f2 in generic_mourn_inferior () at ../../src/gdb/target.c:3682
  #3  0x0000000000995121 in inf_child_target::mourn_inferior (this=0x2fe95c0 <the_amd64_linux_nat_target>) at ../../src/gdb/inf-child.c:192
  #4  0x0000000000995cff in inf_ptrace_target::mourn_inferior (this=0x2fe95c0 <the_amd64_linux_nat_target>) at ../../src/gdb/inf-ptrace.c:125
  #5  0x0000000000a32472 in linux_nat_target::mourn_inferior (this=0x2fe95c0 <the_amd64_linux_nat_target>) at ../../src/gdb/linux-nat.c:3609
  #6  0x0000000000e68a40 in target_mourn_inferior (ptid=...) at ../../src/gdb/target.c:2761
  #7  0x0000000000a323ec in linux_nat_target::kill (this=0x2fe95c0 <the_amd64_linux_nat_target>) at ../../src/gdb/linux-nat.c:3593
  #8  0x0000000000e64d1c in target_kill () at ../../src/gdb/target.c:924
  #9  0x00000000009a19bc in kill_if_already_running (from_tty=1) at ../../src/gdb/infcmd.c:328
  #10 0x00000000009a1a6f in run_command_1 (args=0x0, from_tty=1, run_how=RUN_STOP_AT_MAIN) at ../../src/gdb/infcmd.c:381
  #11 0x00000000009a20a5 in start_command (args=0x0, from_tty=1) at ../../src/gdb/infcmd.c:527
  #12 0x000000000068dc5d in do_simple_func (args=0x0, from_tty=1, c=0x35c7200) at ../../src/gdb/cli/cli-decode.c:95

While the second originates from:

  #0  exec_file_attach (filename=0x3d7a1d0 "/tmp/hello.x", from_tty=0) at ../../src/gdb/exec.c:513
  #1  0x0000000000dfe525 in reread_symbols (from_tty=1) at ../../src/gdb/symfile.c:2517
  #2  0x00000000009a1a98 in run_command_1 (args=0x0, from_tty=1, run_how=RUN_STOP_AT_MAIN) at ../../src/gdb/infcmd.c:398
  #3  0x00000000009a20a5 in start_command (args=0x0, from_tty=1) at ../../src/gdb/infcmd.c:527
  #4  0x000000000068dc5d in do_simple_func (args=0x0, from_tty=1, c=0x35c7200) at ../../src/gdb/cli/cli-decode.c:95

In the first case the call to exec_file_attach first passes through
reopen_exec_file.  The reopen_exec_file performs a modification time
check on the executable file, and only calls exec_file_attach if the
executable has changed on disk since it was last loaded.

However, in the second case things work a little differently.  In this
case GDB is really trying to reread the debug symbol.  As such, we
iterate over the objfiles list, and for each of those we check the
modification time, if the file on disk has changed then we reload the
debug symbols from that file.

However, there is an additional check, if the objfile has the same
name as the executable then we will call exec_file_attach, but we do
so without checking the cached modification time that indicates when
the executable was last reloaded, as a result, we reload the
executable twice.

In this commit I propose that reread_symbols be changed to
unconditionally call reopen_exec_file before performing the objfile
iteration.  This will ensure that, if the executable has changed, then
the executable will be reloaded, however, if the executable has
already been recently reloaded, we will not reload it for a second
time.

After handling the executable, GDB can then iterate over the objfiles
list and reload them in the normal way.

With this done I now see the executable reloaded only once when GDB
restarts an inferior, which means I can remove the kfail that I added
to the gdb.python/py-exec-file.exp test in the previous commit.

Approved-By: Tom Tromey <tom@tromey.com>
agontarek pushed a commit that referenced this issue Mar 13, 2024
It was pointed out on the mailing list that a recently added
test (gdb.python/py-progspace-events.exp) was failing when run with
the native-extended-gdbserver board.  This test was added with this
commit:

  commit 59912fb
  Date:   Tue Sep 19 11:45:36 2023 +0100

      gdb: add Python events for program space addition and removal

It turns out though that the test is failing due to a existing bug
in GDB, the new test just exposes the problem.  Additionally, the
failure really doesn't even rely on the new functionality added in the
above commit.  I reduced the test to a simple set of steps that
reproduced the failure and tested against GDB 13, and the test passes;
so the bug was introduced since then.  In fact, the bug was introduced
with this commit:

  commit a282736
  Date:   Fri Sep 8 15:48:16 2023 +0100

      gdb: remove final user of the executable_changed observer

This commit changed how the per-inferior auxv data cache is managed,
specifically, when the cache is cleared, and it is this that leads to
the failure.

This bug is interesting because it exposes a number of issues with
GDB, I'll explain all of the problems I see, though ultimately, I only
propose fixing one problem in this commit, which is enough to resolve
the crash we are currently seeing.

The crash that we are seeing manifests like this:

  ...
  [Inferior 2 (process 3970384) exited normally]
  +inferior 1
  [Switching to inferior 1 [process 3970383] (/tmp/build/gdb/testsuite/outputs/gdb.python/py-progspace-events/py-progspace-events)]
  [Switching to thread 1.1 (Thread 3970383.3970383)]
  #0  breakpt () at /tmp/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.python/py-progspace-events.c:28
  28	{ /* Nothing.  */ }
  (gdb) step
  +step
  terminate called after throwing an instance of 'gdb_exception_error'

  Fatal signal: Aborted
  ... etc ...

What's happening is that GDB attempts to refill the auxv cache as a
result of the gdbarch_has_shared_address_space call in
program_space::~program_space, the backtrace looks like this:

  #0  0x00007fb4f419a9a5 in raise () from /lib64/libpthread.so.0
  #1  0x00000000008b635d in handle_fatal_signal (sig=6) at ../../src/gdb/event-top.c:912
  #2  <signal handler called>
  #3  0x00007fb4f38e3625 in raise () from /lib64/libc.so.6
  #4  0x00007fb4f38cc8d9 in abort () from /lib64/libc.so.6
  #5  0x00007fb4f3c70756 in __gnu_cxx::__verbose_terminate_handler() [clone .cold] () from /lib64/libstdc++.so.6
  #6  0x00007fb4f3c7c6dc in __cxxabiv1::__terminate(void (*)()) () from /lib64/libstdc++.so.6
  #7  0x00007fb4f3c7b6e9 in __cxa_call_terminate () from /lib64/libstdc++.so.6
  #8  0x00007fb4f3c7c094 in __gxx_personality_v0 () from /lib64/libstdc++.so.6
  #9  0x00007fb4f3a80c63 in _Unwind_RaiseException_Phase2 () from /lib64/libgcc_s.so.1
  #10 0x00007fb4f3a8154e in _Unwind_Resume () from /lib64/libgcc_s.so.1
  #11 0x0000000000e8832d in target_read_alloc_1<unsigned char> (ops=0x408a3a0, object=TARGET_OBJECT_AUXV, annex=0x0) at ../../src/gdb/target.c:2266
  #12 0x0000000000e73dea in target_read_alloc (ops=0x408a3a0, object=TARGET_OBJECT_AUXV, annex=0x0) at ../../src/gdb/target.c:2315
  #13 0x000000000058248c in target_read_auxv_raw (ops=0x408a3a0) at ../../src/gdb/auxv.c:379
  #14 0x000000000058243d in target_read_auxv () at ../../src/gdb/auxv.c:368
  #15 0x000000000058255c in target_auxv_search (match=0x0, valp=0x7ffdee17c598) at ../../src/gdb/auxv.c:415
  #16 0x0000000000a464bb in linux_is_uclinux () at ../../src/gdb/linux-tdep.c:433
  #17 0x0000000000a464f6 in linux_has_shared_address_space (gdbarch=0x409a2d0) at ../../src/gdb/linux-tdep.c:440
  #18 0x0000000000510eae in gdbarch_has_shared_address_space (gdbarch=0x409a2d0) at ../../src/gdb/gdbarch.c:4889
  #19 0x0000000000bc7558 in program_space::~program_space (this=0x4544aa0, __in_chrg=<optimized out>) at ../../src/gdb/progspace.c:124
  #20 0x00000000009b245d in delete_inferior (inf=0x47b3de0) at ../../src/gdb/inferior.c:290
  #21 0x00000000009b2c10 in prune_inferiors () at ../../src/gdb/inferior.c:480
  #22 0x00000000009c5e3e in fetch_inferior_event () at ../../src/gdb/infrun.c:4558
  #23 0x000000000099b4dc in inferior_event_handler (event_type=INF_REG_EVENT) at ../../src/gdb/inf-loop.c:42
  #24 0x0000000000cbc64f in remote_async_serial_handler (scb=0x4090a30, context=0x408a6b0) at ../../src/gdb/remote.c:14859
  #25 0x0000000000d83d3a in run_async_handler_and_reschedule (scb=0x4090a30) at ../../src/gdb/ser-base.c:138
  #26 0x0000000000d83e1f in fd_event (error=0, context=0x4090a30) at ../../src/gdb/ser-base.c:189

So this is problem #1, if we throw an exception while deleting a
program_space then this is not caught, and is going to crash GDB.

Problem #2 becomes evident when we ask why GDB is throwing an error in
this case; the error is thrown because the remote target, operating in
non-async mode, can't read the auxv data while an inferior is running
and GDB is waiting for a stop reply.  The problem here then, is why
does GDB get into a position where it tries to interact with the
remote target in this way, at this time?  The problem is caused by the
prune_inferiors call which can be seen in the above backtrace.

In prune_inferiors we check if the inferior is deletable, and if it
is, we delete it.  The problem is, I think, we should also check if
the target is currently in a state that would allow us to delete the
inferior.  We don't currently have such a check available, we'd need
to add one, but for the remote target, this would return false if the
remote is in async mode and the remote is currently waiting for a stop
reply.  With this change in place GDB would defer deleting the
inferior until the remote target has stopped, at which point GDB would
be able to refill the auxv cache successfully.

And then, problem #3 becomes evident when we ask why GDB is needing to
refill the auxv cache now when it didn't need to for GDB 13.  This is
where the second commit mentioned above (a282736) comes in.
Prior to this commit, the auxv cache was cleared by the
executable_changed observer, while after that commit the auxv cache
was cleared by the new_objfile observer -- but only when the
new_objfile observer is used in the special mode that actually means
that all objfiles have been unloaded (I know, the overloading of the
new_objfile observer is horrible, and unnecessary, but it's not really
important for this bug).

The difference arises because the new_objfile observer is triggered
from clear_symtab_users, which in turn is called from
program_space::~program_space.  The new_objfile observer for auxv does
this:

  static void
  auxv_new_objfile_observer (struct objfile *objfile)
  {
    if (objfile == nullptr)
      invalidate_auxv_cache_inf (current_inferior ());
  }

That is, when all the objfiles are unloaded, we clear the auxv cache
for the current inferior.

The problem is, then when we look at the prune_inferiors ->
delete_inferior -> ~program_space path, we see that the current
inferior is not going to be an inferior that exists within the
program_space being deleted; delete_inferior removes the deleted
inferior from the global inferior list, and then only deletes the
program_space if program_space::empty() returns true, which is only
the case if the current inferior isn't within the program_space to
delete, and no other inferior exists within that program_space
either.

What this means is that when the new_objfile observer is called we
can't rely on the current inferior having any relationship with the
program space in which the objfiles were removed.  This was an error
in the commit a282736, the only thing we can rely on is the
current program space.  As a result of this mistake, after commit
a282736, GDB was sometimes clearing the auxv cache for a random
inferior.  In the native target case this was harmless as we can
always refill the cache when needed, but in the remote target case, if
we need to refill the cache when the remote target is executing, then
we get the crash we observed.

And additionally, if we think about this a little more, we see that
commit a282736 made another mistake.  When all the objfiles are
removed, they are removed from a program_space, a program_space might
contain multiple inferiors, so surely, we should clear the auxv cache
for all of the matching inferiors?

Given these two insights, that the current_inferior is not relevant,
only the current_program_space, and that we should be clearing the
cache for all inferiors in the current_program_space, we can update
auxv_new_objfile_observer to:

  if (objfile == nullptr)
    {
      for (inferior *inf : all_inferiors ())
	{
	  if (inf->pspace == current_program_space)
	    invalidate_auxv_cache_inf (inf);
	}
    }

With this change we now correctly clear the auxv cache for the correct
inferiors, and GDB no longer needs to refill the cache at an
inconvenient time, this avoids the crash we were seeing.

And finally, we reach problem #4.  Inspired by the observation that
using the current_inferior from within the ~program_space function was
not correct, I added some debug to see if current_inferior() was
called anywhere else (below ~program_space), and the answer is yes,
it's called a often.  Mostly the culprit is GDB doing:

  current_inferior ()->top_target ()-> ....

But I think all of these calls are most likely doing the wrong thing,
and only work because the top target in all these cases is shared
between all inferiors, e.g. it's the native target, or the remote
target for all inferiors.  But if we had a truly multi-connection
setup, then we might start to see odd behaviour.

Problem #1 I'm just ignoring for now, I guess at some point we might
run into this again, and then we'd need to solve this.  But in this
case I wasn't sure what a "good" solution would look like.  We need
the auxv data in order to implement the linux_is_uclinux() function.
If we can't get the auxv data then what should we do, assume yes, or
assume no?  The right answer would probably be to propagate the error
back up the stack, but then we reach ~program_space, and throwing
exceptions from a destructor is problematic, so we'd need to catch and
deal at this point.  The linux_is_uclinux() call is made from within
gdbarch_has_shared_address_space(), which is used like:

  if (!gdbarch_has_shared_address_space (target_gdbarch ()))
    delete this->aspace;

So, we would have to choose; delete the address space or not.  If we
delete it on error, then we might delete an address space that is
shared within another program space.  If we don't delete the address
space, then we might leak it.  Neither choice is great.

A better solution might be to have the address spaces be reference
counted, then we could remove the gdbarch_has_shared_address_space
call completely, and just rely on the reference count to auto-delete
the address space when appropriate.

The solution for problem #2 I already hinted at above, we should have
a new target_can_delete_inferiors() call, which should be called from
prune_inferiors, this would prevent GDB from trying to delete
inferiors when a (remote) target is in a state where we know it can't
delete the inferior.  Deleting an inferior often (always?) requires
sending packets to the remote, and if the remote is waiting for a stop
reply then this will never work, so the pruning should be deferred in
this case.

The solution for problem #3 is included in this commit.

And, for problem #4, I'm not sure what the right solution is.  Maybe
delete_inferior should ensure the inferior to be deleted is in place
when ~program_space is called?  But that seems a little weird, as the
current inferior would, in theory, still be using the current
program_space...

Anyway, after this commit, the gdb.python/py-progspace-events.exp test
now passes when run with the native-extended-remote board.

Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=30935
Approved-By: Simon Marchi <simon.marchi@efficios.com>
Change-Id: I41f0e6e2d7ecc1e5e55ec170f37acd4052f46eaf
agontarek pushed a commit that referenced this issue Mar 13, 2024
I noticed that on an Ubuntu 20.04 system, after a following patch
("Step over clone syscall w/ breakpoint,
TARGET_WAITKIND_THREAD_CLONED"), the gdb.threads/step-over-exec.exp
was passing cleanly, but still, we'd end up with four new unexpected
GDB core dumps:

		 === gdb Summary ===

 # of unexpected core files      4
 # of expected passes            48

That said patch is making the pre-existing
gdb.threads/step-over-exec.exp testcase (almost silently) expose a
latent problem in gdb/linux-nat.c, resulting in a GDB crash when:

 #1 - a non-leader thread execs
 #2 - the post-exec program stops somewhere
 #3 - you kill the inferior

Instead of #3 directly, the testcase just returns, which ends up in
gdb_exit, tearing down GDB, which kills the inferior, and is thus
equivalent to #3 above.

Vis (after said patch is applied):

 $ gdb --args ./gdb /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true
 ...
 (top-gdb) r
 ...
 (gdb) b main
 ...
 (gdb) r
 ...
 Breakpoint 1, main (argc=1, argv=0x7fffffffdb88) at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec.c:69
 69        argv0 = argv[0];
 (gdb) c
 Continuing.
 [New Thread 0x7ffff7d89700 (LWP 2506975)]
 Other going in exec.
 Exec-ing /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd
 process 2506769 is executing new program: /home/pedro/gdb/build/gdb/testsuite/outputs/gdb.threads/step-over-exec/step-over-exec-execr-thread-other-diff-text-segs-true-execd

 Thread 1 "step-over-exec-" hit Breakpoint 1, main () at /home/pedro/gdb/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/step-over-exec-execd.c:28
 28        foo ();
 (gdb) k
 ...
 Thread 1 "gdb" received signal SIGSEGV, Segmentation fault.
 0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393
 393         return m_suspend.waitstatus_pending_p;
 (top-gdb) bt
 #0  0x000055555574444c in thread_info::has_pending_waitstatus (this=0x0) at ../../src/gdb/gdbthread.h:393
 #1  0x0000555555a884d1 in get_pending_child_status (lp=0x5555579b8230, ws=0x7fffffffd130) at ../../src/gdb/linux-nat.c:1345
 #2  0x0000555555a8e5e6 in kill_unfollowed_child_callback (lp=0x5555579b8230) at ../../src/gdb/linux-nat.c:3564
 #3  0x0000555555a92a26 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::operator()(gdb::fv_detail::erased_callable, lwp_info*) const (this=0x0, ecall=..., args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:284
 #4  0x0000555555a92a51 in gdb::function_view<int (lwp_info*)>::bind<int, lwp_info*>(int (*)(lwp_info*))::{lambda(gdb::fv_detail::erased_callable, lwp_info*)#1}::_FUN(gdb::fv_detail::erased_callable, lwp_info*) () at ../../src/gdb/../gdbsupport/function-view.h:278
 #5  0x0000555555a91f84 in gdb::function_view<int (lwp_info*)>::operator()(lwp_info*) const (this=0x7fffffffd210, args#0=0x5555579b8230) at ../../src/gdb/../gdbsupport/function-view.h:247
 #6  0x0000555555a87072 in iterate_over_lwps(ptid_t, gdb::function_view<int (lwp_info*)>) (filter=..., callback=...) at ../../src/gdb/linux-nat.c:864
 #7  0x0000555555a8e732 in linux_nat_target::kill (this=0x55555653af40 <the_amd64_linux_nat_target>) at ../../src/gdb/linux-nat.c:3590
 #8  0x0000555555cfdc11 in target_kill () at ../../src/gdb/target.c:911
 ...

The root of the problem is that when a non-leader LWP execs, it just
changes its tid to the tgid, replacing the pre-exec leader thread,
becoming the new leader.  There's no thread exit event for the execing
thread.  It's as if the old pre-exec LWP vanishes without trace.  The
ptrace man page says:

 "PTRACE_O_TRACEEXEC (since Linux 2.5.46)
	Stop the tracee at the next execve(2).  A waitpid(2) by the
	tracer will return a status value such that

	  status>>8 == (SIGTRAP | (PTRACE_EVENT_EXEC<<8))

	If the execing thread is not a thread group leader, the thread
	ID is reset to thread group leader's ID before this stop.
	Since Linux 3.0, the former thread ID can be retrieved with
	PTRACE_GETEVENTMSG."

When the core of GDB processes an exec events, it deletes all the
threads of the inferior.  But, that is too late -- deleting the thread
does not delete the corresponding LWP, so we end leaving the pre-exec
non-leader LWP stale in the LWP list.  That's what leads to the crash
above -- linux_nat_target::kill iterates over all LWPs, and after the
patch in question, that code will look for the corresponding
thread_info for each LWP.  For the pre-exec non-leader LWP still
listed, won't find one.

This patch fixes it, by deleting the pre-exec non-leader LWP (and
thread) from the LWP/thread lists as soon as we get an exec event out
of ptrace.

GDBserver does not need an equivalent fix, because it is already doing
this, as side effect of mourning the pre-exec process, in
gdbserver/linux-low.cc:

  else if (event == PTRACE_EVENT_EXEC && cs.report_exec_events)
    {
...
      /* Delete the execing process and all its threads.  */
      mourn (proc);
      switch_to_thread (nullptr);


The crash with gdb.threads/step-over-exec.exp is not observable on
newer systems, which postdate the glibc change to move "libpthread.so"
internals to "libc.so.6", because right after the exec, GDB traps a
load event for "libc.so.6", which leads to GDB trying to open
libthread_db for the post-exec inferior, and, on such systems that
succeeds.  When we load libthread_db, we call
linux_stop_and_wait_all_lwps, which, as the name suggests, stops all
lwps, and then waits to see their stops.  While doing this, GDB
detects that the pre-exec stale LWP is gone, and deletes it.

If we use "catch exec" to stop right at the exec before the
"libc.so.6" load event ever happens, and issue "kill" right there,
then GDB crashes on newer systems as well.  So instead of tweaking
gdb.threads/step-over-exec.exp to cover the fix, add a new
gdb.threads/threads-after-exec.exp testcase that uses "catch exec".
The test also uses the new "maint info linux-lwps" command if testing
on Linux native, which also exposes the stale LWP problem with an
unfixed GDB.

Also tweak a comment in infrun.c:follow_exec referring to how
linux-nat.c used to behave, as it would become stale otherwise.

Reviewed-By: Andrew Burgess <aburgess@redhat.com>
Change-Id: I21ec18072c7750f3a972160ae6b9e46590376643
agontarek pushed a commit that referenced this issue Mar 13, 2024
Running the
gdb.threads/step-over-thread-exit-while-stop-all-threads.exp testcase
added later in the series against gdbserver, after the
TARGET_WAITKIND_NO_RESUMED fix from the following patch, would run
into an infinite loop in stop_all_threads, leading to a timeout:

  FAIL: gdb.threads/step-over-thread-exit-while-stop-all-threads.exp: displaced-stepping=off: target-non-stop=on: iter 0: continue (timeout)

The is really a latent bug, and it is about the fact that
stop_all_threads stops listening to events from a target as soon as it
sees a TARGET_WAITKIND_NO_RESUMED, ignoring that
TARGET_WAITKIND_NO_RESUMED may be delayed.  handle_no_resumed knows
how to handle delayed no-resumed events, but stop_all_threads was
never taught to.

In more detail, here's what happens with that testcase:

#1 - Multiple threads report breakpoint hits to gdb.

#2 - gdb picks one events, and it's for thread 1.  All other stops are
     left pending.  thread 1 needs to move past a breakpoint, so gdb
     stops all threads to start an inline step over for thread 1.
     While stopping threads, some of the threads that were still
     running report events that are also left pending.

#2 - gdb steps thread 1

#3 - Thread 1 exits while stepping (it steps over an exit syscall),
     gdbserver reports thread exit for thread 1

#4 - Thread 1 was the last resumed thread, so gdbserver also reports
     no-resumed:

    [remote]   Notification received: Stop:w0;p3445d0.3445d3
    [remote] Sending packet: $vStopped#55
    [remote] Packet received: N
    [remote] Sending packet: $vStopped#55
    [remote] Packet received: OK

#5 - gdb processes the thread exit for thread 1, finishes the step
     over and restarts threads.

#6 - gdb picks the next event to process out of one of the resumed
     threads with pending events:

    [infrun] random_resumed_with_pending_wait_status: Found 32 events, selecting #11

#7 - This is again a breakpoint hit and the breakpoint needs to be
     stepped over too, so gdb starts a step-over dance again.

#8 - We reach stop_all_threads, which finds that some threads need to
     be stopped.

#9 - wait_one finally consumes the no-resumed event queue by #4.
     Seeing this, wait_one disable target async, to stop listening for
     events out of the remote target.

#10 - We still haven't seen all the stops expected, so
      stop_all_threads tries another iteration.

#11 - Because the remote target is no longer async, and there are no
      other targets, wait_one return no-resumed immediately without
      polling the remote target.

#12 - We still haven't seen all the stops expected, so
      stop_all_threads tries another iteration.  goto #11, looping
      forever.

Fix this by explicitly enabling/re-enabling target async on targets
that can async, before waiting for stops.

Reviewed-By: Andrew Burgess <aburgess@redhat.com>
Change-Id: Ie3ffb0df89635585a6631aa842689cecc989e33f
agontarek pushed a commit that referenced this issue Mar 13, 2024
On aarch64-linux, with gcc 13.2.1, I run into:
...
 (gdb) backtrace^M
 #0  break_here () at solib-search.c:30^M
 #1  0x0000fffff7f20194 in lib2_func4 () at solib-search-lib2.c:50^M
 #2  0x0000fffff7f70194 in lib1_func3 () at solib-search-lib1.c:50^M
 #3  0x0000fffff7f20174 in lib2_func2 () at solib-search-lib2.c:30^M
 #4  0x0000fffff7f70174 in lib1_func1 () at solib-search-lib1.c:30^M
 #5  0x00000000004101b4 in main () at solib-search.c:23^M
 (gdb) PASS: gdb.base/solib-search.exp: \
   backtrace (with wrong libs) (data collection)
 FAIL: gdb.base/solib-search.exp: backtrace (with wrong libs)
...

The FAIL is generated by this code in the test-case:
...
    if { $expect_fail } {
	# If the backtrace output is correct the test isn't sufficiently
	# testing what it should.
	if { $count == $total_expected } {
	    set fail 1
	}
...

The test-case:
- builds two versions of two shared libs, a "right" and "wrong" version, the
  difference being an additional dummy function (called spacer function),
- uses the "right" version to generate a core file,
- uses the "wrong" version to interpret the core file, and
- generates a backtrace.

The intent is that the backtrace is incorrect due to using the "wrong"
version, but actually it's correct.  This is because the spacer functions
aren't large enough.

Fix this by increasing the size of the spacer functions by adding a dummy
loop, after which we have, as expected, an incorrect backtrace:
...
 (gdb) backtrace^M
 #0  break_here () at solib-search.c:30^M
 #1  0x0000fffff7f201c0 in ?? ()^M
 #2  0x0000fffff7f20174 in lib2_func2 () at solib-search-lib2.c:30^M
 #3  0x0000fffff7f20174 in lib2_func2 () at solib-search-lib2.c:30^M
 #4  0x0000fffff7f70174 in lib1_func1 () at solib-search-lib1.c:30^M
 #5  0x00000000004101b4 in main () at solib-search.c:23^M
 (gdb) PASS: gdb.base/solib-search.exp: \
   backtrace (with wrong libs) (data collection)
 PASS: gdb.base/solib-search.exp: backtrace (with wrong libs)
...

Tested on aarch64-linux.
agontarek pushed a commit that referenced this issue Aug 23, 2024
Commit b5661ff ("gdb: fix possible use-after-free when
executing commands") attempted to fix possible use-after-free
in case command redefines itself.

Commit 37e5833 ("gdb: fix command lookup in execute_command ()")
updated the previous fix to handle subcommands as well by using the
original command string to lookup the command again after its execution.

This fixed the test in gdb.base/define.exp but it turned out that it
does not work (at least) for "target remote" and "target extended-remote".

The problem is that the command buffer P passed to execute_command ()
gets overwritten in dont_repeat () while executing "target remote"
command itself:

	#0  dont_repeat () at top.c:822
	#1  0x000055555730982a in target_preopen (from_tty=1) at target.c:2483
	#2  0x000055555711e911 in remote_target::open_1 (name=0x55555881c7fe ":1234", from_tty=1, extended_p=0)
	    at remote.c:5946
	#3  0x000055555711d577 in remote_target::open (name=0x55555881c7fe ":1234", from_tty=1) at remote.c:5272
	#4  0x00005555573062f2 in open_target (args=0x55555881c7fe ":1234", from_tty=1, command=0x5555589d0490)
	    at target.c:853
	#5  0x0000555556ad22fa in cmd_func (cmd=0x5555589d0490, args=0x55555881c7fe ":1234", from_tty=1)
	    at cli/cli-decode.c:2737
	#6  0x00005555573487fd in execute_command (p=0x55555881c802 "4", from_tty=1) at top.c:688

Therefore the second call to lookup_cmd () at line 697 fails to find
command because the original command string is gone.

This commit addresses this particular problem by creating a *copy* of
original command string for the sole purpose of using it after command
execution to lookup the command again. It may not be the most efficient
way but it's safer given that command buffer is shared and overwritten
in hard-to-foresee situations.

Tested on x86_64-linux.

PR 30249
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=30249

Approved-By: Tom Tromey <tom@tromey.com>
(cherry picked from commit b69378c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants