Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in tool.drcacheoff.burst* tests on Github Actions #5570

Closed
abhinav92003 opened this issue Jul 20, 2022 · 27 comments
Closed

Regression in tool.drcacheoff.burst* tests on Github Actions #5570

abhinav92003 opened this issue Jul 20, 2022 · 27 comments

Comments

@abhinav92003
Copy link
Contributor

The following tests have recently started failing on the Github Actions workflow for x86-64:

	code_api|tool.drcacheoff.burst_static 
	code_api|tool.drcacheoff.burst_replace 
	code_api|tool.drcacheoff.burst_replaceall 
	code_api|tool.drcacheoff.burst_noreach 
	code_api|tool.drcacheoff.burst_maps 
	code_api|tool.drcacheoff.burst_traceopts 
	code_api|tool.drcacheoff.burst_threads 
	code_api|tool.drcacheoff.burst_malloc 
	code_api|tool.drcacheoff.burst_reattach 
	code_api|tool.drcacheoff.burst_threadL0filter 
	code_api|tool.drcacheoff.burst_client 

This is blocking some PRs, including #5569, #5568 and #5562.

Logs: https://github.com/DynamoRIO/dynamorio/runs/7430951767?check_suite_focus=true

2022-07-20T14:16:23.9265812Z 349:   <Application
2022-07-20T14:16:23.9266263Z 349:   /home/runner/work/dynamorio/dynamorio/build_debug-internal-64/clients/bin64/tool.drcacheoff.burst_threads
2022-07-20T14:16:23.9266655Z 349:   (24791).  Internal Error: DynamoRIO debug check failure:
2022-07-20T14:16:23.9266991Z 349:   /home/runner/work/dynamorio/dynamorio/core/unix/signal.c:5861
2022-07-20T14:16:23.9267422Z 349:   syscall_signal || safe_is_in_fcache(dcontext, pc, (byte *)sc->SC_XSP)
@abhinav92003
Copy link
Contributor Author

I couldn't reproduce the failure on my machine. All of them pass locally. I'll try using tmate to debug the issue on the GA runner.

@semihalf-kardach-stanislaw
Copy link
Contributor

Out of curiosity, what versions of Linux kernel, gcc and libc do you have on your machine?

I have a VM (base image: https://cloud-images.ubuntu.com/focal/20220715/):

  • Kernel: Linux ubuntu 5.4.0-122-generic
  • gcc: 9.4.0-1ubuntu1~20.04.1
  • libc6: 2.31-0ubuntu9.9
  • libunwind8: 1.2.1-9build1

On this VM I can see failures in the same set of tests with the same message.

@abhinav92003
Copy link
Contributor Author

My primary workstation is running kernel 5.17.

I was able to reproduce the failure in an Ubuntu 20 VM, running kernel 5.15

@abhinav92003
Copy link
Contributor Author

I grabbed some details from gdb.

Program received signal SIGSEGV, Segmentation fault.
__memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:238
238     ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) bt
#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:238
#1  0x00005555558ae336 in d_r_safe_read (base=0xffffffffff600000, size=64, out_buf=0x7fffffffd820) at /home/abhinav92003/dynamorio/core/drlibc/drlibc_notdr_saferead.c:44
#2  0x00005555558ade15 in safe_read_if_fast (base=0xffffffffff600000, size=64, out_buf=0x7fffffffd820) at /home/abhinav92003/dynamorio/core/drlibc/drlibc.c:59
#3  0x00005555558af4c3 in is_elf_so_header_common (base=0xffffffffff600000 <error: Cannot access memory at address 0xffffffffff600000>, size=4096, memory=true)
    at /home/abhinav92003/dynamorio/core/drlibc/drlibc_module_elf.c:79
#4  0x00005555558af7d0 in is_elf_so_header (base=0xffffffffff600000 <error: Cannot access memory at address 0xffffffffff600000>, size=4096) at /home/abhinav92003/dynamorio/core/drlibc/drlibc_module_elf.c:148
#5  0x0000555555879c03 in module_is_header (base=0xffffffffff600000 <error: Cannot access memory at address 0xffffffffff600000>, size=4096) at /home/abhinav92003/dynamorio/core/unix/module_elf.c:573
#6  0x000055555584f8c4 in os_walk_address_space (iter=0x7fffffffd950, add_modules=true) at /home/abhinav92003/dynamorio/core/unix/os.c:9722
#7  0x00005555558501b1 in find_executable_vm_areas () at /home/abhinav92003/dynamorio/core/unix/os.c:9884
#8  0x000055555570a087 in vm_areas_init () at /home/abhinav92003/dynamorio/core/vmareas.c:1675
#9  0x00005555555c8118 in dynamorio_app_init_part_two_finalize () at /home/abhinav92003/dynamorio/core/dynamo.c:678
#10 0x00005555555c79a5 in dynamorio_app_init () at /home/abhinav92003/dynamorio/core/dynamo.c:390
#11 0x00005555555cbac1 in dr_app_setup () at /home/abhinav92003/dynamorio/core/dynamo.c:2743
#12 0x00005555555b6a3b in main (argc=1, argv=0x7fffffffe458) at /home/abhinav92003/dynamorio/clients/drcachesim/tests/burst_static.cpp:85
(gdb) p $pc
$1 = (void (*)()) 0x7ffff7bbd6f7 <__memmove_avx_unaligned_erms+23>
(gdb) x/10i $pc
=> 0x7ffff7bbd6f7 <__memmove_avx_unaligned_erms+23>:    vmovdqu (%rsi),%ymm0
   0x7ffff7bbd6fb <__memmove_avx_unaligned_erms+27>:    vmovdqu -0x20(%rsi,%rdx,1),%ymm1

@derekbruening
Copy link
Contributor

That's just a safe-read -- continue on to the unhandled signal.

@abhinav92003
Copy link
Contributor Author

abhinav92003 commented Jul 20, 2022

It seemed weird to me too that a safe-read would be unhandled. But this is indeed the unhandled signal. It leads to the following assert.

computing memory target for 0x00007ff3dc7c76f7 causing SIGSEGV, kernel claims it is 0xffffffffff600000
compute_memory_target: falling back to racy protection checks
For SIGSEGV at cache pc 0x00007ff3dc7c76f7, computed target read 0xffffffffff600000
        faulting instr: vmovdqu (%rsi)[32byte] -> %ymm0
** Received SIGSEGV at cache pc 0x00007ff3dc7c76f7 in thread 23251
add_process_lock: 0 lock 0x000055afe5f42a40: name=report_buf_lock(mutex)@/home/abhinav92003/dynamorio/core/utils.c:2005
rank=90 owner=23251 owning_dc=0x00007ff1dc6d0080 contended_event=0xffffffff prev=0x0000000000000000
lock count_times_acquired=       1                              0                               0                              0                               0+2 report_buf_lock(mutex)@/home/abhinav92003/dynamorio/core/utils.c:2005
SYSLOG_ERROR: Application /home/abhinav92003/build/clients/bin64/tool.drcacheoff.burst_static (23251).  Internal Error: DynamoRIO debug check failure: /home/abhinav92003/dynamorio/core/unix/signal.c:5861 syscall_signal || safe_is_in_fcache(dcontext, pc, (byte *)sc->SC_XSP)

Looking at the instrs at the faulting cache pc, they're the same as the stack trace:

(gdb) x/10i 0x00007ffff7bbd6f7
   0x7ffff7bbd6f7 <__memmove_avx_unaligned_erms+23>:    vmovdqu (%rsi),%ymm0
   0x7ffff7bbd6fb <__memmove_avx_unaligned_erms+27>:    vmovdqu -0x20(%rsi,%rdx,1),%ymm1
   0x7ffff7bbd701 <__memmove_avx_unaligned_erms+33>:    vmovdqu %ymm0,(%rdi)
   0x7ffff7bbd705 <__memmove_avx_unaligned_erms+37>:    vmovdqu %ymm1,-0x20(%rdi,%rdx,1)
   0x7ffff7bbd70b <__memmove_avx_unaligned_erms+43>:    vzeroupper 

@abhinav92003
Copy link
Contributor Author

Looks like the following is_safe_read_ucxt doesn't recognize the signal as such:

if (is_safe_read_ucxt(ucxt) ||

@abhinav92003
Copy link
Contributor Author

I see that the version of d_r_safe_read that's invoked in this crash is the one with the WEAK linkage:

Whereas main_signal_handler attempts to detect the other one:

d_r_safe_read(const void *base, size_t size, void *out_buf)

Only the "burst" tests are failing, so I wonder if this has something to do with the start-stop API.

@abhinav92003
Copy link
Contributor Author

I think the problem might be the following entry in /proc//maps:

ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]

The vsyscall segment start address faults when we try to read it.

@derekbruening
Copy link
Contributor

I would guess that DR trying to read an un-readable page (though still +x -- on older x86 hardware that would not be possible) is a secondary issue. Given the couple of faults we seem to see on every startup (I think there's an issue on trying to eliminate them all) I would have expected some on these burst tests the whole time and that something breaking the safe read fault identification would be the trigger: so a toolchain change that caused the weak one to fail to be thrown out?

@abhinav92003
Copy link
Contributor Author

I discovered something interesting: even on my machine (where the burst tests pass) looks like the core/unix/os.c version of d_r_safe_read is not being used, but the drlibc_notdr_saferead one is used. (However, thecore/unix/os.c one is invoked if I use drrun.)

So there's a larger issue here: the wrong d_r_safe_read is being used in statically linked mode in general, and not only on Ubuntu 20. The reason why this regression shows up on Ubuntu 20 is the non-readable vsyscall entry. If the correct d_r_safe_read is used, DR should be able to handle that case just fine.

I'm using verbose make commands and this interesting article (https://stackoverflow.com/questions/51656838/attribute-weak-and-static-libraries) to build a fix. I'll continue tomorrow.

@abhinav92003
Copy link
Contributor Author

abhinav92003 commented Jul 21, 2022

The issue seems to be that d_r_safe_read and safe_read_if_fast, though non-weak symbols, are not exported from libdynamorio_static.

$ nm --defined ../../lib64/debug/libdynamorio_static.a | grep "d_r_safe_read"
00000000002962e8 t d_r_safe_read
$ nm --defined ../../lib64/debug/libdynamorio_static.a | grep "safe_read_if_fast"
0000000000296272 t safe_read_if_fast

Exporting them seems to fix the crash.

abhinav92003 added a commit that referenced this issue Jul 21, 2022
d_r_safe_read and safe_read_if_fast in core/unix/os.c are both non-weak
symbols, but they are not exported.
$ nm --defined ../../lib64/debug/libdynamorio_static.a | grep d_r_safe_read
00000000002962e8 t d_r_safe_read
$ nm --defined ../../lib64/debug/libdynamorio_static.a | grep safe_read_if_fast
0000000000296272 t safe_read_if_fast

This causes drlibc code to use the wrong routines in is_elf_so_header.

On Ubuntu 20, there's a non-readable vsyscall entry in maps. When drlibc
tries to read it, it crashes, and our main_signal_handler isn't able to
recognize it as a safe_read crash.

After this fix, the correct d_r_safe_read is used, which helps the DR
signal handler to recover as intended.

Fixes: #5570
@abhinav92003
Copy link
Contributor Author

I verified that this fix works, in the x86-64 suite on #5573.

I believe we need to do the same also for other WEAK definitions in drlibc, since we'd prefer their respective definitions in libdynamorio when available. Or else their invocations in drlibc would use the "fake" version E.g.

os_map_file
kernel_is_64bit
d_r_safe_read
os_unmap_file
d_r_internal_error
safe_read_if_fast
... others

Maybe it's better to submit #5573 first to resolve the immediate issue though.

@derekbruening
Copy link
Contributor

I'm not sure about this:

  • Exported and private generally applies to shared libraries. We do not want to export these internal symbols when building as a shared library.

  • Static libraries generally have all global symbols in the same visibility pool is my understanding, and when you link them into something all the symbols are munged together is what you would expect. We do have the -fvisibility flag: does it affect our static library builds? But even if it did, why would we have even needed weak symbols if these libdynamorio symbols were really private and hidden? Something doesn't add up here.

@abhinav92003
Copy link
Contributor Author

Thanks Derek. I dug through the make log, and found that one of the build steps for libdynamorio_static.a uses -D localize_hidden=ON. I verified that it is after this is applied to the previously generated libdynamorio_static.a that d_r_safe_read becomes local to the library.

@abhinav92003
Copy link
Contributor Author

We should probably be using the "nohide" version of libdynamorio_static for the failing tests.

@derekbruening
Copy link
Contributor

Xref #3348 where on some toolchains and configs we cannot hide symbols in static libraries and started renaming variables to avoid collisions when they're all global.

Trying to understand this: the static library symbol hiding process still allows cross-object-file references to find the right symbols, as otherwise none of our code would work (we have lots of cross-file references). Yet it somehow causes a weak symbol to be more visible than the non-weak duplicate? Is drlibc not built with symbol hiding and visible trumps weak?

@derekbruening
Copy link
Contributor

Is drlibc not built with symbol hiding and visible trumps weak?

I.e., if we add the same symbol hiding to drlibc, does that solve the problem.

@abhinav92003
Copy link
Contributor Author

Yet it somehow causes a weak symbol to be more visible than the non-weak duplicate?

Note that the drlibc d_r_safe_read (the incorrect one) is being invoked from within drlibc itself. So maybe a weak symbol in the same library as the reference is more visible than the non-weak non-visible duplicate from a different library.

Is drlibc not built with symbol hiding and visible trumps weak? if we add the same symbol hiding to drlibc, does that solve the problem.

drlibc is not built with symbol hiding. I'm trying adding it.

@derekbruening
Copy link
Contributor

Note that the drlibc d_r_safe_read (the incorrect one) is being invoked from within drlibc itself.

My understanding is that static library boundaries are non-existent/meaningless: all the compilation unit object files are just munged together with all the ones from all the other static libraries. So if it's a different compilation unit (file) in drlibc that should be the same as a different CU in the libdynamorio static lib.

@derekbruening
Copy link
Contributor

derekbruening commented Jul 21, 2022

Note that the drlibc d_r_safe_read (the incorrect one) is being invoked from within drlibc itself.

My understanding is that static library boundaries are non-existent/meaningless: all the compilation unit object files are just munged together with all the ones from all the other static libraries. So if it's a different compilation unit (file) in drlibc that should be the same as a different CU in the libdynamorio static lib.

That said, maybe that changes with whatever magic the symbol hiding step is doing -- not sure I ever knew exactly how it was hiding symbols. Maybe it renames them across one library, making the library boundaries matter.

@abhinav92003
Copy link
Contributor Author

I attempted to localize the hidden symbols in libdrlibc.a (manually following the steps from CMake_finalize_static_lib.cmake). I don't understand why but in addition to getting localized, d_r_safe_read also became a non-weak symbol:

$ nm --defined libdrlibc.o | grep "safe_read_if_fast"
00000000000000c5 W safe_read_if_fast
$ objcopy --localize-hidden libdrlibc.o
$ nm --defined libdrlibc.o | grep "safe_read_if_fast"
00000000000000c5 t safe_read_if_fast

Anyway, I archived it and used it to link with the burst_static binary, and I'm getting lots of undefined references in core DR to symbols that are also defined in drlibc, e.g. get_process_id, os_page_size, dynamorio_syscall, is_elf_so_header, dup_syscall, os_seek.

abhinav92003 added a commit that referenced this issue Jul 21, 2022
Switches to dynamorio_static_unhide for configuring static DR so
that DR's symbols are visible when building static binaries.

Various symbols in dynamorio_static, like d_r_safe_read and
safe_read_if_fast in core/unix/os.c are non-weak symbols, but
they are not exported by the static DR library.
$ nm --defined ../../lib64/debug/libdynamorio_static.a | grep d_r_safe_read
00000000002962e8 t d_r_safe_read
$ nm --defined ../../lib64/debug/libdynamorio_static.a | grep safe_read_if_fast
0000000000296272 t safe_read_if_fast

This causes drlibc code to use the wrong routines in is_elf_so_header.
The same would happen for other weakly linked routines in drlibc which
are actually are supposed to be suppressed by their respective DR
definitions.

On Ubuntu 20, there's a non-readable vsyscall entry in maps. When drlibc
tries to read it, it crashes, and our main_signal_handler isn't able to
recognize it as a safe_read crash.

After this fix, the correct d_r_safe_read is used, which helps the DR
signal handler to recover as intended.

Fixes: #5570
@derekbruening
Copy link
Contributor

OK, so --localize-hidden converts both STB_GLOBAL and STB_WEAK into STB_LOCAL. There is no such thing as weak and local.

@abhinav92003
Copy link
Contributor Author

This means that we need to either drop support for the --localize-hidden variant (#5574), or explicitly export the symbols that may be used in other libraries such as drlibc (#5573). The latter options seems fragile to me as it would silently use the wrong routines and we wouldn't know. As the nohide variant is being built successfully by tobuild_static_nohide_api tests, perhaps it's okay to make that the default?

@derekbruening
Copy link
Contributor

Still wondering how the static libdynamorio worked all these years, given this issue with hiding the symbols: is it really possible there were zero safe read faults until this week? When was the drlibc split: wasn't that also years ago?

If we didn't already have a major use case where the toolchain no longer supports the objcopy --localize-hidden feature (that's why I added the nohide variant for tests to ensure we didn't break it), I would push harder to keep the symbols hidden (not sure the best way: refactor drlibc code; make some symbols global again; hand-implement our own symbol picking) and avoid unnecessary conflicts from all our globals. But given that significant use case: I think we have to abandon the symbol hiding. Sigh. It's too bad because it really feels like the cleanest way to go instead of dumping all our global names in the app namespace.

@abhinav92003
Copy link
Contributor Author

Okay. I'll go ahead and change static DR to use the nohide version.

So maybe a weak symbol in the same library as the reference is more visible than the non-weak non-visible duplicate from a different library.

I also found that in api.static_signal both the version of d_r_safe_read are being invoked. The drlibc one is invoked when the callsite is in drlibc.

abhinav92003 added a commit that referenced this issue Jul 22, 2022
Switches to dynamorio_static_unhide for configuring static DR so
that DR's symbols are visible when building static binaries.

Various symbols in dynamorio_static, like d_r_safe_read and
safe_read_if_fast in core/unix/os.c are non-weak symbols, but
they are not exported by the static DR library because we use
--localize_hidden during build.

$ nm --defined ../../lib64/debug/libdynamorio_static.a | grep d_r_safe_read
00000000002962e8 t d_r_safe_read
$ nm --defined ../../lib64/debug/libdynamorio_static.a | grep safe_read_if_fast
0000000000296272 t safe_read_if_fast

This causes drlibc code to use the wrong routines in is_elf_so_header.
The same would happen for other weakly linked routines in drlibc which
are actually supposed to be suppressed by their respective DR
definitions.

There's an existing version of static DR, libdynamorio_static_nohide,
which does not use --localize_hidden. Now, we use that instead
while configuring static DR.

This issue revealed itself on the recent Ubuntu 20 update which has a
non-readable vsyscall entry in maps. When drlibc tries to read it, it crashes,
and our main_signal_handler isn't able to recognize it as a safe_read crash
because the incorrect d_r_safe read is used. After this fix, the correct one
is used, which helps the DR signal handler to recover as intended.

Some cleanup will follow in the next PR: renaming the nohide version to
make it clear that it is the default, evaluating whether we still need the
static_nohide_api tests.

Issue: #5570
abhinav92003 added a commit that referenced this issue Jul 28, 2022
Renames dynamorio_static_nohide to dynamorio_static and deprecates
the version where we hide DR symbols using hide_symbols.

Removes tobuild_static_nohide_api, since now all static libs use
the nohide variant.

Documents hide_symbols as unsafe because it may lead to
confusing linking behavior.

Issue: #5570, #3348
abhinav92003 added a commit that referenced this issue Jul 29, 2022
Renames dynamorio_static_nohide to dynamorio_static and deprecates
the version where we hide DR symbols using hide_symbols.

Removes tobuild_static_nohide_api, since now all static libs use
the nohide variant.

Documents hide_symbols as unsafe because it may lead to
confusing linking behavior.

Issue: #5570, #3348
dolanzhao pushed a commit that referenced this issue Aug 1, 2022
Renames dynamorio_static_nohide to dynamorio_static and deprecates
the version where we hide DR symbols using hide_symbols.

Removes tobuild_static_nohide_api, since now all static libs use
the nohide variant.

Documents hide_symbols as unsafe because it may lead to
confusing linking behavior.

Issue: #5570, #3348
@derekbruening
Copy link
Contributor

For the vsyscall read fault xref https://groups.google.com/g/DynamoRIO-Users/c/2JZESY4KLgs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants