New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression in tool.drcacheoff.burst* tests on Github Actions #5570
Comments
I couldn't reproduce the failure on my machine. All of them pass locally. I'll try using |
Out of curiosity, what versions of Linux kernel, gcc and libc do you have on your machine? I have a VM (base image: https://cloud-images.ubuntu.com/focal/20220715/):
On this VM I can see failures in the same set of tests with the same message. |
My primary workstation is running kernel 5.17. I was able to reproduce the failure in an Ubuntu 20 VM, running kernel 5.15 |
I grabbed some details from gdb.
|
That's just a safe-read -- continue on to the unhandled signal. |
It seemed weird to me too that a safe-read would be unhandled. But this is indeed the unhandled signal. It leads to the following assert.
Looking at the instrs at the faulting cache pc, they're the same as the stack trace:
|
Looks like the following is_safe_read_ucxt doesn't recognize the signal as such: Line 5683 in b0848b5
|
I see that the version of
Whereas Line 4672 in b0848b5
Only the "burst" tests are failing, so I wonder if this has something to do with the start-stop API. |
I think the problem might be the following entry in /proc//maps:
The vsyscall segment start address faults when we try to read it. |
I would guess that DR trying to read an un-readable page (though still +x -- on older x86 hardware that would not be possible) is a secondary issue. Given the couple of faults we seem to see on every startup (I think there's an issue on trying to eliminate them all) I would have expected some on these burst tests the whole time and that something breaking the safe read fault identification would be the trigger: so a toolchain change that caused the weak one to fail to be thrown out? |
I discovered something interesting: even on my machine (where the burst tests pass) looks like the So there's a larger issue here: the wrong I'm using verbose make commands and this interesting article (https://stackoverflow.com/questions/51656838/attribute-weak-and-static-libraries) to build a fix. I'll continue tomorrow. |
The issue seems to be that
Exporting them seems to fix the crash. |
d_r_safe_read and safe_read_if_fast in core/unix/os.c are both non-weak symbols, but they are not exported. $ nm --defined ../../lib64/debug/libdynamorio_static.a | grep d_r_safe_read 00000000002962e8 t d_r_safe_read $ nm --defined ../../lib64/debug/libdynamorio_static.a | grep safe_read_if_fast 0000000000296272 t safe_read_if_fast This causes drlibc code to use the wrong routines in is_elf_so_header. On Ubuntu 20, there's a non-readable vsyscall entry in maps. When drlibc tries to read it, it crashes, and our main_signal_handler isn't able to recognize it as a safe_read crash. After this fix, the correct d_r_safe_read is used, which helps the DR signal handler to recover as intended. Fixes: #5570
I verified that this fix works, in the x86-64 suite on #5573. I believe we need to do the same also for other WEAK definitions in drlibc, since we'd prefer their respective definitions in libdynamorio when available. Or else their invocations in drlibc would use the "fake" version E.g.
Maybe it's better to submit #5573 first to resolve the immediate issue though. |
I'm not sure about this:
|
Thanks Derek. I dug through the make log, and found that one of the build steps for |
We should probably be using the "nohide" version of libdynamorio_static for the failing tests. |
Xref #3348 where on some toolchains and configs we cannot hide symbols in static libraries and started renaming variables to avoid collisions when they're all global. Trying to understand this: the static library symbol hiding process still allows cross-object-file references to find the right symbols, as otherwise none of our code would work (we have lots of cross-file references). Yet it somehow causes a weak symbol to be more visible than the non-weak duplicate? Is drlibc not built with symbol hiding and visible trumps weak? |
I.e., if we add the same symbol hiding to drlibc, does that solve the problem. |
Note that the drlibc
drlibc is not built with symbol hiding. I'm trying adding it. |
My understanding is that static library boundaries are non-existent/meaningless: all the compilation unit object files are just munged together with all the ones from all the other static libraries. So if it's a different compilation unit (file) in drlibc that should be the same as a different CU in the libdynamorio static lib. |
That said, maybe that changes with whatever magic the symbol hiding step is doing -- not sure I ever knew exactly how it was hiding symbols. Maybe it renames them across one library, making the library boundaries matter. |
I attempted to localize the hidden symbols in
Anyway, I archived it and used it to link with the burst_static binary, and I'm getting lots of undefined references in core DR to symbols that are also defined in drlibc, e.g. get_process_id, os_page_size, dynamorio_syscall, is_elf_so_header, dup_syscall, os_seek. |
Switches to dynamorio_static_unhide for configuring static DR so that DR's symbols are visible when building static binaries. Various symbols in dynamorio_static, like d_r_safe_read and safe_read_if_fast in core/unix/os.c are non-weak symbols, but they are not exported by the static DR library. $ nm --defined ../../lib64/debug/libdynamorio_static.a | grep d_r_safe_read 00000000002962e8 t d_r_safe_read $ nm --defined ../../lib64/debug/libdynamorio_static.a | grep safe_read_if_fast 0000000000296272 t safe_read_if_fast This causes drlibc code to use the wrong routines in is_elf_so_header. The same would happen for other weakly linked routines in drlibc which are actually are supposed to be suppressed by their respective DR definitions. On Ubuntu 20, there's a non-readable vsyscall entry in maps. When drlibc tries to read it, it crashes, and our main_signal_handler isn't able to recognize it as a safe_read crash. After this fix, the correct d_r_safe_read is used, which helps the DR signal handler to recover as intended. Fixes: #5570
OK, so |
This means that we need to either drop support for the |
Still wondering how the static libdynamorio worked all these years, given this issue with hiding the symbols: is it really possible there were zero safe read faults until this week? When was the drlibc split: wasn't that also years ago? If we didn't already have a major use case where the toolchain no longer supports the |
Okay. I'll go ahead and change static DR to use the nohide version.
I also found that in |
Switches to dynamorio_static_unhide for configuring static DR so that DR's symbols are visible when building static binaries. Various symbols in dynamorio_static, like d_r_safe_read and safe_read_if_fast in core/unix/os.c are non-weak symbols, but they are not exported by the static DR library because we use --localize_hidden during build. $ nm --defined ../../lib64/debug/libdynamorio_static.a | grep d_r_safe_read 00000000002962e8 t d_r_safe_read $ nm --defined ../../lib64/debug/libdynamorio_static.a | grep safe_read_if_fast 0000000000296272 t safe_read_if_fast This causes drlibc code to use the wrong routines in is_elf_so_header. The same would happen for other weakly linked routines in drlibc which are actually supposed to be suppressed by their respective DR definitions. There's an existing version of static DR, libdynamorio_static_nohide, which does not use --localize_hidden. Now, we use that instead while configuring static DR. This issue revealed itself on the recent Ubuntu 20 update which has a non-readable vsyscall entry in maps. When drlibc tries to read it, it crashes, and our main_signal_handler isn't able to recognize it as a safe_read crash because the incorrect d_r_safe read is used. After this fix, the correct one is used, which helps the DR signal handler to recover as intended. Some cleanup will follow in the next PR: renaming the nohide version to make it clear that it is the default, evaluating whether we still need the static_nohide_api tests. Issue: #5570
Renames dynamorio_static_nohide to dynamorio_static and deprecates the version where we hide DR symbols using hide_symbols. Removes tobuild_static_nohide_api, since now all static libs use the nohide variant. Documents hide_symbols as unsafe because it may lead to confusing linking behavior. Issue: #5570, #3348
Renames dynamorio_static_nohide to dynamorio_static and deprecates the version where we hide DR symbols using hide_symbols. Removes tobuild_static_nohide_api, since now all static libs use the nohide variant. Documents hide_symbols as unsafe because it may lead to confusing linking behavior. Issue: #5570, #3348
Renames dynamorio_static_nohide to dynamorio_static and deprecates the version where we hide DR symbols using hide_symbols. Removes tobuild_static_nohide_api, since now all static libs use the nohide variant. Documents hide_symbols as unsafe because it may lead to confusing linking behavior. Issue: #5570, #3348
For the vsyscall read fault xref https://groups.google.com/g/DynamoRIO-Users/c/2JZESY4KLgs |
The following tests have recently started failing on the Github Actions workflow for x86-64:
This is blocking some PRs, including #5569, #5568 and #5562.
Logs: https://github.com/DynamoRIO/dynamorio/runs/7430951767?check_suite_focus=true
The text was updated successfully, but these errors were encountered: