Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add proper signal support while threads are native #1921

Closed
derekbruening opened this issue Apr 3, 2016 · 5 comments
Closed

add proper signal support while threads are native #1921

derekbruening opened this issue Apr 3, 2016 · 5 comments

Comments

@derekbruening
Copy link
Contributor

xref #38

DR_EMIT_GO_NATIVE is considered experimental on Linux and one feature missing is:

    /* For proper native execution with re-takeover we need to propagate
     * signals to app handlers while native.  For now we do not support re-takeover
     * and we give up our handles via signal_remove_handlers().
     */
@derekbruening
Copy link
Contributor Author

I'm considering this issue to also cover signal support for start/stop where all threads are either native or under DR. Because handling signals for just one thread being native is difficult (we aren't tracking syscalls so we'll have to query for the proper handler and mask) the plan for start/stop whole-process is to remove our handler while the app is native.

@derekbruening
Copy link
Contributor Author

The api.detach_spawn test I'm trying to add for #2601 is revealing a bunch of races, some of which overlap with this issue.

Today there are race windows where we haven't removed our handler completely: e.g., during detach we remove the alarm handlers after the synchall before we let any thread go native, but we leave other signal handlers.

All of this has corner cases with no guarantees. For alarms we can be reasonably confident that the app's handler will come right back but that may not be true for other signals. This affects us whether we keep our handler and implement the feature covered here, sending native control to the app's handler, or whether we drop our handler and let the kernel send native control to the app's handler. The only way to get guarantees is to delay all signals which gets complicated when trying to clean up and detach.

For doing native delivery, the kernel mask should be native, so we should be able to just undo the master_signal_handler_C frame and jump to the app's handler if it exists: if not we'll have to decide what to do. If the native action is to terminate and we're during detach it'd be best to wait so that any client can run exit time code.

One option is to not let any thread go native until we've cleaned them all up. Then we remove our handler and then we let them all go at once. Performance will be worse though with 600 threads stacking up for much longer than before.

@hgreving2304
Copy link

xref #2941

@derekbruening
Copy link
Contributor Author

A particular instance of the detach race described above is the following scenario:

The application generates synchronous signals which it handles during normal execution (maybe SIGFPE, or a safe-read SIGSEGV, or something). We detach and send a thread native, but before we remove our handler, the thread raises one of its handled signals. Today our handler will panic and not cope well. This is similar to the asynchronous timer signal races, but has no (hacky?) fix like we do for alarms where we ignore them until we remove our handler.

It should be easy to write a test for this case, and use it to drive developing the native-delivery mechanism.

derekbruening added a commit that referenced this issue Dec 10, 2020
Adds a new execute_native_handler() feature where DR delivers a signal
to a currently-native thread.  For a native thread, a signal frame is
set up on the appropriate stack, blocked signals are set natively
rather than for emulation, and control is directly sent to the
application's handler.

Adds handling of asynchronous signals by not recording them for
delivery later but delivering immediately.

Adds a test that uses the client interface DR_EMIT_GO_NATIVE to send a
thread native.  The app then generates a synchronous and an
asynchronous signal.

Initial attempts are made to also handle a pure-native thread with no
dcontex, such as late in detach, but that portion is not fully
finished and will be completed separately.  Handling deault-action
signals is also not fully tested and likely still has some issues to
be worked out.

Fixes several small bugs in the go-native path hit by the test:
+ Moves dynamo_thread_not_under_dynamo() to dispatch_enter_native() to
  avoid problems unlocking bb_building_lock on the path back through
  dispatch.
+ Relaxes a TRY_EXCEPT assert to support a currently-native thread
  with no dcontext in TLS.
+ Suppresses a curiosity from a still-temporarily-native thread at exit.

Issue: #1921
derekbruening added a commit that referenced this issue Dec 14, 2020
…4603)

Adds a new execute_native_handler() feature where DR delivers a signal
to a currently-native thread.  For a native thread, a signal frame is
set up on the appropriate stack, blocked signals are set natively
rather than for emulation, and control is directly sent to the
application's handler.

Adds handling of asynchronous signals by not recording them for
delivery later but delivering immediately.

Adds a test that uses the client interface DR_EMIT_GO_NATIVE to send a
thread native.  The app then generates a synchronous and an
asynchronous signal.

Initial attempts are made to also handle a pure-native thread with no
dcontex, such as late in detach, but that portion is not fully
finished and will be completed separately.  Handling deault-action
signals is also not fully tested and likely still has some issues to
be worked out.

Fixes several small bugs in the go-native path hit by the test:
+ Moves dynamo_thread_not_under_dynamo() to dispatch_enter_native() to
  avoid problems unlocking bb_building_lock on the path back through
  dispatch.
+ Relaxes a TRY_EXCEPT assert to support a currently-native thread
  with no dcontext in TLS.
+ Suppresses a curiosity from a still-temporarily-native thread at exit.

Issue: #1921
derekbruening added a commit that referenced this issue Dec 23, 2020
When a signal arrives in an completely unmanaged thread with no
dcontext, typically because DR is detaching, we now deliver the signal
if the application has a handler for it.  This requires adding support
for no dcontext to several parts of the frame setup code even beyond
what was added in PR #4603 for temporarily-native threads.

We have to save the app's handler when we detach a thread so we know
where to send a native signal.  Full support is complex when we're
cleaning up and have no dynamic storage, so we use a single global
handler per signal.  We detect whether multiple handlers are in
operation in this single DR instance (quite rare: only custom
non-pthread clones have this behavior) and in that case we abort like
before on a native signal.  Adds ATOMIC_READ_1BYTE() to complement the
existing atomic operations for a cleaner read of the new multi-handler
flag.

Delivering the frame often overlaps with DR's frame and even DR's
stack usage while delivering, if the app has no sigaltstack.  We add
logic to detect this overlap and avoid clobbering the stack memory.

Alarm signals are still dropped, since they can arrive mid-thread-init
when it is even harder to deliver.

Adds a new test api.detach_signal which creates 10 threads who all sit
in a loop sending 4 different alternating signals (SIGSEGV, SIGBUS,
SIGURG, SIGALRM) while the main thread attaches and then detaches.
When run in debug build, many many signals arrive in post-detach
threads, since detach takes a while to do debug cleanup, exercising
the new code.

Adds a new RSTAT for native signals so we can identify when this
happens in release build.  Exports the stat to the API and uses it to
check that at least some signals were delivered natively in the new
test.

Removes the fatal error on a signal arriving with no dcontext.  But,
non-ignore default signal actions when no handler is present are still
not fully handled, along with multi-sighand-processes as mentioned,
and the fatal error remains in those cases.  For default actions,
since the process is going to terminate anyway, the only shortcoming
of this is whether a core is generated and whether the proper process
exit code is raised.

Issue: #1921
derekbruening added a commit that referenced this issue Dec 23, 2020
When a signal arrives in an completely unmanaged thread with no
dcontext, typically because DR is detaching, we now deliver the signal
if the application has a handler for it.  This requires adding support
for no dcontext to several parts of the frame setup code even beyond
what was added in PR #4603 for temporarily-native threads.

We have to save the app's handler when we detach a thread so we know
where to send a native signal.  Full support is complex when we're
cleaning up and have no dynamic storage, so we use a single global
handler per signal.  We detect whether multiple handlers are in
operation in this single DR instance (quite rare: only custom
non-pthread clones have this behavior) and in that case we abort like
before on a native signal.  Adds ATOMIC_READ_1BYTE() to complement the
existing atomic operations for a cleaner read of the new multi-handler
flag.

Delivering the frame often overlaps with DR's frame and even DR's
stack usage while delivering, if the app has no sigaltstack.  We add
logic to detect this overlap and avoid clobbering the stack memory.

Alarm signals are still dropped, since they can arrive mid-thread-init
when it is even harder to deliver.

Adds a new test api.detach_signal which creates 10 threads who all sit
in a loop sending 4 different alternating signals (SIGSEGV, SIGBUS,
SIGURG, SIGALRM) while the main thread attaches and then detaches.
When run in debug build, many many signals arrive in post-detach
threads, since detach takes a while to do debug cleanup, exercising
the new code.

Adds a new RSTAT for native signals so we can identify when this
happens in release build.  Exports the stat to the API and uses it to
check that at least some signals were delivered natively in the new
test.

Removes the fatal error on a signal arriving with no dcontext.  But,
non-ignore default signal actions when no handler is present are still
not fully handled, along with multi-sighand-processes as mentioned,
and the fatal error remains in those cases.  For default actions,
since the process is going to terminate anyway, the only shortcoming
of this is whether a core is generated and whether the proper process
exit code is raised.

Issue: #1921
derekbruening added a commit that referenced this issue Dec 30, 2020
Adds proper signal mask setting for native signals, which requires
adding handler-masked signals to the DR frame's mask instead of
setting the mask in the kernel.

Adds tests to api.detach_signal.  These mask tests also detect when a
DR-caused crash from a DR bug during detach shows up in the app's
handler post-detach due to detach races, including from #3535.

Issue: #1921, #3535
derekbruening added a commit that referenced this issue Dec 30, 2020
Adds proper signal mask setting for native signals, which requires
adding handler-masked signals to the DR frame's mask instead of
setting the mask in the kernel.

Adds tests to api.detach_signal.  These mask tests also detect when a
DR-caused crash from a DR bug during detach shows up in the app's
handler post-detach due to detach races, including from #3535.

Issue: #1921, #3535
derekbruening added a commit that referenced this issue Dec 30, 2020
The native frame adjustments to handle the DR handler being on the
native stack previously failed to handle an application with a
sigaltstack.  We correct that here.

Adds sigaltstack testing to the api.detach_signal test.  With a
smaller stack, this exercises the stack-too-small error message path
from DR.

Issue: #1921
derekbruening added a commit that referenced this issue Dec 31, 2020
…#4645)

The native frame adjustments to handle the DR handler being on the
native stack previously failed to handle an application with a
sigaltstack.  We correct that here.

Adds sigaltstack testing to the api.detach_signal test.  With a
smaller stack, this exercises the stack-too-small error message path
from DR.

Issue: #1921
derekbruening added a commit that referenced this issue Jan 7, 2021
Puts in place 6 fixes for handling signals during DR initialization,
typically in a start/stop setup where other threads are alive.

1) Copy the app handler at init time for delivering native signals
during init.

2) Reorder signal_arch_init(), which obtains the signal frame size, to
run before DR installs its handler.

3) Obtains the app handler before installing DR's handler, eliminating
a (narrow) race window.

4) Until DR starts executing the app, continues delivering native
signals and using the globally recorded app handler, to match how DR
init works.

5) Set detacher_tid between init and setup to avoid races like in
DR's handler at the end of init that were under #3535

6) Handle a race where the init thread has set the global try_except,
causing master_signal_handler_C to think an app thread's signal is
DR's.  We add a global_try_tid and check the thread id to solve this.

Augments api.detach_signal with signals sent during init.

Also tested on a large proprietary application.

Issue: #1921, #3535
derekbruening added a commit that referenced this issue Jan 9, 2021
Puts in place 6 fixes for handling signals during DR initialization,
typically in a start/stop setup where other threads are alive.

1) Copy the app handler at init time for delivering native signals
during init.

2) Reorder signal_arch_init(), which obtains the signal frame size, to
run before DR installs its handler.

3) Obtains the app handler before installing DR's handler, eliminating
a (narrow) race window.

4) Until DR starts executing the app, continues delivering native
signals and using the globally recorded app handler, to match how DR
init works.

5) Set detacher_tid between init and setup to avoid races like in
DR's handler at the end of init that were under #3535

6) Handle a race where the init thread has set the global try_except,
causing master_signal_handler_C to think an app thread's signal is
DR's.  We add a global_try_tid and check the thread id to solve this.

Augments api.detach_signal with signals sent during init.
This causes the test to hit a DR bug in the vsyscall hook (#4664) on
32-bit Linux.  For now we disable the test for 32-bit.

Also tested on a large proprietary application.

Issue: #1921, #3535, #4664
derekbruening added a commit that referenced this issue Jan 11, 2021
…has-dcontext-but-during-init case from PR #4662 (issue #1921) we need to use the store app sigstack and not query the kernel and find DR's
derekbruening added a commit that referenced this issue Jan 11, 2021
A fault in the executable that has a statically-linked client and DR
is currently reported as a client crash, even though we have no way to
distinguish client code from app code in such a situation.  We change
that here to invoke execute_native_handler(), which solves problems
where the application has a fault handler in place and a tool or other
mechanism that generates faults during DR init or other points.

Updates the has-dcontext-but-during-init case from PR #4662 (issue #1921)
to use the stored app sigstack and not query the kernel and find DR's sigstack.

The api.static_crash exercises this path and we updated its template here,
but it uses -unsafe_crash_process to trigger it.
I tried to reproduce an incoming signal trigger by making a static version of
api.detach_signal but it requires a fault-generating allocator
replacement (hit by droption and other places that use the system
allocator) or something similar, which is non-trivial to replicate in
a small test.  I did test on the original proprietary application.

Fixes #4640
@derekbruening
Copy link
Contributor Author

I think it's time to close this. Nearly everything is covered. There are a few cases that are not, such as too-small sigaltstack for the extra space we take to copy the frame, or default actions: but those are going to exit anyway. Let's file separate issues on anything that becomes important enough to spend time on; I think we can certainly live with the current shortcomings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants