Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HANG: thread_suspend() waits forever if target thread in signal handler waiting on lock #184

Closed
derekbruening opened this issue Nov 27, 2014 · 1 comment

Comments

@derekbruening
Copy link
Contributor

From derek.br...@gmail.com on July 31, 2009 13:32:28

if thread A is trying to synch with thread B, A holds thread_initexit_lock
when it calls thread_suspend(). on linux thread_suspend() waits forever:
but if B is waiting for thread_initexit_lock (say, to translate a context
for a prior signal), we have a deadlock. we need a max count in the
thread_suspend() wait.

But, if hit max count, impossible to back out: we cannot decrement
suspend_count (and caller cannot call resume) b/c there's no way to
synchronize with the target thread and avoid the signal being passed to the
app: so this thread is going to be suspended, but the caller is going to
have back out of its locks first.

However:

  • Complex and fragile to have thread_suspend() time out, since in that
    state where the signal has been sent but not confirmed as received it is
    unsafe to call thread_resume(), and there is no way to retract the
    suspend request. Thus, all callers have to handle the situation.
    Simpler to let our suspend signal interrupt our own handler. We never
    send more than one before resuming, so no danger to stack usage. We have
    two real dangers:

    1. SIGUSR2 from us interrupts DR in way that causes the handling
      of the SIGUSR2 to deadlock or crash: scan of code makes it seem
      safe but easy to miss things

    2. Ditto for SIGUSR2 from app, but here handling of the signal does a lot
      more stuff. I don't see any locks grabbed when interrupting DR, and
      even if we interrupt queue-to-pending the worst I see is losing a
      signal due to the two writes it takes to insert new pending, or
      messing up the special heap alloc: either re-using the same data
      struct (so deliver 2nd after re-using first after free, double-free,
      etc.) or losing a free list entry. Could easily be missing something
      though. Is there any way to reduce the risk by watching
      SYS_kill(SIGUSR2), "stealing" SIGUSR2 (app sends 2 => really send 1,
      then convert), etc. -- except can't mangle signals sent externally.

    Given our existing bugs w/ interrupting DR, given that we need to handle
    nested SIGSEGV (PR 287309) and thus move toward more re-entrancy anyway,
    I'm going forward w/ SIGUSR2 not being blocked in our handler, and if we
    receive an app's SIGUSR2 while recording a prior signal for now we just
    drop the SIGUSR2.

Original issue: http://code.google.com/p/dynamorio/issues/detail?id=184

@derekbruening
Copy link
Contributor Author

From derek.br...@gmail.com on July 31, 2009 10:36:18

fixed using the design above in r192

Status: Verified

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant