Skip to content
This repository was archived by the owner on Oct 12, 2022. It is now read-only.

fix deadlock when creating a thread while executing DllMain in another thread #86

Merged
merged 2 commits into from
Jul 10, 2012

Conversation

rainers
Copy link
Member

@rainers rainers commented Nov 21, 2011

A recent bug report for Visual D exhibited a problem with the thread creation
inside a DLL: if a thread is created while another thread is just excecuting
DllMain(THREAD_ATTACH), a dead lock is almost inevitable:

Main thread, that creates the new thread executes:

Thread.start()
{
    //....
    synchronized( slock )
    {
   -> m_hndl = cast(HANDLE) _beginthreadex( null, m_sz, &thread_entryPoint, cast(void*) this, 0, &m_addr );
      add( this );
    }
}

While the other thread executes:

DllMain()
{
    dll_thread_attach()
    {
        //....
        thread_findByAddr()
        {
        -> synchronized( Thread.slock )
           {
              foreach( t; Thread ) ...

The problem is: the OS serializes calls to CreateThread and DllMain, so the
threads hang at the marked lines.

The solution is to create the thread suspended without the slock, and just add and resume it
with the lock acquired.

@MartinNowak
Copy link
Member

I think we should rather fix this by calling dll_thread_attach and
dll_attach not from DllMain.

@rainers
Copy link
Member Author

rainers commented Jan 18, 2012

These are different use cases. The deadlock appears in a DLL that does
not rely on other DLLs and includes the runtime. The stuff in 4071 deals
with sharing the runtime in a phobos.dll. The problem mentioned here
will actually also appear in phobos.dll if it exists.

I don't think that you can move dll_attach/detach_* elsewhere, it is the
THE callback if the DLL is loaded into a host that has no idea of the
DLLs' implementation or the host creates new threads and calls the DLL
in these threads.

@MartinNowak
Copy link
Member

The idea is to use DLLMain only to register the library and then
let the one that loaded the library start the initialize afterwards.

The host should either explicitly initialize D's runtime or if that is not
possible due to a limited interface one could lazy initialize it on first use.
Another idea to solve this is scheduling an async callback, but it can't be triggered reliably.

A small subset will remain where you can't avoid it and we should provide
the right subset of functions so that all of these scenarios can be implemented.

Making DLLMain the default place for initialization is a mistake as it rules
out basically everything you want to do in a module constructor.

IIUC a solution to the above deadlock would be to do the initialization from the
thread entry point. This required to know all libraries and it would rule out
initialization of foreign threads.

@rainers
Copy link
Member Author

rainers commented Jan 18, 2012

Avoiding the loader lock might work with an explicite initialization
call, but you don't have that in a host that is D agnostic. You would
also need that initialization call for any thread created in the process
if it might call into the DLL.

Both described workarounds are not feasable:

  • lazy initialization would need checks not only in exported functions,
    but also in any interface method of objects that have been passed to the
    calling application in other threads.
  • Queing an APC for initialization ignores the fact that the DLL loading
    application will probably call an exported function immediately after
    loading it, and the APC is probably not run yet. Adding a
    synchronization would result in the same thing as lazy initialization.

I am not sure we have to solve an issue that no other language has yet
solved. When writing a DLL, your startup code should not rely on
anything that relies on passing the DLL loader lock.

So I think, the patch solves a problem of the current runtime, but not
that of any additional init code added by the user. What's wrong with
applying it until we have a better solution?

@MartinNowak
Copy link
Member

I'm really breaking my head about a good DLL thread model, and while Windows
at least has one there is none for POSIX, so we need to define one.
Hopefully we can find a common model for all platforms.

You would also need that initialization call for any thread created in the process if it might call into the DLL.

I'd argue this must be done explicitly, e.g. by providing an init_thread
function in your library (not applicable for VisualD, I know).
Stealing threads from the host is by no means a solution
especially since we are sending them to sleep for GC collection.

DLL_THREAD_ATTACH won't get called for already running threads
when the library is loaded. What do you do about them?

For initialisation outside of DllMain

As a plugin writer you still control the doors to your library,
so it should be possible to defer initialisation up to the first
call into your library.

m_hndl = cast(HANDLE) _beginthreadex( null, m_sz, &thread_entryPoint, cast(void*) this, 0, &m_addr );
if( cast(size_t) m_hndl == 0 )
throw new ThreadException( "Error creating thread" );
ResumeThread( m_hndl );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the return value.

Why is it safe to call ResumeThread?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. I will add the return code check.

When I debugged it, it was CreateThread that acquired the loader lock, but not ResumeThread. So the lock hierarchy is fine with slock never acquired before the loader lock (the order is given by DllMain now).

@rainers
Copy link
Member Author

rainers commented Jan 19, 2012

DLL_THREAD_ATTACH won't get called for already running threads
when the library is loaded. What do you do about them?

dll_attach_process iterates over all existing threads of the process and initializes TLS by "impersonating" the thread (switching a pointer in the thread environment block).

I'll have to think about how the initialization could work in general, but I'm not familiar with Posix. I guess it has some init function to be called when the shared library is loaded, but probably none to get notified when a thread is created.

@MartinNowak
Copy link
Member

dll_attach_process iterates over all existing threads of the process and initializes TLS by "impersonating" the >thread (switching a pointer in the thread environment block).
You can't do that in Posix because the TLS base is stored in a segment register.
It may also lead to TLS data races.

I'll have to think about how the initialization could work in general, but I'm not familiar with Posix. I guess it has >some init function to be called when the shared library is loaded, but probably none to get notified when a thread >is created.
Exactly.

As the pull request does fix the lock order we should go for it.

@rainers
Copy link
Member Author

rainers commented Jan 24, 2012

As the pull request does fix the lock order we should go for it.

I've rebased, resolved the conflict and added the return value check for
ResumeThread.

@dnadlinger
Copy link
Contributor

Huh? Two identical (as for the actual diff) show now up in the history for me.

@rainers
Copy link
Member Author

rainers commented Jan 24, 2012

Huh? Two identical (as for the actual diff) show now up in the
history for me.

I don't understand. What is duplicated for you? Maybe something related
to rebasing? Am I supposed not to do this in a pull request? I had to
resolve a merge conflict...

@MartinNowak
Copy link
Member

No, rebasing is not really an issue with pull requests. But you managed to merge the
changes into your already rebased version of them.
Try "git reset 43525e9" to get rid of the merge commit and things should be fine.

@rainers
Copy link
Member Author

rainers commented Jan 24, 2012

After reset, I cannot push anymore (error: failed to push some refs). When I pull as the error messgae tells me, there is nothing to push. Arrrrggh, that's why I hate git, whenever you try to do something it gets in the way and steals all your time.

BTW: the diff in github looks fine to me.

@MartinNowak
Copy link
Member

Even simpler:
git push -f rainers 43525e9:threadstart_deadlock

@rainers
Copy link
Member Author

rainers commented Jan 24, 2012

Thanks, that might have done it.

@dnadlinger
Copy link
Contributor

Yes, looks great now! As for the actual commit, though I'm not a Windows expert, I see the problem and the fix seems reasonable, so +1 from me.

@rainers
Copy link
Member Author

rainers commented Jan 31, 2012

dll_attach_process iterates over all existing threads of the process and initializes TLS by "impersonating" the >thread (switching a pointer in the thread environment block).

You can't do that in Posix because the TLS base is stored in a segment register.

I've been reading about TLS on linux in this document from 2005: http://www.akkadia.org/drepper/tls.pdf

Is the __tls_get_addr mechanism still used? If yes, a TLS initialisation callback could be added by replacing __tls_get_addr with a statically linked replacement implementation.

@MartinNowak
Copy link
Member

I've dismissed that idea, because initialization of TLS is done lazily but the static this thread initialization
is not restricted to TLS data, i.e. it must be run before executing any code from that library.

When loading libraries through core.runtime we should do it right away, which requires to traverse
the library DAG.

@andralex
Copy link
Member

andralex commented Jul 8, 2012

@dawgfoto Is this good to go? There's rumor on the street it blocks making druntime a shared lib in the future. Please advise, thanks!

@MartinNowak
Copy link
Member

There's rumor on the street it blocks making druntime a shared lib in the future.

Which street?

We're doing way too much in DllMain and we'll keep getting deadlocks (static initialization with the Windows loader lock held).
But this solves a particular issue.

@MartinNowak MartinNowak merged commit 43525e9 into dlang:master Jul 10, 2012
@rainers
Copy link
Member Author

rainers commented Jul 10, 2012

There's rumor on the street it blocks making druntime a shared lib in the future.

I think the patch is vital for the DLL that contains the thread creation code i.e. druntime.dll.

I agree, that there can be too much going on inside DllMain, but it isn't a good idea to put heavyweight code into thread-local static constructors to start with, as they slow down every thread created in the process. Its a bit more critical for shared static constructors, but I'd suggest lazy initialization for the win!

Maybe it was intentional, but I think there was a merge error: the patch moved "add(this)" before ResumeThread (reasoning in the comment before), but it is below in the commit.

@MartinNowak
Copy link
Member

Maybe it was intentional, but I think there was a merge error

It was intentionally. I fixed a bug that was introduced by moving the add. If an exception were thrown on error you'd end up with a stale Thread entry. Because this is synchronized(slock) I don't see how changing the add order would make a difference to another thread.

@rainers
Copy link
Member Author

rainers commented Jul 10, 2012

You are right, the order should be irrelevant because of the
synchronization. IIRC I had changed the order before the patch with
ResumeThread.

@MartinNowak MartinNowak mentioned this pull request Mar 5, 2013
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants