Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tckmap hang on Windows #255

Closed
Lestropie opened this issue May 13, 2015 · 37 comments
Closed

tckmap hang on Windows #255

Lestropie opened this issue May 13, 2015 · 37 comments
Assignees

Comments

@Lestropie
Copy link
Member

On my Windows machine, tckmap is hanging at random percentage markers.

  • Percentage at which it hangs changes every time, or sometimes it completes without issue.
  • Doesn't occur in debug mode.
  • Not fixed using -quiet.
  • Fixed using -nthreads 0 or -nthreads 1; seems to be more frequent with more threads than less.
@Lestropie Lestropie added the bug label May 13, 2015
@jdtournier
Copy link
Member

Sounds nasty - debugging multi-threading issues is really hard (and this does sound like some kind of race condition issue). I tend to find it's easier to inspect the code than trying to debug it any other way.

What surprises my is that this wouldn't be an issue on Linux. I'd expect if this was a race condition, it would manifest on all platforms. Strange...

In any case, you're in for some fun... I typically find the only real way to debug race conditions like this is to inspect the code very carefully to spot which of the objects I'm writing to might not have copy-constructed as a fully independent copy, or something like that.... Happy hunting!

@Lestropie
Copy link
Member Author

Yeah, not looking forward to it. My initial thought was that the ProgressBar fix hadn't propagated, but it's not that.

@thijsdhollander
Copy link
Contributor

I'm also a bit stumped as to why it's not an issue on Linux... I'd be looking for exotic -rather than generic- explanations first. It's really tckmap fully hanging, right? (like, not only the progressbar not updating or something, while the track mapping is still running)

If it is a multi-threading thing gone haywire though, the fact that it hangs rather than crashes at least naively points to a deadlock (rather than a race condition). This is where I can (finally!) bring out the ✨ formally trained informatician ✨ in me. As far as I can remember from a course in parallel systems and distributed computing, we were once taught/drilled deadlocks are the "easiest" from all parallel problems to fix (as in: find the cause, not necessarily fix for real), because you can "simply" inspect a stack trace when deadlocked, which should provide you with important clues as to the cause of the deadlock...

Apart from that, I only remember the course coming with 2 absolutely massive books that resulted in a 10 day studying marathon that resulted in a trauma that resulted in me banning most other useful information from my memory after the exam. So far for the formal training. 😐

@Lestropie
Copy link
Member Author

... You really just wanted an excuse to use ✨ , didn't you? :-P

Trouble is, it refuses to lock when running in debug mode, making the stack untraceable. I think I just need to throw a whole lot of TRACE macros everywhere, and hope to place one inside the erroneous loop.

@jdtournier
Copy link
Member

More to the point: the reason there might be a deadlock in the first place is more than likely related to corruption of the internal structures managing the queues - and the likely cause of said corruption is probably a rare race condition on a non mutex-protected variable somewhere...

So it's either a bug in the Thread::Queue (unlikely given that it only seems to impact tckmap, but not impossible), or some form of memory corruption impacting on the queue - and this all assumes the process hangs somewhere within the queue...

So first off, is tckmap still consuming a lot of CPU when it hangs? If so, then this is unlikely to a due to a deadlock in the queue - the threads would just sit idle waiting for the mutex to be released. In that case, I'd look closely at any loop within tckmap's operators for any potential infinite loop.

Otherwise, see if valgrind shows up anything. I'm guessing you'll have to run this on Linux, but if there is a bug at that level, it should be present on all platforms.

If that doesn't help, we should try to set up a stress test app for the Thread::Queue (and include that in the testing repo), essentially running with no-op operators to maximise the amount of thread contention... If that refuses to lock up on any platform, at least you'll know the problem is elsewhere...

But as I said before, the most likely problem is a race condition on a unprotected shared variable. And this can be subtle - for example, I had trouble a while back passing RefPtr objects through the queue - didn't think that any copying/destruction in one thread will be accessing the shared reference counter for that object, which is not thread-safe... Basically, I'd have a long hard look at the objects passed through the queue for any hidden issues like that.

@jdtournier
Copy link
Member

By the way, if this is triggered by a rare race condition, sticking TRACE macros everywhere will probably slow things down and drastically reduce the amount of thread contention, making it really unlikely to trigger the bug...

@jdtournier
Copy link
Member

Another idea: if you can use GDB on Windows, see if you can attach GDB to the running process once it hangs. You won't be able to get a detailed breakdown of where in the source your process might be stuck, but hopefully the backtrace will at least tell you which part of the queue is hanging...

@jdtournier
Copy link
Member

Just came across the poor man's profiler page, with some really useful & easy tips. In your case, if you get the process to hang, you can inspect what each thread is doing with this command:

gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof tckmap)

I had to run it with admin rights, but it works great. I got dwi2fod to infinite-loop on one of the threads so it would hang, then ran the above:

$ sudo gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof dwi2fod)
[New LWP 16134]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f78a5f4355d in pthread_join () from /usr/lib/libpthread.so.0

Thread 2 (Thread 0x7f787b7fe700 (LWP 16134)):
#0  0x000000000042ca8f in void Processor::operator()<MR::Image<float>, MR::Image<float> >(MR::Image<float>&, MR::Image<float>&) ()
#1  0x000000000041735e in MR::(anonymous namespace)::__RunFunctor<2, Processor<MR::Image<float>, MR::Image<float> > >::operator()(MR::Iterator const&) ()
#2  0x00000000004175da in MR::(anonymous namespace)::__Outer<MR::(anonymous namespace)::__RunFunctor<2, Processor<MR::Image<float>, MR::Image<float> > > >::execute() ()
#3  0x0000000000419b1e in std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, void> >::_M_invoke(std::_Any_data const&) ()
#4  0x0000000000419de2 in std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&) ()
#5  0x00007f78a5f48e3b in __pthread_once_slow () from /usr/lib/libpthread.so.0
#6  0x0000000000416e7c in std::thread::_Impl<std::_Bind_simple<std::__future_base::_Async_state_impl<std::_Bind_simple<std::_Mem_fn<void (MR::(anonymous namespace)::__Outer<MR::(anonymous namespace)::__RunFunctor<2, Processor<MR::Image<float>, MR::Image<float> > > >::*)()> (MR::(anonymous namespace)::__RunFunctor<2, Processor<MR::Image<float>, MR::Image<float> > >*)>, void>::_Async_state_impl(std::_Mem_fn<void (MR::(anonymous namespace)::__Outer<MR::(anonymous namespace)::__RunFunctor<2, Processor<MR::Image<float>, MR::Image<float> > > >::*)()> (&&)(MR::(anonymous namespace)::__RunFunctor<2, Processor<MR::Image<float>, MR::Image<float> > >*))::{lambda()#1} ()> >::_M_run() ()
#7  0x00007f78a672ddf0 in execute_native_thread_routine () from /usr/lib/libstdc++.so.6
#8  0x00007f78a5f42374 in start_thread () from /usr/lib/libpthread.so.0
#9  0x00007f78a5c8027d in clone () from /usr/lib/libc.so.6

Thread 1 (Thread 0x7f78a77b5740 (LWP 16111)):
#0  0x00007f78a5f4355d in pthread_join () from /usr/lib/libpthread.so.0
#1  0x00007f78a672dc87 in std::thread::join() () from /usr/lib/libstdc++.so.6
#2  0x00007f78a5f48e3b in __pthread_once_slow () from /usr/lib/libpthread.so.0
#3  0x000000000041b2e6 in void std::call_once<void (std::thread::*)(), std::reference_wrapper<std::thread> >(std::once_flag&, void (std::thread::*&&)(), std::reference_wrapper<std::thread>&&) ()
#4  0x000000000041b332 in std::__future_base::_Async_state_commonV2::_M_complete_async() ()
#5  0x0000000000415dde in MR::Thread::(anonymous namespace)::__multi_thread<MR::(anonymous namespace)::__Outer<MR::(anonymous namespace)::__RunFunctor<2, Processor<MR::Image<float>, MR::Image<float> > > > >::wait() ()
#6  0x0000000000418926 in run() ()
#7  0x00000000004087b6 in main ()

How good is that? 😁

@Lestropie
Copy link
Member Author

(gdb) thread apply all bt

Thread 12 (Thread 6076.0x1d44):
#0  0x0000000076dbce31 in ntdll!DbgBreakPoint () from C:\Windows\SYSTEM32\ntdll.dll
#1  0x0000000076e63d18 in ntdll!DbgUiRemoteBreakin () from C:\Windows\SYSTEM32\ntdll.dll
#2  0x0000000076b659cd in KERNEL32!BaseThreadInitThunk () from C:\Windows\system32\kernel32.dll
#3  0x0000000076d9b981 in ntdll!RtlUserThreadStart () from C:\Windows\SYSTEM32\ntdll.dll
#4  0x0000000000000000 in ?? ()

Thread 11 (Thread 6076.0x1328):
#0  0x0000000076dbe16a in ntdll!ZwWaitForMultipleObjects () from C:\Windows\SYSTEM32\ntdll.dll
#1  0x000007fefce81430 in KERNELBASE!GetCurrentProcess () from C:\Windows\system32\KernelBase.dll
#2  0x00000000005b9090 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 10 (Thread 6076.0x1fbc):
#0  0x0000000076dbdbfa in ntdll!ZwWaitForSingleObject () from C:\Windows\SYSTEM32\ntdll.dll
#1  0x0000000076dbadd8 in ntdll!RtlDeNormalizeProcessParams () from C:\Windows\SYSTEM32\ntdll.dll
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 9 (Thread 6076.0x1ea0):
#0  0x0000000076dbe16a in ntdll!ZwWaitForMultipleObjects () from C:\Windows\SYSTEM32\ntdll.dll
#1  0x000007fefce81430 in KERNELBASE!GetCurrentProcess () from C:\Windows\system32\KernelBase.dll
#2  0x000000000036fb50 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 8 (Thread 6076.0x234c):
#0  0x0000000076dbe16a in ntdll!ZwWaitForMultipleObjects () from C:\Windows\SYSTEM32\ntdll.dll
#1  0x000007fefce81430 in KERNELBASE!GetCurrentProcess () from C:\Windows\system32\KernelBase.dll
#2  0x00000000002ea5c0 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 7 (Thread 6076.0x1db4):
#0  0x0000000076dbe16a in ntdll!ZwWaitForMultipleObjects () from C:\Windows\SYSTEM32\ntdll.dll
#1  0x000007fefce81430 in KERNELBASE!GetCurrentProcess () from C:\Windows\system32\KernelBase.dll
#2  0x00000000002d8530 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 6 (Thread 6076.0x23dc):
#0  0x0000000076dbdbfa in ntdll!ZwWaitForSingleObject () from C:\Windows\SYSTEM32\ntdll.dll
#1  0x000007fefce810dc in WaitForSingleObjectEx () from C:\Windows\system32\KernelBase.dll
#2  0x0000000064941eec in pthread_cond_init () from c:\Qt\482\qt-5.3.1-x64-mingw482r4-seh-opengl\bin\libwinpthread-1.dll
#3  0x00000000649430f7 in pthread_mutex_init () from c:\Qt\482\qt-5.3.1-x64-mingw482r4-seh-opengl\bin\libwinpthread-1.dll
#4  0x000000000047ba96 in MR::Thread::Queue<MR::DWI::Tractography::Streamline<float> >::pop(MR::DWI::Tractography::Streamline<float>*&) ()
#5  0x000000000042e74c in MR::Thread::(anonymous namespace)::__Pipe<MR::DWI::Tractography::Streamline<float>, MR::Thread::(anonymous namespace)::__Multi<MR::DWI::Tractography::Mapping::TrackMapperTWI>, MR::DWI::Tractography::Mapping::SetVoxel>::execute() ()
#6  0x000000000048d57c in std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, void> >::_M_invoke(std::_Any_data const&) ()
#7  0x000000000048ca7f in std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&) ()
#8  0x0000000064945191 in pthread_once () from c:\Qt\482\qt-5.3.1-x64-mingw482r4-seh-opengl\bin\libwinpthread-1.dll
#9  0x000000000043176e in std::thread::_Impl<std::_Bind_simple<std::__future_base::_Async_state_impl<std::_Bind_simple<std::_Mem_fn<void (MR::Thread::(anonymous namespace)::__Pipe<MR::DWI::Tractography::Streamline<float>, MR::Thread::(anonymous namespace)::__Multi<MR::DWI::Tractography::Mapping::TrackMapperTWI>, MR::DWI::Tractography::Mapping::SetVoxel>::*)()> (MR::Thread::(anonymous namespace)::__Pipe<MR::DWI::Tractography::Streamline<float>, MR::Thread::(anonymous namespace)::__Multi<MR::DWI::Tractography::Mapping::TrackMapperTWI>, MR::DWI::Tractography::Mapping::SetVoxel>*)>, void>::_Async_state_impl(std::_Bind_simple<std::_Mem_fn<void (MR::Thread::(anonymous namespace)::__Pipe<MR::DWI::Tractography::Streamline<float>, MR::Thread::(anonymous namespace)::__Multi<MR::DWI::Tractography::Mapping::TrackMapperTWI>, MR::DWI::Tractography::Mapping::SetVoxel>::*)()> (MR::Thread::(anonymous namespace)::__Pipe<MR::DWI::Tractography::Streamline<float>, MR::Thread::(anonymous namespace)::__Multi<MR::DWI::Tractography::Mapping::TrackMapperTWI>, MR::DWI::Tractography::Mapping::SetVoxel>*)>&&)::{lambda()#1} ()> >::_M_run() ()
#10 0x000000006fcb4ee0 in libstdc++-6!execute_native_thread_routine () from c:\Qt\482\qt-5.3.1-x64-mingw482r4-seh-opengl\bin\libstdc++-6.dll
#11 0x0000000064944dfa in pthread_create_wrapper () from c:\Qt\482\qt-5.3.1-x64-mingw482r4-seh-opengl\bin\libwinpthread-1.dll
#12 0x000007fefe4c415f in srand () from C:\Windows\system32\msvcrt.dll
#13 0x000007fefe4c6ebd in msvcrt!_ftime64_s () from C:\Windows\system32\msvcrt.dll
#14 0x0000000076b659cd in KERNEL32!BaseThreadInitThunk () from C:\Windows\system32\kernel32.dll
#15 0x0000000076d9b981 in ntdll!RtlUserThreadStart () from C:\Windows\SYSTEM32\ntdll.dll
#16 0x0000000000000000 in ?? ()

Thread 5 (Thread 6076.0x1cfc):
#0  0x0000000076dbe16a in ntdll!ZwWaitForMultipleObjects () from C:\Windows\SYSTEM32\ntdll.dll
#1  0x000007fefce81430 in KERNELBASE!GetCurrentProcess () from C:\Windows\system32\KernelBase.dll
#2  0x00000000002e4590 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 4 (Thread 6076.0x1c18):
#0  0x0000000076dbe16a in ntdll!ZwWaitForMultipleObjects () from C:\Windows\SYSTEM32\ntdll.dll
#1  0x000007fefce81430 in KERNELBASE!GetCurrentProcess () from C:\Windows\system32\KernelBase.dll
#2  0x00000000002de560 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 3 (Thread 6076.0x564):
#0  0x0000000076dbdbfa in ntdll!ZwWaitForSingleObject () from C:\Windows\SYSTEM32\ntdll.dll
#1  0x000007fefce810dc in WaitForSingleObjectEx () from C:\Windows\system32\KernelBase.dll
#2  0x0000000064941eec in pthread_cond_init () from c:\Qt\482\qt-5.3.1-x64-mingw482r4-seh-opengl\bin\libwinpthread-1.dll
#3  0x00000000649430f7 in pthread_mutex_init () from c:\Qt\482\qt-5.3.1-x64-mingw482r4-seh-opengl\bin\libwinpthread-1.dll
#4  0x000000000047ba96 in MR::Thread::Queue<MR::DWI::Tractography::Streamline<float> >::pop(MR::DWI::Tractography::Streamline<float>*&) ()
#5  0x000000000042e74c in MR::Thread::(anonymous namespace)::__Pipe<MR::DWI::Tractography::Streamline<float>, MR::Thread::(anonymous namespace)::__Multi<MR::DWI::Tractography::Mapping::TrackMapperTWI>, MR::DWI::Tractography::Mapping::SetVoxel>::execute() ()
#6  0x000000000048d57c in std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, void> >::_M_invoke(std::_Any_data const&) ()
#7  0x000000000048ca7f in std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&) ()
#8  0x0000000064945191 in pthread_once () from c:\Qt\482\qt-5.3.1-x64-mingw482r4-seh-opengl\bin\libwinpthread-1.dll
#9  0x000000000043176e in std::thread::_Impl<std::_Bind_simple<std::__future_base::_Async_state_impl<std::_Bind_simple<std::_Mem_fn<void (MR::Thread::(anonymous namespace)::__Pipe<MR::DWI::Tractography::Streamline<float>, MR::Thread::(anonymous namespace)::__Multi<MR::DWI::Tractography::Mapping::TrackMapperTWI>, MR::DWI::Tractography::Mapping::SetVoxel>::*)()> (MR::Thread::(anonymous namespace)::__Pipe<MR::DWI::Tractography::Streamline<float>, MR::Thread::(anonymous namespace)::__Multi<MR::DWI::Tractography::Mapping::TrackMapperTWI>, MR::DWI::Tractography::Mapping::SetVoxel>*)>, void>::_Async_state_impl(std::_Bind_simple<std::_Mem_fn<void (MR::Thread::(anonymous namespace)::__Pipe<MR::DWI::Tractography::Streamline<float>, MR::Thread::(anonymous namespace)::__Multi<MR::DWI::Tractography::Mapping::TrackMapperTWI>, MR::DWI::Tractography::Mapping::SetVoxel>::*)()> (MR::Thread::(anonymous namespace)::__Pipe<MR::DWI::Tractography::Streamline<float>, MR::Thread::(anonymous namespace)::__Multi<MR::DWI::Tractography::Mapping::TrackMapperTWI>, MR::DWI::Tractography::Mapping::SetVoxel>*)>&&)::{lambda()#1} ()> >::_M_run() ()
#10 0x000000006fcb4ee0 in libstdc++-6!execute_native_thread_routine () from c:\Qt\482\qt-5.3.1-x64-mingw482r4-seh-opengl\bin\libstdc++-6.dll
#11 0x0000000064944dfa in pthread_create_wrapper () from c:\Qt\482\qt-5.3.1-x64-mingw482r4-seh-opengl\bin\libwinpthread-1.dll
#12 0x000007fefe4c415f in srand () from C:\Windows\system32\msvcrt.dll
#13 0x000007fefe4c6ebd in msvcrt!_ftime64_s () from C:\Windows\system32\msvcrt.dll
#14 0x0000000076b659cd in KERNEL32!BaseThreadInitThunk () from C:\Windows\system32\kernel32.dll
#15 0x0000000076d9b981 in ntdll!RtlUserThreadStart () from C:\Windows\SYSTEM32\ntdll.dll
#16 0x0000000000000000 in ?? ()

Thread 2 (Thread 6076.0x14a0):
#0  0x0000000076dbdbfa in ntdll!ZwWaitForSingleObject () from C:\Windows\SYSTEM32\ntdll.dll
#1  0x000007fefce810dc in WaitForSingleObjectEx () from C:\Windows\system32\KernelBase.dll
#2  0x0000000064941eec in pthread_cond_init () from c:\Qt\482\qt-5.3.1-x64-mingw482r4-seh-opengl\bin\libwinpthread-1.dll
#3  0x000000006494217e in pthread_cond_init () from c:\Qt\482\qt-5.3.1-x64-mingw482r4-seh-opengl\bin\libwinpthread-1.dll
#4  0x00000000649423ff in pthread_cond_signal () from c:\Qt\482\qt-5.3.1-x64-mingw482r4-seh-opengl\bin\libwinpthread-1.dll
#5  0x000000006fca1009 in libstdc++-6!_ZNSt18condition_variable10notify_oneEv () from c:\Qt\482\qt-5.3.1-x64-mingw482r4-seh-opengl\bin\libstdc++-6.dll
#6  0x000000000042cebc in MR::Thread::(anonymous namespace)::__Source<MR::DWI::Tractography::Streamline<float>, MR::DWI::Tractography::Mapping::TrackLoader&>::execute() ()
#7  0x000000000048d57c in std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, void> >::_M_invoke(std::_Any_data const&) ()
#8  0x000000000048ca7f in std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&) ()
#9  0x0000000064945191 in pthread_once () from c:\Qt\482\qt-5.3.1-x64-mingw482r4-seh-opengl\bin\libwinpthread-1.dll
#10 0x000000000042fd8e in std::thread::_Impl<std::_Bind_simple<std::__future_base::_Async_state_impl<std::_Bind_simple<std::_Mem_fn<void (MR::Thread::(anonymous namespace)::__Source<MR::DWI::Tractography::Streamline<float>, MR::DWI::Tractography::Mapping::TrackLoader&>::*)()> (MR::Thread::(anonymous namespace)::__Source<MR::DWI::Tractography::Streamline<float>, MR::DWI::Tractography::Mapping::TrackLoader&>*)>, void>::_Async_state_impl(std::_Bind_simple<std::_Mem_fn<void (MR::Thread::(anonymous namespace)::__Source<MR::DWI::Tractography::Streamline<float>, MR::DWI::Tractography::Mapping::TrackLoader&>::*)()> (MR::Thread::(anonymous namespace)::__Source<MR::DWI::Tractography::Streamline<float>, MR::DWI::Tractography::Mapping::TrackLoader&>*)>&&)::{lambda()#1} ()> >::_M_run() ()
#11 0x000000006fcb4ee0 in libstdc++-6!execute_native_thread_routine () from c:\Qt\482\qt-5.3.1-x64-mingw482r4-seh-opengl\bin\libstdc++-6.dll
#12 0x0000000064944dfa in pthread_create_wrapper () from c:\Qt\482\qt-5.3.1-x64-mingw482r4-seh-opengl\bin\libwinpthread-1.dll
#13 0x000007fefe4c415f in srand () from C:\Windows\system32\msvcrt.dll
#14 0x000007fefe4c6ebd in msvcrt!_ftime64_s () from C:\Windows\system32\msvcrt.dll
#15 0x0000000076b659cd in KERNEL32!BaseThreadInitThunk () from C:\Windows\system32\kernel32.dll
#16 0x0000000076d9b981 in ntdll!RtlUserThreadStart () from C:\Windows\SYSTEM32\ntdll.dll
#17 0x0000000000000000 in ?? ()

Thread 1 (Thread 6076.0x1700):
#0  0x0000000076dbdbfa in ntdll!ZwWaitForSingleObject () from C:\Windows\SYSTEM32\ntdll.dll
#1  0x000007fefce810dc in WaitForSingleObjectEx () from C:\Windows\system32\KernelBase.dll
#2  0x00000000649462e2 in pthread_join () from c:\Qt\482\qt-5.3.1-x64-mingw482r4-seh-opengl\bin\libwinpthread-1.dll
#3  0x000000006fca6f4a in libstdc++-6!_ZNSt6thread4joinEv () from c:\Qt\482\qt-5.3.1-x64-mingw482r4-seh-opengl\bin\libstdc++-6.dll
#4  0x0000000064945191 in pthread_once () from c:\Qt\482\qt-5.3.1-x64-mingw482r4-seh-opengl\bin\libwinpthread-1.dll
#5  0x0000000000495eea in void std::call_once<void (std::thread::*)(), std::reference_wrapper<std::thread> >(std::once_flag&, void (std::thread::*&&)(), std::reference_wrapper<std::thread>&&) ()
#6  0x000000000048cb35 in std::__future_base::_Async_state_common::_M_run_deferred() ()
#7  0x000000000041ac9d in MR::Thread::(anonymous namespace)::__single_thread::wait() ()
#8  0x0000000000423bca in void MR::Thread::run_queue<MR::DWI::Tractography::Mapping::TrackLoader&, MR::DWI::Tractography::Streamline<float>, MR::Thread::(anonymous namespace)::__Multi<MR::DWI::Tractography::Mapping::TrackMapperTWI>, MR::DWI::Tractography::Mapping::SetVoxel, MR::DWI::Tractography::Mapping::MapWriterBase&>(MR::DWI::Tractography::Mapping::TrackLoader&, MR::DWI::Tractography::Streamline<float> const&, MR::Thread::(anonymous namespace)::__Multi<MR::DWI::Tractography::Mapping::TrackMapperTWI>&&, MR::DWI::Tractography::Mapping::SetVoxel const&, MR::DWI::Tractography::Mapping::MapWriterBase&, unsigned long long) [clone .constprop.3691] ()
#9  0x0000000000428125 in run() ()
#10 0x0000000000497807 in main ()
(gdb)

Looks like everybody's waiting for everybody else... Unfortunately even running the debug compiled binary outside of gdb refuses to lock. Might have a better chance writing a stress test app.

@Lestropie
Copy link
Member Author

Hmm... I need to check some code on my work system. From memory, while I was working on the solution to #36 (which I really should finish...), I did some reading on condition variables, spurious wakeups etc., and basically found that the whole thing is not overly robust. I need to remind myself of the code, see if there's anything useful in there.

@Lestropie
Copy link
Member Author

OK. From memory, the issue I had with CVs was that notify_one() gives no guarantee of actually waking a thread, and even notify_all() runs the risk of having a race-condition with a spurious wakeup and so a thread may be left sleeping. My solution was to be extra-careful with my predicates, and also to have an atomic counter for the number of 'active' threads, repeatedly calling notify_all() in the master thread until I received notice that my workers had all woken up.

Looking like this is where the problem is. If I change lib/thread_queue.cpp:631,647 from notify_one() to notify_all() calls, the frequency of lockup goes down by about a factor of 10, but doesn't disappear entirely...

@Lestropie
Copy link
Member Author

... Huh. Might actually be a compiler / library issue.

If I replace:

more_data.wait (lock, [this]{ return !(empty() && writer_count); });

with:

if (!more_data.wait_for (lock, std::chrono::seconds(10), [this]{ return !(empty() && writer_count); })) {
  std::cerr << "pop() timed out.\n";
  return false;              
}

, the problem goes away.

... Go figure.

@jdtournier
Copy link
Member

Mmmm.... Is that a fix then...?

@Lestropie
Copy link
Member Author

Not really... you don't want to assume that no queue will ever be stagnant for more than some fixed period of time. Could set it to the lifetime of the universe, I guess. But I'd rather try a newer library first, see if the problem goes away.

@jdtournier
Copy link
Member

Sure. I'm guessing by 'newer library', you mean newer version of GCC or a different threading model...?

By the way, not sure I understand the issue you mention in your previous comment re. notify_one() / notify_all(). Why do you need all threads to wake up? The more_data CV calls notify_one() every time an item is pushed to the queue, it should mean that as long as there's enough data coming through, all threads will eventually be woken up and used to capacity. I certainly don't see why this would introduce any scope for lock-ups. Am I missing something...?

It looks more likely to be a buggy threading implementation than anything else, we've never experienced this issue on Linux...

@Lestropie
Copy link
Member Author

I'm guessing by 'newer library', you mean newer version of GCC or a different threading model...?

Yeah.

By the way, not sure I understand the issue you mention in your previous comment

My previous experience suggests that calling notify_one() doesn't guarantee to wake up a thread, even if there's threads waiting. So if reader threads go to sleep, the writer thread calls notify_one(), but the reader threads don't wake up, eventually the queue is full and everything stagnates. By calling notify_all() instead, it's just giving the reader threads more messages and therefore making it harder for them to hang erroneously.

@jdtournier
Copy link
Member

I've never witnessed this behaviour... Are you able to replicate this...? I ran quite a few tests when I was implementing this, and it never locked up on me - once I had it working, that is...

According to the doc, the implementation will wake up one of the waiting threads. The thread might chose to carry on waiting if its predicate is not true - in our case, the thread goes back to waiting if there is actually no data to process, but at least one writer is still registered with the queue (might happen if one of the non-waiting threads happened to grab the data just at that moment). Maybe this is where things are getting stuck...?

@Lestropie
Copy link
Member Author

May have to rig up a MWE if everything else falls short. Don't suppose you still have any test cmd's lying around?

I may be extrapolating a little here; what I found with my own work (on Linux) was that calling notify_all() would fail to wake all threads, even though they would be expected to all be asleep at the time; so I had to add an atomic counter, which each worker thread would increment as it woke up, and the master thread would keep calling notify_all() until the workers had all indicated that they were awake. Based on that, it wouldn't surprise me if notify_one() is failing to wake a thread.

If anyone's got a better theory, I'm all ears. I might also look into using a different mutex for the CV as opposed to sharing the one used for data access...

@jdtournier
Copy link
Member

Sounds a bit odd...

No commands lying around, I'm afraid, but you might find some in the git history (although they would have since been deleted, so not easy to find). But all it was was a few no-op operators passing lightweights objects around - basically going for maximum contention to get things to fall over as much as possible.

As to your last point, I was a bit perplexed by that statement too, but reading it again, it says:

The notifying thread does not need to hold the lock on the same mutex as the one held by the waiting thread(s);

(emphasis mine). Looks like we don't want to be holding the lock when calling notify_one() (which we don't), but it's otherwise OK - unless we're going to lock it again immediately after. But since both reader & writer need to access the same structure, they really do need to use the same mutex...

@jdtournier
Copy link
Member

By the way, not sure what you were doing with notify_all(), but if it was riding on top of the current Thread::Queue class, it wouldn't be too surprising - if you only push one or two items down the queue, only that number of threads would appear to wake up, the rest of them would find their predicate being false (no data left) and go back to waiting...

But like I said, that depends on exactly what it was you were doing...

@steso
Copy link

steso commented Aug 25, 2015

I also had similar issues with tckmap if not using -nthreads 0 and also with dwi2response with a particular dataset, even more reproducible. Lowering the threads helped as well. (Using 12 instead of 16 now), just to let u know.

Lestropie added a commit that referenced this issue Aug 25, 2015
May assist in solving #255.
@Lestropie
Copy link
Member Author

Was just thinking that being a race condition, increasing the batch size may help to reduce the problem. Had a look, and tckmap wasn't batching. :-/

@steso Can you try updating your code to what I just pushed, and see if you can get tckmap to hang? If just setting up the multi-threading properly can get rid of the problem, I'll have a closer dig at dwi2response and hopefully fix that too.

@steso
Copy link

steso commented Aug 25, 2015

ok testing right now, by the way there is some M_PI usage in fixelcfestats.cpp and fixel2tsf.cpp giving me some trouble compiling. I saw a commit fixing this issue in other functions as well using Math::pi from math.h.
tckmap looks promising so far, I'll let you know if it hangs again, just testing with 17 datasets.

@steso
Copy link

steso commented Aug 25, 2015

Worked for me! (64x no hang, different data)

Lestropie added a commit that referenced this issue Aug 25, 2015
This seems to prevent hangs on Windows, as discussed in #255.
@Lestropie
Copy link
Member Author

OK, so my guess is that the implementation of std::condition_variable is slightly fragile on Windows, and that flogging a multi-threaded queue is how it comes out. Solution is to use the beautiful Thread::batch() to reduce the load (much better than my earlier wait_for() suggestion).

@steso Thanks for the report and the testing. Since I very rarely use my Windows system for work, and the regular testing framework we're putting in place isn't Windows-compatible, can you let me know if you encounter further hangs in any other commands? I just modified dwi2response and haven't been able to hang it since, but given it looks like a library implementation problem there's a chance it could still pop up elsewhere.

@Lestropie Lestropie self-assigned this Aug 26, 2015
@jdtournier
Copy link
Member

Ok, this does sound like a slightly buggy implementation of the C++11 threads API on MinGW. The stack trace posted earlier mentioned a lot of potential stack corruption, the behaviour is reported to depend on the particular version of the compiler, and there's no issues on other platforms. It's hard to believe that the windows native threading primitives would be buggy (and they are being used, according to the stack trace), but not too much of a stretch to believe that the interface with MinGW might still need a bit of polish...

So while the current fix is probably just sweeping the issue under the rug (I'd be very surprised if this has actually fixed it, my guess it'll only reduce its incidence to negligible levels), it's likely this will magically fix itself with future versions of MinGW, so definitely not worth investing much effort into it...

@steso
Copy link

steso commented Aug 27, 2015

The previous hangs could not been reproduced with the new code, but I'll keep you posted if there are some more hangs due to multi threading. The code also freezes in sh2peaks, if I remember correctly -nthreads 0 doesn't even help. But I have another compiled version from a different system to solve this issue temporarily. So if you could point me into the right direction, I might find the difference in library / MinGW version between the two systems.

Lestropie added a commit that referenced this issue Aug 27, 2015
Prevents hanging on Windows systems as discussed in #255.
@Lestropie
Copy link
Member Author

I did make a change to sh2peaks only a couple of days ago, which would have increased the load on the threading queue so made deadlocks more likely. Just fixed that one; might do a full search today to look for any Thread::run_queue() instances that aren't using batching and possibly should be.

@steso
Copy link

steso commented Aug 28, 2015

That fix did it for me so far, again! Keep up the good work ;-) I'll let you know if something else hangs.

Lestropie added a commit that referenced this issue Oct 4, 2015
Make use of Thread::batch() to reduce the load on multi-threading queues, which can hang in heavy processing on Windows.
Related to #255.
@steso
Copy link

steso commented Jan 21, 2016

I had some more hangs, this time in tckgen, probably also related to the thread problematic. I must admit I'm not using the newest version right now, so if it's already fixed sorry for bothering you. If I check in my build.log it seems like I'm still using 0.3.12-1027-g14c576b4-dirty. The hang occurs during the segmenting FOD step which I think is only needed if seed_dynamic is given as an argument.

@jdtournier
Copy link
Member

Can you confirm how you set up your installation? Is this using the pre-built Qt download instructions or the more recent MSYS2 instructions? I was hoping the MSYS2 installation would work OK by virtue of providing much more up to date components (in particular the compiler and/or POSIX threads library implementation, which seems to be the issue here)...

@steso
Copy link

steso commented Jan 21, 2016

I did set up my installation using the (old) Qt instructions, but not sure if there would be a difference in threads libraries? Unfortunately I can't really mess with my installation right now in order to check if the issue is persistent if the MSYS2 installation instructions are used...

@jdtournier
Copy link
Member

Sure, I understand you don't necessarily want to mess up your install in the process... There will be a difference in the compiler and libraries used, the MSYS2 version allows you to install all kinds of other packages, including gcc and Qt. And they seem to keep things very much up to date. The compiler used in the Qt download was 4.8 or so, whereas MSYS2 provides version 5.3.0... That hopefully will bring in fixes related to the threading library too - but of course we won't know till we try. 😉

@steso
Copy link

steso commented Jan 21, 2016

At least I have a newer Qt version, 5.2.something, I can check the exact version tomorrow

@jdtournier
Copy link
Member

Sure, but the Qt version really has nothing to do with this - it's the compiler supplied with the download that matters, and specifically the threading libraries it provides. As far as I remember, even the most recent Qt download used GCC 4.8.6 or so, which is very unlikely to have had that particular bug fixed (assuming that is indeed where the problem is...).

@steso
Copy link

steso commented Jan 22, 2016

GCC version bundled with my MinGW installation is 4.8.3, so this might be an issue...

@Lestropie
Copy link
Member Author

I can't say that I've thoroughly tested the MSYS2 Windows install for these hanging cases, but it wouldn't surprise me if the underlying condition variable bug hasn't yet been found and fixed even in the most up-to-date gcc.

Once the automated tests are completed for #435, I'll merge it to master: @steso If you then pull the latest changes and re-compile, hopefully that'll prevent the hanging. I don't do heavy processing on my Windows systems so continue to report any other issues you encounter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants