New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tckmap hang on Windows #255
Comments
Sounds nasty - debugging multi-threading issues is really hard (and this does sound like some kind of race condition issue). I tend to find it's easier to inspect the code than trying to debug it any other way. What surprises my is that this wouldn't be an issue on Linux. I'd expect if this was a race condition, it would manifest on all platforms. Strange... In any case, you're in for some fun... I typically find the only real way to debug race conditions like this is to inspect the code very carefully to spot which of the objects I'm writing to might not have copy-constructed as a fully independent copy, or something like that.... Happy hunting! |
Yeah, not looking forward to it. My initial thought was that the ProgressBar fix hadn't propagated, but it's not that. |
I'm also a bit stumped as to why it's not an issue on Linux... I'd be looking for exotic -rather than generic- explanations first. It's really If it is a multi-threading thing gone haywire though, the fact that it hangs rather than crashes at least naively points to a deadlock (rather than a race condition). This is where I can (finally!) bring out the ✨ formally trained informatician ✨ in me. As far as I can remember from a course in parallel systems and distributed computing, we were once taught/drilled deadlocks are the "easiest" from all parallel problems to fix (as in: find the cause, not necessarily fix for real), because you can "simply" inspect a stack trace when deadlocked, which should provide you with important clues as to the cause of the deadlock... Apart from that, I only remember the course coming with 2 absolutely massive books that resulted in a 10 day studying marathon that resulted in a trauma that resulted in me banning most other useful information from my memory after the exam. So far for the formal training. 😐 |
... You really just wanted an excuse to use ✨ , didn't you? :-P Trouble is, it refuses to lock when running in debug mode, making the stack untraceable. I think I just need to throw a whole lot of |
More to the point: the reason there might be a deadlock in the first place is more than likely related to corruption of the internal structures managing the queues - and the likely cause of said corruption is probably a rare race condition on a non mutex-protected variable somewhere... So it's either a bug in the Thread::Queue (unlikely given that it only seems to impact tckmap, but not impossible), or some form of memory corruption impacting on the queue - and this all assumes the process hangs somewhere within the queue... So first off, is tckmap still consuming a lot of CPU when it hangs? If so, then this is unlikely to a due to a deadlock in the queue - the threads would just sit idle waiting for the mutex to be released. In that case, I'd look closely at any loop within tckmap's operators for any potential infinite loop. Otherwise, see if valgrind shows up anything. I'm guessing you'll have to run this on Linux, but if there is a bug at that level, it should be present on all platforms. If that doesn't help, we should try to set up a stress test app for the Thread::Queue (and include that in the testing repo), essentially running with no-op operators to maximise the amount of thread contention... If that refuses to lock up on any platform, at least you'll know the problem is elsewhere... But as I said before, the most likely problem is a race condition on a unprotected shared variable. And this can be subtle - for example, I had trouble a while back passing RefPtr objects through the queue - didn't think that any copying/destruction in one thread will be accessing the shared reference counter for that object, which is not thread-safe... Basically, I'd have a long hard look at the objects passed through the queue for any hidden issues like that. |
By the way, if this is triggered by a rare race condition, sticking TRACE macros everywhere will probably slow things down and drastically reduce the amount of thread contention, making it really unlikely to trigger the bug... |
Another idea: if you can use GDB on Windows, see if you can attach GDB to the running process once it hangs. You won't be able to get a detailed breakdown of where in the source your process might be stuck, but hopefully the backtrace will at least tell you which part of the queue is hanging... |
Just came across the poor man's profiler page, with some really useful & easy tips. In your case, if you get the process to hang, you can inspect what each thread is doing with this command: gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof tckmap) I had to run it with admin rights, but it works great. I got dwi2fod to infinite-loop on one of the threads so it would hang, then ran the above: $ sudo gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof dwi2fod)
[New LWP 16134]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f78a5f4355d in pthread_join () from /usr/lib/libpthread.so.0
Thread 2 (Thread 0x7f787b7fe700 (LWP 16134)):
#0 0x000000000042ca8f in void Processor::operator()<MR::Image<float>, MR::Image<float> >(MR::Image<float>&, MR::Image<float>&) ()
#1 0x000000000041735e in MR::(anonymous namespace)::__RunFunctor<2, Processor<MR::Image<float>, MR::Image<float> > >::operator()(MR::Iterator const&) ()
#2 0x00000000004175da in MR::(anonymous namespace)::__Outer<MR::(anonymous namespace)::__RunFunctor<2, Processor<MR::Image<float>, MR::Image<float> > > >::execute() ()
#3 0x0000000000419b1e in std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, void> >::_M_invoke(std::_Any_data const&) ()
#4 0x0000000000419de2 in std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&) ()
#5 0x00007f78a5f48e3b in __pthread_once_slow () from /usr/lib/libpthread.so.0
#6 0x0000000000416e7c in std::thread::_Impl<std::_Bind_simple<std::__future_base::_Async_state_impl<std::_Bind_simple<std::_Mem_fn<void (MR::(anonymous namespace)::__Outer<MR::(anonymous namespace)::__RunFunctor<2, Processor<MR::Image<float>, MR::Image<float> > > >::*)()> (MR::(anonymous namespace)::__RunFunctor<2, Processor<MR::Image<float>, MR::Image<float> > >*)>, void>::_Async_state_impl(std::_Mem_fn<void (MR::(anonymous namespace)::__Outer<MR::(anonymous namespace)::__RunFunctor<2, Processor<MR::Image<float>, MR::Image<float> > > >::*)()> (&&)(MR::(anonymous namespace)::__RunFunctor<2, Processor<MR::Image<float>, MR::Image<float> > >*))::{lambda()#1} ()> >::_M_run() ()
#7 0x00007f78a672ddf0 in execute_native_thread_routine () from /usr/lib/libstdc++.so.6
#8 0x00007f78a5f42374 in start_thread () from /usr/lib/libpthread.so.0
#9 0x00007f78a5c8027d in clone () from /usr/lib/libc.so.6
Thread 1 (Thread 0x7f78a77b5740 (LWP 16111)):
#0 0x00007f78a5f4355d in pthread_join () from /usr/lib/libpthread.so.0
#1 0x00007f78a672dc87 in std::thread::join() () from /usr/lib/libstdc++.so.6
#2 0x00007f78a5f48e3b in __pthread_once_slow () from /usr/lib/libpthread.so.0
#3 0x000000000041b2e6 in void std::call_once<void (std::thread::*)(), std::reference_wrapper<std::thread> >(std::once_flag&, void (std::thread::*&&)(), std::reference_wrapper<std::thread>&&) ()
#4 0x000000000041b332 in std::__future_base::_Async_state_commonV2::_M_complete_async() ()
#5 0x0000000000415dde in MR::Thread::(anonymous namespace)::__multi_thread<MR::(anonymous namespace)::__Outer<MR::(anonymous namespace)::__RunFunctor<2, Processor<MR::Image<float>, MR::Image<float> > > > >::wait() ()
#6 0x0000000000418926 in run() ()
#7 0x00000000004087b6 in main () How good is that? 😁 |
Looks like everybody's waiting for everybody else... Unfortunately even running the debug compiled binary outside of |
Hmm... I need to check some code on my work system. From memory, while I was working on the solution to #36 (which I really should finish...), I did some reading on condition variables, spurious wakeups etc., and basically found that the whole thing is not overly robust. I need to remind myself of the code, see if there's anything useful in there. |
OK. From memory, the issue I had with CVs was that Looking like this is where the problem is. If I change |
... Huh. Might actually be a compiler / library issue. If I replace:
with:
, the problem goes away. ... Go figure. |
Mmmm.... Is that a fix then...? |
Not really... you don't want to assume that no queue will ever be stagnant for more than some fixed period of time. Could set it to the lifetime of the universe, I guess. But I'd rather try a newer library first, see if the problem goes away. |
Sure. I'm guessing by 'newer library', you mean newer version of GCC or a different threading model...? By the way, not sure I understand the issue you mention in your previous comment re. It looks more likely to be a buggy threading implementation than anything else, we've never experienced this issue on Linux... |
Yeah.
My previous experience suggests that calling |
I've never witnessed this behaviour... Are you able to replicate this...? I ran quite a few tests when I was implementing this, and it never locked up on me - once I had it working, that is... According to the doc, the implementation will wake up one of the waiting threads. The thread might chose to carry on waiting if its predicate is not true - in our case, the thread goes back to waiting if there is actually no data to process, but at least one writer is still registered with the queue (might happen if one of the non-waiting threads happened to grab the data just at that moment). Maybe this is where things are getting stuck...? |
May have to rig up a MWE if everything else falls short. Don't suppose you still have any test cmd's lying around? I may be extrapolating a little here; what I found with my own work (on Linux) was that calling If anyone's got a better theory, I'm all ears. I might also look into using a different mutex for the CV as opposed to sharing the one used for data access... |
Sounds a bit odd... No commands lying around, I'm afraid, but you might find some in the git history (although they would have since been deleted, so not easy to find). But all it was was a few no-op operators passing lightweights objects around - basically going for maximum contention to get things to fall over as much as possible. As to your last point, I was a bit perplexed by that statement too, but reading it again, it says:
(emphasis mine). Looks like we don't want to be holding the lock when calling |
By the way, not sure what you were doing with But like I said, that depends on exactly what it was you were doing... |
I also had similar issues with tckmap if not using -nthreads 0 and also with dwi2response with a particular dataset, even more reproducible. Lowering the threads helped as well. (Using 12 instead of 16 now), just to let u know. |
Was just thinking that being a race condition, increasing the batch size may help to reduce the problem. Had a look, and @steso Can you try updating your code to what I just pushed, and see if you can get |
ok testing right now, by the way there is some M_PI usage in fixelcfestats.cpp and fixel2tsf.cpp giving me some trouble compiling. I saw a commit fixing this issue in other functions as well using Math::pi from math.h. |
Worked for me! (64x no hang, different data) |
This seems to prevent hangs on Windows, as discussed in #255.
OK, so my guess is that the implementation of @steso Thanks for the report and the testing. Since I very rarely use my Windows system for work, and the regular testing framework we're putting in place isn't Windows-compatible, can you let me know if you encounter further hangs in any other commands? I just modified |
Ok, this does sound like a slightly buggy implementation of the C++11 threads API on MinGW. The stack trace posted earlier mentioned a lot of potential stack corruption, the behaviour is reported to depend on the particular version of the compiler, and there's no issues on other platforms. It's hard to believe that the windows native threading primitives would be buggy (and they are being used, according to the stack trace), but not too much of a stretch to believe that the interface with MinGW might still need a bit of polish... So while the current fix is probably just sweeping the issue under the rug (I'd be very surprised if this has actually fixed it, my guess it'll only reduce its incidence to negligible levels), it's likely this will magically fix itself with future versions of MinGW, so definitely not worth investing much effort into it... |
The previous hangs could not been reproduced with the new code, but I'll keep you posted if there are some more hangs due to multi threading. The code also freezes in sh2peaks, if I remember correctly -nthreads 0 doesn't even help. But I have another compiled version from a different system to solve this issue temporarily. So if you could point me into the right direction, I might find the difference in library / MinGW version between the two systems. |
Prevents hanging on Windows systems as discussed in #255.
I did make a change to |
That fix did it for me so far, again! Keep up the good work ;-) I'll let you know if something else hangs. |
Make use of Thread::batch() to reduce the load on multi-threading queues, which can hang in heavy processing on Windows. Related to #255.
I had some more hangs, this time in tckgen, probably also related to the thread problematic. I must admit I'm not using the newest version right now, so if it's already fixed sorry for bothering you. If I check in my build.log it seems like I'm still using 0.3.12-1027-g14c576b4-dirty. The hang occurs during the segmenting FOD step which I think is only needed if seed_dynamic is given as an argument. |
Can you confirm how you set up your installation? Is this using the pre-built Qt download instructions or the more recent MSYS2 instructions? I was hoping the MSYS2 installation would work OK by virtue of providing much more up to date components (in particular the compiler and/or POSIX threads library implementation, which seems to be the issue here)... |
I did set up my installation using the (old) Qt instructions, but not sure if there would be a difference in threads libraries? Unfortunately I can't really mess with my installation right now in order to check if the issue is persistent if the MSYS2 installation instructions are used... |
Sure, I understand you don't necessarily want to mess up your install in the process... There will be a difference in the compiler and libraries used, the MSYS2 version allows you to install all kinds of other packages, including gcc and Qt. And they seem to keep things very much up to date. The compiler used in the Qt download was 4.8 or so, whereas MSYS2 provides version 5.3.0... That hopefully will bring in fixes related to the threading library too - but of course we won't know till we try. 😉 |
At least I have a newer Qt version, 5.2.something, I can check the exact version tomorrow |
Sure, but the Qt version really has nothing to do with this - it's the compiler supplied with the download that matters, and specifically the threading libraries it provides. As far as I remember, even the most recent Qt download used GCC 4.8.6 or so, which is very unlikely to have had that particular bug fixed (assuming that is indeed where the problem is...). |
GCC version bundled with my MinGW installation is 4.8.3, so this might be an issue... |
I can't say that I've thoroughly tested the MSYS2 Windows install for these hanging cases, but it wouldn't surprise me if the underlying condition variable bug hasn't yet been found and fixed even in the most up-to-date gcc. Once the automated tests are completed for #435, I'll merge it to master: @steso If you then pull the latest changes and re-compile, hopefully that'll prevent the hanging. I don't do heavy processing on my Windows systems so continue to report any other issues you encounter. |
On my Windows machine,
tckmap
is hanging at random percentage markers.-quiet
.-nthreads 0
or-nthreads 1
; seems to be more frequent with more threads than less.The text was updated successfully, but these errors were encountered: