New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler hangs when user code attempts to "block" OS-threads #1036
Comments
FWIW, I'm able to get the test to hang only if it is run on more than one NUMA domain. The reason is that only one specific thread per NUMA domain is allowed to steal from a neighboring NUMA domain. If this particular thread is blocked then no work can be stolen from the neighboring NUMA domain, even if some is available. I'm not sure what we can do about that. I'm very inclined to close this ticket as "won't fix" as it relates to a border case which I doubt is interesting in general. Comments? |
I believe that the thread-affinity/NUMA aware code that has been added to the scheduler is broken. I opened the debugger onto a "hung" instance of the application
In GDB, I observed three of the OS-threads spinning on one of my atomic variables (as The last OS-thread was spinning, looking for work. It /should/ have been executing In GDB, I managed to catch the last OS-thread (the one that was looking for work) The relevant code from the scheduler is:
Here's what I see in the debugger:
To summarize what we are seeing:
Something is very wrong here.
|
In reply to your comment:
Okay; I do not see this on multiple NUMA domains. I have used a variety of tools (listed above) as well as --hpx:bind and --hpx:print-bind to verify that this issue occurs on ariel00 with one socket. It does not show up with great frequency, but it does show up.
I strongly believe that this policy is application specific and should be tunable. I would like to have the ability to disable this "throttling" mechanism. My rationale: I believe that this policy has the potential to negatively impact our ability to steal work. I think that we should use NUMA/thread affinity information to decide where to steal work from first. However, I think it is a bad idea to prevent certain threads from stealing work from other threads. If a scheduling thread cannot find any work in its NUMA domain, and it is not the "special" thread that is allowed to steal from the other domain, why should we prevent that thread from stealing work? Additionally, we may create unnecessary contention this way. If a NUMA domain is work starved, and only one OS-thread in that domain is allowed to steal from other NUMA domains, then all the other threads in that NUMA domain will fight over the work. What about systems with more than 1 NUMA domain? If I have 4 NUMA domains, is only one thread from each NUMA domain allowed to steal?
I would suggest that the "one stealing OS-thread per NUMA domain" policy be made configurable. I would like to be able to disable this. I believe this may be causing problems with the HTTS benchmark results. |
I'd be willing to accept a pull request for this. |
I don't think the particular test in question is correct. It makes various assumptions about the underlying thread scheduling mechanism which is not guarantueed. For example that threads are scheduled in a round robin fashion. Nevertheless, bugs in the existing schedulers got revealed. However, the intended behavior should be achievable in a more scheduler indepependant manner. |
What is the status of this? |
This is resolved now; I agree that it wasn't really a bug. |
…OS-threads Conflicts: hpx/runtime/threads/policies/local_priority_queue_scheduler.hpp
I am attempting to "block" all OS-threads except for one, so that I can control when the execution of tasks begins in the HTTS benchmark.
My strategy for blocking OS-threads is to spawn an HPX thread for each one of the
OS-threads I'd like to block. E.g. if I have 4 OS-threads, I want to block 3 of them,
so I spawn 3 "blocker" threads. I created a unit test (tests.regressions.threads.block_os_threads) that implements this technique.
Each blocker thread will:
I've tried the following, in order:
boost::barrier
for both synchronization points.boost::condition
and aboost::mutex
for both synchronization points.boost::atomic
for both synchronization points (the committed version of the unit test uses an atomic).Also, I tried handling the "entered" synchronization in a few different ways:
I run into trouble when I use this technique with more than 4 OS-threads. The
test doesn't always fail, but it fails pretty frequently on ariel00. It appears
that sometimes, the scheduler gets "stuck" - a few of the OS-threads will be
properly blocked, but the 1 remaining OS-thread will spin, looking for work. I
am fairly certain that the work is actually there (I've used performance
counters and the debugger to verify this); I think that for some reason,
blocking the other OS-threads prevents the last OS-thread from finding the
"blocker" that it is meant to execute.
The text was updated successfully, but these errors were encountered: