Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler hangs when user code attempts to "block" OS-threads #1036

Closed
brycelelbach opened this issue Dec 24, 2013 · 7 comments
Closed

Scheduler hangs when user code attempts to "block" OS-threads #1036

brycelelbach opened this issue Dec 24, 2013 · 7 comments

Comments

@brycelelbach
Copy link
Member

I am attempting to "block" all OS-threads except for one, so that I can control when the execution of tasks begins in the HTTS benchmark.

My strategy for blocking OS-threads is to spawn an HPX thread for each one of the
OS-threads I'd like to block. E.g. if I have 4 OS-threads, I want to block 3 of them,
so I spawn 3 "blocker" threads. I created a unit test (tests.regressions.threads.block_os_threads) that implements this technique.

Each blocker thread will:

  • Synchronize with the master/controller thread when it is executing; this synchronization must happen without suspension at the HPX-thread level. In the unit test, this is handled by the "entered" atomic.
  • Block the OS-thread and wait for a "start" signal from the master/controller thread. In the unit test, this is handled by the "started" atomic.

I've tried the following, in order:

  • Using a boost::barrier for both synchronization points.
  • Using boost::condition and a boost::mutex for both synchronization points.
  • Using a boost::atomic for both synchronization points (the committed version of the unit test uses an atomic).

Also, I tried handling the "entered" synchronization in a few different ways:

  • Creating all the "blocker" threads at once and using one synchronization primitive for all of them.
  • Creating the "blocker" threads one at a time and using one synchronization primitive for each of them (this is what is done in the committed version).

I run into trouble when I use this technique with more than 4 OS-threads. The
test doesn't always fail, but it fails pretty frequently on ariel00. It appears
that sometimes, the scheduler gets "stuck" - a few of the OS-threads will be
properly blocked, but the 1 remaining OS-thread will spin, looking for work. I
am fairly certain that the work is actually there (I've used performance
counters and the debugger to verify this); I think that for some reason,
blocking the other OS-threads prevents the last OS-thread from finding the
"blocker" that it is meant to execute.

@hkaiser
Copy link
Member

hkaiser commented Dec 24, 2013

FWIW, I'm able to get the test to hang only if it is run on more than one NUMA domain. The reason is that only one specific thread per NUMA domain is allowed to steal from a neighboring NUMA domain. If this particular thread is blocked then no work can be stolen from the neighboring NUMA domain, even if some is available. I'm not sure what we can do about that. I'm very inclined to close this ticket as "won't fix" as it relates to a border case which I doubt is interesting in general. Comments?

hkaiser added a commit that referenced this issue Dec 24, 2013
@brycelelbach
Copy link
Member Author

I believe that the thread-affinity/NUMA aware code that has been added to the scheduler is broken.

I opened the debugger onto a "hung" instance of the application
(block_os_threads).

  • Used a debug build of the application.
  • The application was running on 4 threads on ariel00.
  • I verified, using htop, hwloc-ls, and mpstat, that the
    4 OS-threads running the scheduler loops were bound to four cores on the first
    socket of the machine.

In GDB, I observed three of the OS-threads spinning on one of my atomic variables (as
they should be).

The last OS-thread was spinning, looking for work. It /should/ have been executing
hpx_main. When the application hung, hpx_main was either suspended or pending.

In GDB, I managed to catch the last OS-thread (the one that was looking for work)
at the following line: https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/runtime/threads/policies/local_priority_queue_scheduler.hpp#L290.

The relevant code from the scheduler is:

            for (std::size_t i = 0; i < high_priority_queue_size; ++i)
            {
                if (i == num_thread)
                    continue; // don't steal from ourselves
                if (!test(this_numa_domain, i) && !test(numa_domain, i))
                    continue;

                if (high_priority_queues_[i]->get_next_thread(thrd, queue_size + i))
                {
                    high_priority_queues_[i]->increment_num_stolen_threads();
                    return true;
                }
            }

Here's what I see in the debugger:

#0  0x00002aaaacec683e in hpx::threads::policies::local_priority_queue_scheduler<boost::mutex>::get_next_thread (this=0x2aaaaac38438, num_thread=2, running=true, 
    idle_loop_count=@0x2aaab53a1ec0: 100226, thrd=@0x2aaab53a1eb0: 0x0) at /home/wash/hpx/hpx/runtime/threads/policies/local_priority_queue_scheduler.hpp:290
290                 if (!test(this_numa_domain, i) && !test(numa_domain, i))
(gdb) print this_numa_domain
$3 = 21845
(gdb) print numa_domain
$4 = 0
(gdb) print i
$5 = 3
(gdb) print num_thread
$6 = 2
(gdb) print !test(this_numa_domain, i)
$7 = true
(gdb) print !test(numa_domain, i)
$8 = true
(gdb)

To summarize what we are seeing:

  • We are in a thread stealing loop.
  • The scheduler is checking whether the active OS-thread (thread 2) can steal from thread 3 (for NUMA reasons).
  • This check is /FAILING/.
  • Since the check fails, the OS-thread doesn't steal work from its neighbor.

Something is very wrong here.

  • For one thing, why is this thread affinity/NUMA-sensitive code always enabled?
  • Shouldn't we only use thread affinity/NUMA-sensitive scheduling when --hpx:numa-sensitive is enabled?
  • I think it's reasonable to have the NUMA-sensitive scheduling enabled by default, but I would like a way to disable it. Maybe --hpx:numa-sensitive=false.
  • I want to be able to completely disable all NUMA/thread affinity related scheduling from the command line. Should I just create a new scheduler for now?

@brycelelbach
Copy link
Member Author

In reply to your comment:

FWIW, I'm able to get the test to hang only if it is run on more than one NUMA domain.

Okay; I do not see this on multiple NUMA domains. I have used a variety of tools (listed above) as well as --hpx:bind and --hpx:print-bind to verify that this issue occurs on ariel00 with one socket. It does not show up with great frequency, but it does show up.

The reason is that only one specific thread per NUMA domain is allowed to steal from a neighboring NUMA domain.

I strongly believe that this policy is application specific and should be tunable. I would like to have the ability to disable this "throttling" mechanism.

My rationale: I believe that this policy has the potential to negatively impact our ability to steal work. I think that we should use NUMA/thread affinity information to decide where to steal work from first. However, I think it is a bad idea to prevent certain threads from stealing work from other threads.

If a scheduling thread cannot find any work in its NUMA domain, and it is not the "special" thread that is allowed to steal from the other domain, why should we prevent that thread from stealing work?

Additionally, we may create unnecessary contention this way. If a NUMA domain is work starved, and only one OS-thread in that domain is allowed to steal from other NUMA domains, then all the other threads in that NUMA domain will fight over the work.

What about systems with more than 1 NUMA domain? If I have 4 NUMA domains, is only one thread from each NUMA domain allowed to steal?

If this particular thread is blocked then no work can be stolen from the neighboring NUMA domain, even if some is available. I'm not sure what we can do about that.

I would suggest that the "one stealing OS-thread per NUMA domain" policy be made configurable. I would like to be able to disable this. I believe this may be causing problems with the HTTS benchmark results.

@hkaiser
Copy link
Member

hkaiser commented Dec 25, 2013

I would suggest that the "one stealing OS-thread per NUMA domain" policy be made
configurable. I would like to be able to disable this. I believe this may be causing
problems with the HTTS benchmark results.

I'd be willing to accept a pull request for this.

@sithhell
Copy link
Member

I don't think the particular test in question is correct. It makes various assumptions about the underlying thread scheduling mechanism which is not guarantueed. For example that threads are scheduled in a round robin fashion. Nevertheless, bugs in the existing schedulers got revealed. However, the intended behavior should be achievable in a more scheduler indepependant manner.

@hkaiser
Copy link
Member

hkaiser commented Jan 5, 2014

What is the status of this?

@brycelelbach
Copy link
Member Author

This is resolved now; I agree that it wasn't really a bug.

@ghost ghost assigned hkaiser Jan 7, 2014
hkaiser added a commit that referenced this issue Jan 14, 2014
hkaiser added a commit that referenced this issue Jan 15, 2014
…OS-threads

Conflicts:
	hpx/runtime/threads/policies/local_priority_queue_scheduler.hpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants