Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: parallel task runtime #22631

Open
wants to merge 4 commits into
base: master
from

Conversation

@kpamnany
Copy link
Contributor

kpamnany commented Jun 30, 2017

This replaces the existing fork-join threading infrastructure with a parallel task runtime (partr) that implements parallel depth first scheduling. This model fully supports nested parallelism.

The default remains the original threading code. Enable partr by setting JULIA_PARTR := 1 in your Make.user.

The core idea is simple -- Julia tasks can now be run by any thread. The task scheduler attempts to order task execution depth-first for provably better cache efficiency, and for true nested parallelism.

However, as tasks are an existing thing in Julia and used in a number of places, we're first introducing the infrastructure that will enable parallel tasks with this PR, keeping (hopefully) the serial semantics of the existing task interface. This PR does not introduce any new interface calls for parallel tasks -- those will be in future PRs.

All test-cases pass with JULIA_PARTR off (as they should). With JULIA_PARTR on, all test cases are currently passing on Linux and OS-X.

Cc: @JeffBezanson, @vtjnash, @yuyichao, @ViralBShah, @vchuravy, @anton-malakhov.

@jtravs

This comment has been minimized.

Copy link
Contributor

jtravs commented Jun 30, 2017

Any chance you could give a very simple example of what the interface would like like in user code?

@kpamnany

This comment has been minimized.

Copy link
Contributor Author

kpamnany commented Jun 30, 2017

The Julia interface is not designed yet. While elaborations are possible, the essence of the interface is similar to Cilk so something like:

t1 = @spawn foo(1, 2) # foo(1, 2) will run asynchronously, possibly in another thread
res1 = @sync t1 # res1 will get the return value of foo(1, 2)
t2 = @parfor (+) i = 1:10 # iterations may run in parallel
  i-1
end
res2 = @sync t2 # res2 = 45
@tknopp

This comment has been minimized.

Copy link
Contributor

tknopp commented Jun 30, 2017

@kpamnany: Does this mean I can have a background thread/task running parallel to the foreground thread. In other words: Can I run a thread asynchronously?

@kpamnany

This comment has been minimized.

Copy link
Contributor Author

kpamnany commented Jun 30, 2017

@tknopp: yes, that's what spawn will do. If you spawn a task and there's more than one thread, it will start running right away. It will continue to run until a yield point (another spawn, a sync, a parfor, or an explicit yield).

while (jl_atomic_load_acquire(&tiarg->state) == TI_THREAD_INIT)
jl_cpu_pause();

// Assuming the functions called below doesn't contain unprotected GC

This comment has been minimized.

@Sacha0

Sacha0 Jun 30, 2017

Member

"doesn't" -> "don't"?

STATIC_INLINE uint64_t cong(uint64_t max, uint64_t unbias, uint64_t *seed)
{
while ((*seed = 69069 * (*seed) + 362437) > unbias)
;

This comment has been minimized.

@Sacha0

Sacha0 Jun 30, 2017

Member

Semicolon on separate line as a ghost of the empty loop body or unintentional extra space?

src/partr.c Outdated

init_started_thread();

// Assuming the functions called below doesn't contain unprotected GC

This comment has been minimized.

@Sacha0

Sacha0 Jun 30, 2017

Member

"doesn't" -> "don't" here as well?

@ViralBShah

This comment has been minimized.

Copy link
Member

ViralBShah commented Jul 2, 2017

Seems worthwhile to experiment with the Projects feature for this one.

@davidanthoff

This comment has been minimized.

Copy link
Contributor

davidanthoff commented Jul 5, 2017

Just out of curiosity, is this something that might make it into 1.0?

@ViralBShah

This comment has been minimized.

Copy link
Member

ViralBShah commented Jul 10, 2017

We will try to get it into 1.0 if it is ready before feature freeze, but not hold 1.0. I am personally hopeful that it will be ready by 1.0. Hope that helps.

@StefanKarpinski

This comment has been minimized.

Copy link
Member

StefanKarpinski commented Jul 11, 2017

If it can't make it into 1.0 in complete form, we can at least include it in experimental form and try our damndest to leave room for it in the 1.x series – we really don't want to have to wait until 2.0 for full-on threading support.

@amitmurthy

This comment has been minimized.

Copy link
Member

amitmurthy commented Jul 18, 2017

@kpamnany, will the spawned tasks be able to perform asynchronous IO via the libuv event loop ? Is the plan to have the main Julia thread run the event loop and perform all compute only tasks in separate threads?

Currently, in the Distributed module incoming requests are executed in separate tasks, and each invocation can also make additional remote calls. This leads to any IO being blocked when a worker is busy on a compute. Will it be possible for incoming requests to be executed via a threadpool and IO calls if any to be internally routed to the main Julia thread running the event loop?

@kpamnany

This comment has been minimized.

Copy link
Contributor Author

kpamnany commented Jul 18, 2017

@amitmurthy: who runs the event loop (and how) is a good question. As a general statement, irrespective of Julia, unless you reserve a thread for I/O, it is possible for requests to be serviced late/very slowly. But you don't always have/want a thread to reserve. Ideally, this should be a program choice.

Executing tasks are not preempted. The API entry points (spawn, sync, and parfor) may cause the calling task to yield. However, this runtime allows for sticky tasks, i.e. tasks that only run on the thread that started them. Sticky tasks do not yield in spawn and parfor. So, you can create a sticky I/O task and drive the event loop from it. It's pretty straightforward to allow tasks to perform asynchronous I/O requests, but it isn't obvious how to get completion notifications. I'm not entirely sure how to do this right now but @vtjnash and @JeffBezanson have probably thought this out in greater detail (they suggested sticky tasks).

Clearly it would be a useful enhancement to this runtime to add the ability to trigger a task based on an event. But that gets us into having to define events, and decide semantics for event mux/demux and that opens many questions -- are there system events? Can multiple tasks be triggered by the same event? How about the conjunction or disjunction of multiple events? Not sure we should go down this rabbit hole right now.

@JeffBezanson

This comment has been minimized.

Copy link
Member

JeffBezanson commented Jul 18, 2017

Our existing Tasks can already be triggered by events, so we're already in the rabbit hole. We can't fully leave this up to applications; we need to make some default choice for people.

@StefanKarpinski

This comment has been minimized.

Copy link
Member

StefanKarpinski commented Jul 18, 2017

It seems like the default should probably be to have a sticky I/O thread since most applications don't need all of the threads. For really high performance situations where one wants to defer I/O until the I/O thread wakes up, we should probably have people opt into that.

@kpamnany kpamnany force-pushed the kp/partr branch from 28a825b to 8384384 Jul 19, 2017

@amitmurthy

This comment has been minimized.

Copy link
Member

amitmurthy commented Jul 20, 2017

Googled a bit on integrating libuv and multithreading. See
http://docs.libuv.org/en/v1.x/threading.html, http://docs.libuv.org/en/v1.x/async.html
and https://nikhilm.github.io/uvbook/threads.html#threads

I would like to try out the following simple model in parallel to the work being done
in this PR.

  1. The main Julia thread continues to handle all I/O and the event infrastructure.
  2. Provide an API to spawn a Julia 0-arg function from a thread selected from a threadpool. Lets call this a compute thread.
  3. The compute thread forwards all I/O and event handling calls (sleep, Timers, notify, etc.) to the main Julia thread.
  4. I/O request forwarding from compute_thread -> main_thread is done via a multiple-writer, single-reader queue and uv_async_send to notify the event loop. All compute threads push their I/O requests onto this queue which is processed by the main_thread (running the event loop)
  5. compute_threads are notified of I/O completion events via a regular system condition variable (uv_cond_t and uv_cond_signal). One condition variable per compute_thread.
  6. Julia code running on a compute_thread is therefore not a Julia Task in the regular sense. It is just Julia code running in a separate thread. All calls requiring libuv facilities are forwarded to the main_thread and the compute_thread waits for its completion (on a system condition variable).
  7. The fact that the Julia I/O API is designed to be a blocking interface (while being fully event driven and asynchronous under the hood) makes this model much easier to implement.

At the very least it will help in getting a handle on libuv event loop integration in a multi-threaded environment.

@JeffBezanson

This comment has been minimized.

Copy link
Member

JeffBezanson commented Jul 20, 2017

Thanks @amitmurthy that sounds basically good. I suspect this can work with normal Tasks, though. When a Task (running on any thread) wants to do I/O, it queues its request and yields. When the I/O completes, the requesting Task can be restarted as usual.

int last_arriver(arriver_t *, int);
void *reduce(arriver_t *, reducer_t *, void *(*rf)(void *, void *), void *, int);
#endif // JULIA_ENABLE_PARTR

This comment has been minimized.

@JeffBezanson

JeffBezanson Jul 21, 2017

Member

All of these should maybe be in a scheduler.h; they probably shouldn't be called by miscellaneous run time system code.

@JeffBezanson

This comment has been minimized.

Copy link
Member

JeffBezanson commented Jul 28, 2017

@kpamnany jl_switchto(task) should work for starting and resuming a task. Fortunately we are already storing per-thread stack address information, so if all tasks are put in sticky queues this might work now.

@kpamnany kpamnany force-pushed the kp/partr branch from 0fa4042 to 095f5ce Aug 2, 2017

@kpamnany kpamnany force-pushed the kp/partr branch from 095f5ce to 0d693ef Aug 17, 2017

@kpamnany kpamnany force-pushed the kp/partr branch from 0d693ef to 06c7e2f Sep 16, 2017

@rveltz

This comment has been minimized.

Copy link

rveltz commented Sep 27, 2017

Any idea to make it to a package first?

@anton-malakhov

This comment has been minimized.

Copy link

anton-malakhov commented Oct 3, 2017

Hi folks! #iamintel here to help Kiran to push multi-threading forward as he's transitioned to other projects. He offered me to work on libuv-related stuff while he's finishing some other parts.

@amitmurthy, are you working on the approach you suggested on Jul 19th?
It looks good enough though it is still vulnerable to the situation when main thread gets blocked in compute-intensive user code thus all the I/O and events stuck during this time. Moving event loop into a separate dedicated I/O thread would work but it'd introduce additional overheads to single-threaded case if it always packs and sends requests to the other threads.
Studying go-lang runtime, I'd like to follow their flexibility in scheduling event polling to any threads available. Unfortunately, it is not possible with current libuv implementation but it is still possible to run multiple loops in parallel as they recommend for multi-threading. libuv can also evolve to support multithreading better and as result, an implementation explicitly communicating to a single uv_loop would require deep refactoring again. Thus, I'd suggest to start with uv_loop per thread and, as the next step, implement 'loop stealing/syncing' mechanism which would allow a request to migrate from an originating thread to another one by stealing or mailing the whole uv_loop instance or the underneath handles. Do you see any issue with this design?
Of course, if you already implemented your idea, I can work on something more important for the next release. We can also continue on Slack's #parallel

@ViralBShah

This comment has been minimized.

Copy link
Member

ViralBShah commented Oct 3, 2017

@amitmurthy is off grid for a the next couple of weeks.

src/task.c Outdated
jl_condition_type = (jl_datatype_t*)
jl_new_datatype(jl_symbol("Condition"), NULL, jl_any_type, jl_emptysvec,
jl_perm_symsvec(3, "head", "lock_owner", "lock_count"),
jl_svec(3, jl_task_type, jl_int64_type, jl_int32_type),

This comment has been minimized.

@c42f

c42f Oct 20, 2018

Contributor

Oop, I think the same bug exists here with the jl_int64_type, given that jl_condition_t is really a _jl_taskq_t on the C side.

This comment has been minimized.

@c42f

c42f Oct 20, 2018

Contributor

Well, I just went ahead and fixed this directly to see what would happen in CI. Hope that's ok!

This comment has been minimized.

@kpamnany

kpamnany Oct 21, 2018

Author Contributor

👍

{
heap_p = heap_c * jl_n_threads;
heaps = (taskheap_t *)calloc(heap_p, sizeof(taskheap_t));
for (int16_t i = 0; i < heap_p; ++i) {

This comment has been minimized.

@c42f

c42f Oct 20, 2018

Contributor

Why the int16_t for the iterator here? Perhaps a native width int (or size_t) would make more sense?

Same comment applies to various iterators within partr.c - some are int16_t, some int, there's even an int64_t.

This comment has been minimized.

@kpamnany

kpamnany Oct 21, 2018

Author Contributor

I think this is because we'd chosen int16_t for thread IDs. So, some of these are int16_ts and the iterators are the same to avoid warnings.

This comment has been minimized.

@c42f

c42f Oct 21, 2018

Contributor

Right. I think it should mainly be necessary to match signed vs unsigned in comparisons (not so much bit width) to avoid surprises and related warnings arising from the likes of

#include <stdio.h>
int main() {
    if (-1 > (size_t)0)
        printf("Surprise!\n");
    return 0;
}

I mainly noticed this because I was having a read through for where 32 and 64 bit versions might differ.

@kpamnany

This comment has been minimized.

Copy link
Contributor Author

kpamnany commented Oct 22, 2018

Looks like the generate_precompile step now passes fine. 🎆

Now, all platforms seem to get to the same point:

┌ Warning: TerminalMenus: Unable to enter raw mode: ArgumentError("stream is closed or unusable")
└ @ REPL.TerminalMenus /usr/home/julia/julia-fbsd-buildbot/worker/freebsdci/build/usr/share/julia/stdlib/v1.1/REPL/src/TerminalMenus/util.jl:21

Running ./julia test/runtests.jl REPL interactively (on OS-X) gives me:

REPL: Test Failed at /Users/kpamnany/julia/usr/share/julia/stdlib/v1.1/REPL/test/replcompletions.jl:107
  Expression: count(isequal("REPL"), c) == 1
   Evaluated: 2 == 1                 

Which seems unrelated? 😕

@KristofferC

This comment has been minimized.

Copy link
Contributor

KristofferC commented Oct 22, 2018

┌ Warning: TerminalMenus: Unable to enter raw mode: ArgumentError("stream is closed or unusable")

Is a warning that is always printed (we should fix that) but it is unrelated to the PR. It also doesn't fail the tests.

The actual problem is

No output has been received in the last 30m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received

which means that one of the tests likely got stuck.

The testing framework itself uses tasks so could it be that we are stuck at the end of this block (all parallel tests are done)?

julia/test/runtests.jl

Lines 135 to 173 in c162219

@sync begin
for p in workers()
@async begin
push!(all_tasks, current_task())
while length(tests) > 0
test = popfirst!(tests)
local resp
wrkr = p
try
resp = remotecall_fetch(runtests, wrkr, test, test_path(test); seed=seed)
catch e
isa(e, InterruptException) && return
resp = [e]
end
push!(results, (test, resp))
if resp[1] isa Exception
if exit_on_error
skipped = length(tests)
empty!(tests)
end
elseif resp[end] > max_worker_rss
if n > 1
rmprocs(wrkr, waitfor=30)
p = addprocs_with_testenv(1)[1]
remotecall_fetch(include, p, "testdefs.jl")
else # single process testing
error("Halting tests. Memory limit reached : $resp > $max_worker_rss")
end
end
!isa(resp[1], Exception) && print_testworker_stats(test, wrkr, resp)
end
if p != 1
# Free up memory =)
rmprocs(p, waitfor=30)
end
end
end
end

@c42f

This comment has been minimized.

Copy link
Contributor

c42f commented Oct 24, 2018

Shall I merge #29791 in here?

@kpamnany

This comment has been minimized.

Copy link
Contributor Author

kpamnany commented Oct 25, 2018

Looks like a big patch, and this is a big patch too with some time pressure to merge. @JeffBezanson can make the call.

@Keno

This comment has been minimized.

Copy link
Member

Keno commented Oct 25, 2018

@kpamnany note that this branch has conflicts with master. @c42f's PR resolves those conflicts.

@kpamnany

This comment has been minimized.

Copy link
Contributor Author

kpamnany commented Oct 25, 2018

Ah, I see now, it is already on master so it has to be merged. It'd be best if we could get runtests to complete successfully on this branch first before adding new code to fix though. Unless the new code can help?

@c42f

This comment has been minimized.

Copy link
Contributor

c42f commented Oct 25, 2018

The new code is likely to help in the circumstance that you're loosing stack traces of exceptions thrown in tasks due to context switching. Other than that, it only resolves the conflicts with master.

@JeffBezanson

This comment has been minimized.

Copy link
Member

JeffBezanson commented Oct 30, 2018

Ok I think I found the next problem. wait/isready/n_avail depend on the length of c.putters, but the partr code doesn't use that array any more. Will change it to use the state of cond_put.

@JeffBezanson

This comment has been minimized.

Copy link
Member

JeffBezanson commented Oct 30, 2018

Now other tests pass but there is a mystery crash in the embedding test. 😡

src/task.c Outdated
static void record_backtrace(jl_ptls_t ptls) JL_NOTSAFEPOINT
{
// storing bt_size in ptls ensures roots in bt_data will be found
ptls->bt_size = rec_backtrace(ptls->bt_data, JL_MAX_BT_SIZE);

This comment has been minimized.

@c42f

c42f Oct 31, 2018

Contributor

I agree this is nice to have factored out (especially for having a place to put the note about rooting). I only removed it for symmetry with the equivalent code which had removed it in partr.c. We should probably just call record_backtrace there as well.

This comment has been minimized.

@JeffBezanson

JeffBezanson Oct 31, 2018

Member

Either is fine, I was just minimizing the diff for now.

@kpamnany

This comment has been minimized.

Copy link
Contributor Author

kpamnany commented Oct 31, 2018

Such good progress! The Win* failures are mystifying -- I see a lot of fatal: cannot change to '/cygdrive/c/projects/julia': No such file or directory messages?

We're still going to turn JULIA_PARTR off before merge I would think?

@kcajf

This comment has been minimized.

Copy link
Contributor

kcajf commented Dec 5, 2018

I don't understand all the complexity and current state of this PR, so please forgive for sounding impatient or pushy, but I was just wondering what the chances of seeing some part of this released in the soon upcoming 1.1? Are there any easy tasks that someone (like me), who perhaps isn't very familiar with the language implementation side of Julia, work on to help push things along?

@StefanKarpinski

This comment has been minimized.

Copy link
Member

StefanKarpinski commented Dec 5, 2018

The remaining blocker is unresolved bugs, mainly on Windows. So, if you have a Windows system (or anything else, actually), you could clone this branch, build it and run all the tests. If there are crashes, try to debug them. However, I don't suspect that will be particularly easy, but help is welcomed.

Note that you probably need to also be on #30186 or one of the other Channel API revision branches. I'm a bit unclear on which one should be on at this point.

@JeffBezanson

This comment has been minimized.

Copy link
Member

JeffBezanson commented Dec 5, 2018

This branch still ought to work on its own, at least with 1 thread.

kpamnany and others added some commits Jun 27, 2017

threading: Integrating partr (almost done!)
Added partr code. Abstracted interface to threading infrastructure.

@JeffBezanson JeffBezanson force-pushed the kp/partr branch from 7a6d4ba to 8714c98 Dec 5, 2018

@JeffBezanson

This comment has been minimized.

Copy link
Member

JeffBezanson commented Dec 6, 2018

win32: generate_precompile hanging
win64: a node 1 test is hanging, either precompile, SharedArrays, or Distributed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.