WIP: parallel task runtime #22631

kpamnany · 2017-06-30T13:02:38Z

This replaces the existing fork-join threading infrastructure with a parallel task runtime (partr) that implements parallel depth first scheduling. This model fully supports nested parallelism.

The default remains the original threading code. Enable partr by setting JULIA_PARTR := 1 in your Make.user.

The core idea is simple -- Julia tasks can now be run by any thread. The task scheduler attempts to order task execution depth-first for provably better cache efficiency, and for true nested parallelism.

However, as tasks are an existing thing in Julia and used in a number of places, we're first introducing the infrastructure that will enable parallel tasks with this PR, keeping (hopefully) the serial semantics of the existing task interface. This PR does not introduce any new interface calls for parallel tasks -- those will be in future PRs.

All test-cases pass with JULIA_PARTR off (as they should). With JULIA_PARTR on, all test cases are currently passing on Linux and OS-X.

Cc: @JeffBezanson, @vtjnash, @yuyichao, @ViralBShah, @vchuravy, @anton-malakhov.

jtravs · 2017-06-30T14:12:50Z

Any chance you could give a very simple example of what the interface would like like in user code?

kpamnany · 2017-06-30T14:37:53Z

The Julia interface is not designed yet. While elaborations are possible, the essence of the interface is similar to Cilk so something like:

t1 = @spawn foo(1, 2) # foo(1, 2) will run asynchronously, possibly in another thread
res1 = @sync t1 # res1 will get the return value of foo(1, 2)
t2 = @parfor (+) i = 1:10 # iterations may run in parallel
  i-1
end
res2 = @sync t2 # res2 = 45

tknopp · 2017-06-30T15:16:16Z

@kpamnany: Does this mean I can have a background thread/task running parallel to the foreground thread. In other words: Can I run a thread asynchronously?

kpamnany · 2017-06-30T18:37:59Z

@tknopp: yes, that's what spawn will do. If you spawn a task and there's more than one thread, it will start running right away. It will continue to run until a yield point (another spawn, a sync, a parfor, or an explicit yield).

Sacha0 · 2017-06-30T22:39:20Z

src/forkjoin-ti.c

+    while (jl_atomic_load_acquire(&tiarg->state) == TI_THREAD_INIT)
+        jl_cpu_pause();
+
+    // Assuming the functions called below doesn't contain unprotected GC


"doesn't" -> "don't"?

Sacha0 · 2017-06-30T22:47:45Z

src/julia_internal.h

+STATIC_INLINE uint64_t cong(uint64_t max, uint64_t unbias, uint64_t *seed)
+{
+    while ((*seed = 69069 * (*seed) + 362437) > unbias)
+        ;


Semicolon on separate line as a ghost of the empty loop body or unintentional extra space?

Sacha0 · 2017-06-30T23:02:48Z

src/partr.c

+
+    init_started_thread();
+
+    // Assuming the functions called below doesn't contain unprotected GC


"doesn't" -> "don't" here as well?

ViralBShah · 2017-07-02T23:55:17Z

Seems worthwhile to experiment with the Projects feature for this one.

davidanthoff · 2017-07-05T20:06:47Z

Just out of curiosity, is this something that might make it into 1.0?

ViralBShah · 2017-07-10T08:01:18Z

We will try to get it into 1.0 if it is ready before feature freeze, but not hold 1.0. I am personally hopeful that it will be ready by 1.0. Hope that helps.

StefanKarpinski · 2017-07-11T19:17:32Z

If it can't make it into 1.0 in complete form, we can at least include it in experimental form and try our damndest to leave room for it in the 1.x series – we really don't want to have to wait until 2.0 for full-on threading support.

amitmurthy · 2017-07-18T11:43:00Z

@kpamnany, will the spawned tasks be able to perform asynchronous IO via the libuv event loop ? Is the plan to have the main Julia thread run the event loop and perform all compute only tasks in separate threads?

Currently, in the Distributed module incoming requests are executed in separate tasks, and each invocation can also make additional remote calls. This leads to any IO being blocked when a worker is busy on a compute. Will it be possible for incoming requests to be executed via a threadpool and IO calls if any to be internally routed to the main Julia thread running the event loop?

kpamnany · 2017-07-18T14:03:12Z

@amitmurthy: who runs the event loop (and how) is a good question. As a general statement, irrespective of Julia, unless you reserve a thread for I/O, it is possible for requests to be serviced late/very slowly. But you don't always have/want a thread to reserve. Ideally, this should be a program choice.

Executing tasks are not preempted. The API entry points (spawn, sync, and parfor) may cause the calling task to yield. However, this runtime allows for sticky tasks, i.e. tasks that only run on the thread that started them. Sticky tasks do not yield in spawn and parfor. So, you can create a sticky I/O task and drive the event loop from it. It's pretty straightforward to allow tasks to perform asynchronous I/O requests, but it isn't obvious how to get completion notifications. I'm not entirely sure how to do this right now but @vtjnash and @JeffBezanson have probably thought this out in greater detail (they suggested sticky tasks).

Clearly it would be a useful enhancement to this runtime to add the ability to trigger a task based on an event. But that gets us into having to define events, and decide semantics for event mux/demux and that opens many questions -- are there system events? Can multiple tasks be triggered by the same event? How about the conjunction or disjunction of multiple events? Not sure we should go down this rabbit hole right now.

JeffBezanson · 2017-07-18T14:41:47Z

Our existing Tasks can already be triggered by events, so we're already in the rabbit hole. We can't fully leave this up to applications; we need to make some default choice for people.

StefanKarpinski · 2017-07-18T17:24:35Z

It seems like the default should probably be to have a sticky I/O thread since most applications don't need all of the threads. For really high performance situations where one wants to defer I/O until the I/O thread wakes up, we should probably have people opt into that.

amitmurthy · 2017-07-20T04:25:11Z

Googled a bit on integrating libuv and multithreading. See
http://docs.libuv.org/en/v1.x/threading.html, http://docs.libuv.org/en/v1.x/async.html
and https://nikhilm.github.io/uvbook/threads.html#threads

I would like to try out the following simple model in parallel to the work being done
in this PR.

The main Julia thread continues to handle all I/O and the event infrastructure.
Provide an API to spawn a Julia 0-arg function from a thread selected from a threadpool. Lets call this a compute thread.
The compute thread forwards all I/O and event handling calls (sleep, Timers, notify, etc.) to the main Julia thread.
I/O request forwarding from compute_thread -> main_thread is done via a multiple-writer, single-reader queue and uv_async_send to notify the event loop. All compute threads push their I/O requests onto this queue which is processed by the main_thread (running the event loop)
compute_threads are notified of I/O completion events via a regular system condition variable (uv_cond_t and uv_cond_signal). One condition variable per compute_thread.
Julia code running on a compute_thread is therefore not a Julia Task in the regular sense. It is just Julia code running in a separate thread. All calls requiring libuv facilities are forwarded to the main_thread and the compute_thread waits for its completion (on a system condition variable).
The fact that the Julia I/O API is designed to be a blocking interface (while being fully event driven and asynchronous under the hood) makes this model much easier to implement.

At the very least it will help in getting a handle on libuv event loop integration in a multi-threaded environment.

JeffBezanson · 2017-07-20T05:23:28Z

Thanks @amitmurthy that sounds basically good. I suspect this can work with normal Tasks, though. When a Task (running on any thread) wants to do I/O, it queues its request and yields. When the I/O completes, the requesting Task can be restarted as usual.

JeffBezanson · 2017-07-21T14:22:46Z

src/julia_internal.h

+int last_arriver(arriver_t *, int);
+void *reduce(arriver_t *, reducer_t *, void *(*rf)(void *, void *), void *, int);
+#endif // JULIA_ENABLE_PARTR
+


All of these should maybe be in a scheduler.h; they probably shouldn't be called by miscellaneous run time system code.

JeffBezanson · 2017-07-28T17:28:56Z

@kpamnany jl_switchto(task) should work for starting and resuming a task. Fortunately we are already storing per-thread stack address information, so if all tasks are put in sticky queues this might work now.

rveltz · 2017-09-27T05:46:11Z

Any idea to make it to a package first?

anton-malakhov · 2017-10-03T16:06:02Z

Hi folks! #iamintel here to help Kiran to push multi-threading forward as he's transitioned to other projects. He offered me to work on libuv-related stuff while he's finishing some other parts.

@amitmurthy, are you working on the approach you suggested on Jul 19th?
It looks good enough though it is still vulnerable to the situation when main thread gets blocked in compute-intensive user code thus all the I/O and events stuck during this time. Moving event loop into a separate dedicated I/O thread would work but it'd introduce additional overheads to single-threaded case if it always packs and sends requests to the other threads.
Studying go-lang runtime, I'd like to follow their flexibility in scheduling event polling to any threads available. Unfortunately, it is not possible with current libuv implementation but it is still possible to run multiple loops in parallel as they recommend for multi-threading. libuv can also evolve to support multithreading better and as result, an implementation explicitly communicating to a single uv_loop would require deep refactoring again. Thus, I'd suggest to start with uv_loop per thread and, as the next step, implement 'loop stealing/syncing' mechanism which would allow a request to migrate from an originating thread to another one by stealing or mailing the whole uv_loop instance or the underneath handles. Do you see any issue with this design?
Of course, if you already implemented your idea, I can work on something more important for the next release. We can also continue on Slack's #parallel

ViralBShah · 2017-10-03T21:16:01Z

@amitmurthy is off grid for a the next couple of weeks.

c42f · 2018-10-24T23:01:07Z

Shall I merge #29791 in here?

kpamnany · 2018-10-25T13:53:15Z

Looks like a big patch, and this is a big patch too with some time pressure to merge. @JeffBezanson can make the call.

Keno · 2018-10-25T13:54:43Z

@kpamnany note that this branch has conflicts with master. @c42f's PR resolves those conflicts.

kpamnany · 2018-10-25T14:54:42Z

Ah, I see now, it is already on master so it has to be merged. It'd be best if we could get runtests to complete successfully on this branch first before adding new code to fix though. Unless the new code can help?

c42f · 2018-10-25T22:15:37Z

The new code is likely to help in the circumstance that you're loosing stack traces of exceptions thrown in tasks due to context switching. Other than that, it only resolves the conflicts with master.

JeffBezanson · 2018-10-30T17:46:36Z

Ok I think I found the next problem. wait/isready/n_avail depend on the length of c.putters, but the partr code doesn't use that array any more. Will change it to use the state of cond_put.

JeffBezanson · 2018-10-30T20:44:09Z

Now other tests pass but there is a mystery crash in the embedding test. 😡

c42f · 2018-10-31T00:19:52Z

src/task.c

+static void record_backtrace(jl_ptls_t ptls) JL_NOTSAFEPOINT
+{
+    // storing bt_size in ptls ensures roots in bt_data will be found
+    ptls->bt_size = rec_backtrace(ptls->bt_data, JL_MAX_BT_SIZE);


I agree this is nice to have factored out (especially for having a place to put the note about rooting). I only removed it for symmetry with the equivalent code which had removed it in partr.c. We should probably just call record_backtrace there as well.

Either is fine, I was just minimizing the diff for now.

kpamnany · 2018-10-31T14:40:48Z

Such good progress! The Win* failures are mystifying -- I see a lot of fatal: cannot change to '/cygdrive/c/projects/julia': No such file or directory messages?

We're still going to turn JULIA_PARTR off before merge I would think?

kcajf · 2018-12-05T15:29:34Z

I don't understand all the complexity and current state of this PR, so please forgive for sounding impatient or pushy, but I was just wondering what the chances of seeing some part of this released in the soon upcoming 1.1? Are there any easy tasks that someone (like me), who perhaps isn't very familiar with the language implementation side of Julia, work on to help push things along?

StefanKarpinski · 2018-12-05T17:27:34Z

The remaining blocker is unresolved bugs, mainly on Windows. So, if you have a Windows system (or anything else, actually), you could clone this branch, build it and run all the tests. If there are crashes, try to debug them. However, I don't suspect that will be particularly easy, but help is welcomed.

Note that you probably need to also be on #30186 or one of the other Channel API revision branches. I'm a bit unclear on which one should be on at this point.

JeffBezanson · 2018-12-05T17:40:03Z

This branch still ought to work on its own, at least with 1 thread.

Added partr code. Abstracted interface to threading infrastructure.

JeffBezanson · 2018-12-06T17:17:50Z

win32: generate_precompile hanging
win64: a node 1 test is hanging, either precompile, SharedArrays, or Distributed

kpamnany · 2019-03-26T04:35:54Z

Refactored into #30806, #30838 and #31398.

kpamnany added the domain:multithreading Base.Threads and related functionality label Jun 30, 2017

Sacha0 reviewed Jun 30, 2017

View reviewed changes

davidanthoff mentioned this pull request Jul 5, 2017

Opening large projects takes an eternity julia-vscode/julia-vscode#201

Open

kpamnany force-pushed the kp/partr branch from 28a825b to 8384384 Compare July 19, 2017 14:57

JeffBezanson reviewed Jul 21, 2017

View reviewed changes

amitmurthy mentioned this pull request Jul 28, 2017

RFH/WIP : exec_on_worker_thread and exec_on_main via a threadpool #22996

Closed

kpamnany force-pushed the kp/partr branch from 00b84d2 to d58c288 Compare July 28, 2017 15:46

kpamnany force-pushed the kp/partr branch from 0fa4042 to 095f5ce Compare August 2, 2017 21:00

kpamnany force-pushed the kp/partr branch from 095f5ce to 0d693ef Compare August 17, 2017 20:56

kpamnany force-pushed the kp/partr branch from 0d693ef to 06c7e2f Compare September 16, 2017 21:24

This was referenced Oct 23, 2018

@test_logs when using @threads #29450

Open

Merge master containing exception stacks into kp/partr #29791

Merged

c42f reviewed Oct 31, 2018

View reviewed changes

c42f mentioned this pull request Oct 31, 2018

initialize previous_exception #29861

Merged

vtjnash force-pushed the kp/partr branch from 1e05c30 to 7a6d4ba Compare November 12, 2018 16:17

vtjnash mentioned this pull request Nov 28, 2018

Channel: add ability to handle threads #30186

Merged

kpamnany and others added 3 commits December 5, 2018 14:37

threading: Integrating partr (almost done!)

d3c0d28

Added partr code. Abstracted interface to threading infrastructure.

revert invalid changes

6708d27

try enabling again

8714c98

JeffBezanson force-pushed the kp/partr branch from 7a6d4ba to 8714c98 Compare December 5, 2018 19:51

fix behavior divergence with non-partr build

70b4852

ChrisRackauckas mentioned this pull request Dec 20, 2018

reduction_func and :threads error SciML/DiffEqMonteCarlo.jl#31

Closed

stevengj mentioned this pull request Feb 1, 2019

Implement message interrupt mode JuliaLang/IJulia.jl#804

Merged

c42f mentioned this pull request Feb 26, 2019

Slow REPL together with Makie MakieOrg/Makie.jl#267

Closed

kpamnany closed this Mar 26, 2019

AzamatB mentioned this pull request Mar 27, 2019

import partr code, allow using it in threaded loops #31398

Merged

DilumAluthge deleted the kp/partr branch March 25, 2021 21:58


		init_started_thread();

		// Assuming the functions called below doesn't contain unprotected GC

WIP: parallel task runtime #22631

WIP: parallel task runtime #22631

Conversation

kpamnany commented Jun 30, 2017 • edited Loading

jtravs commented Jun 30, 2017

kpamnany commented Jun 30, 2017

tknopp commented Jun 30, 2017

kpamnany commented Jun 30, 2017

Sacha0 Jun 30, 2017

Choose a reason for hiding this comment

Sacha0 Jun 30, 2017

Choose a reason for hiding this comment

Sacha0 Jun 30, 2017

Choose a reason for hiding this comment

ViralBShah commented Jul 2, 2017

davidanthoff commented Jul 5, 2017

ViralBShah commented Jul 10, 2017

StefanKarpinski commented Jul 11, 2017 • edited Loading

amitmurthy commented Jul 18, 2017

kpamnany commented Jul 18, 2017

JeffBezanson commented Jul 18, 2017

StefanKarpinski commented Jul 18, 2017

amitmurthy commented Jul 20, 2017

JeffBezanson commented Jul 20, 2017

JeffBezanson Jul 21, 2017

Choose a reason for hiding this comment

JeffBezanson commented Jul 28, 2017

rveltz commented Sep 27, 2017

anton-malakhov commented Oct 3, 2017

ViralBShah commented Oct 3, 2017

c42f commented Oct 24, 2018

kpamnany commented Oct 25, 2018

Keno commented Oct 25, 2018

kpamnany commented Oct 25, 2018

c42f commented Oct 25, 2018

JeffBezanson commented Oct 30, 2018

JeffBezanson commented Oct 30, 2018

c42f Oct 31, 2018

Choose a reason for hiding this comment

JeffBezanson Oct 31, 2018

Choose a reason for hiding this comment

kpamnany commented Oct 31, 2018

kcajf commented Dec 5, 2018

StefanKarpinski commented Dec 5, 2018 • edited Loading

JeffBezanson commented Dec 5, 2018

JeffBezanson commented Dec 6, 2018

kpamnany commented Mar 26, 2019

kpamnany commented Jun 30, 2017 •

edited

Loading

StefanKarpinski commented Jul 11, 2017 •

edited

Loading

StefanKarpinski commented Dec 5, 2018 •

edited

Loading