Add fault tolerance to scheduler #105

jpsamaroo · 2019-04-23T02:06:06Z

When a worker is killed due to OS signals (such as the Linux OOM
killer), a ProcessExitedException is likely to be thrown to the head
node. When such an exception is detected, we reschedule the failed thunk
on a new, non-dead worker (randomly for now).

Todo:

Reschedule dependent thunks that were cached on the dead worker
Recursively reschedule all dependent inputs of dead thunks if not cached on a live worker
Fix various KeyErrors due to not getting state just right...
Add some tests

src/scheduler.jl

jpsamaroo · 2019-05-25T00:43:40Z

@shashi @aviks please excuse the @debug statements, but are you otherwise content with the (locally passing) tests I've added to test/fault-tolerance.jl? I won't make the assertion that it will be able to correct every possible DAG and worker failure situation, but it seems to work properly for some common patterns.

This should also have no effect on normal (non-failure) situations, since the fault handling code only triggers when a worker dies unexpectedly.

jpsamaroo · 2019-05-25T12:14:09Z

Also, I will be moving most of this code out into its own file so that I don't make scheduler.jl hard to read.

jpsamaroo · 2019-06-01T22:22:52Z

bump

jpsamaroo · 2019-06-06T17:27:43Z

I also think it would be good to add an option to the scheduler to disable this new recovery functionality, in cases where it is undesired. I don't want to mess with this PR any more than I need to, so I will be doing that in a new PR soon (which will also expose some other unrelated options for the scheduler).

@shashi @aviks @tanmaykm any chance one of you can review this PR and let me know if there is anything else desired before merge?

jpsamaroo · 2019-06-06T18:02:28Z

I'm going to assume the single Travis failure on nightly is not my fault 🙂

versipellis · 2019-06-06T21:35:50Z

I tested this with the dataset that was giving me troubles on JuliaDB which spawned the initial convos around worker fault tolerance with @jpsamaroo, and can verify that it resolves the Linux OOM killer issue. There does seem to still be a memory leak either with JuliaDB or Dagger (unrelated to the fault tolerance) where about ~2gb of memory remains allocated after the processes have completed.

shashi · 2019-06-06T22:41:02Z

Going to review it tonight! This is a lot of code for this package haha.

jpsamaroo · 2019-06-10T21:40:59Z

Hi @shashi, did you get the chance to review this PR? I'm happy to answer any questions you may have about how it works, since not everything I included is obvious (or possibly even necessary) 🙂

When a worker is killed due to OS signals (such as the Linux OOM killer), a ProcessExitedException is likely to be thrown to the head node. When such an exception is detected, we reschedule the failed thunk on a new, non-dead worker (randomly for now).

jpsamaroo · 2019-07-05T20:16:53Z

@shashi @andreasnoack if there are no objections, I'm going to merge this.

coveralls · 2019-07-05T20:26:38Z

Coverage increased (+2.1%) to 54.15% when pulling 59e57bb on jpsamaroo:jps/fault-tolerance into 777b002 on JuliaParallel:master.

jpsamaroo · 2019-07-06T12:40:51Z

Thanks!

aviks · 2019-07-07T13:16:03Z

Thanks for the work here, @jpsamaroo . Does JuliaDB tests pass on this code? Would you please verify that before tagging a release here?

jpsamaroo · 2019-07-07T17:33:10Z

I had the exact same thought 🙂 I'm running tests locally as we speak. Are there any other packages that rely on Dagger that I should test? Also, maybe we should setup some reverse dependency testing in CI, since JuliaDB is a pretty important consumer of Dagger.

jpsamaroo · 2019-07-07T20:40:22Z

JuliaDB master is passing with Dagger master and MemPool latest release and master.

shashi reviewed Apr 23, 2019

View reviewed changes

src/scheduler.jl Show resolved Hide resolved

jpsamaroo force-pushed the jps/fault-tolerance branch 2 times, most recently from 79242e2 to f801242 Compare May 1, 2019 01:09

jpsamaroo force-pushed the jps/fault-tolerance branch from f801242 to b271329 Compare May 15, 2019 14:23

jpsamaroo force-pushed the jps/fault-tolerance branch 3 times, most recently from f114ee6 to 672efe8 Compare May 25, 2019 00:37

jpsamaroo force-pushed the jps/fault-tolerance branch from 82b3967 to 8c26c69 Compare May 25, 2019 22:11

jpsamaroo changed the title ~~[WIP] Add fault tolerance to scheduler~~ Add fault tolerance to scheduler Jun 6, 2019

jpsamaroo mentioned this pull request Jun 6, 2019

Add configurable options to scheduler #107

Merged

jpsamaroo added the enhancement label Jun 25, 2019

jpsamaroo added 3 commits July 5, 2019 15:14

Add fault tolerance to scheduler

e675c38

When a worker is killed due to OS signals (such as the Linux OOM killer), a ProcessExitedException is likely to be thrown to the head node. When such an exception is detected, we reschedule the failed thunk on a new, non-dead worker (randomly for now).

Refactor fault handler to its own file

3393a5f

Clean up debug statements and inline docs

59e57bb

jpsamaroo force-pushed the jps/fault-tolerance branch from 580f9f3 to 59e57bb Compare July 5, 2019 20:15

jpsamaroo merged commit 309b9d9 into JuliaParallel:master Jul 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fault tolerance to scheduler #105

Add fault tolerance to scheduler #105

jpsamaroo commented Apr 23, 2019 •

edited

Loading

jpsamaroo commented May 25, 2019

jpsamaroo commented May 25, 2019

jpsamaroo commented Jun 1, 2019

jpsamaroo commented Jun 6, 2019

jpsamaroo commented Jun 6, 2019

versipellis commented Jun 6, 2019

shashi commented Jun 6, 2019

jpsamaroo commented Jun 10, 2019

jpsamaroo commented Jul 5, 2019

coveralls commented Jul 5, 2019

jpsamaroo commented Jul 6, 2019

aviks commented Jul 7, 2019

jpsamaroo commented Jul 7, 2019

jpsamaroo commented Jul 7, 2019

Add fault tolerance to scheduler #105

Add fault tolerance to scheduler #105

Conversation

jpsamaroo commented Apr 23, 2019 • edited Loading

jpsamaroo commented May 25, 2019

jpsamaroo commented May 25, 2019

jpsamaroo commented Jun 1, 2019

jpsamaroo commented Jun 6, 2019

jpsamaroo commented Jun 6, 2019

versipellis commented Jun 6, 2019

shashi commented Jun 6, 2019

jpsamaroo commented Jun 10, 2019

jpsamaroo commented Jul 5, 2019

coveralls commented Jul 5, 2019

jpsamaroo commented Jul 6, 2019

aviks commented Jul 7, 2019

jpsamaroo commented Jul 7, 2019

jpsamaroo commented Jul 7, 2019

jpsamaroo commented Apr 23, 2019 •

edited

Loading