Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fault tolerance to scheduler #105

Merged
merged 3 commits into from
Jul 6, 2019

Conversation

jpsamaroo
Copy link
Member

@jpsamaroo jpsamaroo commented Apr 23, 2019

When a worker is killed due to OS signals (such as the Linux OOM
killer), a ProcessExitedException is likely to be thrown to the head
node. When such an exception is detected, we reschedule the failed thunk
on a new, non-dead worker (randomly for now).

Todo:

  • Reschedule dependent thunks that were cached on the dead worker
  • Recursively reschedule all dependent inputs of dead thunks if not cached on a live worker
  • Fix various KeyErrors due to not getting state just right...
  • Add some tests

@jpsamaroo jpsamaroo force-pushed the jps/fault-tolerance branch 2 times, most recently from 79242e2 to f801242 Compare May 1, 2019 01:09
@jpsamaroo jpsamaroo force-pushed the jps/fault-tolerance branch 3 times, most recently from f114ee6 to 672efe8 Compare May 25, 2019 00:37
@jpsamaroo
Copy link
Member Author

@shashi @aviks please excuse the @debug statements, but are you otherwise content with the (locally passing) tests I've added to test/fault-tolerance.jl? I won't make the assertion that it will be able to correct every possible DAG and worker failure situation, but it seems to work properly for some common patterns.

This should also have no effect on normal (non-failure) situations, since the fault handling code only triggers when a worker dies unexpectedly.

@jpsamaroo
Copy link
Member Author

Also, I will be moving most of this code out into its own file so that I don't make scheduler.jl hard to read.

@jpsamaroo
Copy link
Member Author

bump

@jpsamaroo
Copy link
Member Author

I also think it would be good to add an option to the scheduler to disable this new recovery functionality, in cases where it is undesired. I don't want to mess with this PR any more than I need to, so I will be doing that in a new PR soon (which will also expose some other unrelated options for the scheduler).

@shashi @aviks @tanmaykm any chance one of you can review this PR and let me know if there is anything else desired before merge?

@jpsamaroo jpsamaroo changed the title [WIP] Add fault tolerance to scheduler Add fault tolerance to scheduler Jun 6, 2019
@jpsamaroo
Copy link
Member Author

I'm going to assume the single Travis failure on nightly is not my fault 🙂

@versipellis
Copy link

I tested this with the dataset that was giving me troubles on JuliaDB which spawned the initial convos around worker fault tolerance with @jpsamaroo, and can verify that it resolves the Linux OOM killer issue. There does seem to still be a memory leak either with JuliaDB or Dagger (unrelated to the fault tolerance) where about ~2gb of memory remains allocated after the processes have completed.

@shashi
Copy link
Collaborator

shashi commented Jun 6, 2019

Going to review it tonight! This is a lot of code for this package haha.

@jpsamaroo
Copy link
Member Author

Hi @shashi, did you get the chance to review this PR? I'm happy to answer any questions you may have about how it works, since not everything I included is obvious (or possibly even necessary) 🙂

When a worker is killed due to OS signals (such as the Linux OOM
killer), a ProcessExitedException is likely to be thrown to the head
node. When such an exception is detected, we reschedule the failed thunk
on a new, non-dead worker (randomly for now).
@jpsamaroo
Copy link
Member Author

@shashi @andreasnoack if there are no objections, I'm going to merge this.

@coveralls
Copy link

Coverage Status

Coverage increased (+2.1%) to 54.15% when pulling 59e57bb on jpsamaroo:jps/fault-tolerance into 777b002 on JuliaParallel:master.

@jpsamaroo jpsamaroo merged commit 309b9d9 into JuliaParallel:master Jul 6, 2019
@jpsamaroo
Copy link
Member Author

Thanks!

@aviks
Copy link

aviks commented Jul 7, 2019

Thanks for the work here, @jpsamaroo . Does JuliaDB tests pass on this code? Would you please verify that before tagging a release here?

@jpsamaroo
Copy link
Member Author

I had the exact same thought 🙂 I'm running tests locally as we speak. Are there any other packages that rely on Dagger that I should test? Also, maybe we should setup some reverse dependency testing in CI, since JuliaDB is a pretty important consumer of Dagger.

@jpsamaroo
Copy link
Member Author

JuliaDB master is passing with Dagger master and MemPool latest release and master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants