-
-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fault tolerance to scheduler #105
Conversation
79242e2
to
f801242
Compare
f801242
to
b271329
Compare
f114ee6
to
672efe8
Compare
@shashi @aviks please excuse the This should also have no effect on normal (non-failure) situations, since the fault handling code only triggers when a worker dies unexpectedly. |
Also, I will be moving most of this code out into its own file so that I don't make scheduler.jl hard to read. |
82b3967
to
8c26c69
Compare
bump |
I also think it would be good to add an option to the scheduler to disable this new recovery functionality, in cases where it is undesired. I don't want to mess with this PR any more than I need to, so I will be doing that in a new PR soon (which will also expose some other unrelated options for the scheduler). @shashi @aviks @tanmaykm any chance one of you can review this PR and let me know if there is anything else desired before merge? |
I'm going to assume the single Travis failure on nightly is not my fault 🙂 |
I tested this with the dataset that was giving me troubles on JuliaDB which spawned the initial convos around worker fault tolerance with @jpsamaroo, and can verify that it resolves the Linux OOM killer issue. There does seem to still be a memory leak either with JuliaDB or Dagger (unrelated to the fault tolerance) where about ~2gb of memory remains allocated after the processes have completed. |
Going to review it tonight! This is a lot of code for this package haha. |
Hi @shashi, did you get the chance to review this PR? I'm happy to answer any questions you may have about how it works, since not everything I included is obvious (or possibly even necessary) 🙂 |
When a worker is killed due to OS signals (such as the Linux OOM killer), a ProcessExitedException is likely to be thrown to the head node. When such an exception is detected, we reschedule the failed thunk on a new, non-dead worker (randomly for now).
580f9f3
to
59e57bb
Compare
@shashi @andreasnoack if there are no objections, I'm going to merge this. |
Thanks! |
Thanks for the work here, @jpsamaroo . Does JuliaDB tests pass on this code? Would you please verify that before tagging a release here? |
I had the exact same thought 🙂 I'm running tests locally as we speak. Are there any other packages that rely on Dagger that I should test? Also, maybe we should setup some reverse dependency testing in CI, since JuliaDB is a pretty important consumer of Dagger. |
JuliaDB master is passing with Dagger master and MemPool latest release and master. |
When a worker is killed due to OS signals (such as the Linux OOM
killer), a ProcessExitedException is likely to be thrown to the head
node. When such an exception is detected, we reschedule the failed thunk
on a new, non-dead worker (randomly for now).
Todo:
KeyError
s due to not gettingstate
just right...