-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sometimes worker ProcessExittedException kills manager even with pmap with retry_check #36709
Comments
This patch should show where the error is thrown, so we can hopefully find what isn't covered by the try block: --- a/base/asyncmap.jl
+++ b/base/asyncmap.jl
@@ -237,6 +237,7 @@ function start_worker_task!(worker_tasks, exec_func, chnl, batch_size=nothing)
end
catch e
close(chnl)
+ display_error(stderr, catch_stack())
retval = e
end
retval |
A duplicate of Jeff's comment above for showing error, but expressed as a monkey-patch so that we don't have to recompile julia to test: function Base.start_worker_task!(worker_tasks, exec_func, chnl, batch_size=nothing)
t = @async begin
retval = nothing
try
if isa(batch_size, Number)
while isopen(chnl)
# The mapping function expects an array of input args, as it processes
# elements in a batch.
batch_collection=Any[]
n = 0
for exec_data in chnl
push!(batch_collection, exec_data)
n += 1
(n == batch_size) && break
end
if n > 0
exec_func(batch_collection)
end
end
else
for exec_data in chnl
exec_func(exec_data...)
end
end
catch e
close(chnl)
Base.display_error(stderr, Base.catch_stack())
retval = e
end
retval
end
push!(worker_tasks, t)
end |
I managed to reproduce this Here is the output
I have half a dozen different occurances of this. If this is not the one that indicates the right thing |
Sometimes i seem to see the error being thown without hitting the
|
Thanks for the update. Is the manager process running on the same machine or a different one?
I'm not sure I understand --- if it stops the crash, what does it mean to reproduce the problem? (Obviously adding the printing code shouldn't change the problem, but that's a separate issue).
Are they the same, or different-looking stack traces? If different, it might help to see more of them. Neither of the traces look like they are from the place we instrumented --- am I right about that? |
Same machine
I was totally wrong with that comments. It definitely does not. Process still dies.
The one headed with "The Bad Thing Happened" They all look pretty similar. |
In that case isn't it possible the oom killer is killing the manager process? |
In theory yes, but I am pretty sure that isn't what is happening in this case.
|
but this time it seemingly didn't kill it as I have a lot more logs.
This happens about 4 or 5 more times stacktrace is always identical. over about 10 minutes
no other errors Looking at another.
Then it kept running for almost 2 hours.
What is interesting here is it was Other places in the logs we do see it retrying successfully after a worker dies. It does seem possible that we are instrumenting the wrong place. |
I wonder if it is seeing some wrapped version of the exception. Let's print out |
I have encountered this again, with Parallelism 1.2 which prints the error if |
Can this change be made to Base? It seems like it would fix #41030 in case it needs any more motivation, I'm debugging a network failure that occured 3.5 hours into training a model and it's frustrating to not have a real stacktrace in my logs so I know where I'm missing the |
I have a weird thing happening.
I am using
pmap
for local distributed parallelism on very large machines. (96 CPU cores, 384GB of RAM)For my task it is hard to know how much memory it will need.
The memory bounds how much i can parallelize it.
To handle this I have taken the approach of starting 1 worker per core (96), and then letting the out of memory killer kill them off til it has enough memory.
This generally works great, and I normally achieve 90% memory utilization, and depending on the version of the task 20-70 parallel workers remaining.
I use Parallelism.jl's
robust_pmap
,which is just a thin wrapper around
pmap
withretry_check
set to retry on a bunch of different error conditions, includingProcessExittedException
.Recently I have started to see a few time a
ProcessExcittedException
for one of the worker take down the manager.Killing my whole program.
Which shouldn't be possible, since I retry on those.
Things that have changed recently include:
CachingPool
s inrobust_pmap
, even though I am not actually using large closures this time (only small ones).Preliminary testing suggests if i go back to not using caching pool and i cut down so less workers have to be killed it is solved.
The error is being thrown in https://github.com/JuliaLang/julia/blob/release-1.3/base/asyncmap.jl#L178
from the anon-function in the
foreach
. It is being returned from thefetch
(not thrown by thefetch
or stacktrack would show it) then thrown.So I am wondering if the error is happening somewhere else in the
pmap
machinery that is is outside of theretry_wrapper
thatretry_check
sets up.Since the OOM killer can strike at any time.
The text was updated successfully, but these errors were encountered: