Sometimes worker ProcessExittedException kills manager even with pmap with retry_check #36709

oxinabox · 2020-07-17T11:29:27Z

I have a weird thing happening.
I am using pmap for local distributed parallelism on very large machines. (96 CPU cores, 384GB of RAM)

For my task it is hard to know how much memory it will need.
The memory bounds how much i can parallelize it.
To handle this I have taken the approach of starting 1 worker per core (96), and then letting the out of memory killer kill them off til it has enough memory.
This generally works great, and I normally achieve 90% memory utilization, and depending on the version of the task 20-70 parallel workers remaining.

I use Parallelism.jl's robust_pmap,
which is just a thin wrapper around pmap with retry_check set to retry on a bunch of different error conditions, including ProcessExittedException.

Recently I have started to see a few time a ProcessExcittedException for one of the worker take down the manager.
Killing my whole program.
Which shouldn't be possible, since I retry on those.

Things that have changed recently include:

I am now working on a more memory intenstive version of the problem, so a lot more worker get killed. So if this was something chance then i have a lot more lotto tickets.
I started using CachingPools in robust_pmap, even though I am not actually using large closures this time (only small ones).
Preliminary testing suggests if i go back to not using caching pool and i cut down so less workers have to be killed it is solved.

The error is being thrown in https://github.com/JuliaLang/julia/blob/release-1.3/base/asyncmap.jl#L178
from the anon-function in the foreach. It is being returned from the fetch (not thrown by the fetch or stacktrack would show it) then thrown.

So I am wondering if the error is happening somewhere else in the pmap machinery that is is outside of the retry_wrapper that retry_check sets up.
Since the OOM killer can strike at any time.

The text was updated successfully, but these errors were encountered:

JeffBezanson · 2020-08-12T16:54:17Z

This patch should show where the error is thrown, so we can hopefully find what isn't covered by the try block:

--- a/base/asyncmap.jl
+++ b/base/asyncmap.jl
@@ -237,6 +237,7 @@ function start_worker_task!(worker_tasks, exec_func, chnl, batch_size=nothing)
             end
         catch e
             close(chnl)
+            display_error(stderr, catch_stack())
             retval = e
         end
         retval

oxinabox · 2021-02-04T13:19:51Z

A duplicate of Jeff's comment above for showing error, but expressed as a monkey-patch so that we don't have to recompile julia to test:

function Base.start_worker_task!(worker_tasks, exec_func, chnl, batch_size=nothing)
    t = @async begin
        retval = nothing

        try
            if isa(batch_size, Number)
                while isopen(chnl)
                    # The mapping function expects an array of input args, as it processes
                    # elements in a batch.
                    batch_collection=Any[]
                    n = 0
                    for exec_data in chnl
                        push!(batch_collection, exec_data)
                        n += 1
                        (n == batch_size) && break
                    end
                    if n > 0
                        exec_func(batch_collection)
                    end
                end
            else
                for exec_data in chnl
                    exec_func(exec_data...)
                end
            end
        catch e
            close(chnl)
            Base.display_error(stderr, Base.catch_stack())
            retval = e
        end
        retval
    end
    push!(worker_tasks, t)
end

oxinabox · 2021-02-26T16:32:46Z

I managed to reproduce this
Actually managed to reproduce it a ton of times ( ☹️ )but the above code stops it crashing the proccess so it too a while to realized we had reproduced it.

Here is the output

************* The bad thing happened Report this at https://github.com/JuliaLang/julia/issues/36709

ERROR: ProcessExitedException(14)

[1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
[2] wait at ./task.jl:713 [inlined] 
[3] wait(::Base.GenericCondition{ReentrantLock}) at ./condition.jl:106 
[4] take_buffered(::Channel{Any}) at ./channels.jl:387 
[5] take!(::Channel{Any}) at ./channels.jl:381 
[6] take!(::Distributed.RemoteValue) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:599 
[7] remotecall_fetch(::Function, ::Distributed.Worker, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:390 
[8] remotecall_fetch(::Function, ::Distributed.Worker, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:386 
[9] remotecall_fetch(::Function, ::Int64, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421 
[10] remotecall_fetch(::Function, ::Int64, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421
[11] remotecall_pool(::Function, ::Function, ::CachingPool, ::SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:352 
[12] remotecall_pool at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:347 [inlined] 
[13] #remotecall_fetch#182 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:229 [inlined] 
[14] remotecall_fetch at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:229 [inlined] 
[15] #190#191 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:274 [inlined] 
[16] #190 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:274 [inlined] 
[17] (::Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,BLINDED}})(::SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/pmap.jl:176 
[18] (::Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,BLINDED,1}}}}}}})(::SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./error.jl:301 
[19] #56 at ./error.jl:284 [inlined] 
[20] (::Base.var"#760#765"{Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,BLINDED,Array{Pair{StepRange{Date,Day},String},1}}}}}}}})(::Base.RefValue{Any}, ::Tuple{SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}}) at ./asyncmap.jl:100 
[21] macro expansion at /opt/ml/code/BLINDED.jl:26 [inlined] 
[22] (::var"#1#2"{Base.var"#760#765"{Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#6BLINDEDArray{Pair{StepRange{Date,Day},String},1}}}}}}}},Channel{Any},Nothing})() at ./task.jl:356

I have half a dozen different occurances of this. If this is not the one that indicates the right thing

oxinabox · 2021-03-08T11:59:19Z

Sometimes i seem to see the error being thown without hitting the displayerror we added in manually still
Though i might just not be seeing it. in the massive logs.

ERROR: LoadError: ProcessExitedException(51)

Stacktrace:
 [1] (::Base.var"#770#772")(::Task) at ./asyncmap.jl:178
 [2] foreach(::Base.var"#770#772", ::Array{Any,1}) at ./abstractarray.jl:2009
 [3] maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::Base.Iterators.PartitionIterator{Array{Date,1}}) at ./asyncmap.jl:178
 [4] wrap_n_exec_twice(::Channel{Any}, ::Array{Any,1}, ::Distributed.var"#206#209"{CachingPool}, ::Function, ::Base.Iterators.PartitionIterator{Array{Date,1}}) at ./asyncmap.jl:154
 [5] async_usemap(::Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#68"{BLINDED,Array{Pair{StepRange{Date,Day},String},1}}}}}}}, ::Base.Iterators.PartitionIterator{Array{Date,1}}; ntasks::Function, batch_size::Nothing) at ./asyncmap.jl:103
 [6] #asyncmap#754 at ./asyncmap.jl:81 [inlined]
 [7] pmap(::Function, ::CachingPool, ::Base.Iterators.PartitionIterator{Array{Date,1}}; distributed::Bool, batch_size::Int64, on_error::Nothing, retry_delays::ExponentialBackOff, retry_check::Function) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/pmap.jl:126
 [8] robust_pmap(::Function, ::Base.Iterators.PartitionIterator{Array{Date,1}}; num_retries::Int64) at /root/.julia/packages/Parallelism/vEmN1/src/robust_pmap.jl:27
 [9] robust_pmap at /root/.julia/packages/Parallelism/vEmN1/src/robust_pmap.jl:9 [inlined]

JeffBezanson · 2021-03-08T21:58:47Z

Thanks for the update.

Is the manager process running on the same machine or a different one?

but the above code stops it crashing the proccess

I'm not sure I understand --- if it stops the crash, what does it mean to reproduce the problem? (Obviously adding the printing code shouldn't change the problem, but that's a separate issue).

I have half a dozen different occurances of this. If this is not the one that indicates the right thing

Are they the same, or different-looking stack traces? If different, it might help to see more of them. Neither of the traces look like they are from the place we instrumented --- am I right about that?

oxinabox · 2021-03-08T22:16:48Z

Is the manager process running on the same machine or a different one?

Same machine

but the above code stops it crashing the proccess

I was totally wrong with that comments. It definitely does not. Process still dies.

Neither of the traces look like they are from the place we instrumented --- am I right about that?

The one headed with "The Bad Thing Happened"
Is from the instrumented point.
Because I added that as a println
to help me find it.

They all look pretty similar.
But I will double check tomorrow and post a bunch of them if they are different.

JeffBezanson · 2021-03-08T22:46:21Z

Same machine

In that case isn't it possible the oom killer is killing the manager process?

oxinabox · 2021-03-08T23:11:51Z

In that case isn't it possible the oom killer is killing the manager process?

In theory yes, but I am pretty sure that isn't what is happening in this case.

The OOM killer in my experience kills with SIGKILL (I have seen that far to often in another project). SIGKILL kills instantly. You don't get a stack trace because you are dead before you can print one. Where as here I see not just the stack trace printed, but also atexit running -- sometimes it complains that it couldn't shut down one of the workers.
Also the manager has almost no memory being held so wouldn't make sense to kill it.

oxinabox · 2021-03-09T13:23:33Z

************* The bad thing happened Report this at https://github.com/JuliaLang/julia/issues/36709
ERROR: ProcessExitedException(14)
Stacktrace:
 [1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
 [2] wait at ./task.jl:713 [inlined]
 [3] wait(::Base.GenericCondition{ReentrantLock}) at ./condition.jl:106
 [4] take_buffered(::Channel{Any}) at ./channels.jl:387
 [5] take!(::Channel{Any}) at ./channels.jl:381
 [6] take!(::Distributed.RemoteValue) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:599
 [7] remotecall_fetch(::Function, ::Distributed.Worker, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:390
 [8] remotecall_fetch(::Function, ::Distributed.Worker, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:386
 [9] remotecall_fetch(::Function, ::Int64, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421
 [10] remotecall_fetch(::Function, ::Int64, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421
 [11] remotecall_pool(::Function, ::Function, ::CachingPool, ::SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:352
 [12] remotecall_pool at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:347 [inlined]
 [13] #remotecall_fetch#182 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:229 [inlined]
 [14] remotecall_fetch at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:229 [inlined]
 [15] #190#191 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:274 [inlined]
 [16] #190 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:274 [inlined]
 [17] (::Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#68"{BLIND,Array{Pair{StepRange{Date,Day},String},1}}}}})(::SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/pmap.jl:176
 [18] (::Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#68"{BLIND,Array{Pair{StepRange{Date,Day},String},1}}}}}}})(::SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./error.jl:301
 [19] #56 at ./error.jl:284 [inlined]
 [20] (::Base.var"#760#765"{Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#68"{BLIND,Array{Pair{StepRange{Date,Day},String},1}}}}}}}})(::Base.RefValue{Any}, ::Tuple{SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}}) at ./asyncmap.jl:100
 [21] macro expansion at /opt/ml/code/backrun.jl:26 [inlined]
 [22] (::var"#1#2"{Base.var"#760#765"{Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#68"{BLIND,Array{Pair{StepRange{Date,Day},String},1}}}}}}}},Channel{Any},Nothing})() at ./task.jl:356

but this time it seemingly didn't kill it as I have a lot more logs.
infact it even happened again several more times.

Worker 27 terminated.
ERROR: ProcessExitedException(27)
Stacktrace:
 [1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
 [2] wait at ./task.jl:713 [inlined]
 [3] wait(::Base.GenericCondition{ReentrantLock}) at ./condition.jl:106
 [4] take_buffered(::Channel{Any}) at ./channels.jl:387
 [5] take!(::Channel{Any}) at ./channels.jl:381
 [6] take!(::Distributed.RemoteValue) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:599
 [7] remotecall_fetch(::Function, ::Distributed.Worker, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:390
 [8] remotecall_fetch(::Function, ::Distributed.Worker, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:386
 [9] remotecall_fetch(::Function, ::Int64, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421
 [10] remotecall_fetch(::Function, ::Int64, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421
 [11] remotecall_pool(::Function, ::Function, ::CachingPool, ::SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:352
 [12] remotecall_pool at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:347 [inlined]
 [13] #remotecall_fetch#182 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:229 [inlined]
 [14] remotecall_fetch at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:229 [inlined]
 [15] #190#191 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:274 [inlined]
 [16] #190 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:274 [inlined]
 [17] (::Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#68"{BLIND,Array{Pair{StepRange{Date,Day},String},1}}}}})(::SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/pmap.jl:176
 [18] (::Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#68"{BLIND,Array{Pair{StepRange{Date,Day},String},1}}}}}}})(::SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./error.jl:301
 [19] #56 at ./error.jl:284 [inlined]
 [20] (::Base.var"#760#765"{Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#68"{BLIND,Array{Pair{StepRange{Date,Day},String},1}}}}}}}})(::Base.RefValue{Any}, ::Tuple{SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}}) at ./asyncmap.jl:100
 [21] macro expansion at /opt/ml/code/backrun.jl:26 [inlined]
 [22] (::var"#1#2"{Base.var"#760#765"{Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#68"{BLIND,Array{Pair{StepRange{Date,Day},String},1}}}}}}}},Channel{Any},Nothing})() at ./task.jl:356

This happens about 4 or 5 more times stacktrace is always identical. over about 10 minutes
with a few other things happening between according the the logs
Then the logs end with:

[2021-02-26 16:35:06 | warn | Distributed]: Forcibly interrupting busy workers
[2021-02-26 16:35:06 | warn | Distributed]: rmprocs: process 1 not removed

no other errors

Looking at another.
Got same bad thing error message.

************* The bad thing happened Report this at https://github.com/JuliaLang/julia/issues/36709
ERROR: ProcessExitedException(13)
Stacktrace:
 [1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
 [2] wait at ./task.jl:713 [inlined]
 [3] wait(::Base.GenericCondition{ReentrantLock}) at ./condition.jl:106
 [4] take_buffered(::Channel{Any}) at ./channels.jl:387
... etc

Then it kept running for almost 2 hours.
then exitted after

ERROR: LoadError: ProcessExitedException(13)
Stacktrace:
 [1] (::Base.var"#770#772")(::Task) at ./asyncmap.jl:178
 [2] foreach(::Base.var"#770#772", ::Array{Any,1}) at ./abstractarray.jl:2009
 [3] maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::Base.Iterators.PartitionIterator{Array{Date,1}}) at ./asyncmap.jl:178
 [4] wrap_n_exec_twice(::Channel{Any}, ::Array{Any,1}, ::Distributed.var"#206#209"{CachingPool}, ::Function, ::Base.Iterators.PartitionIterator{Array{Date,1}}) at ./asyncmap.jl:154
 [5] async_usemap(::Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#68"{BLIND,Array{Pair{StepRange{Date,Day},String},1}}}}}}}, ::Base.Iterators.PartitionIterator{Array{Date,1}}; ntasks::Function, batch_size::Nothing) at ./asyncmap.jl:103
 [6] #asyncmap#754 at ./asyncmap.jl:81 [inlined]
 [7] pmap(::Function, ::CachingPool, ::Base.Iterators.PartitionIterator{Array{Date,1}}; distributed::Bool, batch_size::Int64, on_error::Nothing, retry_delays::ExponentialBackOff, retry_check::Function) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/pmap.jl:126
 [8] robust_pmap(::Function, ::Base.Iterators.PartitionIterator{Array{Date,1}}; num_retries::Int64) at /root/.julia/packages/Parallelism/vEmN1/src/robust_pmap.jl:27
 [9] robust_pmap at /root/.julia/packages/Parallelism/vEmN1/src/robust_pmap.jl:9 [inlined]
 [10] sagemaker_backrun(::BLIND, ::Array{Date,1}) at /opt/ml/code/HyperRun.jl:274
 [11] sagemaker_run(::BLIND, ::RandomSelector) at /opt/ml/code/HyperRun.jl:325
 [12] top-level scope at /opt/ml/code/backrun.jl:115
 [13] include(::Function, ::Module, ::String) at ./Base.jl:380
 [14] include(::Module, ::String) at ./Base.jl:368
 [15] exec_options(::Base.JLOptions) at ./client.jl:296
 [16] _start() at ./client.jl:506 ::Array{Date,1}) at /opt/ml/code/HyperRun.jl:274
 [11] sagemaker_run(::BLIND, ::RandomSelector) at /opt/ml/code/HyperRun.jl:325
 [12] top-level scope at /opt/ml/code/backrun.jl:115
 [13] include(::Function, ::Module, ::String) at ./Base.jl:380
 [14] include(::Module, ::String) at ./Base.jl:368
 [15] exec_options(::Base.JLOptions) at ./client.jl:296
 [16] _start() at ./client.jl:506

What is interesting here is it was ProcessExitedException(13) that killed it in the end.
It just took it 2 hours to realize that and die?

Other places in the logs we do see it retrying successfully after a worker dies.

It does seem possible that we are instrumenting the wrong place.
idk.
All i know is that it shouldn't die with a ProcessExitedException inside robust_pmap.
Its whole point is to handle that one.

JeffBezanson · 2021-03-10T22:26:23Z

I wonder if it is seeing some wrapped version of the exception. Let's print out typeof(err) if should_retry ends up false in robust_pmap.

oxinabox · 2021-04-09T16:44:44Z

I have encountered this again, with Parallelism 1.2 which prints the error if should_retry is false (or true).
It did not print anything, except the message as per #36709 (comment)

ericphanson · 2021-07-18T14:38:20Z

This patch should show where the error is thrown, so we can hopefully find what isn't covered by the try block:

--- a/base/asyncmap.jl
+++ b/base/asyncmap.jl
@@ -237,6 +237,7 @@ function start_worker_task!(worker_tasks, exec_func, chnl, batch_size=nothing)
             end
         catch e
             close(chnl)
+            display_error(stderr, catch_stack())
             retval = e
         end
         retval

Can this change be made to Base? It seems like it would fix #41030

in case it needs any more motivation, I'm debugging a network failure that occured 3.5 hours into training a model and it's frustrating to not have a real stacktrace in my logs so I know where I'm missing the @retry statement

ViralBShah added the domain:parallelism Parallel or distributed computation label Jul 22, 2020

JeffBezanson self-assigned this Jul 27, 2020

oxinabox mentioned this issue Mar 11, 2021

Log the type of error, whether or not retrying invenia/Parallelism.jl#10

Merged

marius311 mentioned this issue May 31, 2021

asyncmap hides actual error stack trace #41030

Closed

ericphanson mentioned this issue Sep 3, 2021

Throw CapturedExceptions from asyncmap to avoid dropping the stacktrace #42105

Merged

vtjnash added the needs more info Clarification or a reproducible example is required label Feb 11, 2024

vtjnash closed this as completed Feb 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sometimes worker ProcessExittedException kills manager even with pmap with retry_check #36709

Sometimes worker ProcessExittedException kills manager even with pmap with retry_check #36709

oxinabox commented Jul 17, 2020

JeffBezanson commented Aug 12, 2020

oxinabox commented Feb 4, 2021

oxinabox commented Feb 26, 2021 •

edited

Loading

oxinabox commented Mar 8, 2021

JeffBezanson commented Mar 8, 2021

oxinabox commented Mar 8, 2021

JeffBezanson commented Mar 8, 2021

oxinabox commented Mar 8, 2021

oxinabox commented Mar 9, 2021

JeffBezanson commented Mar 10, 2021

oxinabox commented Apr 9, 2021

ericphanson commented Jul 18, 2021 •

edited

Loading

Sometimes worker ProcessExittedException kills manager even with pmap with retry_check #36709

Sometimes worker ProcessExittedException kills manager even with pmap with retry_check #36709

Comments

oxinabox commented Jul 17, 2020

JeffBezanson commented Aug 12, 2020

oxinabox commented Feb 4, 2021

oxinabox commented Feb 26, 2021 • edited Loading

oxinabox commented Mar 8, 2021

JeffBezanson commented Mar 8, 2021

oxinabox commented Mar 8, 2021

JeffBezanson commented Mar 8, 2021

oxinabox commented Mar 8, 2021

oxinabox commented Mar 9, 2021

JeffBezanson commented Mar 10, 2021

oxinabox commented Apr 9, 2021

ericphanson commented Jul 18, 2021 • edited Loading

oxinabox commented Feb 26, 2021 •

edited

Loading

ericphanson commented Jul 18, 2021 •

edited

Loading