Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes worker ProcessExittedException kills manager even with pmap with retry_check #36709

Closed
oxinabox opened this issue Jul 17, 2020 · 12 comments
Assignees
Labels
domain:parallelism Parallel or distributed computation needs more info Clarification or a reproducible example is required

Comments

@oxinabox
Copy link
Contributor

I have a weird thing happening.
I am using pmap for local distributed parallelism on very large machines. (96 CPU cores, 384GB of RAM)

For my task it is hard to know how much memory it will need.
The memory bounds how much i can parallelize it.
To handle this I have taken the approach of starting 1 worker per core (96), and then letting the out of memory killer kill them off til it has enough memory.
This generally works great, and I normally achieve 90% memory utilization, and depending on the version of the task 20-70 parallel workers remaining.

I use Parallelism.jl's robust_pmap,
which is just a thin wrapper around pmap with retry_check set to retry on a bunch of different error conditions, including ProcessExittedException.

Recently I have started to see a few time a ProcessExcittedException for one of the worker take down the manager.
Killing my whole program.
Which shouldn't be possible, since I retry on those.

Things that have changed recently include:

  • I am now working on a more memory intenstive version of the problem, so a lot more worker get killed. So if this was something chance then i have a lot more lotto tickets.
  • I started using CachingPools in robust_pmap, even though I am not actually using large closures this time (only small ones).
    Preliminary testing suggests if i go back to not using caching pool and i cut down so less workers have to be killed it is solved.

The error is being thrown in https://github.com/JuliaLang/julia/blob/release-1.3/base/asyncmap.jl#L178
from the anon-function in the foreach. It is being returned from the fetch (not thrown by the fetch or stacktrack would show it) then thrown.

So I am wondering if the error is happening somewhere else in the pmap machinery that is is outside of the retry_wrapper that retry_check sets up.
Since the OOM killer can strike at any time.

@ViralBShah ViralBShah added the domain:parallelism Parallel or distributed computation label Jul 22, 2020
@JeffBezanson JeffBezanson self-assigned this Jul 27, 2020
@JeffBezanson
Copy link
Sponsor Member

This patch should show where the error is thrown, so we can hopefully find what isn't covered by the try block:

--- a/base/asyncmap.jl
+++ b/base/asyncmap.jl
@@ -237,6 +237,7 @@ function start_worker_task!(worker_tasks, exec_func, chnl, batch_size=nothing)
             end
         catch e
             close(chnl)
+            display_error(stderr, catch_stack())
             retval = e
         end
         retval

@oxinabox
Copy link
Contributor Author

oxinabox commented Feb 4, 2021

A duplicate of Jeff's comment above for showing error, but expressed as a monkey-patch so that we don't have to recompile julia to test:

function Base.start_worker_task!(worker_tasks, exec_func, chnl, batch_size=nothing)
    t = @async begin
        retval = nothing

        try
            if isa(batch_size, Number)
                while isopen(chnl)
                    # The mapping function expects an array of input args, as it processes
                    # elements in a batch.
                    batch_collection=Any[]
                    n = 0
                    for exec_data in chnl
                        push!(batch_collection, exec_data)
                        n += 1
                        (n == batch_size) && break
                    end
                    if n > 0
                        exec_func(batch_collection)
                    end
                end
            else
                for exec_data in chnl
                    exec_func(exec_data...)
                end
            end
        catch e
            close(chnl)
            Base.display_error(stderr, Base.catch_stack())
            retval = e
        end
        retval
    end
    push!(worker_tasks, t)
end

@oxinabox
Copy link
Contributor Author

oxinabox commented Feb 26, 2021

I managed to reproduce this
Actually managed to reproduce it a ton of times ( ☹️ )but the above code stops it crashing the proccess so it too a while to realized we had reproduced it.

Here is the output

************* The bad thing happened Report this at https://github.com/JuliaLang/julia/issues/36709

ERROR: ProcessExitedException(14)

[1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
[2] wait at ./task.jl:713 [inlined] 
[3] wait(::Base.GenericCondition{ReentrantLock}) at ./condition.jl:106 
[4] take_buffered(::Channel{Any}) at ./channels.jl:387 
[5] take!(::Channel{Any}) at ./channels.jl:381 
[6] take!(::Distributed.RemoteValue) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:599 
[7] remotecall_fetch(::Function, ::Distributed.Worker, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:390 
[8] remotecall_fetch(::Function, ::Distributed.Worker, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:386 
[9] remotecall_fetch(::Function, ::Int64, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421 
[10] remotecall_fetch(::Function, ::Int64, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421
[11] remotecall_pool(::Function, ::Function, ::CachingPool, ::SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:352 
[12] remotecall_pool at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:347 [inlined] 
[13] #remotecall_fetch#182 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:229 [inlined] 
[14] remotecall_fetch at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:229 [inlined] 
[15] #190#191 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:274 [inlined] 
[16] #190 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:274 [inlined] 
[17] (::Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,BLINDED}})(::SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/pmap.jl:176 
[18] (::Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,BLINDED,1}}}}}}})(::SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./error.jl:301 
[19] #56 at ./error.jl:284 [inlined] 
[20] (::Base.var"#760#765"{Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,BLINDED,Array{Pair{StepRange{Date,Day},String},1}}}}}}}})(::Base.RefValue{Any}, ::Tuple{SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}}) at ./asyncmap.jl:100 
[21] macro expansion at /opt/ml/code/BLINDED.jl:26 [inlined] 
[22] (::var"#1#2"{Base.var"#760#765"{Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#6BLINDEDArray{Pair{StepRange{Date,Day},String},1}}}}}}}},Channel{Any},Nothing})() at ./task.jl:356 

I have half a dozen different occurances of this. If this is not the one that indicates the right thing

@oxinabox
Copy link
Contributor Author

oxinabox commented Mar 8, 2021

Sometimes i seem to see the error being thown without hitting the displayerror we added in manually still
Though i might just not be seeing it. in the massive logs.

ERROR: LoadError: ProcessExitedException(51)

Stacktrace:
 [1] (::Base.var"#770#772")(::Task) at ./asyncmap.jl:178
 [2] foreach(::Base.var"#770#772", ::Array{Any,1}) at ./abstractarray.jl:2009
 [3] maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::Base.Iterators.PartitionIterator{Array{Date,1}}) at ./asyncmap.jl:178
 [4] wrap_n_exec_twice(::Channel{Any}, ::Array{Any,1}, ::Distributed.var"#206#209"{CachingPool}, ::Function, ::Base.Iterators.PartitionIterator{Array{Date,1}}) at ./asyncmap.jl:154
 [5] async_usemap(::Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#68"{BLINDED,Array{Pair{StepRange{Date,Day},String},1}}}}}}}, ::Base.Iterators.PartitionIterator{Array{Date,1}}; ntasks::Function, batch_size::Nothing) at ./asyncmap.jl:103
 [6] #asyncmap#754 at ./asyncmap.jl:81 [inlined]
 [7] pmap(::Function, ::CachingPool, ::Base.Iterators.PartitionIterator{Array{Date,1}}; distributed::Bool, batch_size::Int64, on_error::Nothing, retry_delays::ExponentialBackOff, retry_check::Function) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/pmap.jl:126
 [8] robust_pmap(::Function, ::Base.Iterators.PartitionIterator{Array{Date,1}}; num_retries::Int64) at /root/.julia/packages/Parallelism/vEmN1/src/robust_pmap.jl:27
 [9] robust_pmap at /root/.julia/packages/Parallelism/vEmN1/src/robust_pmap.jl:9 [inlined]

@JeffBezanson
Copy link
Sponsor Member

Thanks for the update.

Is the manager process running on the same machine or a different one?

but the above code stops it crashing the proccess

I'm not sure I understand --- if it stops the crash, what does it mean to reproduce the problem? (Obviously adding the printing code shouldn't change the problem, but that's a separate issue).

I have half a dozen different occurances of this. If this is not the one that indicates the right thing

Are they the same, or different-looking stack traces? If different, it might help to see more of them. Neither of the traces look like they are from the place we instrumented --- am I right about that?

@oxinabox
Copy link
Contributor Author

oxinabox commented Mar 8, 2021

Is the manager process running on the same machine or a different one?

Same machine

but the above code stops it crashing the proccess

I was totally wrong with that comments. It definitely does not. Process still dies.

Neither of the traces look like they are from the place we instrumented --- am I right about that?

The one headed with "The Bad Thing Happened"
Is from the instrumented point.
Because I added that as a println
to help me find it.

They all look pretty similar.
But I will double check tomorrow and post a bunch of them if they are different.

@JeffBezanson
Copy link
Sponsor Member

Same machine

In that case isn't it possible the oom killer is killing the manager process?

@oxinabox
Copy link
Contributor Author

oxinabox commented Mar 8, 2021

In that case isn't it possible the oom killer is killing the manager process?

In theory yes, but I am pretty sure that isn't what is happening in this case.

  1. The OOM killer in my experience kills with SIGKILL (I have seen that far to often in another project). SIGKILL kills instantly. You don't get a stack trace because you are dead before you can print one. Where as here I see not just the stack trace printed, but also atexit running -- sometimes it complains that it couldn't shut down one of the workers.

  2. Also the manager has almost no memory being held so wouldn't make sense to kill it.

@oxinabox
Copy link
Contributor Author

oxinabox commented Mar 9, 2021

************* The bad thing happened Report this at https://github.com/JuliaLang/julia/issues/36709
ERROR: ProcessExitedException(14)
Stacktrace:
 [1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
 [2] wait at ./task.jl:713 [inlined]
 [3] wait(::Base.GenericCondition{ReentrantLock}) at ./condition.jl:106
 [4] take_buffered(::Channel{Any}) at ./channels.jl:387
 [5] take!(::Channel{Any}) at ./channels.jl:381
 [6] take!(::Distributed.RemoteValue) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:599
 [7] remotecall_fetch(::Function, ::Distributed.Worker, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:390
 [8] remotecall_fetch(::Function, ::Distributed.Worker, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:386
 [9] remotecall_fetch(::Function, ::Int64, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421
 [10] remotecall_fetch(::Function, ::Int64, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421
 [11] remotecall_pool(::Function, ::Function, ::CachingPool, ::SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:352
 [12] remotecall_pool at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:347 [inlined]
 [13] #remotecall_fetch#182 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:229 [inlined]
 [14] remotecall_fetch at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:229 [inlined]
 [15] #190#191 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:274 [inlined]
 [16] #190 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:274 [inlined]
 [17] (::Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#68"{BLIND,Array{Pair{StepRange{Date,Day},String},1}}}}})(::SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/pmap.jl:176
 [18] (::Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#68"{BLIND,Array{Pair{StepRange{Date,Day},String},1}}}}}}})(::SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./error.jl:301
 [19] #56 at ./error.jl:284 [inlined]
 [20] (::Base.var"#760#765"{Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#68"{BLIND,Array{Pair{StepRange{Date,Day},String},1}}}}}}}})(::Base.RefValue{Any}, ::Tuple{SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}}) at ./asyncmap.jl:100
 [21] macro expansion at /opt/ml/code/backrun.jl:26 [inlined]
 [22] (::var"#1#2"{Base.var"#760#765"{Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#68"{BLIND,Array{Pair{StepRange{Date,Day},String},1}}}}}}}},Channel{Any},Nothing})() at ./task.jl:356

but this time it seemingly didn't kill it as I have a lot more logs.
infact it even happened again several more times.

Worker 27 terminated.
ERROR: ProcessExitedException(27)
Stacktrace:
 [1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
 [2] wait at ./task.jl:713 [inlined]
 [3] wait(::Base.GenericCondition{ReentrantLock}) at ./condition.jl:106
 [4] take_buffered(::Channel{Any}) at ./channels.jl:387
 [5] take!(::Channel{Any}) at ./channels.jl:381
 [6] take!(::Distributed.RemoteValue) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:599
 [7] remotecall_fetch(::Function, ::Distributed.Worker, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:390
 [8] remotecall_fetch(::Function, ::Distributed.Worker, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:386
 [9] remotecall_fetch(::Function, ::Int64, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421
 [10] remotecall_fetch(::Function, ::Int64, ::RemoteChannel{Channel{Any}}, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421
 [11] remotecall_pool(::Function, ::Function, ::CachingPool, ::SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:352
 [12] remotecall_pool at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:347 [inlined]
 [13] #remotecall_fetch#182 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:229 [inlined]
 [14] remotecall_fetch at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:229 [inlined]
 [15] #190#191 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:274 [inlined]
 [16] #190 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/workerpool.jl:274 [inlined]
 [17] (::Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#68"{BLIND,Array{Pair{StepRange{Date,Day},String},1}}}}})(::SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/pmap.jl:176
 [18] (::Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#68"{BLIND,Array{Pair{StepRange{Date,Day},String},1}}}}}}})(::SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./error.jl:301
 [19] #56 at ./error.jl:284 [inlined]
 [20] (::Base.var"#760#765"{Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#68"{BLIND,Array{Pair{StepRange{Date,Day},String},1}}}}}}}})(::Base.RefValue{Any}, ::Tuple{SubArray{Date,1,Array{Date,1},Tuple{UnitRange{Int64}},true}}) at ./asyncmap.jl:100
 [21] macro expansion at /opt/ml/code/backrun.jl:26 [inlined]
 [22] (::var"#1#2"{Base.var"#760#765"{Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#68"{BLIND,Array{Pair{StepRange{Date,Day},String},1}}}}}}}},Channel{Any},Nothing})() at ./task.jl:356

This happens about 4 or 5 more times stacktrace is always identical. over about 10 minutes
with a few other things happening between according the the logs
Then the logs end with:

[2021-02-26 16:35:06 | warn | Distributed]: Forcibly interrupting busy workers
[2021-02-26 16:35:06 | warn | Distributed]: rmprocs: process 1 not removed

no other errors


Looking at another.
Got same bad thing error message.

************* The bad thing happened Report this at https://github.com/JuliaLang/julia/issues/36709
ERROR: ProcessExitedException(13)
Stacktrace:
 [1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
 [2] wait at ./task.jl:713 [inlined]
 [3] wait(::Base.GenericCondition{ReentrantLock}) at ./condition.jl:106
 [4] take_buffered(::Channel{Any}) at ./channels.jl:387
... etc

Then it kept running for almost 2 hours.
then exitted after

ERROR: LoadError: ProcessExitedException(13)
Stacktrace:
 [1] (::Base.var"#770#772")(::Task) at ./asyncmap.jl:178
 [2] foreach(::Base.var"#770#772", ::Array{Any,1}) at ./abstractarray.jl:2009
 [3] maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::Base.Iterators.PartitionIterator{Array{Date,1}}) at ./asyncmap.jl:178
 [4] wrap_n_exec_twice(::Channel{Any}, ::Array{Any,1}, ::Distributed.var"#206#209"{CachingPool}, ::Function, ::Base.Iterators.PartitionIterator{Array{Date,1}}) at ./asyncmap.jl:154
 [5] async_usemap(::Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,Parallelism.var"#retry_check#2",Distributed.var"#222#223"{Distributed.var"#190#192"{Distributed.var"#190#191#193"{CachingPool,var"#67#68"{BLIND,Array{Pair{StepRange{Date,Day},String},1}}}}}}}, ::Base.Iterators.PartitionIterator{Array{Date,1}}; ntasks::Function, batch_size::Nothing) at ./asyncmap.jl:103
 [6] #asyncmap#754 at ./asyncmap.jl:81 [inlined]
 [7] pmap(::Function, ::CachingPool, ::Base.Iterators.PartitionIterator{Array{Date,1}}; distributed::Bool, batch_size::Int64, on_error::Nothing, retry_delays::ExponentialBackOff, retry_check::Function) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/pmap.jl:126
 [8] robust_pmap(::Function, ::Base.Iterators.PartitionIterator{Array{Date,1}}; num_retries::Int64) at /root/.julia/packages/Parallelism/vEmN1/src/robust_pmap.jl:27
 [9] robust_pmap at /root/.julia/packages/Parallelism/vEmN1/src/robust_pmap.jl:9 [inlined]
 [10] sagemaker_backrun(::BLIND, ::Array{Date,1}) at /opt/ml/code/HyperRun.jl:274
 [11] sagemaker_run(::BLIND, ::RandomSelector) at /opt/ml/code/HyperRun.jl:325
 [12] top-level scope at /opt/ml/code/backrun.jl:115
 [13] include(::Function, ::Module, ::String) at ./Base.jl:380
 [14] include(::Module, ::String) at ./Base.jl:368
 [15] exec_options(::Base.JLOptions) at ./client.jl:296
 [16] _start() at ./client.jl:506 ::Array{Date,1}) at /opt/ml/code/HyperRun.jl:274
 [11] sagemaker_run(::BLIND, ::RandomSelector) at /opt/ml/code/HyperRun.jl:325
 [12] top-level scope at /opt/ml/code/backrun.jl:115
 [13] include(::Function, ::Module, ::String) at ./Base.jl:380
 [14] include(::Module, ::String) at ./Base.jl:368
 [15] exec_options(::Base.JLOptions) at ./client.jl:296
 [16] _start() at ./client.jl:506

What is interesting here is it was ProcessExitedException(13) that killed it in the end.
It just took it 2 hours to realize that and die?


Other places in the logs we do see it retrying successfully after a worker dies.


It does seem possible that we are instrumenting the wrong place.
idk.
All i know is that it shouldn't die with a ProcessExitedException inside robust_pmap.
Its whole point is to handle that one.

@JeffBezanson
Copy link
Sponsor Member

I wonder if it is seeing some wrapped version of the exception. Let's print out typeof(err) if should_retry ends up false in robust_pmap.

@oxinabox
Copy link
Contributor Author

oxinabox commented Apr 9, 2021

I have encountered this again, with Parallelism 1.2 which prints the error if should_retry is false (or true).
It did not print anything, except the message as per #36709 (comment)

@ericphanson
Copy link
Contributor

ericphanson commented Jul 18, 2021

This patch should show where the error is thrown, so we can hopefully find what isn't covered by the try block:

--- a/base/asyncmap.jl
+++ b/base/asyncmap.jl
@@ -237,6 +237,7 @@ function start_worker_task!(worker_tasks, exec_func, chnl, batch_size=nothing)
             end
         catch e
             close(chnl)
+            display_error(stderr, catch_stack())
             retval = e
         end
         retval

Can this change be made to Base? It seems like it would fix #41030

in case it needs any more motivation, I'm debugging a network failure that occured 3.5 hours into training a model and it's frustrating to not have a real stacktrace in my logs so I know where I'm missing the @retry statement

@vtjnash vtjnash added the needs more info Clarification or a reproducible example is required label Feb 11, 2024
@vtjnash vtjnash closed this as completed Feb 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:parallelism Parallel or distributed computation needs more info Clarification or a reproducible example is required
Projects
None yet
Development

No branches or pull requests

5 participants