Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with addprocs(manager) with manager::MPIWorkerManager #37

Closed
fverdugo opened this issue Feb 2, 2023 · 4 comments
Closed

Error with addprocs(manager) with manager::MPIWorkerManager #37

fverdugo opened this issue Feb 2, 2023 · 4 comments

Comments

@fverdugo
Copy link

fverdugo commented Feb 2, 2023

I am having this error. Is this the expected behavior?

Thanks in advance!

reproducer$ export JULIA_PROJECT=`pwd`
reproducer$ julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.8.4 (2022-12-23)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(reproducer) pkg> st
Status `~/reproducer/Project.toml`
  [e7922434] MPIClusterManagers v0.2.4
  [8ba89e20] Distributed

julia> using MPIClusterManagers

julia> using Distributed

julia> manager = MPIWorkerManager(4)
MPIWorkerManager(4, Dict{Int64, Int64}(), Dict{Int64, Int64}(), false, false, Condition(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.AlwaysLockedST(1)), IO[])

julia> addprocs(manager)
ERROR: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:345 [inlined]
 [2] addprocs_locked(manager::MPIWorkerManager; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed ~/apps/julia/1.8.4/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:507
 [3] addprocs_locked
   @ ~/apps/julia/1.8.4/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:456 [inlined]
 [4] addprocs(manager::MPIWorkerManager; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed ~/apps/julia/1.8.4/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:450
 [5] addprocs(manager::MPIWorkerManager)
   @ Distributed ~/apps/julia/1.8.4/share/julia/stdlib/v1.8/Distributed/src/cluster.jl:443
 [6] top-level scope
   @ REPL[5]:1

    nested task error: Could not connect to workers
    Stacktrace:
     [1] error(s::String)
       @ Base ./error.jl:35
     [2] launch(mgr::MPIWorkerManager, params::Dict{Symbol, Any}, instances::Vector{WorkerConfig}, cond::Condition)
       @ MPIClusterManagers ~/.julia/packages/MPIClusterManagers/RQVkV/src/workermanager.jl:170
     [3] (::Distributed.var"#43#46"{MPIWorkerManager, Condition, Vector{WorkerConfig}, Dict{Symbol, Any}})()
       @ Distributed ./task.jl:484
    
    caused by: TaskFailedException
    Stacktrace:
     [1] wait
       @ ./task.jl:345 [inlined]
     [2] launch(mgr::MPIWorkerManager, params::Dict{Symbol, Any}, instances::Vector{WorkerConfig}, cond::Condition)
       @ MPIClusterManagers ~/.julia/packages/MPIClusterManagers/RQVkV/src/workermanager.jl:168
     [3] (::Distributed.var"#43#46"{MPIWorkerManager, Condition, Vector{WorkerConfig}, Dict{Symbol, Any}})()
       @ Distributed ./task.jl:484
    
        nested task error: MethodError: no method matching +(::Symbol, ::Int64)
        Closest candidates are:
          +(::Any, ::Any, ::Any, ::Any...) at operators.jl:591
          +(::T, ::T) where T<:Union{Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8} at int.jl:87
          +(::LinearAlgebra.UniformScaling, ::Number) at ~/apps/julia/1.8.4/share/julia/stdlib/v1.8/LinearAlgebra/src/uniformscaling.jl:144
          ...
        Stacktrace:
         [1] macro expansion
           @ ~/.julia/packages/MPIClusterManagers/RQVkV/src/workermanager.jl:140 [inlined]
         [2] (::MPIClusterManagers.var"#1#4"{MPIWorkerManager, Dict{Symbol, Any}, Sockets.TCPServer})()
           @ MPIClusterManagers ./task.jl:484


@simonbyrne
Copy link
Member

I am having this error. Is this the expected behavior?

Well, no....

@simonbyrne
Copy link
Member

Unfortunately I can't replicate it. It appears that at setup time a process is sending a symbol instead of the rank. Can you add an @show to this line:

rank = Serialization.deserialize(io)

and see what happens?

@fverdugo
Copy link
Author

I have run the reproducer again and I don't get the error anymore. (perhaps the dependencies are resolved differently? unfortunately I do not have the original Manifest.toml anymore).

I close the issue. If I get the problem again I can always re-open.

Thanks @simonbyrne for your rapid input anyway!

@dalviebenu
Copy link

I got a similar problem, however I was able to resolve this by turning off the wifi on my machine. For context this was executed in a Julia notebook while using the eduroam wifi network.

using MPIClusterManagers
using Distributed
if procs() == workers()
    nranks = 3
    manager = MPIWorkerManager(nranks)
    addprocs(manager)
end
TaskFailedException

    nested task error: Could not connect to workers
    Stacktrace:
     [1] error(s::String)
       @ Base ./error.jl:35
     [2] launch(mgr::MPIWorkerManager, params::Dict{Symbol, Any}, instances::Vector{WorkerConfig}, cond::Condition)
       @ MPIClusterManagers ~/.julia/packages/MPIClusterManagers/RQVkV/src/workermanager.jl:170
     [3] (::Distributed.var"#43#46"{MPIWorkerManager, Condition, Vector{WorkerConfig}, Dict{Symbol, Any}})()
       @ Distributed ./task.jl:514
    
    caused by: TaskFailedException
    Stacktrace:
     [1] wait
       @ ./task.jl:349 [inlined]
     [2] launch(mgr::MPIWorkerManager, params::Dict{Symbol, Any}, instances::Vector{WorkerConfig}, cond::Condition)
       @ MPIClusterManagers ~/.julia/packages/MPIClusterManagers/RQVkV/src/workermanager.jl:168
     [3] (::Distributed.var"#43#46"{MPIWorkerManager, Condition, Vector{WorkerConfig}, Dict{Symbol, Any}})()
       @ Distributed ./task.jl:514
    
        nested task error: MethodError: no method matching +(::Type{Any}, ::Int64)
        
        Closest candidates are:
          +(::Any, ::Any, ::Any, ::Any...)
           @ Base operators.jl:578
          +(::T, ::T) where T<:Union{Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8}
           @ Base int.jl:87
          +(::T, ::Integer) where T<:AbstractChar
           @ Base char.jl:237
          ...
        
        Stacktrace:
         [1] macro expansion
           @ ~/.julia/packages/MPIClusterManagers/RQVkV/src/workermanager.jl:140 [inlined]
         [2] (::MPIClusterManagers.var"#1#4"{MPIWorkerManager, Dict{Symbol, Any}, Sockets.TCPServer})()
           @ MPIClusterManagers ./task.jl:514

Stacktrace:
 [1] wait
   @ ./task.jl:349 [inlined]
 [2] addprocs_locked(manager::MPIWorkerManager; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/cluster.jl:507
 [3] addprocs_locked
   @ /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/cluster.jl:456 [inlined]
 [4] addprocs(manager::MPIWorkerManager; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/cluster.jl:450
 [5] addprocs(manager::MPIWorkerManager)
   @ Distributed /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/cluster.jl:443
 [6] top-level scope
   @ In[2]:6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants