-
Notifications
You must be signed in to change notification settings - Fork 8
Closed
Labels
Description
I tried to run a singularity container inside a HPC cluster, my goal is to run accross many nodes. However, and I cannot even create one process. The lines below are how I submit my job:
(...)
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --partition=slow
(...)
srun singularity exec \
--bind=/scratch:/scratch \
--bind=/var/spool/slurm:/var/spool/slurm \
work.simg /opt/julia/bin/julia julia_parallel_test.jl
work.simg is my singularity image, that on my computer I am sure that everything works fine.
/opt/julia/bin/julia is the executable path inside the image
julia_parallel_test.jl is the most simple code that I come up:
using ClusterManagers
addprocs_slurm(2, nodes=1, partition="slow")
@everywhere println("process: $(myid()) on host $(gethostname())")
The message the concern us is here:
nested task error: IOError: could not spawn `srun -J julia-31840 -n 2 -o /home/u770885/Noel/./julia-31840-16189562422-%4t.out -D /home/u770885/Noel --partition=slow --nodes=1 /opt/julia/bin/julia --worker=hPfJL8h1rY8YWH7D`: no such file or directory (ENOENT)
The complete error is in the end.
What I think is happening is that : ClusterManagers.jl wants to run /opt/julia/bin/julia over the Node, and not on my singularity image.
First of all, am I right ?
Second, does anybody has a simple solution ?
Thank you for your attention.
Error launching Slurm job:
ERROR: LoadError: TaskFailedException
Stacktrace:
[1] wait
@ ./task.jl:317 [inlined]
[2] addprocs_locked(manager::SlurmManager; kwargs::Base.Iterators.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:nodes, :partition), Tuple{Int64, String}}})
@ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:497
[3] addprocs(manager::SlurmManager; kwargs::Base.Iterators.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:nodes, :partition), Tuple{Int64, String}}})
@ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:444
[4] #addprocs_slurm#14
@ ~/.julia/packages/ClusterManagers/Mq0H0/src/slurm.jl:100 [inlined]
[5] top-level scope
@ ~/Noel/julia_parallel_test.jl:2
nested task error: IOError: could not spawn `srun -J julia-31840 -n 2 -o /home/u770885/Noel/./julia-31840-16189562422-%4t.out -D /home/u770885/Noel --partition=slow --nodes=1 /opt/julia/bin/julia --worker=hPfJL8h1rY8YWH7D`: no such file or directory (ENOENT)
Stacktrace:
[1] _spawn_primitive(file::String, cmd::Cmd, stdio::Vector{Any})
@ Base ./process.jl:99
[2] #637
@ ./process.jl:112 [inlined]
[3] setup_stdios(f::Base.var"#637#638"{Cmd}, stdios::Vector{Any})
@ Base ./process.jl:196
[4] _spawn
@ ./process.jl:111 [inlined]
[5] open(cmds::Cmd, stdio::Base.DevNull; write::Bool, read::Bool)
@ Base ./process.jl:370
[6] open (repeats 2 times)
@ ./process.jl:361 [inlined]
[7] launch(manager::SlurmManager, params::Dict{Symbol, Any}, instances_arr::Vector{Distributed.WorkerConfig}, c::Condition)
@ ClusterManagers ~/.julia/packages/ClusterManagers/Mq0H0/src/slurm.jl:55
[8] (::Distributed.var"#39#42"{SlurmManager, Condition, Vector{Distributed.WorkerConfig}, Dict{Symbol, Any}})()
@ Distributed ./task.jl:406
in expression starting at /home/u770885/Noel/julia_parallel_test.jl:2
srun: error: c2: task 0: Exited with exit code 1