Skip to content

Singularity images does not work with SLURM #57

@NoelAraujo

Description

@NoelAraujo

I tried to run a singularity container inside a HPC cluster, my goal is to run accross many nodes. However, and I cannot even create one process. The lines below are how I submit my job:

(...)
#SBATCH --nodes=1     
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --partition=slow

(...)
srun singularity exec \
     --bind=/scratch:/scratch \
     --bind=/var/spool/slurm:/var/spool/slurm \
     work.simg /opt/julia/bin/julia julia_parallel_test.jl

work.simg is my singularity image, that on my computer I am sure that everything works fine.
/opt/julia/bin/julia is the executable path inside the image
julia_parallel_test.jl is the most simple code that I come up:

using ClusterManagers
addprocs_slurm(2, nodes=1, partition="slow")
@everywhere println("process: $(myid()) on host $(gethostname())")

The message the concern us is here:

nested task error: IOError: could not spawn `srun -J julia-31840 -n 2 -o /home/u770885/Noel/./julia-31840-16189562422-%4t.out -D /home/u770885/Noel --partition=slow --nodes=1 /opt/julia/bin/julia --worker=hPfJL8h1rY8YWH7D`: no such file or directory (ENOENT)

The complete error is in the end.

What I think is happening is that : ClusterManagers.jl wants to run /opt/julia/bin/julia over the Node, and not on my singularity image.
First of all, am I right ?
Second, does anybody has a simple solution ?

Thank you for your attention.


Error launching Slurm job:
ERROR: LoadError: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:317 [inlined]
 [2] addprocs_locked(manager::SlurmManager; kwargs::Base.Iterators.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:nodes, :partition), Tuple{Int64, String}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:497
 [3] addprocs(manager::SlurmManager; kwargs::Base.Iterators.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:nodes, :partition), Tuple{Int64, String}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:444
 [4] #addprocs_slurm#14
   @ ~/.julia/packages/ClusterManagers/Mq0H0/src/slurm.jl:100 [inlined]
 [5] top-level scope
   @ ~/Noel/julia_parallel_test.jl:2

    nested task error: IOError: could not spawn `srun -J julia-31840 -n 2 -o /home/u770885/Noel/./julia-31840-16189562422-%4t.out -D /home/u770885/Noel --partition=slow --nodes=1 /opt/julia/bin/julia --worker=hPfJL8h1rY8YWH7D`: no such file or directory (ENOENT)
    Stacktrace:
     [1] _spawn_primitive(file::String, cmd::Cmd, stdio::Vector{Any})
       @ Base ./process.jl:99
     [2] #637
       @ ./process.jl:112 [inlined]
     [3] setup_stdios(f::Base.var"#637#638"{Cmd}, stdios::Vector{Any})
       @ Base ./process.jl:196
     [4] _spawn
       @ ./process.jl:111 [inlined]
     [5] open(cmds::Cmd, stdio::Base.DevNull; write::Bool, read::Bool)
       @ Base ./process.jl:370
     [6] open (repeats 2 times)
       @ ./process.jl:361 [inlined]
     [7] launch(manager::SlurmManager, params::Dict{Symbol, Any}, instances_arr::Vector{Distributed.WorkerConfig}, c::Condition)
       @ ClusterManagers ~/.julia/packages/ClusterManagers/Mq0H0/src/slurm.jl:55
     [8] (::Distributed.var"#39#42"{SlurmManager, Condition, Vector{Distributed.WorkerConfig}, Dict{Symbol, Any}})()
       @ Distributed ./task.jl:406
in expression starting at /home/u770885/Noel/julia_parallel_test.jl:2
srun: error: c2: task 0: Exited with exit code 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions