Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

-o argument in addprocs_slurm leads to an error #185

Open
stasis0 opened this issue Nov 10, 2022 · 0 comments
Open

-o argument in addprocs_slurm leads to an error #185

stasis0 opened this issue Nov 10, 2022 · 0 comments
Labels

Comments

@stasis0
Copy link

stasis0 commented Nov 10, 2022

Hello everyone,

To add workers and schedule jobs on the cluster, I'm using the addprocs_slurm function from ClasterManagers

slurm_cpus = 4
@async addprocs(SlurmManager(slurm_cpus), partition="all", t="00:10:0")

It works as intended

Task (runnable) @0x00002b8be08c5cd0connecting to worker 1 out of 4

srun: job 13332841 queued and waiting for resources

julia> srun: job 13332841 has been allocated resources
connecting to worker 2 out of 4
connecting to worker 3 out of 4
connecting to worker 4 out of 4

However, if I have a lot of workers, the corresponding number of output files appears in the working directory. I decided to add the -o argument and log everything into one file

slurm_cpus = 4
@async addprocs(SlurmManager(slurm_cpus), partition="all", t="00:10:0", o="log.out")

It indeed creates this log file

julia_worker:9007#131.169.193.109
julia_worker:9006#131.169.193.109
julia_worker:9008#131.169.193.109
julia_worker:9009#131.169.193.109
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.

but does not give any workers

Task (runnable) @0x00002b8be01f7260connecting to worker 1 out of 4

srun: job 13332876 queued and waiting for resources

julia> srun: job 13332876 has been allocated resources
srun: error: max-wn009: tasks 0-3: Exited with exit code 1

I decided to have a look at the source code. If I understand correctly, it specifies values for -o and -D independently of my choice. Maybe, it causes trouble

jobname = "julia-$(getpid())"
job_output_name = "$(jobname)-$(trunc(Int, Base.time() * 10))"
make_job_output_path(task_num) = joinpath(job_file_loc, "$(job_output_name)-$(task_num).out")
job_output_template = make_job_output_path("%4t")
srun_cmd = `srun -J $jobname -n $np -o "$(job_output_template)" -D $exehome $(srunargs) $exename $exeflags $(worker_arg())`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants