Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to rmprocs in distributed environment #37526

Closed
juliohm opened this issue Sep 11, 2020 · 1 comment
Closed

Failing to rmprocs in distributed environment #37526

juliohm opened this issue Sep 11, 2020 · 1 comment
Labels
domain:parallelism Parallel or distributed computation

Comments

@juliohm
Copy link
Sponsor Contributor

juliohm commented Sep 11, 2020

I have an issue that is hard to reproduce locally, but that I would like to share here in case it can be addressed without a MWE. The issue occurs in a HPC cluster that uses the LSB job scheduler. In order to distribute Julia processes in this cluster, we had to workaround the SSH communication that Julia uses as follows:

We needed to use the blaunch drop-in replacement for ssh. We created a file ~/bin/ssh with the following contents:

#!/bin/sh

exec /opt/share/lsf-9.1.3/10.1/linux3.10-glibc2.17-x86_64/bin/blaunch -use-login-shell "$@"

and gave it execution permission chmod +x ~/bin/ssh. Finally, we added the file to the PATH before the actual ssh command in the system:

export PATH=$HOME/bin:$PATH

To run the distributed app, we created a submission script launch.sh:

#!/bin/bash
#BSUB -J TESTING
#BSUB -q x86_6h
#BSUB -n 50
#BSUB -M 300000
#BSUB -o log.txt

julia --machine-file=$LSB_DJOB_HOSTFILE main.jl

The script works fine with 50 processes, but always prints a warning at the end:

┌ Warning: Forcibly interrupting busy workers
│   exception = rmprocs: pids [2, 5, 6, 7, 8, 10, 11, 12, 14, 15, 16, 17, 18, 19, 21, 22, 24, 29] not terminated after 5.0 seconds.
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1234
┌ Warning: rmprocs: process 1 not removed
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1030
      From worker 4:	 Activating environment at `~/dataprep/Project.toml`
      From worker 4:	Precompiling      From worker 22:	 Activating environment at `~/dataprep/Project.toml`
      From worker 22:	Precompiling project...
      From worker 20:	 Activating environment at `~/dataprep/Project.toml`
      From worker 20:	Precompiling project...
      From worker 23:	 Activating environment at `~/dataprep/Project.toml`

Attempting to remove processes from the pool and failing to do so. If we request more Julia processes like 100 processes, the issue is more serious, and Julia crashes with the following bus error:

From worker 97:	signal (7): Bus error
      From worker 97:	in expression starting at none:0
      From worker 97:	__memmove_ssse3_back at /lib64/libc.so.6 (unknown line)
      From worker 97:	_ZN4llvm15RuntimeDyldImpl11emitSectionERKNS_6object10ObjectFileERKNS1_10SectionRefEb at /u/juliohm/julia-1.5.0/bin/../lib/julia/libLLVM-9jl.so (unknown line)
      From worker 97:	_ZN4llvm15RuntimeDyldImpl17findOrEmitSectionERKNS_6object10ObjectFileERKNS1_10SectionRefEbRSt3mapIS5_jSt4lessIS5_ESaISt4pairIS6_jEEE at /u/juliohm/julia-1.5.0/bin/../lib/julia/libLLVM-9jl.so (unknown line)
      From worker 97:	_ZN4llvm15RuntimeDyldImpl14loadObjectImplERKNS_6object10ObjectFileE at /u/juliohm/julia-1.5.0/bin/../lib/julia/libLLVM-9jl.so (unknown line)
      From worker 97:	_ZN4llvm14RuntimeDyldELF10loadObjectERKNS_6object10ObjectFileE at /u/juliohm/julia-1.5.0/bin/../lib/julia/libLLVM-9jl.so (unknown line)
      From worker 97:	_ZN4llvm11RuntimeDyld10loadObjectERKNS_6object10ObjectFileE at /u/juliohm/julia-1.5.0/bin/../lib/julia/libLLVM-9jl.so (unknown line)
      From worker 97:	finalize at /buildworker/worker/package_linux64/build/usr/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:242
      From worker 97:	emitAndFinalize at /buildworker/worker/package_linux64/build/usr/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:462 [inlined]
      From worker 97:	emitAndFinalize at /buildworker/worker/package_linux64/build/usr/include/llvm/ExecutionEngine/Orc/IRCompileLayer.h:127 [inlined]
      From worker 97:	addModule at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:651
      From worker 97:	jl_add_to_ee at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:893 [inlined]
      From worker 97:	jl_add_to_ee at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:955
      From worker 97:	jl_add_to_ee at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:977 [inlined]
      From worker 97:	_jl_compile_codeinst at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:126
      From worker 97:	jl_generate_fptr at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:302
      From worker 97:	jl_compile_method_internal at /buildworker/worker/package_linux64/build/src/gf.c:1964
      From worker 97:	jl_compile_method_internal at /buildworker/worker/package_linux64/build/src/gf.c:1919 [inlined]
      From worker 97:	_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2224 [inlined]
      From worker 97:	jl_gf_invoke_by_method at /buildworker/worker/package_linux64/build/src/gf.c:2482 [inlined]
      From worker 97:	jl_gf_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2447
      From worker 97:	jl_f_invoke at /buildworker/worker/package_linux64/build/src/builtins.c:1019
      From worker 97:	_invoked_shouldlog at ./logging.jl:78 [inlined]
      From worker 97:	macro expansion at ./logging.jl:327 [inlined]
      From worker 97:	#17 at /u/juliohm/dataprep/era5.jl:96
      From worker 97:	#43 at /u/juliohm/.julia/packages/ProgressMeter/OUQkp/src/ProgressMeter.jl:810
      From worker 97:	unknown function (ip: 0x2adbdc4ec0af)
      From worker 97:	_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
      From worker 97:	jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
      From worker 97:	jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1690 [inlined]
      From worker 97:	do_apply at /buildworker/worker/package_linux64/build/src/builtins.c:655
      From worker 97:	#106 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:294
      From worker 97:	run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:79
      From worker 97:	macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:294 [inlined]
      From worker 97:	#105 at ./task.jl:356
      From worker 97:	unknown function (ip: 0x2adbdc4bdc2c)
      From worker 97:	_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
      From worker 97:	jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
      From worker 97:	jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1690 [inlined]
      From worker 97:	start_task at /buildworker/worker/package_linux64/build/src/task.c:707
      From worker 97:	unknown function (ip: (nil))
      From worker 97:	Allocations: 47054843 (Pool: 47048744; Big: 6099); GC: 63
      From worker 90:	
      From worker 90:	signal (7): Bus error
      From worker 90:	in expression starting at none:0
      From worker 90:	__memmove_ssse3_back at /lib64/libc.so.6 (unknown line)
      From worker 90:	_ZN4llvm15RuntimeDyldImpl11emitSectionERKNS_6object10ObjectFileERKNS1_10SectionRefEb at /u/juliohm/julia-1.5.0/bin/../lib/julia/libLLVM-9jl.so (unknown line)
      From worker 90:	_ZN4llvm15RuntimeDyldImpl17findOrEmitSectionERKNS_6object10ObjectFileERKNS1_10SectionRefEbRSt3mapIS5_jSt4lessIS5_ESaISt4pairIS6_jEEE at /u/juliohm/julia-1.5.0/bin/../lib/julia/libLLVM-9jl.so (unknown line)
      From worker 90:	_ZN4llvm15RuntimeDyldImpl14loadObjectImplERKNS_6object10ObjectFileE at /u/juliohm/julia-1.5.0/bin/../lib/julia/libLLVM-9jl.so (unknown line)
      From worker 90:	_ZN4llvm14RuntimeDyldELF10loadObjectERKNS_6object10ObjectFileE at /u/juliohm/julia-1.5.0/bin/../lib/julia/libLLVM-9jl.so (unknown line)
      From worker 90:	_ZN4llvm11RuntimeDyld10loadObjectERKNS_6object10ObjectFileE at /u/juliohm/julia-1.5.0/bin/../lib/julia/libLLVM-9jl.so (unknown line)
      From worker 90:	finalize at /buildworker/worker/package_linux64/build/usr/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:242
      From worker 90:	emitAndFinalize at /buildworker/worker/package_linux64/build/usr/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:462 [inlined]
      From worker 90:	emitAndFinalize at /buildworker/worker/package_linux64/build/usr/include/llvm/ExecutionEngine/Orc/IRCompileLayer.h:127 [inlined]
      From worker 90:	addModule at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:651
      From worker 90:	jl_add_to_ee at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:893 [inlined]
      From worker 90:	jl_add_to_ee at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:955
      From worker 90:	jl_add_to_ee at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:977 [inlined]
      From worker 90:	_jl_compile_codeinst at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:126
      From worker 90:	jl_generate_fptr at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:302
      From worker 90:	jl_compile_method_internal at /buildworker/worker/package_linux64/build/src/gf.c:1964
      From worker 90:	jl_compile_method_internal at /buildworker/worker/package_linux64/build/src/gf.c:1919 [inlined]
      From worker 90:	_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2224 [inlined]
      From worker 90:	jl_gf_invoke_by_method at /buildworker/worker/package_linux64/build/src/gf.c:2482 [inlined]
      From worker 90:	jl_gf_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2447
      From worker 90:	jl_f_invoke at /buildworker/worker/package_linux64/build/src/builtins.c:1019
      From worker 90:	_invoked_shouldlog at ./logging.jl:78 [inlined]

In this cluster we have GPFS filesystem and I was told that writing NetCDF (HDF5) files in parallel can be problematic (mmap?). I was also told that xarray patched their IO to handle parallel writing of NetCDF, and so maybe this bus error specifically is not related to Julia's Distributed.

However, the warning about rmprocs is still there and I wonder if somehow the workaround with blaunch may be causing these warnings and errors. Appreciate your input, and sorry again for not being able to produce a MWE. The issue only happens at scale with 100 processes in parallel and Terabytes of data to process.

cc: @vchuravy @simonbyrne

@juliohm
Copy link
Sponsor Contributor Author

juliohm commented Jul 23, 2023

Cannot reproduce anymore. Lost access to the cluster a long time ago. Closing it.

@juliohm juliohm closed this as completed Jul 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:parallelism Parallel or distributed computation
Projects
None yet
Development

No branches or pull requests

2 participants