You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have an issue that is hard to reproduce locally, but that I would like to share here in case it can be addressed without a MWE. The issue occurs in a HPC cluster that uses the LSB job scheduler. In order to distribute Julia processes in this cluster, we had to workaround the SSH communication that Julia uses as follows:
We needed to use the blaunch drop-in replacement for ssh. We created a file ~/bin/ssh with the following contents:
The script works fine with 50 processes, but always prints a warning at the end:
┌ Warning: Forcibly interrupting busy workers
│ exception = rmprocs: pids [2, 5, 6, 7, 8, 10, 11, 12, 14, 15, 16, 17, 18, 19, 21, 22, 24, 29] not terminated after 5.0 seconds.
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1234
┌ Warning: rmprocs: process 1 not removed
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1030
From worker 4: Activating environment at `~/dataprep/Project.toml`
From worker 4: Precompiling From worker 22: Activating environment at `~/dataprep/Project.toml`
From worker 22: Precompiling project...
From worker 20: Activating environment at `~/dataprep/Project.toml`
From worker 20: Precompiling project...
From worker 23: Activating environment at `~/dataprep/Project.toml`
Attempting to remove processes from the pool and failing to do so. If we request more Julia processes like 100 processes, the issue is more serious, and Julia crashes with the following bus error:
From worker 97: signal (7): Bus error
From worker 97:in expression starting at none:0
From worker 97: __memmove_ssse3_back at /lib64/libc.so.6 (unknown line)
From worker 97: _ZN4llvm15RuntimeDyldImpl11emitSectionERKNS_6object10ObjectFileERKNS1_10SectionRefEb at /u/juliohm/julia-1.5.0/bin/../lib/julia/libLLVM-9jl.so (unknown line)
From worker 97: _ZN4llvm15RuntimeDyldImpl17findOrEmitSectionERKNS_6object10ObjectFileERKNS1_10SectionRefEbRSt3mapIS5_jSt4lessIS5_ESaISt4pairIS6_jEEE at /u/juliohm/julia-1.5.0/bin/../lib/julia/libLLVM-9jl.so (unknown line)
From worker 97: _ZN4llvm15RuntimeDyldImpl14loadObjectImplERKNS_6object10ObjectFileE at /u/juliohm/julia-1.5.0/bin/../lib/julia/libLLVM-9jl.so (unknown line)
From worker 97: _ZN4llvm14RuntimeDyldELF10loadObjectERKNS_6object10ObjectFileE at /u/juliohm/julia-1.5.0/bin/../lib/julia/libLLVM-9jl.so (unknown line)
From worker 97: _ZN4llvm11RuntimeDyld10loadObjectERKNS_6object10ObjectFileE at /u/juliohm/julia-1.5.0/bin/../lib/julia/libLLVM-9jl.so (unknown line)
From worker 97: finalize at /buildworker/worker/package_linux64/build/usr/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:242
From worker 97: emitAndFinalize at /buildworker/worker/package_linux64/build/usr/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:462 [inlined]
From worker 97: emitAndFinalize at /buildworker/worker/package_linux64/build/usr/include/llvm/ExecutionEngine/Orc/IRCompileLayer.h:127 [inlined]
From worker 97: addModule at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:651
From worker 97: jl_add_to_ee at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:893 [inlined]
From worker 97: jl_add_to_ee at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:955
From worker 97: jl_add_to_ee at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:977 [inlined]
From worker 97: _jl_compile_codeinst at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:126
From worker 97: jl_generate_fptr at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:302
From worker 97: jl_compile_method_internal at /buildworker/worker/package_linux64/build/src/gf.c:1964
From worker 97: jl_compile_method_internal at /buildworker/worker/package_linux64/build/src/gf.c:1919 [inlined]
From worker 97: _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2224 [inlined]
From worker 97: jl_gf_invoke_by_method at /buildworker/worker/package_linux64/build/src/gf.c:2482 [inlined]
From worker 97: jl_gf_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2447
From worker 97: jl_f_invoke at /buildworker/worker/package_linux64/build/src/builtins.c:1019
From worker 97: _invoked_shouldlog at ./logging.jl:78 [inlined]
From worker 97:macro expansion at ./logging.jl:327 [inlined]
From worker 97:#17 at /u/juliohm/dataprep/era5.jl:96
From worker 97:#43 at /u/juliohm/.julia/packages/ProgressMeter/OUQkp/src/ProgressMeter.jl:810
From worker 97: unknown function (ip:0x2adbdc4ec0af)
From worker 97: _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
From worker 97: jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
From worker 97: jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1690 [inlined]
From worker 97: do_apply at /buildworker/worker/package_linux64/build/src/builtins.c:655
From worker 97:#106 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:294
From worker 97: run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:79
From worker 97:macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:294 [inlined]
From worker 97:#105 at ./task.jl:356
From worker 97: unknown function (ip:0x2adbdc4bdc2c)
From worker 97: _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
From worker 97: jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
From worker 97: jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1690 [inlined]
From worker 97: start_task at /buildworker/worker/package_linux64/build/src/task.c:707
From worker 97: unknown function (ip: (nil))
From worker 97: Allocations:47054843 (Pool:47048744; Big:6099); GC:63
From worker 90:
From worker 90: signal (7): Bus error
From worker 90:in expression starting at none:0
From worker 90: __memmove_ssse3_back at /lib64/libc.so.6 (unknown line)
From worker 90: _ZN4llvm15RuntimeDyldImpl11emitSectionERKNS_6object10ObjectFileERKNS1_10SectionRefEb at /u/juliohm/julia-1.5.0/bin/../lib/julia/libLLVM-9jl.so (unknown line)
From worker 90: _ZN4llvm15RuntimeDyldImpl17findOrEmitSectionERKNS_6object10ObjectFileERKNS1_10SectionRefEbRSt3mapIS5_jSt4lessIS5_ESaISt4pairIS6_jEEE at /u/juliohm/julia-1.5.0/bin/../lib/julia/libLLVM-9jl.so (unknown line)
From worker 90: _ZN4llvm15RuntimeDyldImpl14loadObjectImplERKNS_6object10ObjectFileE at /u/juliohm/julia-1.5.0/bin/../lib/julia/libLLVM-9jl.so (unknown line)
From worker 90: _ZN4llvm14RuntimeDyldELF10loadObjectERKNS_6object10ObjectFileE at /u/juliohm/julia-1.5.0/bin/../lib/julia/libLLVM-9jl.so (unknown line)
From worker 90: _ZN4llvm11RuntimeDyld10loadObjectERKNS_6object10ObjectFileE at /u/juliohm/julia-1.5.0/bin/../lib/julia/libLLVM-9jl.so (unknown line)
From worker 90: finalize at /buildworker/worker/package_linux64/build/usr/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:242
From worker 90: emitAndFinalize at /buildworker/worker/package_linux64/build/usr/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:462 [inlined]
From worker 90: emitAndFinalize at /buildworker/worker/package_linux64/build/usr/include/llvm/ExecutionEngine/Orc/IRCompileLayer.h:127 [inlined]
From worker 90: addModule at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:651
From worker 90: jl_add_to_ee at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:893 [inlined]
From worker 90: jl_add_to_ee at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:955
From worker 90: jl_add_to_ee at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:977 [inlined]
From worker 90: _jl_compile_codeinst at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:126
From worker 90: jl_generate_fptr at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:302
From worker 90: jl_compile_method_internal at /buildworker/worker/package_linux64/build/src/gf.c:1964
From worker 90: jl_compile_method_internal at /buildworker/worker/package_linux64/build/src/gf.c:1919 [inlined]
From worker 90: _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2224 [inlined]
From worker 90: jl_gf_invoke_by_method at /buildworker/worker/package_linux64/build/src/gf.c:2482 [inlined]
From worker 90: jl_gf_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2447
From worker 90: jl_f_invoke at /buildworker/worker/package_linux64/build/src/builtins.c:1019
From worker 90: _invoked_shouldlog at ./logging.jl:78 [inlined]
In this cluster we have GPFS filesystem and I was told that writing NetCDF (HDF5) files in parallel can be problematic (mmap?). I was also told that xarray patched their IO to handle parallel writing of NetCDF, and so maybe this bus error specifically is not related to Julia's Distributed.
However, the warning about rmprocs is still there and I wonder if somehow the workaround with blaunch may be causing these warnings and errors. Appreciate your input, and sorry again for not being able to produce a MWE. The issue only happens at scale with 100 processes in parallel and Terabytes of data to process.
I have an issue that is hard to reproduce locally, but that I would like to share here in case it can be addressed without a MWE. The issue occurs in a HPC cluster that uses the LSB job scheduler. In order to distribute Julia processes in this cluster, we had to workaround the SSH communication that Julia uses as follows:
We needed to use the
blaunch
drop-in replacement forssh
. We created a file~/bin/ssh
with the following contents:and gave it execution permission
chmod +x ~/bin/ssh
. Finally, we added the file to thePATH
before the actualssh
command in the system:To run the distributed app, we created a submission script
launch.sh
:The script works fine with 50 processes, but always prints a warning at the end:
Attempting to remove processes from the pool and failing to do so. If we request more Julia processes like 100 processes, the issue is more serious, and Julia crashes with the following bus error:
In this cluster we have GPFS filesystem and I was told that writing NetCDF (HDF5) files in parallel can be problematic (mmap?). I was also told that xarray patched their IO to handle parallel writing of NetCDF, and so maybe this bus error specifically is not related to Julia's Distributed.
However, the warning about
rmprocs
is still there and I wonder if somehow the workaround withblaunch
may be causing these warnings and errors. Appreciate your input, and sorry again for not being able to produce a MWE. The issue only happens at scale with 100 processes in parallel and Terabytes of data to process.cc: @vchuravy @simonbyrne
The text was updated successfully, but these errors were encountered: