Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock during Julia image generation #54200

Open
vchuravy opened this issue Apr 22, 2024 · 2 comments
Open

Deadlock during Julia image generation #54200

vchuravy opened this issue Apr 22, 2024 · 2 comments
Labels
codegen Generation of LLVM IR and native code

Comments

@vchuravy
Copy link
Sponsor Member

I recently observed a deadlock, that seems to occur when we attempt to JIT compile a function during the emission of Julia code.

LLVM.jl installs a error handler that roughly looks like this:

function handle_error(reason::Cstring)
    throw(LLVMException(unsafe_string(reason)))
end

function _install_handlers()
    handler = @cfunction(handle_error, Cvoid, (Cstring,))
    ccall((:LLVMInstallFatalErrorHandler, libllvm), Cvoid, (Ptr{Cvoid},), handler)
end

Using the profiler to get a backtrace:

cmd: /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/julia 18641 running 2 of 2

signal (10): User defined signal 1
unknown function (ip: 0x7c1c1496f10e)
pthread_mutex_lock at /usr/lib/libc.so.6 (unknown line)
__gthread_mutex_lock at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/x86_64-linux-gnu/bits/gthr-default.h:749 [inlined]
__gthread_recursive_mutex_lock at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/x86_64-linux-gnu/bits/gthr-default.h:811 [inlined]
lock at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/mutex:106 [inlined]
lock at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/bits/unique_lock.h:141 [inlined]
unique_lock at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/bits/unique_lock.h:71 [inlined]
Lock at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/usr/include/llvm/ExecutionEngine/Orc/ThreadSafeModule.h:42 [inlined]
getLock at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/usr/include/llvm/ExecutionEngine/Orc/ThreadSafeModule.h:69
jl_codegen_params_t at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/jitlayers.h:258 [inlined]
_jl_compile_codeinst at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/jitlayers.cpp:213
jl_generate_fptr_impl at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/jitlayers.cpp:528
jl_compile_method_internal at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/gf.c:2534 [inlined]
jl_compile_method_internal at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/gf.c:2421
_jl_invoke at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/gf.c:2938 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/gf.c:3123
handle_error at /home/vchuravy/.julia/packages/LLVM/bzSzE/src/core/context.jl:168
jfptr_handle_error_5213 at /home/vchuravy/.julia/compiled/v1.11/LLVM/e8NBy_INkA2.so (unknown line)
jlcapi_handle_error_5773 at /home/vchuravy/.julia/compiled/v1.11/LLVM/e8NBy_INkA2.so (unknown line)
_ZN4llvm18report_fatal_errorERKNS_5TwineEb at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm16SelectionDAGISel15CannotYetSelectEPNS_6SDNodeE at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm16SelectionDAGISel16SelectCodeCommonEPNS_6SDNodeEPKhj at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN12_GLOBAL__N_115X86DAGToDAGISel6SelectEPN4llvm6SDNodeE at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm16SelectionDAGISel22DoInstructionSelectionEv at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm16SelectionDAGISel17CodeGenAndEmitDAGEv at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm16SelectionDAGISel20SelectAllBasicBlocksERKNS_8FunctionE at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm16SelectionDAGISel20runOnMachineFunctionERNS_15MachineFunctionE.part.0 at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN12_GLOBAL__N_115X86DAGToDAGISel20runOnMachineFunctionERN4llvm15MachineFunctionE at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm19MachineFunctionPass13runOnFunctionERNS_8FunctionE.part.0 at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm13FPPassManager13runOnFunctionERNS_8FunctionE at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm13FPPassManager11runOnModuleERNS_6ModuleE at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZN4llvm6legacy15PassManagerImpl3runERNS_6ModuleE at /home/vchuravy/.julia/juliaup/julia-1.11.0-beta1+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
add_output_impl at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/aotcompile.cpp:1171
operator() at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/aotcompile.cpp:1477
operator() at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/bits/std_function.h:690 [inlined]
lambda_trampoline at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/aotcompile.cpp:1347
unknown function (ip: 0x7c1c14972559)
unknown function (ip: 0x7c1c149efa3b)
unknown function (ip: (nil))
unknown function (ip: 0x7c1c1496eebc)
unknown function (ip: 0x7c1c149740e2)
uv_thread_join at /workspace/srcdir/libuv/src/unix/thread.c:294
add_output<jl_dump_native_impl(void*, char const*, char const*, char const*, char const*, ios_t*, ios_t*, jl_emission_params_t*)::<lambda(llvm::Module&)> > at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/aotcompile.cpp:1485
operator()<jl_dump_native_impl(void*, char const*, char const*, char const*, char const*, ios_t*, ios_t*, jl_emission_params_t*)::<lambda(llvm::Module&)> > at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/aotcompile.cpp:1645 [inlined]
jl_dump_native_impl at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/aotcompile.cpp:1790
ijl_write_compiler_output at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/precompile.c:168
ijl_atexit_hook at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/init.c:285
jl_repl_entrypoint at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/jlapi.c:1060
main at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/cli/loader_exe.c:58
unknown function (ip: 0x7c1c1490cccf)
__libc_start_main at /usr/lib/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
unknown function (ip: (nil))

My hypothesis is that the two locks involved are:

Lock at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/usr/include/llvm/ExecutionEngine/Orc/ThreadSafeModule.h:42 

and

julia/src/aotcompile.cpp

Lines 1785 to 1786 in 08e1fc0

auto lock = TSCtx.getLock();
auto dataM = data->M.getModuleUnlocked();

and that we end up re-using the context and therefore the lock.

@pchintalapudi any thoughts?

@vchuravy vchuravy added the codegen Generation of LLVM IR and native code label Apr 22, 2024
@gbaraldi
Copy link
Member

The thing I'm a bit puzzled about is that we sometimes use getContext and sometimes we use AcquireContext and acquireContext seems more correct?

@pchintalapudi
Copy link
Member

getContext will automatically return the context to the pool of contexts when its object is destroyed, while acquireContext should be paired with a releaseContext.

Also, I think it's wrong to trigger additional compilation from within orc itself; I'm pretty sure there's some assumptions that are made about not touching the runtime within the addModule/lookup calls for thread safety purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
codegen Generation of LLVM IR and native code
Projects
None yet
Development

No branches or pull requests

3 participants