Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock with autodiff_deferred #1516

Open
vchuravy opened this issue Jun 6, 2024 · 0 comments
Open

Deadlock with autodiff_deferred #1516

vchuravy opened this issue Jun 6, 2024 · 0 comments

Comments

@vchuravy
Copy link
Member

vchuravy commented Jun 6, 2024

@michel2323 has the best luck. autodiff_deferred can cause a deadlock in the system.

The two locks involved are:

  • Julia GC safepoint lock
  • LLVM trampoline lock
Thread 12 (Thread 0x7840dfe006c0 (LWP 45441) "julia"):
#0  0x000078411e0524e9 in ?? () from /usr/lib/libc.so.6
#1  0x000078411e054ed9 in pthread_cond_wait () from /usr/lib/libc.so.6
#2  0x000078411d132fca in uv_cond_wait (cond=0x78411d52dfa0 <gc_threads_cond>, mutex=0x78411d52dfe0 <gc_threads_lock>) at src/unix/thread.c:883
#3  0x000078411d0979e5 in jl_gc_mark_threadfun (arg=<optimized out>) at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/partr.c:135
#4  0x000078411e055ded in ?? () from /usr/lib/libc.so.6
(More stack frames follow...)

Thread 11 (Thread 0x7840ea2006c0 (LWP 45440) "julia"):
#0  0x000078411e0524e9 in ?? () from /usr/lib/libc.so.6
#1  0x000078411e054ed9 in pthread_cond_wait () from /usr/lib/libc.so.6
#2  0x000078411d132fca in uv_cond_wait (cond=0x78411d52dfa0 <gc_threads_cond>, mutex=0x78411d52dfe0 <gc_threads_lock>) at src/unix/thread.c:883
#3  0x000078411d0979e5 in jl_gc_mark_threadfun (arg=<optimized out>) at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/partr.c:135
#4  0x000078411e055ded in ?? () from /usr/lib/libc.so.6
(More stack frames follow...)
--Type <RET> for more, q to quit, c to continue without paging--

Thread 10 (Thread 0x7840eac006c0 (LWP 45439) "julia"):
#0  0x000078411e0524e9 in ?? () from /usr/lib/libc.so.6
#1  0x000078411e054ed9 in pthread_cond_wait () from /usr/lib/libc.so.6
#2  0x000078411d132fca in uv_cond_wait (cond=0x78411d52dfa0 <gc_threads_cond>, mutex=0x78411d52dfe0 <gc_threads_lock>) at src/unix/thread.c:883
#3  0x000078411d0979e5 in jl_gc_mark_threadfun (arg=<optimized out>) at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/partr.c:135
#4  0x000078411e055ded in ?? () from /usr/lib/libc.so.6
(More stack frames follow...)

Thread 9 (Thread 0x7840ebe006c0 (LWP 45438) "julia"):
#0  jl_gc_wait_for_the_world (gc_n_threads=<optimized out>, gc_all_tls_states=<optimized out>) at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/gc.c:241
#1  ijl_gc_collect (collection=collection@entry=JL_GC_AUTO) at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/gc.c:3508
#2  0x000078411d0a97ed in maybe_collect (ptls=0x7840d8000b70) at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/gc.c:937
#3  jl_gc_pool_alloc_inner (ptls=0x7840d8000b70, pool_offset=pool_offset@entry=752, osize=osize@entry=16) at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/gc.c:1293
#4  0x000078411d0a9925 in jl_gc_pool_alloc_noinline (ptls=<optimized out>, pool_offset=pool_offset@entry=752, osize=osize@entry=16) at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/gc.c:1350
(More stack frames follow...)

Thread 8 (Thread 0x7840fbe006c0 (LWP 45437) "julia"):
#0  0x000078411e0d6e9d in syscall () from /usr/lib/libc.so.6
#1  0x000078411d6da9e4 in std::__atomic_futex_unsigned_base::_M_futex_wait_until (this=<optimized out>, __addr=0x7840e0018160, __val=2147483648, __has_timeout=<optimized out>, __s=..., __ns=...) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/futex.cc:122
#2  0x0000784119a7fb42 in std::future<unsigned long>::get() () from /home/vchuravy/.julia/juliaup/julia-1.10.3+0.x64.linux.gnu/bin/../lib/julia/libLLVM-15jl.so
#3  0x0000784119a8040a in llvm::orc::LocalTrampolinePool<llvm::orc::OrcX86_64_SysV>::reenter(void*, void*) () from /home/vchuravy/.julia/juliaup/julia-1.10.3+0.x64.linux.gnu/bin/../lib/julia/libLLVM-15jl.so
#4  0x000078411e1bb044 in ?? ()
(More stack frames follow...)

Thread 7 (Thread 0x784101e006c0 (LWP 45436) "julia"):
#0  0x000078411e0d6e9d in syscall () from /usr/lib/libc.so.6
#1  0x000078411d6da9e4 in std::__atomic_futex_unsigned_base::_M_futex_wait_until (this=<optimized out>, __addr=0x7840e4003ff0, __val=2147483648, __has_timeout=<optimized out>, __s=..., __ns=...) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/futex.cc:122
#2  0x0000784119a7fb42 in std::future<unsigned long>::get() () from /home/vchuravy/.julia/juliaup/julia-1.10.3+0.x64.linux.gnu/bin/../lib/julia/libLLVM-15jl.so
#3  0x0000784119a8040a in llvm::orc::LocalTrampolinePool<llvm::orc::OrcX86_64_SysV>::reenter(void*, void*) () from /home/vchuravy/.julia/juliaup/julia-1.10.3+0.x64.linux.gnu/bin/../lib/julia/libLLVM-15jl.so
#4  0x000078411e1bb044 in ?? ()
(More stack frames follow...)

Thread 6 (Thread 0x7841028006c0 (LWP 45435) "julia"):
#0  0x000078411e0524e9 in ?? () from /usr/lib/libc.so.6
--Type <RET> for more, q to quit, c to continue without paging--
#1  0x000078411e054ed9 in pthread_cond_wait () from /usr/lib/libc.so.6
#2  0x000078411d132fca in uv_cond_wait (cond=0x78411d5ca8c0 <safepoint_cond>, mutex=0x78411d5ca900 <safepoint_lock>) at src/unix/thread.c:883
#3  0x000078411d0b4735 in jl_safepoint_wait_gc () at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/safepoint.c:173
#4  0x000078411d0b455b in jl_set_gc_and_wait () at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/julia_internal.h:956
(More stack frames follow...)

Thread 5 (Thread 0x7841032006c0 (LWP 45434) "julia"):
#0  0x000078411e0524e9 in ?? () from /usr/lib/libc.so.6
#1  0x000078411e054ed9 in pthread_cond_wait () from /usr/lib/libc.so.6
#2  0x000078411d132fca in uv_cond_wait (cond=0x78411d5ca8c0 <safepoint_cond>, mutex=0x78411d5ca900 <safepoint_lock>) at src/unix/thread.c:883
#3  0x000078411d0b4735 in jl_safepoint_wait_gc () at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/safepoint.c:173
#4  0x000078411d0b455b in jl_set_gc_and_wait () at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/julia_internal.h:956
(More stack frames follow...)

Thread 4 (Thread 0x784103c006c0 (LWP 45433) "julia"):
#0  0x000078411e0524e9 in ?? () from /usr/lib/libc.so.6
#1  0x000078411e054ed9 in pthread_cond_wait () from /usr/lib/libc.so.6
#2  0x000078411d132fca in uv_cond_wait (cond=0x78411d5ca8c0 <safepoint_cond>, mutex=0x78411d5ca900 <safepoint_lock>) at src/unix/thread.c:883
#3  0x000078411d0b4735 in jl_safepoint_wait_gc () at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/safepoint.c:173
#4  0x000078411d0b455b in jl_set_gc_and_wait () at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/julia_internal.h:956
(More stack frames follow...)

Thread 3 (Thread 0x7841046006c0 (LWP 45432) "julia"):
#0  0x000078411e0524e9 in ?? () from /usr/lib/libc.so.6
#1  0x000078411e054ed9 in pthread_cond_wait () from /usr/lib/libc.so.6
#2  0x000078411d132fca in uv_cond_wait (cond=0x78411d5ca8c0 <safepoint_cond>, mutex=0x78411d5ca900 <safepoint_lock>) at src/unix/thread.c:883
#3  0x000078411d0b4735 in jl_safepoint_wait_gc () at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/safepoint.c:173
#4  0x000078411d0b455b in jl_set_gc_and_wait () at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/julia_internal.h:956
(More stack frames follow...)

Thread 2 (Thread 0x784116a006c0 (LWP 45431) "julia"):
#0  0x000078411e000768 in sigtimedwait () from /usr/lib/libc.so.6
#1  0x000078411d0b2c3c in signal_listener (arg=<optimized out>) at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/signals-unix.c:765
#2  0x000078411e055ded in ?? () from /usr/lib/libc.so.6
#3  0x000078411e0d90dc in ?? () from /usr/lib/libc.so.6

Thread 1 (Thread 0x78411df9cd00 (LWP 45429) "julia"):
#0  0x000078411e0524e9 in ?? () from /usr/lib/libc.so.6
#1  0x000078411e054ed9 in pthread_cond_wait () from /usr/lib/libc.so.6
#2  0x000078411d132fca in uv_cond_wait (cond=0x78411d5ca8c0 <safepoint_cond>, mutex=0x78411d5ca900 <safepoint_lock>) at src/unix/thread.c:883
#3  0x000078411d0b4735 in jl_safepoin

The thread triggering GC is holding the trampoline lock

(gdb) bt
#0  jl_gc_wait_for_the_world (gc_n_threads=<optimized out>, gc_all_tls_states=<optimized out>)
    at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/gc.c:241
#1  ijl_gc_collect (collection=collection@entry=JL_GC_AUTO) at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/gc.c:3508
#2  0x0000786a056aa2e3 in maybe_collect (ptls=0x7869dc000b70) at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/gc.c:937
...
#156 jl_compile_method_internal (mi=<optimized out>, world=31686) at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/gf.c:2368
#157 0x0000786a05646c3e in _jl_invoke (world=31686, mfunc=0x78689cac4c40, nargs=1, args=0x78693f7fec98, F=0x78689e80c470)
    at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/gf.c:2887
#158 ijl_apply_generic (F=<optimized out>, args=0x78693f7fec98, nargs=<optimized out>)
    at /cache/build/builder-amdci4-2/julialang/julia-release-1-dot-10/src/gf.c:3077
#159 0x00007869ca168c10 in julia___materialize_7661 () at /home/vchuravy/.julia/packages/LLVM/ShACK/src/orc.jl:292
#160 0x00007869ca169ffd in jlcapi___materialize_7663 ()
#161 0x0000786a0203c9bb in llvm::orc::MaterializationTask::run() ()
   from /home/vchuravy/.julia/juliaup/julia-1.10.3+0.x64.linux.gnu/bin/../lib/julia/libLLVM-15jl.so
#162 0x0000786a02022d1d in void llvm::detail::UniqueFunctionBase<void, std::unique_ptr<llvm::orc::Task, std::default_delete<llvm::orc::Task> > >::CallImpl<void (*)(std::unique_ptr<llvm::orc::Task, std::default_delete<llvm::orc::Task> >)>(void*, std::unique_ptr<llvm::orc::Task, std::default_delete<llvm::orc::Task> >&) ()
   from /home/vchuravy/.julia/juliaup/julia-1.10.3+0.x64.linux.gnu/bin/../lib/julia/libLLVM-15jl.so
#163 0x0000786a0203cac8 in llvm::orc::ExecutionSession::dispatchOutstandingMUs() ()
   from /home/vchuravy/.julia/juliaup/julia-1.10.3+0.x64.linux.gnu/bin/../lib/julia/libLLVM-15jl.so
#164 0x0000786a02042381 in llvm::orc::ExecutionSession::OL_completeLookup(std::unique_ptr<llvm::orc::InProgressLookupState, std::default_delete<llvm::orc::InProgressLookupState> >, std::shared_ptr<llvm::orc::AsynchronousSymbolQuery>, std::function<void (llvm::DenseMap<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void> >, llvm::DenseMapInfo<llvm::orc::JITDylib*, void>, llvm::detail::DenseMapPair<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void> > > > const&)>) ()
   from /home/vchuravy/.julia/juliaup/julia-1.10.3+0.x64.linux.gnu/bin/../lib/julia/libLLVM-15jl.so
#165 0x0000786a02042a0d in llvm::orc::InProgressFullLookupState::complete(std::unique_ptr<llvm::orc::InProgressLookupState, std::default_delete<llvm::orc::InProgressLookupState> >) () from /home/vchuravy/.julia/juliaup/julia-1.10.3+0.x64.linux.gnu/bin/../lib/julia/libLLVM-15jl.so
#166 0x0000786a02031202 in llvm::orc::ExecutionSession::OL_applyQueryPhase1(std::unique_ptr<llvm::orc::InProgressLookupState, std::default_delete<llvm::orc::InProgressLookupState> >, llvm::Error) () from /home/vchuravy/.julia/juliaup/julia-1.10.3+0.x64.linux.gnu/bin/../lib/julia/libLLVM-15jl.so
#167 0x0000786a0203ce8f in llvm::orc::ExecutionSession::lookup(llvm::orc::LookupKind, std::vector<std::pair<llvm::orc::JITDylib*, llvm::orc::JITDylibLookupFlags>, std::allocator<std::pair<llvm::orc::JITDylib*, llvm::orc::JITDylibLookupFlags> > > const&, llvm::orc::SymbolLookupSet, llvm::orc::SymbolState, llvm::unique_function<void (llvm::Expected<llvm::DenseMap<llvm::orc::SymbolStringPtr, llvm::JITEvaluatedSymbol, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void>, llvm::detail::DenseMapPair<llvm::orc::SymbolStringPtr, llvm::JITEvaluatedSymbol> > >)>, std::function<void (llvm::DenseMap<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void> >, llvm::DenseMapInfo<llvm::orc::JITDylib*, void>, llvm::detail::DenseMapPair<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void> > > > const&)>) ()
   from /home/vchuravy/.julia/juliaup/julia-1.10.3+0.x64.linux.gnu/bin/../lib/julia/libLLVM-15jl.so
#168 0x0000786a0208e456 in llvm::orc::LazyCallThroughManager::resolveTrampolineLandingAddress(unsigned long, llvm::unique_function<void (unsigned long) const>)
    () from /home/vchuravy/.julia/juliaup/julia-1.10.3+0.x64.linux.gnu/bin/../lib/julia/libLLVM-15jl.so
#169 0x0000786a0208ea30 in void llvm::detail::UniqueFunctionBase<void, unsigned long, llvm::unique_function<void (unsigned long) const> >::CallImpl<llvm::orc::LocalLazyCallThroughManager::init<llvm::orc::OrcX86_64_SysV>()::{lambda(unsigned long, llvm::unique_function<void (unsigned long) const>)#1} const>(void*, unsigned long, llvm::unique_function<void (unsigned long) const>&) () from /home/vchuravy/.julia/juliaup/julia-1.10.3+0.x64.linux.gnu/bin/../lib/julia/libLLVM-15jl.so
#170 0x0000786a020803fa in llvm::orc::LocalTrampolinePool<llvm::orc::OrcX86_64_SysV>::reenter(void*, void*) ()
   from /home/vchuravy/.julia/juliaup/julia-1.10.3+0.x64.linux.gnu/bin/../lib/julia/libLLVM-15jl.so
#171 0x0000786a06809044 in ?? ()
#172 0x000000000000037f in ?? ()
#173 0x0000000000000000 in ?? ()

We most likely need to mark ourselves as GC unsafe here:

r = call!(builder, FT, lfn, callparams)

and then reacquire (like @cfunction) GC safeness inside the wrapper function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant