Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jl_print_task_backtraces() is missing some tasks #47928

Closed
NHDaly opened this issue Dec 19, 2022 · 8 comments
Closed

jl_print_task_backtraces() is missing some tasks #47928

NHDaly opened this issue Dec 19, 2022 · 8 comments

Comments

@NHDaly
Copy link
Member

NHDaly commented Dec 19, 2022

We believe that we have seen cases where jl_print_task_backtraces() (introduced by #46845) is missing Tasks. We see tasks present in the short profile printed by SIGUSR1 that aren't present in jl_print_task_backtraces(), no matter which order we do them in.

Additionally, here is a very simple experiment, where you can see that the REPL's task is missing, which should be waiting in foo(cond) -> wait(cond), but isn't present at all:

 11:46:29 $ ./julia
con               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.0-DEV.176 (2022-12-14)
 _/ |\__'_|_|_|\__'_|  |  pv-compiler-timings-lock/5dce6f233d* (fork: 158 commits, 31 days)
|__/                   |

julia> cond = Threads.Condition()
Base.GenericCondition{ReentrantLock}(Base.IntrusiveLinkedList{Task}(nothing, nothing), ReentrantLock(nothing, 0x00000000, 0x00, Base.GenericCondition{Base.Threads.SpinLock}(Base.IntrusiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), (0, 0, 0)))

julia> foo(cond) = begin lock(cond); wait(cond) end
foo (generic function with 1 method)

julia> foo(cond)


^Z
[1]  + 66990 suspended  ./julia
 11:46:43 $ lldb -p 66990
(lldb) process attach --pid 66990
Process 66990 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00007ff81a69b34e libsystem_kernel.dylib`kevent + 10
libsystem_kernel.dylib`kevent:
->  0x7ff81a69b34e <+10>: jae    0x7ff81a69b358            ; <+20>
    0x7ff81a69b350 <+12>: movq   %rax, %rdi
    0x7ff81a69b353 <+15>: jmp    0x7ff81a6971c5            ; cerror_nocancel
    0x7ff81a69b358 <+20>: retq
Target 0: (julia) stopped.

Executable module set to "/Users/nathandaly/src/julia/usr/bin/julia".
Architecture set to: x86_64h-apple-macosx-.
(lldb) p (void)jl_print_task_backtraces()
warning: could not find Objective-C class data in the process. This may reduce the quality of type information available.
==== Thread 1 created 3 live tasks
     ---- Root task (0x10b970010)
          (sticky: 1, started: 1, state: 0, tid: 1)
jl_rec_backtrace at /Users/nathandaly/src/julia/src/stackwalk.c:863
jlbacktracet at /Users/nathandaly/src/julia/src/stackwalk.c:1113 [inlined]
jl_print_task_backtraces at /Users/nathandaly/src/julia/src/stackwalk.c:1144
     ---- Task 1 (0x10d4d7340)
          (sticky: 1, started: 1, state: 0, tid: 1)
jl_swap_fiber at /Users/nathandaly/src/julia/src/task.c:1200
ctx_switch at /Users/nathandaly/src/julia/src/task.c:570
ijl_switch at /Users/nathandaly/src/julia/src/task.c:655
try_yieldto at ./task.jl:899
wait at ./task.jl:973
#wait#625 at ./condition.jl:130
wait at ./condition.jl:125 [inlined]
wait at /Users/nathandaly/src/julia/usr/share/julia/stdlib/v1.10/FileWatching/src/FileWatching.jl:655
watch_file at /Users/nathandaly/src/julia/usr/share/julia/stdlib/v1.10/FileWatching/src/FileWatching.jl:766
watch_file at /Users/nathandaly/src/julia/usr/share/julia/stdlib/v1.10/FileWatching/src/FileWatching.jl:758 [inlined]
wait_changed at /Users/nathandaly/.julia/packages/Revise/do2nH/src/packagedef.jl:34
watch_manifest at /Users/nathandaly/.julia/packages/Revise/do2nH/src/pkgs.jl:413
unknown function (ip: 0x11861e302)
_jl_invoke at /Users/nathandaly/src/julia/src/gf.c:0 [inlined]
ijl_apply_generic at /Users/nathandaly/src/julia/src/gf.c:2829
jl_apply at /Users/nathandaly/src/julia/src/./julia.h:1870 [inlined]
do_apply at /Users/nathandaly/src/julia/src/builtins.c:734
TaskThunk at /Users/nathandaly/.julia/packages/Revise/do2nH/src/types.jl:268
unknown function (ip: 0x11861959f)
_jl_invoke at /Users/nathandaly/src/julia/src/gf.c:0 [inlined]
ijl_apply_generic at /Users/nathandaly/src/julia/src/gf.c:2829
jl_apply at /Users/nathandaly/src/julia/src/./julia.h:1870 [inlined]
start_task at /Users/nathandaly/src/julia/src/task.c:1092
     ---- End task 1
     ---- Task 2 (0x118efe230)
          (sticky: 1, started: 1, state: 0, tid: 1)
jl_swap_fiber at /Users/nathandaly/src/julia/src/task.c:1200
ctx_switch at /Users/nathandaly/src/julia/src/task.c:570
ijl_switch at /Users/nathandaly/src/julia/src/task.c:655
try_yieldto at ./task.jl:899
wait at ./task.jl:973
#wait#625 at ./condition.jl:130
wait at ./condition.jl:125 [inlined]
take_buffered at ./channels.jl:456
take! at ./channels.jl:450 [inlined]
eval_with_backend at /Users/nathandaly/src/julia/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:874
do_respond at /Users/nathandaly/src/julia/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:888
jfptr_do_respond_59919 at /Users/nathandaly/src/julia/usr/lib/julia/sys.dylib (unknown line)
_jl_invoke at /Users/nathandaly/src/julia/src/gf.c:0 [inlined]
ijl_apply_generic at /Users/nathandaly/src/julia/src/gf.c:2829
jl_apply at /Users/nathandaly/src/julia/src/./julia.h:1870 [inlined]
jl_f__call_latest at /Users/nathandaly/src/julia/src/builtins.c:778
#invokelatest#2 at ./essentials.jl:816 [inlined]
invokelatest at ./essentials.jl:813 [inlined]
run_interface at /Users/nathandaly/src/julia/usr/share/julia/stdlib/v1.10/REPL/src/LineEdit.jl:2644
jfptr_run_interface_58905 at /Users/nathandaly/src/julia/usr/lib/julia/sys.dylib (unknown line)
_jl_invoke at /Users/nathandaly/src/julia/src/gf.c:0 [inlined]
ijl_apply_generic at /Users/nathandaly/src/julia/src/gf.c:2829
run_frontend at /Users/nathandaly/src/julia/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:1293
#62 at ./task.jl:514
jfptr_YY.62_59949 at /Users/nathandaly/src/julia/usr/lib/julia/sys.dylib (unknown line)
_jl_invoke at /Users/nathandaly/src/julia/src/gf.c:0 [inlined]
ijl_apply_generic at /Users/nathandaly/src/julia/src/gf.c:2829
jl_apply at /Users/nathandaly/src/julia/src/./julia.h:1870 [inlined]
start_task at /Users/nathandaly/src/julia/src/task.c:1092
     ---- End task 2
==== End thread 1
==== Thread 2 created 2 live tasks
     ---- Root task (0x10ba68010)
          (sticky: 1, started: 1, state: 1, tid: 2)
     ---- Task 1 (0x10b9702b0)
          (sticky: 0, started: 1, state: 0, tid: 2)
     ---- End task 1
==== End thread 2
==== Thread 3 created 1 live tasks
     ---- Root task (0x10ba70010)
          (sticky: 1, started: 1, state: 1, tid: 3)
==== End thread 3
==== Thread 4 created 2 live tasks
     ---- Root task (0x10ba6c010)
          (sticky: 1, started: 1, state: 1, tid: 4)
     ---- Task 1 (0x10b970550)
          (sticky: 0, started: 1, state: 0, tid: 4)
     ---- End task 1
==== End thread 4
==== Done
(lldb)

The above experiment is on a Mac, but we believe we have seen missing Tasks in prod.

Additionally, there are two tasks in the printout above that are started, and not completed (started: 1, state: 0) yet they do not print a stack trace. Why is that?

Those states are defined here:

julia/src/julia.h

Lines 1947 to 1949 in 437ebe1

#define JL_TASK_STATE_RUNNABLE 0
#define JL_TASK_STATE_DONE 1
#define JL_TASK_STATE_FAILED 2

Thanks

@NHDaly
Copy link
Member Author

NHDaly commented Dec 19, 2022

CC: @kpamnany

@kpamnany
Copy link
Contributor

@vtjnash can you verify my understanding of the logic? jl_alloc_fiber calls jl_malloc_stack which adds the task to ptls->heap.live_tasks. Either jl_new_task calls jl_alloc_fiber or ctx_switch does, but essentially a task cannot start running unless it gets a stack from jl_malloc_stack. Which should mean that this issue is impossible, but...

Also, why are the root tasks on the other threads showing state DONE? Root tasks should never terminate I thought...

@NHDaly
Copy link
Member Author

NHDaly commented Jan 6, 2023

One possible explanation for some of the missing tasks is that the tasks that are currently scheduled don't have a stack buffer since it's owned by the pthread that's running them, so we need to do something different to print those.

But we don't think this explains all the missing tasks we've seen.

@tveldhui
Copy link

Just to underscore the importance of this, I'm currently trying to figure out why a parallel computation for a major client running on 128 nodes / 1024 vcpus gets stuck after four hours and > 100 TB of data movement, and I cannot get usable stack traces to diagnose the problem.

@NHDaly
Copy link
Member Author

NHDaly commented Jan 12, 2023

Thanks Todd. 👍 This got stuck; thanks for the reminder to unstuck it

@kpamnany
Copy link
Contributor

Discussed with @vtjnash. On Linux, a thread cannot inspect another thread's registers; this capability was removed some years back. Thus we cannot get %rsp, and so we cannot unwind the stack.

One solution is to use GC-style stop-the-world and have each thread backtrace its own stack.

Another solution is to use clone at Julia startup for the main Julia process and then use the parent process to receive signals/requests to do things like query task backtraces in the child process.

@NHDaly
Copy link
Member Author

NHDaly commented Apr 25, 2023

Leaving this note here:

We believe there still seems to be at least one other case of missing stacktraces that cannot currently be explained.

The above explanation should result in exactly one Task with a missing trace in each Thread. However, we believe we have seen dumps that have more than one Task missing a stacktrace, despite being started and having state=0 (not yet finished).

We still need to investigate that. Having a consistent reproducer would be extremely helpful.

@kpamnany
Copy link
Contributor

Fixed by #51430.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants