-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jl_print_task_backtraces()
is missing some tasks
#47928
Comments
CC: @kpamnany |
@vtjnash can you verify my understanding of the logic? Also, why are the root tasks on the other threads showing state |
One possible explanation for some of the missing tasks is that the tasks that are currently scheduled don't have a stack buffer since it's owned by the pthread that's running them, so we need to do something different to print those. But we don't think this explains all the missing tasks we've seen. |
Just to underscore the importance of this, I'm currently trying to figure out why a parallel computation for a major client running on 128 nodes / 1024 vcpus gets stuck after four hours and > 100 TB of data movement, and I cannot get usable stack traces to diagnose the problem. |
Thanks Todd. 👍 This got stuck; thanks for the reminder to unstuck it |
Discussed with @vtjnash. On Linux, a thread cannot inspect another thread's registers; this capability was removed some years back. Thus we cannot get One solution is to use GC-style stop-the-world and have each thread backtrace its own stack. Another solution is to use |
Leaving this note here: We believe there still seems to be at least one other case of missing stacktraces that cannot currently be explained. The above explanation should result in exactly one Task with a missing trace in each Thread. However, we believe we have seen dumps that have more than one Task missing a stacktrace, despite being started and having We still need to investigate that. Having a consistent reproducer would be extremely helpful. |
Fixed by #51430. |
We believe that we have seen cases where
jl_print_task_backtraces()
(introduced by #46845) is missing Tasks. We see tasks present in the short profile printed by SIGUSR1 that aren't present injl_print_task_backtraces()
, no matter which order we do them in.Additionally, here is a very simple experiment, where you can see that the REPL's task is missing, which should be waiting in
foo(cond)
->wait(cond)
, but isn't present at all:The above experiment is on a Mac, but we believe we have seen missing Tasks in prod.
Additionally, there are two tasks in the printout above that are started, and not completed (
started: 1, state: 0
) yet they do not print a stack trace. Why is that?Those states are defined here:
julia/src/julia.h
Lines 1947 to 1949 in 437ebe1
Thanks
The text was updated successfully, but these errors were encountered: