-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
try
changes backtraces on FreeBSD
#30233
Comments
#29695 can only corrupt about one frame per 1000 or so because that's the chunk size used in I can't repro this on linux, but here's some things which might help a little:
|
Likewise, nor on macOS; I've only seen it on FreeBSD. If it occurred elsewhere the tests would be failing on CI, but oddly enough the tests pass even on FreeBSD CI.
Without
? I only see one method for |
Ah, I see, it's |
Sorry, as you found out I meant Anyway, is this the difference between the system deciding to run this code in the interpreter vs not? Maybe interpreter stack traces are broken on FreeBSD? There's certainly enough system-specific black magic in interpreter-stacktrace.c that it wouldn't be too surprising. What happens without the |
Well, |
I guess the root cause is not julia> begin
f(c) = g(c + 1)
g(c) = c > 10000 ? (return backtrace()) : f(c + 1)
bt = f(1)
io = IOBuffer()
Base.show_backtrace(io, bt)
println(String(take!(io)))
end
Stacktrace:
[1] g(::Int64) at ./REPL[14]:3
[2] f(::Int64) at ./REPL[14]:2
... (the last 2 lines are repeated 5000 more times)
[10003] top-level scope at REPL[14]:4
[10004] eval(::Module, ::Any) at ./boot.jl:319
[10005] eval_user_input(::Any, ::REPL.REPLBackend) at /usr/home/iblis/git/julia/usr/share/julia/stdlib/v1.1/REPL/src/REPL.jl:85
[10006] macro expansion at /usr/home/iblis/git/julia/usr/share/julia/stdlib/v1.1/REPL/src/REPL.jl:117 [inlined]
julia> let
f(c) = g(c + 1)
g(c) = c > 10000 ? (return backtrace()) : f(c + 1)
bt = f(1)
io = IOBuffer()
Base.show_backtrace(io, bt)
println(String(take!(io)))
end
Stacktrace:
[1] (::getfield(Main, Symbol("#g#10")){getfield(Main, Symbol("#f#9"))})(::Int64) at ./REPL[15]:3
[2] (::getfield(Main, Symbol("#f#9")))(::Int64) at ./REPL[15]:2
... (the last 2 lines are repeated 145 more times)
[293] (::getfield(Main, Symbol("#g#10")){getfield(Main, Symbol("#f#9"))})(::Int64) at ./REPL[15]:3 |
Interesting note about I've also noticed that I get the same backtrace with f = c->g(c + 1)
g = c->c > 10000 ? (return backtrace()) : f(c + 1) both on 11.2 and 12.0. |
Running in the interpreter I get the same backtrace for both cases: |
I can repro this with the FreeBSD image I thought it was a bug in interpreter backtraces, but looks like it might be a bug in walking the native stack instead. Here's a repro in global scope showing the edge of the failure condition: f = c-> g(c + 1)
g = c-> c > 10000 ? (return backtrace()) : f(c + 1)
bt = f(1)
@show length(f(9833))
display(stacktrace(f(9833), true))
println()
@show length(f(9831))
display(stacktrace(f(9831), true)) which prints
Note the stack trace has been truncated in |
A more minimalistic reproduction, stripping away the mutually recursive functions and any @noinline function raw_bt()
bt, bt2 = ccall(:jl_backtrace_from_here, Any, (Int32,), false)
collect(bt)
end
g = c-> c <= 0 ? (return raw_bt()) : g(c - 1)
@show length(g(167))
@show length(g(168)) Which produces
|
By inserting a bunch of printf logging, I've traced this down to a call to Line 415 in 7a5042a
So that would suggest a FreeBSD specific bug in libunwind, I suppose? |
Or, looking into libunwind, a bug in the way that the DWARF info is emitted by the julia JIT? Turning on libunwind
vs
The confusing thing is that it's giving up while unwinding a recursive self-call (which actually turns into a couple of mutually recursive calls as we can see in the alternating addresses here), so I would have thought the structure of the native stack would be consistent all the way down. |
Ah, looks like DWARF lookup failed, so libunwind is falling back to a heuristic:
which of course can get confused. So I guess this is down to lack of — or somehow broken — dwarf info emitted by the JIT on FreeBSD. |
I just have to say: absolutely amazing work tracking this down! |
libunwind has really great debug logging. Here's part of the output of UNW_DEBUG_LEVEL=100 ./julia -e 'g(c) = c <= 0 ? (return backtrace()) : g(c - 1); g(100)' 2> unw.log which shows that the DWARF info is not found on FreeBSD, even for this normal (non-anonymous) function
It doesn't even seem to be required that we define any function in particular. You can get a "dwarf [...] not found" from libunwind on FreeBSD even with UNW_DEBUG_LEVEL=100 ./julia -e 'backtrace()' |
More info. Comparing linux and FreeBSD for the following command UNW_DEBUG_LEVEL=100 ./julia -e 'g() = backtrace(); g()' 2> unw.log and comparing the first frame which is not found on FreeBSD but is found on linux, we have: Linux:
FreeBSD:
not sure what to make of this yet. |
Oh, okay. It cannot find the debug info. I think the reason of CI pass is |
It cannot find debug info for specific frames, including JITted ones. But other frames are found just fine, including ones from the sysimg.
|
The
errorshow
tests failed locally for me on FreeBSD 12.0, and when I investigated, I found that the code that fails works fine if run outside of a testset. As it turns out, thetry
inserted by@testset
is what's causing the difference in stack traces. The following is adapted from the tests:I'm also seeing this on FreeBSD 11.2, which is what our CI is running. This happens with both Julia 1.0 and master. Julia 0.6 produces the correct number of frames (10006 instead of 293) but without the collapsing that we now have.
Note that this issue seems similar to #29695, but Julia is not running with
--compile=min
here.The text was updated successfully, but these errors were encountered: