-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Julia 1.5.1 segfaults, rr trace attached #37524
Comments
@Keno This recording consistently breaks
|
Can reproduce the rr issue.
This pointer is odd. Stopping slightly before that crash, the pointer without the high bit set is |
Note that that's the lowest bit of the pointer's high byte. I'm not really sure what would be setting that. For our GC marking we use the low bits, and I don't think we do anything with the high bits anywhere. |
@hh0rva1h I'd be interested if you see the same rr failure mode or whether it replays correctly for you. Could you try replaying your upload and seeing if it replays ok (BugReporting has a helper if you need it: https://github.com/JuliaLang/BugReporting.jl/blob/master/src/BugReporting.jl#L133). |
I tried
but that never gets hit. So whatever the divergence is, it starts earlier than that and this code path is never hit. |
Also, I'd be interested in another recording of the same crash to see if it breaks in the same way. |
@hh0rva1h As a general comment, you appear to be allocating super long tuples of tiny arrays. That's exactly the wrong way around. You'll want to use arrays for large variable sized things and tuples for small structural things. |
Right, so it's looking at the |
Yet another trace: Regarding the machine: It's a Dell Precision M4700, so I guess no ECC memory since thats a consumer device, no unusual device drivers, kernel info:
I was not able to reproduce this issue on another Computer (Dell XPS 13 9360), another colleague also failed to reproduce, so this seems to happen only on my machine. I did check journalctl for any mce messages, however I did not notice anything suspicious. The system is behaving quite stable, no kernel panics or other segfaults ... |
Can you try doing the replay yourself on your machine and seeing if you observe the same failure to replay? |
@Keno Thx for all your effort, could you outline me how I should replay? I have not experience with |
|
As for the new trace, the symptoms are the same:
r8 is an otherwise valid pointer that has the low bit of its high byte flipped. I'm still replaying to see if that was caused by any architecturally executed instruction, but I'm assuming not. My best guess that this point is that one of your RAM DIMMs has a sticky bit that gets stuck when the DIMM gets hot. Unless you can reproduce on another machine, I'm not sure there's much for us to do here. I'd recommend maybe running memtest on that machine to see if you can catch the memory error. |
Alright, the replay just finished, and yeah, same deal, somehow that bit gets flipped and things break. |
@Keno Ah thx, I figured it out from the link you gave a few comments above, but thanks anyway:
I'll run memtest on my machine. Thanks very much for your great support! |
Yeah, that's the same behavior we're seeing. That's good. That indicates there there is indeed something transient on your machine that is flipping bits, rather than say some weird architectural behavior of your particular CPU. Since I don't think there's anything actionable here, I'm gonna close the issue, but do let us know what you find. And of course if you do end up reproducing on another machine feel free to ask for the issue to be reopened. |
@Keno You were absolutely right, thanks very much again for your amazing support and sorry for the noise! |
Julia Version: 1.5.1, official binary
Environment: Ubuntu 20.04
Julia is segfaulting for us repeatedly with a trace that looks like the following:
The rr trace of the crash can be found here:
https://s3.amazonaws.com/julialang-dumps/reports/2020-09-11T08-25-23-hh0rva1h.tar.zst
We are able to reproduce the crash, however the triggering code involves an unpublished reinforcement learning library (rl_framework) from our university. We could give access to a dev looking into it privately, just tell us what you need or we could try debugging instructions ourselves.
The text was updated successfully, but these errors were encountered: