Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Julia 1.5.1 segfaults, rr trace attached #37524

Closed
hh0rva1h opened this issue Sep 11, 2020 · 17 comments
Closed

Julia 1.5.1 segfaults, rr trace attached #37524

hh0rva1h opened this issue Sep 11, 2020 · 17 comments

Comments

@hh0rva1h
Copy link

Julia Version: 1.5.1, official binary
Environment: Ubuntu 20.04

Julia is segfaulting for us repeatedly with a trace that looks like the following:

signal (11): Segmentation fault
in expression starting at /home/user/RL-Carla/rl-carla/proofofconcepts/crashes/julia-crash.jl:76
jl_is_concrete_type at /buildworker/worker/package_linux64/build/src/julia.h:1222 [inlined]
jl_apply_tuple_type_v_ at /buildworker/worker/package_linux64/build/src/jltypes.c:1381 [inlined]
jl_apply_tuple_type at /buildworker/worker/package_linux64/build/src/jltypes.c:1389
arg_type_tuple at /buildworker/worker/package_linux64/build/src/gf.c:1853
jl_lookup_generic_ at /buildworker/worker/package_linux64/build/src/gf.c:2361 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2394
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1690 [inlined]
do_apply at /buildworker/worker/package_linux64/build/src/builtins.c:655
update! at /home/user/RL-Carla/rl_framework/src/approximators/Knn2.jl:48
update! at /home/user/RL-Carla/rl_framework/src/approximators/Knn2.jl:106
unknown function (ip: 0x7f0483dd8205)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
update! at /home/user/RL-Carla/rl_framework/src/learners/QLearner2.jl:40
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
#101#102 at /home/user/RL-Carla/rl_framework/src/main.jl:127
101#104##kw at /home/user/RL-Carla/rl_framework/src/main.jl:89
unknown function (ip: 0x7f0483dd308c)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
#simulate!#1 at /home/user/RL-Carla/rl-carla/proofofconcepts/crashes/julia-crash.jl:55
simulate!##kw at /home/user/RL-Carla/rl-carla/proofofconcepts/crashes/julia-crash.jl:49
unknown function (ip: 0x7f0483dc76b4)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
top-level scope at /home/user/RL-Carla/rl-carla/proofofconcepts/crashes/julia-crash.jl:78
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:834
jl_parse_eval_all at /buildworker/worker/package_linux64/build/src/ast.c:913
jl_load_rewrite at /buildworker/worker/package_linux64/build/src/toplevel.c:914
include at ./client.jl:457
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2231 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1690 [inlined]
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:117
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:206
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:157 [inlined]
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:566
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:660
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:840
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:790
jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/toplevel.c:883
eval at ./boot.jl:331
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
eval_user_input at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:134
repl_backend_loop at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:195
start_repl_backend at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:180
#run_repl#37 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:292
run_repl at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:288
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2231 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
#806 at ./client.jl:399
jfptr_YY.806_33777.clone_1 at /home/user/RL-Carla/julia/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1690 [inlined]
do_apply at /buildworker/worker/package_linux64/build/src/builtins.c:655
jl_f__apply_latest at /buildworker/worker/package_linux64/build/src/builtins.c:705
#invokelatest#1 at ./essentials.jl:710 [inlined]
invokelatest at ./essentials.jl:709 [inlined]
run_main_repl at ./client.jl:383
exec_options at ./client.jl:313
_start at ./client.jl:506
jfptr__start_51675.clone_1 at /home/user/RL-Carla/julia/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
jl_apply at /buildworker/worker/package_linux64/build/ui/../src/julia.h:1690 [inlined]
true_main at /buildworker/worker/package_linux64/build/ui/repl.c:106
main at /buildworker/worker/package_linux64/build/ui/repl.c:227
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_start at /home/user/RL-Carla/julia/bin/julia (unknown line)
Allocations: 943748757 (Pool: 943641655; Big: 107102); GC: 472

The rr trace of the crash can be found here:
https://s3.amazonaws.com/julialang-dumps/reports/2020-09-11T08-25-23-hh0rva1h.tar.zst

We are able to reproduce the crash, however the triggering code involves an unpublished reinforcement learning library (rl_framework) from our university. We could give access to a dev looking into it privately, just tell us what you need or we could try debugging instructions ourselves.

@maleadt
Copy link
Member

maleadt commented Sep 11, 2020

@Keno This recording consistently breaks rr (on antarctic with your build of rr):

Episode: 2605
[FATAL /home/keno/rr/src/ReplaySession.cc:636:check_pending_sig()] 
 (task 33780 (rec:7760) at time 51911)
 -> Assertion `false' failed to hold. Replaying `SIGNAL: SIGSEGV(det)': expecting tracee signal or trap, but instead at `write' (ticks: 144040993972)
Tail of trace dump:
{
  real_time:38257.508732 global_time:51891, event:`SYSCALL: write' (state:ENTERING_SYSCALL) tid:7760, ticks:143952193134
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x4 rsi:0x7f046af10e38 rdi:0xd rbp:0xd rsp:0x681ffe10 r8:0x0 r9:0x7f0490095f90 r10:0x7f0475ad7bb0 r11:0x246 r12:0x0 r13:0x7f046af10e38 r14:0x4 r15:0x7f046af10e38 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.508734 global_time:51892, event:`SYSCALLBUF_RESET' tid:7760, ticks:143952193134
}
{
  real_time:38257.508760 global_time:51893, event:`SYSCALL: write' (state:EXITING_SYSCALL) tid:7760, ticks:143952193134
rax:0x4 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x4 rsi:0x7f046af10e38 rdi:0xd rbp:0xd rsp:0x681ffe10 r8:0x0 r9:0x7f0490095f90 r10:0x7f0475ad7bb0 r11:0x246 r12:0x0 r13:0x7f046af10e38 r14:0x4 r15:0x7f046af10e38 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.508810 global_time:51894, event:`SYSCALLBUF_FLUSH' tid:7760, ticks:143952193901
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
}
{
  real_time:38257.508813 global_time:51895, event:`SYSCALL: write' (state:ENTERING_SYSCALL) tid:7760, ticks:143952193901
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1 rsi:0x7f047b1f43f0 rdi:0xd rbp:0xd rsp:0x681ffe10 r8:0x0 r9:0x7f0490095f90 r10:0x7f0475ad7bb0 r11:0x246 r12:0x0 r13:0x7f047b1f43f0 r14:0x1 r15:0x7f047b1f43f0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.508815 global_time:51896, event:`SYSCALLBUF_RESET' tid:7760, ticks:143952193901
}
{
  real_time:38257.508841 global_time:51897, event:`SYSCALL: write' (state:EXITING_SYSCALL) tid:7760, ticks:143952193901
rax:0x1 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1 rsi:0x7f047b1f43f0 rdi:0xd rbp:0xd rsp:0x681ffe10 r8:0x0 r9:0x7f0490095f90 r10:0x7f0475ad7bb0 r11:0x246 r12:0x0 r13:0x7f047b1f43f0 r14:0x1 r15:0x7f047b1f43f0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.562196 global_time:51898, event:`SYSCALLBUF_FLUSH' tid:7760, ticks:143983370693
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
}
{
  real_time:38257.562212 global_time:51899, event:`SYSCALL: write' (state:ENTERING_SYSCALL) tid:7760, ticks:143983370693
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x9 rsi:0x7f04771a4078 rdi:0xd rbp:0xd rsp:0x681ffe10 r8:0x0 r9:0x7f0490095f90 r10:0x7f0492d83b20 r11:0x246 r12:0x0 r13:0x7f04771a4078 r14:0x9 r15:0x7f04771a4078 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.562214 global_time:51900, event:`SYSCALLBUF_RESET' tid:7760, ticks:143983370693
}
{
  real_time:38257.562268 global_time:51901, event:`SYSCALL: write' (state:EXITING_SYSCALL) tid:7760, ticks:143983370693
rax:0x9 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x9 rsi:0x7f04771a4078 rdi:0xd rbp:0xd rsp:0x681ffe10 r8:0x0 r9:0x7f0490095f90 r10:0x7f0492d83b20 r11:0x246 r12:0x0 r13:0x7f04771a4078 r14:0x9 r15:0x7f04771a4078 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.562363 global_time:51902, event:`SYSCALLBUF_FLUSH' tid:7760, ticks:143983372177
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
}
{
  real_time:38257.562367 global_time:51903, event:`SYSCALL: write' (state:ENTERING_SYSCALL) tid:7760, ticks:143983372177
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x4 rsi:0x7f046a9a0338 rdi:0xd rbp:0xd rsp:0x681ffe10 r8:0x0 r9:0x7f0490095f90 r10:0x7f0475ad7bb0 r11:0x246 r12:0x0 r13:0x7f046a9a0338 r14:0x4 r15:0x7f046a9a0338 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.562369 global_time:51904, event:`SYSCALLBUF_RESET' tid:7760, ticks:143983372177
}
{
  real_time:38257.562395 global_time:51905, event:`SYSCALL: write' (state:EXITING_SYSCALL) tid:7760, ticks:143983372177
rax:0x4 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x4 rsi:0x7f046a9a0338 rdi:0xd rbp:0xd rsp:0x681ffe10 r8:0x0 r9:0x7f0490095f90 r10:0x7f0475ad7bb0 r11:0x246 r12:0x0 r13:0x7f046a9a0338 r14:0x4 r15:0x7f046a9a0338 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.562444 global_time:51906, event:`SYSCALLBUF_FLUSH' tid:7760, ticks:143983372944
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
}
{
  real_time:38257.562448 global_time:51907, event:`SYSCALL: write' (state:ENTERING_SYSCALL) tid:7760, ticks:143983372944
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1 rsi:0x7f047ba39840 rdi:0xd rbp:0xd rsp:0x681ffe10 r8:0x0 r9:0x7f0490095f90 r10:0x7f0475ad7bb0 r11:0x246 r12:0x0 r13:0x7f047ba39840 r14:0x1 r15:0x7f047ba39840 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.562449 global_time:51908, event:`SYSCALLBUF_RESET' tid:7760, ticks:143983372944
}
{
  real_time:38257.562475 global_time:51909, event:`SYSCALL: write' (state:EXITING_SYSCALL) tid:7760, ticks:143983372944
rax:0x1 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1 rsi:0x7f047ba39840 rdi:0xd rbp:0xd rsp:0x681ffe10 r8:0x0 r9:0x7f0490095f90 r10:0x7f0475ad7bb0 r11:0x246 r12:0x0 r13:0x7f047ba39840 r14:0x1 r15:0x7f047ba39840 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.585211 global_time:51910, event:`SYSCALLBUF_FLUSH' tid:7760, ticks:143998759917
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
}
{
  real_time:38257.585221 global_time:51911, event:`SIGNAL: SIGSEGV(det)' tid:7760, ticks:143998759917
rax:0x436487c0 rbx:0x7f0490b74330 rcx:0xb479 rdx:0x435f2dc8 rsi:0x435f2dc0 rdi:0x435f2dc0 rbp:0x7fff6ca71250 rsp:0x7fff6ca71230 r8:0x1 r9:0x7f0490b74330 r10:0x1007f0490b784d0 r11:0x4364d190 r12:0x0 r13:0x7f0490b88c70 r14:0x7f047daf36d0 r15:0xb479 rip:0x7f049e2ba8e3 eflags:0x10287 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.585223 global_time:51912, event:`SYSCALLBUF_RESET' tid:7760, ticks:143998759917
}
=== Start rr backtrace:
/home/keno/rr-build/bin/rr(_ZN2rr13dump_rr_stackEv+0x44)[0x55ea25994b6c]
/home/keno/rr-build/bin/rr(_ZN2rr9GdbServer15emergency_debugEPNS_4TaskE+0x1a2)[0x55ea257db142]
/home/keno/rr-build/bin/rr(+0x364d00)[0x55ea2580cd00]
/home/keno/rr-build/bin/rr(_ZN2rr21EmergencyDebugOstreamD1Ev+0x62)[0x55ea2580cf5a]
/home/keno/rr-build/bin/rr(_ZN2rr13ReplaySession17check_pending_sigEPNS_10ReplayTaskE+0x156)[0x55ea258ec748]
/home/keno/rr-build/bin/rr(_ZN2rr13ReplaySession16continue_or_stepEPNS_10ReplayTaskERKNS0_15StepConstraintsENS_12TicksRequestENS_13ResumeRequestE+0x3a1)[0x55ea258ecb35]
/home/keno/rr-build/bin/rr(_ZN2rr13ReplaySession28emulate_deterministic_signalEPNS_10ReplayTaskEiRKNS0_15StepConstraintsE+0xda)[0x55ea258ee7ba]
/home/keno/rr-build/bin/rr(_ZN2rr13ReplaySession18try_one_trace_stepEPNS_10ReplayTaskERKNS0_15StepConstraintsE+0x134)[0x55ea258f0872]
/home/keno/rr-build/bin/rr(_ZN2rr13ReplaySession11replay_stepERKNS0_15StepConstraintsE+0x169)[0x55ea258f1cd7]
/home/keno/rr-build/bin/rr(_ZN2rr14ReplayTimeline19replay_step_forwardENS_10RunCommandEl+0x106)[0x55ea2590e13a]
/home/keno/rr-build/bin/rr(_ZN2rr9GdbServer14debug_one_stepERNS_10GdbRequestE+0x59b)[0x55ea257d840f]
/home/keno/rr-build/bin/rr(_ZN2rr9GdbServer12serve_replayERKNS0_15ConnectionFlagsE+0x548)[0x55ea257da278]
/home/keno/rr-build/bin/rr(+0x43f75e)[0x55ea258e775e]
/home/keno/rr-build/bin/rr(_ZN2rr13ReplayCommand3runERSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0x400)[0x55ea258e819c]
/home/keno/rr-build/bin/rr(main+0x21b)[0x55ea259b04fa]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f0199aa2b97]
/home/keno/rr-build/bin/rr(_start+0x2a)[0x55ea25741d2a]
=== End rr backtrace

@Keno
Copy link
Member

Keno commented Sep 12, 2020

Can reproduce the rr issue.

r10:0x1007f0490b784d0

This pointer is odd. Stopping slightly before that crash, the pointer without the high bit set is Vector{Any}. The most straightforward explanation would be that something unrecorded set the high bit of that pointer, but I'm not sure what that would be.

@Keno
Copy link
Member

Keno commented Sep 12, 2020

Note that that's the lowest bit of the pointer's high byte. I'm not really sure what would be setting that. For our GC marking we use the low bits, and I don't think we do anything with the high bits anywhere.

@Keno
Copy link
Member

Keno commented Sep 12, 2020

@hh0rva1h I'd be interested if you see the same rr failure mode or whether it replays correctly for you. Could you try replaying your upload and seeing if it replays ok (BugReporting has a helper if you need it: https://github.com/JuliaLang/BugReporting.jl/blob/master/src/BugReporting.jl#L133).

@Keno
Copy link
Member

Keno commented Sep 12, 2020

I tried

(rr) hbreak *0x00007f049e2ba8e3

but that never gets hit. So whatever the divergence is, it starts earlier than that and this code path is never hit.

@Keno
Copy link
Member

Keno commented Sep 12, 2020

Also, I'd be interested in another recording of the same crash to see if it breaks in the same way.

@Keno
Copy link
Member

Keno commented Sep 12, 2020

@hh0rva1h As a general comment, you appear to be allocating super long tuples of tiny arrays. That's exactly the wrong way around. You'll want to use arrays for large variable sized things and tuples for small structural things.

@Keno
Copy link
Member

Keno commented Sep 12, 2020

Right, so it's looking at the 43840'th element of the array of types that it allocates the tuple from. In the recording that that's just Array{Any, 1}, but something on your machine flipped the low bit of the high byte of that pointer, which first caused the type cache lookup to fail and then enventually caused it to fail to allocate the new tuple when it tried to look at the corrupted pointer. More recordings of the same failure would be helpful, as would some information about the machine doing the recording (kernel version, any unusual device drivers, ECC memory?). Also, if possible try a different machine.

@hh0rva1h
Copy link
Author

Yet another trace:
https://s3.amazonaws.com/julialang-dumps/reports/2020-09-12T14-59-59-hh0rva1h.tar.zst

Regarding the machine: It's a Dell Precision M4700, so I guess no ECC memory since thats a consumer device, no unusual device drivers, kernel info:

Linux pc 5.6.0-1023-oem #23-Ubuntu SMP Tue Aug 25 06:16:32 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

I was not able to reproduce this issue on another Computer (Dell XPS 13 9360), another colleague also failed to reproduce, so this seems to happen only on my machine. I did check journalctl for any mce messages, however I did not notice anything suspicious. The system is behaving quite stable, no kernel panics or other segfaults ...

@Keno
Copy link
Member

Keno commented Sep 12, 2020

Can you try doing the replay yourself on your machine and seeing if you observe the same failure to replay?

@hh0rva1h
Copy link
Author

@Keno Thx for all your effort, could you outline me how I should replay? I have not experience with rr yet.

@Keno
Copy link
Member

Keno commented Sep 12, 2020

julia> using BugReporting
[ Info: Precompiling BugReporting [bcf9a6e7-4020-453c-b88e-690564246bb8]

julia> BugReporting.replay("https://s3.amazonaws.com/julialang-dumps/reports/2020-09-11T08-25-23-hh0rva1h.tar.zst")
[snip]
Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.31.so...
0x00007f049e9f2100 in _start () from /lib64/ld-linux-x86-64.so.2
(rr) c
Continuing.

@Keno
Copy link
Member

Keno commented Sep 12, 2020

As for the new trace, the symptoms are the same:

{
  real_time:27024.688030 global_time:211981, event:`SIGNAL_DELIVERY: SIGSEGV(det)' tid:6029, ticks:759211181056
rax:0x7fc8075f0c90 rbx:0x6caf rcx:0x8a2 rdx:0xc12 rsi:0x2c8 rdi:0x7fc8239a0920 rbp:0x7ffc2d501210 rsp:0x7ffc2d501160 r8:0x1007fc7f047b040 r9:0x1 r10:0x7fc806371490 r11:0x2 r12:0x7fc828e304d8 r13:0x7ffc2d5012a0 r14:0x7fc8239a0910 r15:0x7fc82d5edc60 rip:0x7fc82d1b3cdc eflags:0x10202 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7fc828e34240 gs_base:0x0
}

r8 is an otherwise valid pointer that has the low bit of its high byte flipped. I'm still replaying to see if that was caused by any architecturally executed instruction, but I'm assuming not. My best guess that this point is that one of your RAM DIMMs has a sticky bit that gets stuck when the DIMM gets hot. Unless you can reproduce on another machine, I'm not sure there's much for us to do here. I'd recommend maybe running memtest on that machine to see if you can catch the memory error.

@Keno
Copy link
Member

Keno commented Sep 12, 2020

Alright, the replay just finished, and yeah, same deal, somehow that bit gets flipped and things break.

@hh0rva1h
Copy link
Author

@Keno Ah thx, I figured it out from the link you gave a few comments above, but thanks anyway:

Episode: 2605
[FATAL /workspace/srcdir/rr/src/ReplaySession.cc:613:check_pending_sig()] 
 (task 2636 (rec:7760) at time 51911)
 -> Assertion `false' failed to hold. Replaying `SIGNAL: SIGSEGV(det)': expecting tracee signal or trap, but instead at `write' (ticks: 144040993972)
Tail of trace dump:
{
  real_time:38257.508732 global_time:51891, event:`SYSCALL: write' (state:ENTERING_SYSCALL) tid:7760, ticks:143952193134
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x4 rsi:0x7f046af10e38 rdi:0xd rbp:0xd rsp:0x681ffe10 r8:0x0 r9:0x7f0490095f90 r10:0x7f0475ad7bb0 r11:0x246 r12:0x0 r13:0x7f046af10e38 r14:0x4 r15:0x7f046af10e38 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.508734 global_time:51892, event:`SYSCALLBUF_RESET' tid:7760, ticks:143952193134
}
{
  real_time:38257.508760 global_time:51893, event:`SYSCALL: write' (state:EXITING_SYSCALL) tid:7760, ticks:143952193134
rax:0x4 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x4 rsi:0x7f046af10e38 rdi:0xd rbp:0xd rsp:0x681ffe10 r8:0x0 r9:0x7f0490095f90 r10:0x7f0475ad7bb0 r11:0x246 r12:0x0 r13:0x7f046af10e38 r14:0x4 r15:0x7f046af10e38 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.508810 global_time:51894, event:`SYSCALLBUF_FLUSH' tid:7760, ticks:143952193901
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
}
{
  real_time:38257.508813 global_time:51895, event:`SYSCALL: write' (state:ENTERING_SYSCALL) tid:7760, ticks:143952193901
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1 rsi:0x7f047b1f43f0 rdi:0xd rbp:0xd rsp:0x681ffe10 r8:0x0 r9:0x7f0490095f90 r10:0x7f0475ad7bb0 r11:0x246 r12:0x0 r13:0x7f047b1f43f0 r14:0x1 r15:0x7f047b1f43f0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.508815 global_time:51896, event:`SYSCALLBUF_RESET' tid:7760, ticks:143952193901
}
{
  real_time:38257.508841 global_time:51897, event:`SYSCALL: write' (state:EXITING_SYSCALL) tid:7760, ticks:143952193901
rax:0x1 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1 rsi:0x7f047b1f43f0 rdi:0xd rbp:0xd rsp:0x681ffe10 r8:0x0 r9:0x7f0490095f90 r10:0x7f0475ad7bb0 r11:0x246 r12:0x0 r13:0x7f047b1f43f0 r14:0x1 r15:0x7f047b1f43f0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.562196 global_time:51898, event:`SYSCALLBUF_FLUSH' tid:7760, ticks:143983370693
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
}
{
  real_time:38257.562212 global_time:51899, event:`SYSCALL: write' (state:ENTERING_SYSCALL) tid:7760, ticks:143983370693
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x9 rsi:0x7f04771a4078 rdi:0xd rbp:0xd rsp:0x681ffe10 r8:0x0 r9:0x7f0490095f90 r10:0x7f0492d83b20 r11:0x246 r12:0x0 r13:0x7f04771a4078 r14:0x9 r15:0x7f04771a4078 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.562214 global_time:51900, event:`SYSCALLBUF_RESET' tid:7760, ticks:143983370693
}
{
  real_time:38257.562268 global_time:51901, event:`SYSCALL: write' (state:EXITING_SYSCALL) tid:7760, ticks:143983370693
rax:0x9 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x9 rsi:0x7f04771a4078 rdi:0xd rbp:0xd rsp:0x681ffe10 r8:0x0 r9:0x7f0490095f90 r10:0x7f0492d83b20 r11:0x246 r12:0x0 r13:0x7f04771a4078 r14:0x9 r15:0x7f04771a4078 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.562363 global_time:51902, event:`SYSCALLBUF_FLUSH' tid:7760, ticks:143983372177
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
}
{
  real_time:38257.562367 global_time:51903, event:`SYSCALL: write' (state:ENTERING_SYSCALL) tid:7760, ticks:143983372177
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x4 rsi:0x7f046a9a0338 rdi:0xd rbp:0xd rsp:0x681ffe10 r8:0x0 r9:0x7f0490095f90 r10:0x7f0475ad7bb0 r11:0x246 r12:0x0 r13:0x7f046a9a0338 r14:0x4 r15:0x7f046a9a0338 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.562369 global_time:51904, event:`SYSCALLBUF_RESET' tid:7760, ticks:143983372177
}
{
  real_time:38257.562395 global_time:51905, event:`SYSCALL: write' (state:EXITING_SYSCALL) tid:7760, ticks:143983372177
rax:0x4 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x4 rsi:0x7f046a9a0338 rdi:0xd rbp:0xd rsp:0x681ffe10 r8:0x0 r9:0x7f0490095f90 r10:0x7f0475ad7bb0 r11:0x246 r12:0x0 r13:0x7f046a9a0338 r14:0x4 r15:0x7f046a9a0338 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.562444 global_time:51906, event:`SYSCALLBUF_FLUSH' tid:7760, ticks:143983372944
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
}
{
  real_time:38257.562448 global_time:51907, event:`SYSCALL: write' (state:ENTERING_SYSCALL) tid:7760, ticks:143983372944
rax:0xffffffffffffffda rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1 rsi:0x7f047ba39840 rdi:0xd rbp:0xd rsp:0x681ffe10 r8:0x0 r9:0x7f0490095f90 r10:0x7f0475ad7bb0 r11:0x246 r12:0x0 r13:0x7f047ba39840 r14:0x1 r15:0x7f047ba39840 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.562449 global_time:51908, event:`SYSCALLBUF_RESET' tid:7760, ticks:143983372944
}
{
  real_time:38257.562475 global_time:51909, event:`SYSCALL: write' (state:EXITING_SYSCALL) tid:7760, ticks:143983372944
rax:0x1 rbx:0x681fffa0 rcx:0xffffffffffffffff rdx:0x1 rsi:0x7f047ba39840 rdi:0xd rbp:0xd rsp:0x681ffe10 r8:0x0 r9:0x7f0490095f90 r10:0x7f0475ad7bb0 r11:0x246 r12:0x0 r13:0x7f047ba39840 r14:0x1 r15:0x7f047ba39840 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.585211 global_time:51910, event:`SYSCALLBUF_FLUSH' tid:7760, ticks:143998759917
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
  { syscall:'epoll_pwait', ret:0x0, size:0x10 }
  { syscall:'clock_gettime', ret:0x0, size:0x20 }
}
{
  real_time:38257.585221 global_time:51911, event:`SIGNAL: SIGSEGV(det)' tid:7760, ticks:143998759917
rax:0x436487c0 rbx:0x7f0490b74330 rcx:0xb479 rdx:0x435f2dc8 rsi:0x435f2dc0 rdi:0x435f2dc0 rbp:0x7fff6ca71250 rsp:0x7fff6ca71230 r8:0x1 r9:0x7f0490b74330 r10:0x1007f0490b784d0 r11:0x4364d190 r12:0x0 r13:0x7f0490b88c70 r14:0x7f047daf36d0 r15:0xb479 rip:0x7f049e2ba8e3 eflags:0x10287 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f0499f49240 gs_base:0x0
}
{
  real_time:38257.585223 global_time:51912, event:`SYSCALLBUF_RESET' tid:7760, ticks:143998759917
}
=== Start rr backtrace:
/home/user/.julia/artifacts/8d4aa64cc85b308b4619a66038db4d983df98b9b/bin/rr(_ZN2rr9GdbServer15emergency_debugEPNS_4TaskE+0x249)[0x503349]
/home/user/.julia/artifacts/8d4aa64cc85b308b4619a66038db4d983df98b9b/bin/rr[0x51d664]
/home/user/.julia/artifacts/8d4aa64cc85b308b4619a66038db4d983df98b9b/bin/rr(_ZN2rr13ReplaySession17check_pending_sigEPNS_10ReplayTaskE+0xc0)[0x47f8e0]
/home/user/.julia/artifacts/8d4aa64cc85b308b4619a66038db4d983df98b9b/bin/rr[0x541de1]
/home/user/.julia/artifacts/8d4aa64cc85b308b4619a66038db4d983df98b9b/bin/rr(_ZN2rr13ReplaySession28emulate_deterministic_signalEPNS_10ReplayTaskEiRKNS0_15StepConstraintsE+0x7c)[0x47f3ac]
/home/user/.julia/artifacts/8d4aa64cc85b308b4619a66038db4d983df98b9b/bin/rr(_ZN2rr13ReplaySession18try_one_trace_stepEPNS_10ReplayTaskERKNS0_15StepConstraintsE+0x154)[0x480f74]
/home/user/.julia/artifacts/8d4aa64cc85b308b4619a66038db4d983df98b9b/bin/rr(_ZN2rr13ReplaySession11replay_stepERKNS0_15StepConstraintsE+0xf1)[0x4852e1]
/home/user/.julia/artifacts/8d4aa64cc85b308b4619a66038db4d983df98b9b/bin/rr(_ZN2rr14ReplayTimeline19replay_step_forwardENS_10RunCommandEl+0x94)[0x55b7a4]
/home/user/.julia/artifacts/8d4aa64cc85b308b4619a66038db4d983df98b9b/bin/rr(_ZN2rr9GdbServer14debug_one_stepERNS_10GdbRequestE+0xcd9)[0x504329]
/home/user/.julia/artifacts/8d4aa64cc85b308b4619a66038db4d983df98b9b/bin/rr(_ZN2rr9GdbServer12serve_replayERKNS0_15ConnectionFlagsE+0x3fb)[0x5049db]
/home/user/.julia/artifacts/8d4aa64cc85b308b4619a66038db4d983df98b9b/bin/rr[0x44d2f4]
/home/user/.julia/artifacts/8d4aa64cc85b308b4619a66038db4d983df98b9b/bin/rr(_ZN2rr13ReplayCommand3runERSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0x478)[0x44da98]
/home/user/.julia/artifacts/8d4aa64cc85b308b4619a66038db4d983df98b9b/bin/rr(main+0x2c1)[0x443131]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f84fd2fb0b3]
/home/user/.julia/artifacts/8d4aa64cc85b308b4619a66038db4d983df98b9b/bin/rr[0x44322f]
=== End rr backtrace
Launch gdb with
  gdb '-l' '10000' '-ex' 'set sysroot /' '-ex' 'target extended-remote 127.0.0.1:2636' /home/user/RL-Carla/.julia/artifacts/f5de3f7dbf1c5c1908dc7a4c978252cf18d45076/julia-0/mmap_clone_3_julia

I'll run memtest on my machine. Thanks very much for your great support!

@Keno
Copy link
Member

Keno commented Sep 12, 2020

Yeah, that's the same behavior we're seeing. That's good. That indicates there there is indeed something transient on your machine that is flipping bits, rather than say some weird architectural behavior of your particular CPU. Since I don't think there's anything actionable here, I'm gonna close the issue, but do let us know what you find. And of course if you do end up reproducing on another machine feel free to ask for the issue to be reopened.

@Keno Keno closed this as completed Sep 12, 2020
@hh0rva1h
Copy link
Author

memerrors

@Keno You were absolutely right, thanks very much again for your amazing support and sorry for the noise!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants