Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Valgrind failure due to multidimensional.jl stagedfunction change #10171

Closed
garrison opened this issue Feb 12, 2015 · 14 comments
Closed

Valgrind failure due to multidimensional.jl stagedfunction change #10171

garrison opened this issue Feb 12, 2015 · 14 comments

Comments

@garrison
Copy link
Sponsor Member

While working on trying to get all tests to pass under valgrind (PR #10015), I noticed that running test/strings.jl under valgrind fails. While the precise error is not constant across revisions, git-bisect tells me that 373e61b (part of PR #9098) introduced the change that leads to failure. With this commit, the error is as follows:

==25386== Warning: set address range perms: large range [0x3a048000, 0x23ad5c000) (defined)
==25386== Warning: client switching stacks?  SP change: 0x9543c88 --> 0xffeffda80
==25386==          to suppress, use: --max-stackframe=68546174456 or greater

valgrind: m_mallocfree.c:304 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi' failed.
valgrind: Heap block lo/hi size mismatch: lo = 104226312, hi = 0.
This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata.  If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away.  Please try that before reporting this as a bug.


host stacktrace:
==25386==    at 0x3805DAB6: show_sched_status_wrk (in /nix/store/3a2mk27v0l6h8zvgg1l1pl6dfcsxw3kb-valgrind-3.10.1/lib/valgrind/memcheck-amd64-linux)
==25386==    by 0x3805DBC4: report_and_quit (in /nix/store/3a2mk27v0l6h8zvgg1l1pl6dfcsxw3kb-valgrind-3.10.1/lib/valgrind/memcheck-amd64-linux)
==25386==    by 0x3805DD46: vgPlain_assert_fail (in /nix/store/3a2mk27v0l6h8zvgg1l1pl6dfcsxw3kb-valgrind-3.10.1/lib/valgrind/memcheck-amd64-linux)
==25386==    by 0x3806B4A1: vgPlain_arena_free (in /nix/store/3a2mk27v0l6h8zvgg1l1pl6dfcsxw3kb-valgrind-3.10.1/lib/valgrind/memcheck-amd64-linux)
==25386==    by 0x3802CAA5: create_MC_Chunk (in /nix/store/3a2mk27v0l6h8zvgg1l1pl6dfcsxw3kb-valgrind-3.10.1/lib/valgrind/memcheck-amd64-linux)
==25386==    by 0x3802CC9B: vgMemCheck_new_block (in /nix/store/3a2mk27v0l6h8zvgg1l1pl6dfcsxw3kb-valgrind-3.10.1/lib/valgrind/memcheck-amd64-linux)
==25386==    by 0x3802CE6A: vgMemCheck___builtin_new (in /nix/store/3a2mk27v0l6h8zvgg1l1pl6dfcsxw3kb-valgrind-3.10.1/lib/valgrind/memcheck-amd64-linux)
==25386==    by 0x380AF155: vgPlain_scheduler (in /nix/store/3a2mk27v0l6h8zvgg1l1pl6dfcsxw3kb-valgrind-3.10.1/lib/valgrind/memcheck-amd64-linux)
==25386==    by 0x380BDC1C: run_a_thread_NORETURN (in /nix/store/3a2mk27v0l6h8zvgg1l1pl6dfcsxw3kb-valgrind-3.10.1/lib/valgrind/memcheck-amd64-linux)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable
==25386==    at 0x4C28085: operator new(unsigned long) (in /nix/store/3a2mk27v0l6h8zvgg1l1pl6dfcsxw3kb-valgrind-3.10.1/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==25386==    by 0x662BE98: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.20)
==25386==    by 0x662BF13: std::string::_M_mutate(unsigned long, unsigned long, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.20)
==25386==    by 0x662C52D: std::string::_M_replace_safe(unsigned long, unsigned long, char const*, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.20)
==25386==    by 0x4F40D0F: emit_function(_jl_lambda_info_t*, bool) (codegen.cpp:3661)
==25386==    by 0x4F1F5CB: to_function(_jl_lambda_info_t*, bool) (codegen.cpp:581)
==25386==    by 0x4F1FA9B: jl_compile (codegen.cpp:709)
==25386==    by 0x4F10ED2: jl_trampoline_compile_function (builtins.c:918)
==25386==    by 0x4F10FF6: jl_trampoline (builtins.c:933)
==25386==    by 0x4F0141D: jl_apply (julia.h:1066)
==25386==    by 0x4F07306: jl_apply_generic (gf.c:1685)
==25386==    by 0x8A4DDB0: ???
==25386==    by 0x8A4DCFA: ???
==25386==    by 0x4F0141D: jl_apply (julia.h:1066)
==25386==    by 0x4F07203: jl_apply_generic (gf.c:1665)
==25386==    by 0x8A4DDB0: ???
==25386==    by 0x8A4DCFA: ???
==25386==    by 0x4F0D156: jl_apply (julia.h:1066)
==25386==    by 0x4F1100C: jl_trampoline (builtins.c:934)
==25386==    by 0x4F0141D: jl_apply (julia.h:1066)
==25386==    by 0x4F07306: jl_apply_generic (gf.c:1685)
==25386==    by 0x8A4DC41: ???
==25386==    by 0x4F0D156: jl_apply (julia.h:1066)
==25386==    by 0x4F1100C: jl_trampoline (builtins.c:934)
==25386==    by 0x4F0D156: jl_apply (julia.h:1066)
==25386==    by 0x4F0F08C: jl_f_apply (builtins.c:457)
==25386==    by 0x8A4D983: ???
==25386==    by 0x4F0D156: jl_apply (julia.h:1066)
==25386==    by 0x4F1100C: jl_trampoline (builtins.c:934)
==25386==    by 0x4F0141D: jl_apply (julia.h:1066)
==25386==    by 0x4F07306: jl_apply_generic (gf.c:1685)
==25386==    by 0x8A4D66D: ???
==25386==    by 0x4F0D156: jl_apply (julia.h:1066)
==25386==    by 0x4F1100C: jl_trampoline (builtins.c:934)
==25386==    by 0x4F0141D: jl_apply (julia.h:1066)
==25386==    by 0x4F07306: jl_apply_generic (gf.c:1685)
==25386==    by 0x7B11D0F: julia__start_41098 (client.jl:449)
==25386==    by 0x7B129A8: jlcall__start_41098 (in /home/garrison/julia/usr/lib/julia/sys.so)
==25386==    by 0x4F0D156: jl_apply (julia.h:1066)
==25386==    by 0x4F1100C: jl_trampoline (builtins.c:934)
==25386==    by 0x4F0141D: jl_apply (julia.h:1066)
==25386==    by 0x4F07203: jl_apply_generic (gf.c:1665)
==25386==    by 0x40145F: jl_apply (julia.h:1066)
==25386==    by 0x402163: true_main (repl.c:340)
==25386==    by 0x402331: main (repl.c:392)

Thread 2: status = VgTs_WaitSys
==25386==    at 0x635C08F: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.S:185)
==25386==    by 0xA3ADCAA: blas_thread_server (in /home/garrison/julia/usr/lib/libopenblas.so)
==25386==    by 0x63580A3: start_thread (pthread_create.c:309)
==25386==    by 0x6E74CCC: clone (clone.S:111)

Thread 3: status = VgTs_WaitSys
==25386==    at 0x635C08F: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.S:185)
==25386==    by 0xA3ADCAA: blas_thread_server (in /home/garrison/julia/usr/lib/libopenblas.so)
==25386==    by 0x63580A3: start_thread (pthread_create.c:309)
==25386==    by 0x6E74CCC: clone (clone.S:111)

Thread 4: status = VgTs_WaitSys
==25386==    at 0x635C08F: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.S:185)
==25386==    by 0xA3ADCAA: blas_thread_server (in /home/garrison/julia/usr/lib/libopenblas.so)
==25386==    by 0x63580A3: start_thread (pthread_create.c:309)
==25386==    by 0x6E74CCC: clone (clone.S:111)

Thread 5: status = VgTs_WaitSys
==25386==    at 0x635C08F: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.S:185)
==25386==    by 0xA3ADCAA: blas_thread_server (in /home/garrison/julia/usr/lib/libopenblas.so)
==25386==    by 0x63580A3: start_thread (pthread_create.c:309)
==25386==    by 0x6E74CCC: clone (clone.S:111)

Thread 6: status = VgTs_WaitSys
==25386==    at 0x635C08F: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.S:185)
==25386==    by 0xA3ADCAA: blas_thread_server (in /home/garrison/julia/usr/lib/libopenblas.so)
==25386==    by 0x63580A3: start_thread (pthread_create.c:309)
==25386==    by 0x6E74CCC: clone (clone.S:111)

Thread 7: status = VgTs_WaitSys
==25386==    at 0x635C08F: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.S:185)
==25386==    by 0xA3ADCAA: blas_thread_server (in /home/garrison/julia/usr/lib/libopenblas.so)
==25386==    by 0x63580A3: start_thread (pthread_create.c:309)
==25386==    by 0x6E74CCC: clone (clone.S:111)

Thread 8: status = VgTs_WaitSys
==25386==    at 0x635C08F: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.S:185)
==25386==    by 0xA3ADCAA: blas_thread_server (in /home/garrison/julia/usr/lib/libopenblas.so)
==25386==    by 0x63580A3: start_thread (pthread_create.c:309)
==25386==    by 0x6E74CCC: clone (clone.S:111)


Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

At first I thought this might be a valgrind bug, but the Warning: client switching stacks? does not appear on the commit before this one, c8c5b6f. (In both cases I applied the change in 1296c04 so that there would be no known memory errors in the program, in case that was a confounding factor.) I am using the latest valgrind, version 3.10.1.

CC: @timholy

@timholy
Copy link
Sponsor Member

timholy commented Feb 12, 2015

You're doing some pretty heroic detective work.

Unfortunately, I fear it's a bug exposed by that commit, rather than introduced by that commit. Other than @inbounds (can I ask you run with --check-bounds=yes?), there's nothing obvious in that commit that messes with memory in anything other than what are supposed to be safe ways.

I have to say, in the past valgrind seemed almost like a magic bullet for zeroing in on memory bugs, but I've been dismayed that in julia these errors are not easier to track down. Does anyone have any insights about why?

@garrison
Copy link
Sponsor Member Author

I get the same exact behavior when I run with --check-bounds=yes, so I don't think it is due to @inbounds. I was thinking perhaps this commit is exposing an issue with stagedfunctions. If so, it may be possible to reduce the size of the test case.

I'm hoping that once the entire testsuite runs (and passes) under Valgrind, we will be able to detect, identify and fix memory errors much more quickly than now. Hopefully this ends up working out :).

For now I will probably proceed by trying to find another test file (other than strings.jl) that fails under valgrind and bisecting to find the bad commit there. At the moment I expect there may be more than one bug preventing the full test suite from running.

@timholy
Copy link
Sponsor Member

timholy commented Feb 12, 2015

I was thinking perhaps this commit is exposing an issue with stagedfunctions

That's certainly my thinking as well, given that #9098 could be subtitled "invasion of the stagedfunctions."

@timholy
Copy link
Sponsor Member

timholy commented Feb 12, 2015

To answer my own question above, I bet we need to teach valgrind about julia's mempools.

@garrison
Copy link
Sponsor Member Author

@timholy that is a great idea, one that should probably take higher priority than a bunch of us running around trying to find memory bugs here and there ;). Python takes an alternative approach, which is to suggest that people disable memory pools when running under valgrind. There is an obvious disadvantange that running under valgrind then requires a separate build (and would then probably be done less frequently), but depending on the current structure of the code, this may get us more information sooner than teaching valgrind about julia's mempools.

@garrison
Copy link
Sponsor Member Author

Actually, it seems that defining MEMDEBUG will disable memory pools; in fact, src/options.h explicitly mentions that it's good to define MEMDEBUG when using valgrind; I'll try this and add a note to the existing Valgrind docs. There's also MEMDEBUG2, which is referenced in flisp, but seems to be undocumented. My best guess is that it's a second MEMDEBUG flag, but just for flisp.

@timholy
Copy link
Sponsor Member

timholy commented Feb 13, 2015

Nice, thanks @garrison.

@garrison
Copy link
Sponsor Member Author

I'm going to close this since I believe I figured it out. The key is that we need to pass --smc-check=all-non-file to valgrind for it to behave properly with a JIT compiler. In pull request #10015 I will redo things such that the valgrind docs are improved and this is mentioned there.

@Keno
Copy link
Member

Keno commented Feb 17, 2015

I'm surprised LLVM doesn't use the client request to invalidate the code cache.

@garrison
Copy link
Sponsor Member Author

@Keno I'm surprised too, but passing this flag makes the full test suite work under Nulgrind (valgrind --tool=none) for me. Without this flag, I couldn't get the test suite to work even on various releases of Julia (e.g. v0.3.5, v0.2.0, etc).

It is mentioned in the Valgrind docs that only some architectures require programs to flush the instruction cache, so perhaps that is what is going on.

@Keno
Copy link
Member

Keno commented Feb 17, 2015

Posted this in the wrong thread earlier:
LLVM has llvm::sys::ValgrindDiscardTranslations. If that's not being run, that's an llvm bug.

@garrison
Copy link
Sponsor Member Author

@Keno Or perhaps HAVE_VALGRIND_VALGRIND_H is false sometimes (or even always) when LLVM is built from the julia build.

@garrison
Copy link
Sponsor Member Author

On second thought, HAVE_VALGRIND_VALGRIND_H is true for me (at least) in build_Release/include/llvm/Config/config.h, so I must be wrong.

@garrison
Copy link
Sponsor Member Author

@Keno Actually, LLVM might not be the problem (or might not be the only problem). In fact, the regex tests were some of the ones that failed the most consistently under valgrind. I was just watching a slow build of julia's dependencies and noticed that "Valgrind support" is turned off in the build of pcre. The README suggests that --enable-valgrind is only helpful for informational purposes (and this seems to be the case from looking at the source), but this got me wondering whether pcre itself might contain self-modifying code.

I looked at the pcre source, and in sljit/sljitLir.h it mentions:

For valgrind users: pass --smc-check=all argument to valgrind, since JIT is a "self-modifying code"

So I think this is one of the issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants