New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests.unit.python.execution_tree.eval test fails on POWER8/Clang #584
Comments
The Release and Debug errors seem to be unrelated. While the release error comes out of an actual action invocation, the Debug error happens in Spirit during parsing (presumably a PhySL expression). Both actually could be stack overflows :/ |
I think I eliminated the stack overflow issue by doubling the stack size (changing |
@khuck I don't think |
OK, trying with |
Running with #0 0x00003fffaf14cdd0 in hpx::threads::thread_data::set_description(hpx::util::thread_description) ()
from /home/users/khuck/src/phylanx/tools/buildbot/build-centaur-ppc64le-Linux-clang/hpx-Release/lib/libhpx.so.1
#1 0x00003fffaf149d4c in hpx::threads::set_thread_description(hpx::threads::thread_id_type const&, hpx::util::thread_description const&, hpx::error_code&) ()
from /home/users/khuck/src/phylanx/tools/buildbot/build-centaur-ppc64le-Linux-clang/hpx-Release/lib/libhpx.so.1
#2 0x00003fffb027ce40 in hpx::util::annotate_function::annotate_function(char const*) ()
from /home/users/khuck/src/phylanx/tools/buildbot/build-centaur-ppc64le-Linux-clang/phylanx-Release/lib/libhpx_phylanx.so.0
#3 0x00003fffb027b294 in phylanx::execution_tree::primitives::primitive_component_base::do_eval(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const ()
from /home/users/khuck/src/phylanx/tools/buildbot/build-centaur-ppc64le-Linux-clang/phylanx-Release/lib/libhpx_phylanx.so.0
#4 0x00003fffb0244a90 in phylanx::execution_tree::primitives::primitive_component::eval(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const ()
from /home/users/khuck/src/phylanx/tools/buildbot/build-centaur-ppc64le-Linux-clang/phylanx-Release/lib/libhpx_phylanx.so.0 Could it be that there is an operation that is just missing an annotation, or is getting mis-annotated in some way? |
After compiling with Clang 6.0 on an x86_64 machine, I think I confirmed it's a POWER8-specific problem. Is there something specific about this particular primitive that does something unusual? |
@hkaiser - another clue... as you pointed out, the crash is in: #367 0x00003fffb0609664 in phylanx::bindings::expression_evaluator(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, phylanx::bindings::compiler_state&, pybind11::args)::{lambda()#1}::operator()() const (this=0x3fffffffcd88)
at /home/users/khuck/src/phylanx/python/src/bindings/binding_helpers.hpp:181 ...but the expression it is parsing is not that crazy: 181 auto xexpr = phylanx::ast::generate_ast(xexpr_str);
(gdb) print xexpr_str
$1 = "\nblock(\n define(fib,n,\n if(n<2,n,\n fib(n-1)+fib(n-2))),\n fib)" except for the fact that it is a recursive definition. Also, this didn't crash when I built it on an x86_64 machine (HPX and Phylanx were built with Clang 5.0) that used the ubuntu boost package (built by GCC, I assume). Whereas the machine that is crashing was using a boost built by clang 5.0. Also, I built the test that crashed with |
@hkaiser The stack stuff might be a red herring. I have another clue. I tried running a RelWithDebInfo build. It crashes, but in a different way. 2 steps up the stack, the program is in the "eval" method of the primitive_component base class. When I dereference the "this" pointer, I get this back:
...which seems OK, except for the RTTI warning. Then, stepping down the stack things get interesting:
which also seems OK. but then taking one more step, into the concrete instance of the object:
...you'll notice the "this" pointer is null! So for some reason, this object is either corrupted, or...? Is something missing from the implementation of |
@hkaiser any thoughts on the above? I thought maybe it was because access_function didn't inherit from |
@khuck: I don't think this causes the issue we're seeing. All primitives are kept alive by a |
@hkaiser ok. I started playing with the code in eval.py, and it's crashing on the definition of fib10 (compressed here):
BUT if I change it to fib9, it works:
...and the same is true of the fib() function defined later, if I call it with fib(9) it's OK, but fib(10) crashes. So it is stack related, but it's the stack of the AST that is the problem. Reminder, this is Clang 5.0 on POWER8, so different beast than GCC on x86_64. |
Recursive functions (fibonacci) eventually crash on power8/clang with direct actions, so this change will prevent that crash from happening. Issue #584 still needs to be fixed.
Yup, stack related. This issue will stay open, but a work-around for that platform has been committed. See pull request #601 |
This PR enables stack overflow prevention in HPX on Power platforms: STEllAR-GROUP/hpx#3469. Please verify. |
@khuck I believe the calculation of the remaining amount of stack space in my original patch was wrong. Could you try again, please? |
@hkaiser nope, same error. I have asked for someone to send me the instructions for getting an account on our system if you want to test it yourself... |
@khuck can you try a build with address sanitizer? This is usually very accurate in pinpointing to issues |
@sithhell I did. I ran into so many linker issues I couldn't figure out how to fix them. I tried with valgrind, but after 3-4 hours building a suppression file, I was no closer to the cause of the problem. |
@khuck for the linker errors, configure your HPX build with |
@sithhell IIRC, building HPX wasn't the problem, but building Phylanx was. The address sanitizer library was supposed to be first in the link order, but it wasn't. Besides, I built Clang 5.0 myself for this machine, and it's possible I didn't configure/build the sanitizer libraries correctly. |
@sithhell yes it would - are you volunteering? :) |
OK, I attempted to make a Phylanx docker image that uses sanitize. I fail at the Phylanx link step. https://gist.github.com/stevenrbrandt/56cc36a9c9cb0375ae264c398d0e3431 Setting However, setting Regardless, however, I can't run build]# bin/physl --doc
==15==Your application is linked against incompatible ASan runtimes. Not sure how that comes about, since I only have the default Clang / libasan installed. @sithhell Any idea what I'm doing wrong? |
Address Sanitizer is really temperamental sometimes… Instead of adding `-lasan`, can you add the specific library? i.e. `/path/to/compiler/lib/libasan.so` instead of `-lasan` to make sure you get the right one.
Kevin
|
@khuck I've discovered the |
@khuck @sithhell Ok, this uses ==27==Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING.
==27==ASan shadow was supposed to be located in the [0x00007fff7000-0x10007fff7fff] range.
==27==This might be related to ELF_ET_DYN_BASE change in Linux 4.12.
==27==See https://github.com/google/sanitizers/issues/856 for possible workarounds.
==27==Process memory map follows:
0x000000400000-0x0000007be000 /usr/bin/python3.6
0x0000009bd000-0x0000009be000 /usr/bin/python3.6
0x0000009be000-0x000000a5b000 /usr/bin/python3.6
0x000000a5b000-0x000000a8f000 Not sure what to do at this point. |
@stevenrbrandt just curious - are you using the system allocator or tcmalloc/jemalloc? |
@stevenrbrandt also, what happens if you run an example without python involved? like lra_csv or something like that? |
@khuck I'm using the System Allocator, see the docker file I linked. You can't even run "physl --doc" without problems: # bin/physl --doc
terminate called after throwing an instance of 'std::runtime_error'
what(): Cannot instantiate more than one affinity data instance
Aborted |
So, a small success (I think). The problem seems to have partly been the 80 core cluster I built it on... Running on a smaller machine, I get this. You can try out stevenrbrandt/phylanx.sanitized from Docker yourself. # ./bin/physl --doc
=================================================================
==27==ERROR: AddressSanitizer: odr-violation (0x7fb8b739c940):
[1] size=32 'hpx::util::detail::global_fixture' /hpx/src/util/lightweight_test.cpp:56:13
[2] size=32 'hpx::util::detail::global_fixture' /hpx/src/util/lightweight_test.cpp:56:13
These globals were registered at these points:
[1]:
#0 0x7fb8cd3385c8 (/usr/local/clang_7.0.0/lib/clang/7.0.0/lib/linux/libclang_rt.asan-x86_64.so+0x675c8)
#1 0x7fb8b46582dd in asan.module_ctor (/usr/local/lib/phylanx/libphylanx_solversd.so+0x39b2dd)
[2]:
#0 0x7fb8cd3385c8 (/usr/local/clang_7.0.0/lib/clang/7.0.0/lib/linux/libclang_rt.asan-x86_64.so+0x675c8)
#1 0x7fb8b6e14f7d in asan.module_ctor (/usr/local/lib/phylanx/libphylanx_arithmeticsd.so+0x2503f7d)
==27==HINT: if you don't care about these errors you may set ASAN_OPTIONS=detect_odr_violation=0
SUMMARY: AddressSanitizer: odr-violation: global 'hpx::util::detail::global_fixture' at /hpx/src/util/lightweight_test.cpp:56:13
==27==ABORTING |
Did you try |
@khuck using that setting, the ==96==Could not attach to thread 72 (errno 1).
==96==Failed suspending threads.
==72==LeakSanitizer has encountered a fatal error.
==72==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
==72==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc) |
OK, so Python doesn't work, but we can generate PhySL code from Python on another machine and run the PhySL code inside the phylanx.sanitized image. Trying LRA with the PhySL interpreter (./examples/interpreter/lra.physl), I get this: ==191==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7f4897636550 at pc 0x7f48d0a1017c bp 0x7f4897636540 sp 0x7f4897635cf0
WRITE of size 16 at 0x7f4897636550 thread T15
#0 0x7f48d0a1017b (/usr/local/clang_7.0.0/lib/clang/7.0.0/lib/linux/libclang_rt.asan-x86_64.so+0xd717b)
#1 0x7f48c2deb290 in std::chrono::_V2::steady_clock::now() (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0xb5290)
#2 0x7f48cfa4a8b7 in hpx::util::high_resolution_clock::now() /usr/local/include/hpx/util/high_resolution_clock.hpp:30:17
#3 0x7f48cfa4843a in phylanx::util::scoped_timer<long>::~scoped_timer() /phylanx/phylanx/util/scoped_timer.hpp:37:25
#4 0x7f48cfa42918 in phylanx::execution_tree::primitives::primitive_component_base::do_eval(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const /phylanx/src/execution_tree/primitives/primitive_component_base.cpp:103:5
#5 0x7f48cf9687da in phylanx::execution_tree::primitives::primitive_component::eval(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const /phylanx/src/execution_tree/primitives/primitive_component.cpp:123:28 |
OK, using the Address Sanitizer block(
define(fib,n,
if(n<2,n,
fib(n-1)+fib(n-2))),
cout(fib(16))
) runs without difficulty. This was the code that originally prompted the ticket (correct me if I'm wrong). |
Of course, this was clang 7 not 8 |
@khuck I've updated stevenrbrandt/phylanx.sanitized so that it's only 6.6GB. Still not tiny. |
@khuck should we get this sanitized phylanx image running in your test framework? |
@stevenrbrandt yes, if you could send me the cmake configuration steps, I would appreciate it. |
@khuck |
The Release build call stack is massive (318 functions deep) and the test fails this way:
The Debug build fails in a different location, but with an equally massive call stack (in the ~436 range). In the Debug build, it appears a boost "unused" type is passed as an attribute/context somewhere deep in boost:
The text was updated successfully, but these errors were encountered: