Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests.unit.python.execution_tree.eval test fails on POWER8/Clang #584

Open
khuck opened this issue Sep 5, 2018 · 38 comments
Open

tests.unit.python.execution_tree.eval test fails on POWER8/Clang #584

khuck opened this issue Sep 5, 2018 · 38 comments

Comments

@khuck
Copy link
Contributor

khuck commented Sep 5, 2018

The Release build call stack is massive (318 functions deep) and the test fails this way:

[khuck@centaur phylanx-Release]$ gdb --args /usr/local/packages/python3/3.6.3/bin/python3 "/home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-release/build/tests/unit/python/execution_tree/eval.py"
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "ppc64le-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /storage/packages/python3/3.6.3/bin/python3.6...done.
(gdb) run
Starting program: /usr/local/packages/python3/3.6.3/bin/python3 /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-release/build/tests/unit/python/execution_tree/eval.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Detaching after fork from child process 148026.
Detaching after fork from child process 148031.
[New Thread 0x3fffa815e990 (LWP 148036)]
[New Thread 0x3fffa762e990 (LWP 148037)]
[New Thread 0x3fffa5d6e990 (LWP 148038)]
[New Thread 0x3fffa555e990 (LWP 148039)]
[New Thread 0x3fffa4d4e990 (LWP 148040)]
[New Thread 0x3fff8fffe990 (LWP 148041)]
[New Thread 0x3fff8f7ee990 (LWP 148042)]
[New Thread 0x3fff8efde990 (LWP 148043)]
[New Thread 0x3fff8e7ce990 (LWP 148044)]
[New Thread 0x3fff8dfbe990 (LWP 148045)]
[Thread 0x3fffa4d4e990 (LWP 148040) exited]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x3fff8dfbe990 (LWP 148045)]
0x00003fffb7fca7ac in _dl_update_slotinfo () from /lib64/ld64.so.2
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.ppc64le elfutils-libelf-0.170-4.el7.ppc64le elfutils-libs-0.170-4.el7.ppc64le glibc-2.17-222.el7.ppc64le keyutils-libs-1.5.8-3.el7.ppc64le krb5-libs-1.15.1-19.el7.ppc64le libattr-2.4.46-13.el7.ppc64le libcap-2.22-9.el7.ppc64le libcom_err-1.42.9-12.el7_5.ppc64le libffi-3.0.13-18.el7.ppc64le libicu-50.1.2-15.el7.ppc64le libselinux-2.5-12.el7.ppc64le libselinux-2.5-6.el7.ppc64le openssl-libs-1.0.2k-12.el7.ppc64le pcre-8.32-17.el7.ppc64le systemd-libs-219-57.el7.ppc64le xz-libs-5.2.2-1.el7.ppc64le zlib-1.2.7-17.el7.ppc64le
(gdb) bt
#0  0x00003fffb7fca7ac in _dl_update_slotinfo () from /lib64/ld64.so.2
#1  0x00003fffb7fb16f0 in update_get_addr () from /lib64/ld64.so.2
#2  0x00003fffaeee2350 in hpx::threads::coroutines::detail::coroutine_self::get_self() ()
   from /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-release/build/tools/buildbot/build-centaur-ppc64le-Linux-clang/hpx-Release/lib/libhpx.so.1
#3  0x00003fffaf03ae18 in hpx::threads::get_self_ptr() ()
   from /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-release/build/tools/buildbot/build-centaur-ppc64le-Linux-clang/hpx-Release/lib/libhpx.so.1
#4  0x00003fffb02561e8 in hpx::util::annotate_function::annotate_function(char const*) ()
   from /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-release/build/tools/buildbot/build-centaur-ppc64le-Linux-clang/phylanx-Release/lib/libhpx_phylanx.so.0
#5  0x00003fffb02546f4 in phylanx::execution_tree::primitives::primitive_component_base::do_eval(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const ()
   from /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-release/build/tools/buildbot/build-centaur-ppc64le-Linux-clang/phylanx-Release/lib/libhpx_phylanx.so.0
#6  0x00003fffb021e270 in phylanx::execution_tree::primitives::primitive_component::eval(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const ()
   from /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-release/build/tools/buildbot/build-centaur-ppc64le-Linux-clang/phylanx-Release/lib/libhpx_phylanx.so.0
#7  0x00003fffb00b1fe0 in hpx::actions::basic_action_impl<hpx::lcos::future<phylanx::execution_tree::primitive_argument_type> (phylanx::execution_tree::primitives::primitive_component::*)(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const, hpx::lcos::future<phylanx::execution_tree::primitive_argument_type> (phylanx::execution_tree::primitives::primitive_component::*)(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const, &(phylanx::execution_tree::primitives::primitive_component::eval(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const), phylanx::execution_tree::primitives::primitive_component::eval_action>::invoke_helper<hpx::lcos::future<phylanx::execution_tree::primitive_argument_type>, std::vector<phylanx::---Type <return> to continue, or q <return> to quit---

The Debug build fails in a different location, but with an equally massive call stack (in the ~436 range). In the Debug build, it appears a boost "unused" type is passed as an attribute/context somewhere deep in boost:

(gdb) bt
#0  0x00003fffaf64dd44 in boost::fusion::vector<>::vector() (this=0x3fff96eb0180)
    at /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-debug/build/tools/buildbot/build-centaur-ppc64le-Linux-clang/boost-1.65.0/include/boost/fusion/container/vector/vector.hpp:288
#1  0x00003fffaf64da64 in boost::spirit::context<boost::fusion::cons<boost::spirit::unused_type&, boost::fusion::nil_>, boost::fusion::vector<> >::context(boost::spirit::unused_type&) (this=0x3fff96eb0160, 
    attribute=...)
    at /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-debug/build/tools/buildbot/build-centaur-ppc64le-Linux-clang/boost-1.65.0/include/boost/spirit/home/support/context.hpp:101
#2  0x00003fffaf65a498 in boost::spirit::qi::rule<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::spirit::unused_type, boost::spirit::unused_type, boost::spirit::unused_type, boost::spirit::unused_type>::parse<boost::spirit::unused_type const, boost::spirit::unused_type, boost::spirit::unused_type const> (this=0x3fff96ebfe70, first=110 'n', 
    last=0 '\000', skipper=..., attr_param=...)
    at /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-debug/build/tools/buildbot/build-centaur-ppc64le-Linux-clang/boost-1.65.0/include/boost/spirit/home/qi/nonterminal/rule.hpp:298
#3  0x00003fffaf65a3cc in boost::spirit::qi::reference<boost::spirit::qi::rule<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::spirit::unused_type, boost::spirit::unused_type, boost::spirit::unused_type, boost::spirit::unused_type> const>::parse<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::spirit::unused_type const, boost::spirit::unused_type, boost::spirit::unused_type const> (this=0x3fff96ebfc38, first=110 'n', last=0 '\000', context=..., skipper=..., attr_=...)
    at /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-debug/build/tools/buildbot/build-centaur-ppc64le-Linux-clang/boost-1.65.0/include/boost/spirit/home/qi/reference.hpp:43
#4  0x00003fffaf653b24 in boost::spirit::qi::skip_over<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::spirit::qi::reference<boost::spirit::qi::rule<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::spirit::unused_type, boost::spirit::unused_type, boost::spirit::unused_type, boost::spirit::unused_type> const> > (first=110 'n', last=0 '\000', skipper=...)
    at /home/users/khuck/buildbot/slaves/phylanx/ppc64le-clang5-debug/build/tools/buildbot/build-centaur-ppc64le-Linux-clang/boost-1.65.0/include/boost/spirit/home/qi/skip_over.hpp:27
#5  0x00003fffaf6e8778 in boost::spirit::qi::lexeme_directive<boost::spirit::qi::sequence<boost::fusion::cons<---Type <return> to continue, or q <return> to quit---
@hkaiser
Copy link
Member

hkaiser commented Sep 5, 2018

The Release and Debug errors seem to be unrelated. While the release error comes out of an actual action invocation, the Debug error happens in Spirit during parsing (presumably a PhySL expression). Both actually could be stack overflows :/

@khuck
Copy link
Contributor Author

khuck commented Sep 5, 2018

I think I eliminated the stack overflow issue by doubling the stack size (changing ulimit -s) and getting the same crash, in the same location.

@hkaiser
Copy link
Member

hkaiser commented Sep 5, 2018

@khuck I don't think ulimit -s has any bearings on the stack size used by HPX for its threads. I wouldn't rule out a stack overflow for this problem.

@khuck
Copy link
Contributor Author

khuck commented Sep 5, 2018

OK, trying with HPX_WITH_STACKOVERFLOW_DETECTION_DEFAULT=On

@khuck
Copy link
Contributor Author

khuck commented Sep 5, 2018

Running with HPX_WITH_STACKOVERFLOW_DETECTION_DEFAULT=On didn't change anything - it still crashed in roughly the same location, but slightly different:

#0  0x00003fffaf14cdd0 in hpx::threads::thread_data::set_description(hpx::util::thread_description) ()
   from /home/users/khuck/src/phylanx/tools/buildbot/build-centaur-ppc64le-Linux-clang/hpx-Release/lib/libhpx.so.1
#1  0x00003fffaf149d4c in hpx::threads::set_thread_description(hpx::threads::thread_id_type const&, hpx::util::thread_description const&, hpx::error_code&) ()
   from /home/users/khuck/src/phylanx/tools/buildbot/build-centaur-ppc64le-Linux-clang/hpx-Release/lib/libhpx.so.1
#2  0x00003fffb027ce40 in hpx::util::annotate_function::annotate_function(char const*) ()
   from /home/users/khuck/src/phylanx/tools/buildbot/build-centaur-ppc64le-Linux-clang/phylanx-Release/lib/libhpx_phylanx.so.0
#3  0x00003fffb027b294 in phylanx::execution_tree::primitives::primitive_component_base::do_eval(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const ()
   from /home/users/khuck/src/phylanx/tools/buildbot/build-centaur-ppc64le-Linux-clang/phylanx-Release/lib/libhpx_phylanx.so.0
#4  0x00003fffb0244a90 in phylanx::execution_tree::primitives::primitive_component::eval(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const ()
   from /home/users/khuck/src/phylanx/tools/buildbot/build-centaur-ppc64le-Linux-clang/phylanx-Release/lib/libhpx_phylanx.so.0

Could it be that there is an operation that is just missing an annotation, or is getting mis-annotated in some way?

@khuck
Copy link
Contributor Author

khuck commented Sep 7, 2018

After compiling with Clang 6.0 on an x86_64 machine, I think I confirmed it's a POWER8-specific problem. Is there something specific about this particular primitive that does something unusual?

@khuck
Copy link
Contributor Author

khuck commented Sep 7, 2018

@hkaiser - another clue... as you pointed out, the crash is in:

#367 0x00003fffb0609664 in phylanx::bindings::expression_evaluator(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, phylanx::bindings::compiler_state&, pybind11::args)::{lambda()#1}::operator()() const (this=0x3fffffffcd88)
    at /home/users/khuck/src/phylanx/python/src/bindings/binding_helpers.hpp:181

...but the expression it is parsing is not that crazy:

181	                auto xexpr = phylanx::ast::generate_ast(xexpr_str);
(gdb) print xexpr_str
$1 = "\nblock(\n    define(fib,n,\n    if(n<2,n,\n        fib(n-1)+fib(n-2))),\n    fib)"

except for the fact that it is a recursive definition.

Also, this didn't crash when I built it on an x86_64 machine (HPX and Phylanx were built with Clang 5.0) that used the ubuntu boost package (built by GCC, I assume). Whereas the machine that is crashing was using a boost built by clang 5.0.

Also, I built the test that crashed with -fstack-protector-all -fstack-protector-strong and didn't see any difference.

@khuck
Copy link
Contributor Author

khuck commented Sep 21, 2018

@hkaiser The stack stuff might be a red herring. I have another clue. I tried running a RelWithDebInfo build. It crashes, but in a different way. 2 steps up the stack, the program is in the "eval" method of the primitive_component base class. When I dereference the "this" pointer, I get this back:

#2  0x00003fffb0259fa0 in phylanx::execution_tree::primitives::primitive_component::eval (this=0x10fc05a0, 
    params=..., mode=<optimized out>)
    at /home/users/khuck/src/phylanx/src/execution_tree/primitives/primitive_component.cpp:123
123	        return primitive_->do_eval(params, mode);
(gdb) print this
$10 = (const phylanx::execution_tree::primitives::primitive_component *) 0x10fc05a0
(gdb) print *this
$11 = {<hpx::components::component_base<phylanx::execution_tree::primitives::primitive_component>> = {<hpx::components::detail::base_component> = {<hpx::traits::detail::component_tag> = {<No data fields>}, gid_ = {
        static credit_base_mask = 31, static credit_shift = 24, static credit_mask = 520093696, 
        static was_split_mask = 2147483648, static has_credits_mask = 1073741824, 
        static is_locked_mask = 536870912, static locality_id_mask = 18446744069414584320, 
        static locality_id_shift = 32, static virtual_memory_mask = 4194303, 
        static dont_cache_mask = 8388608, static is_migratable = 4194304, static dynamically_assigned = 1, 
        static component_type_base_mask = 1048575, static component_type_shift = 1, 
        static component_type_mask = 2097150, static credit_bits_mask = 3741319168, 
        static internal_bits_mask = 4290772992, static special_bits_mask = 18446744073707454462, 
        id_msb_ = 4294967376, id_lsb_ = 284951968}}, <No data fields>}, primitive_ = warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<phylanx::execution_tree::primitives::access_function, std::allocator<phylanx::execution_tree::primitives::access_function>, (__gnu_cxx::_Lock_policy)2>'
warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<phylanx::execution_tree::primitives::access_function, std::allocator<phylanx::execution_tree::primitives::access_function>, (__gnu_cxx::_Lock_policy)2>'

std::shared_ptr (count 1, weak 0) 0x10fb77e0}

...which seems OK, except for the RTTI warning. Then, stepping down the stack things get interesting:

(gdb) down
#1  0x00003fffb028adb4 in phylanx::execution_tree::primitives::primitive_component_base::do_eval (
    this=0x10fb77e0, params=std::vector of length 1, capacity 1 = {...}, 
    mode=(phylanx::execution_tree::eval_dont_wrap_functions | phylanx::execution_tree::eval_dont_evaluate_partials | phylanx::execution_tree::eval_dont_evaluate_lambdas))
    at /home/users/khuck/src/phylanx/src/execution_tree/primitives/primitive_component_base.cpp:89
89	        auto f = this->eval(params, mode);

which also seems OK. but then taking one more step, into the concrete instance of the object:

(gdb) down
#0  0x00003fffb009d230 in phylanx::execution_tree::primitives::access_function::eval (this=0x0, 
    params=std::vector of length 1, capacity 1 = {...}, 
    mode=(phylanx::execution_tree::eval_dont_wrap_functions | phylanx::execution_tree::eval_dont_evaluate_partials | phylanx::execution_tree::eval_dont_evaluate_lambdas))
    at /home/users/khuck/src/phylanx/src/execution_tree/primitives/access_function.cpp:57
57	    {

...you'll notice the "this" pointer is null! So for some reason, this object is either corrupted, or...? Is something missing from the implementation of phylanx::execution_tree::primitives::access_function so that it isn't getting handled like the other primitives?

@khuck
Copy link
Contributor Author

khuck commented Sep 25, 2018

@hkaiser any thoughts on the above? I thought maybe it was because access_function didn't inherit from public std::enable_shared_from_this<access_function> like the other primitives. Could that be the case?

@hkaiser
Copy link
Member

hkaiser commented Sep 25, 2018

@hkaiser any thoughts on the above? I thought maybe it was because access_function didn't inherit from public std::enable_shared_from_this<access_function> like the other primitives. Could that be the case?

@khuck: I don't think this causes the issue we're seeing. All primitives are kept alive by a shared_ptr in any case, most of them however additionally need to stay alive for 'delayed' operation (requiring the enable_shared_from_this), access_variable is not one of those, iirc.

@khuck
Copy link
Contributor Author

khuck commented Sep 26, 2018

@hkaiser ok. I started playing with the code in eval.py, and it's crashing on the definition of fib10 (compressed here):

fib10 = et.eval(" block( define(fib,n, if(n<2,n, fib(n-1)+fib(n-2))), fib) ", cs, 10)

BUT if I change it to fib9, it works:

fib9 = et.eval(" block( define(fib,n, if(n<2,n, fib(n-1)+fib(n-2))), fib) ", cs, 9)

...and the same is true of the fib() function defined later, if I call it with fib(9) it's OK, but fib(10) crashes. So it is stack related, but it's the stack of the AST that is the problem. Reminder, this is Clang 5.0 on POWER8, so different beast than GCC on x86_64.

khuck added a commit that referenced this issue Sep 26, 2018
Recursive functions (fibonacci) eventually crash on power8/clang
with direct actions, so this change will prevent that crash from
happening.  Issue #584 still needs to be fixed.
@khuck
Copy link
Contributor Author

khuck commented Sep 26, 2018

Yup, stack related. This issue will stay open, but a work-around for that platform has been committed. See pull request #601

@hkaiser
Copy link
Member

hkaiser commented Sep 27, 2018

This PR enables stack overflow prevention in HPX on Power platforms: STEllAR-GROUP/hpx#3469. Please verify.

@hkaiser
Copy link
Member

hkaiser commented Sep 27, 2018

@khuck I believe the calculation of the remaining amount of stack space in my original patch was wrong. Could you try again, please?

@khuck
Copy link
Contributor Author

khuck commented Sep 27, 2018

@hkaiser nope, same error. I have asked for someone to send me the instructions for getting an account on our system if you want to test it yourself...

@sithhell
Copy link
Member

sithhell commented Oct 1, 2018

@khuck can you try a build with address sanitizer? This is usually very accurate in pinpointing to issues

@khuck
Copy link
Contributor Author

khuck commented Oct 1, 2018

@sithhell I did. I ran into so many linker issues I couldn't figure out how to fix them. I tried with valgrind, but after 3-4 hours building a suppression file, I was no closer to the cause of the problem.

@sithhell
Copy link
Member

sithhell commented Oct 1, 2018

@khuck for the linker errors, configure your HPX build with -DHPX_WITH_SANITIZERS=On. This should solve most of them.

@khuck
Copy link
Contributor Author

khuck commented Oct 1, 2018

@sithhell IIRC, building HPX wasn't the problem, but building Phylanx was. The address sanitizer library was supposed to be first in the link order, but it wasn't. Besides, I built Clang 5.0 myself for this machine, and it's possible I didn't configure/build the sanitizer libraries correctly.

@stevenrbrandt
Copy link
Member

@sithhell @khuck it would be great to have a docker image with the address sanitizer enabled and working correctly.

@khuck
Copy link
Contributor Author

khuck commented Oct 11, 2018

@sithhell yes it would - are you volunteering? :)

@stevenrbrandt
Copy link
Member

stevenrbrandt commented Oct 30, 2018

OK, I attempted to make a Phylanx docker image that uses sanitize. I fail at the Phylanx link step.
Here's the Dockerfile I attempted to use

https://gist.github.com/stevenrbrandt/56cc36a9c9cb0375ae264c398d0e3431

Setting -lasan in CMAKE_EXE_LINKER for Phylanx seems to do nothing.

However, setting -lasan in CMAKE_CXX_FLAGS allows Phylanx to link works - though it gives a bunch of spurious warning messages about using a link flag while not linking.

Regardless, however, I can't run bin/physl because I get this error:

build]# bin/physl --doc
==15==Your application is linked against incompatible ASan runtimes.

Not sure how that comes about, since I only have the default Clang / libasan installed.

@sithhell Any idea what I'm doing wrong?

@khuck
Copy link
Contributor Author

khuck commented Oct 30, 2018 via email

@stevenrbrandt
Copy link
Member

stevenrbrandt commented Oct 30, 2018

@khuck I've discovered the -shared-libasan flag. I'm experimenting with that.

@stevenrbrandt
Copy link
Member

stevenrbrandt commented Oct 31, 2018

@khuck @sithhell
Current Dockerfile: https://gist.github.com/stevenrbrandt/27e1d4eb5fd86a4b57697567c3964697

Ok, this uses -shared-libasan and everything compiles, but when I try to run Phylanx Hello World, I get this:

==27==Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING.
==27==ASan shadow was supposed to be located in the [0x00007fff7000-0x10007fff7fff] range.
==27==This might be related to ELF_ET_DYN_BASE change in Linux 4.12.
==27==See https://github.com/google/sanitizers/issues/856 for possible workarounds.
==27==Process memory map follows:
        0x000000400000-0x0000007be000   /usr/bin/python3.6
        0x0000009bd000-0x0000009be000   /usr/bin/python3.6
        0x0000009be000-0x000000a5b000   /usr/bin/python3.6
        0x000000a5b000-0x000000a8f000

Not sure what to do at this point.

@khuck
Copy link
Contributor Author

khuck commented Oct 31, 2018

@stevenrbrandt just curious - are you using the system allocator or tcmalloc/jemalloc?

@khuck
Copy link
Contributor Author

khuck commented Oct 31, 2018

@stevenrbrandt also, what happens if you run an example without python involved? like lra_csv or something like that?

@stevenrbrandt
Copy link
Member

stevenrbrandt commented Nov 2, 2018

@khuck I'm using the System Allocator, see the docker file I linked.

You can't even run "physl --doc" without problems:

# bin/physl --doc
terminate called after throwing an instance of 'std::runtime_error'
  what():  Cannot instantiate more than one affinity data instance
Aborted

@stevenrbrandt
Copy link
Member

stevenrbrandt commented Nov 2, 2018

So, a small success (I think). The problem seems to have partly been the 80 core cluster I built it on...

Running on a smaller machine, I get this. You can try out stevenrbrandt/phylanx.sanitized from Docker yourself.

# ./bin/physl --doc
=================================================================
==27==ERROR: AddressSanitizer: odr-violation (0x7fb8b739c940):
  [1] size=32 'hpx::util::detail::global_fixture' /hpx/src/util/lightweight_test.cpp:56:13
  [2] size=32 'hpx::util::detail::global_fixture' /hpx/src/util/lightweight_test.cpp:56:13
These globals were registered at these points:
  [1]:
    #0 0x7fb8cd3385c8  (/usr/local/clang_7.0.0/lib/clang/7.0.0/lib/linux/libclang_rt.asan-x86_64.so+0x675c8)
    #1 0x7fb8b46582dd in asan.module_ctor (/usr/local/lib/phylanx/libphylanx_solversd.so+0x39b2dd)

  [2]:
    #0 0x7fb8cd3385c8  (/usr/local/clang_7.0.0/lib/clang/7.0.0/lib/linux/libclang_rt.asan-x86_64.so+0x675c8)
    #1 0x7fb8b6e14f7d in asan.module_ctor (/usr/local/lib/phylanx/libphylanx_arithmeticsd.so+0x2503f7d)

==27==HINT: if you don't care about these errors you may set ASAN_OPTIONS=detect_odr_violation=0
SUMMARY: AddressSanitizer: odr-violation: global 'hpx::util::detail::global_fixture' at /hpx/src/util/lightweight_test.cpp:56:13
==27==ABORTING

@khuck
Copy link
Contributor Author

khuck commented Nov 2, 2018

Did you try export ASAN_OPTIONS=detect_odr_violation=0 before running?

@stevenrbrandt
Copy link
Member

stevenrbrandt commented Nov 2, 2018

@khuck using that setting, the physl --doc does run. I get this at the end:

==96==Could not attach to thread 72 (errno 1).
==96==Failed suspending threads.
==72==LeakSanitizer has encountered a fatal error.
==72==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
==72==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc)

@stevenrbrandt
Copy link
Member

stevenrbrandt commented Nov 2, 2018

OK, so Python doesn't work, but we can generate PhySL code from Python on another machine and run the PhySL code inside the phylanx.sanitized image.

Trying LRA with the PhySL interpreter (./examples/interpreter/lra.physl), I get this:

==191==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7f4897636550 at pc 0x7f48d0a1017c bp 0x7f4897636540 sp 0x7f4897635cf0
WRITE of size 16 at 0x7f4897636550 thread T15
    #0 0x7f48d0a1017b  (/usr/local/clang_7.0.0/lib/clang/7.0.0/lib/linux/libclang_rt.asan-x86_64.so+0xd717b)
    #1 0x7f48c2deb290 in std::chrono::_V2::steady_clock::now() (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0xb5290)
    #2 0x7f48cfa4a8b7 in hpx::util::high_resolution_clock::now() /usr/local/include/hpx/util/high_resolution_clock.hpp:30:17
    #3 0x7f48cfa4843a in phylanx::util::scoped_timer<long>::~scoped_timer() /phylanx/phylanx/util/scoped_timer.hpp:37:25
    #4 0x7f48cfa42918 in phylanx::execution_tree::primitives::primitive_component_base::do_eval(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const /phylanx/src/execution_tree/primitives/primitive_component_base.cpp:103:5
    #5 0x7f48cf9687da in phylanx::execution_tree::primitives::primitive_component::eval(std::vector<phylanx::execution_tree::primitive_argument_type, std::allocator<phylanx::execution_tree::primitive_argument_type> > const&, phylanx::execution_tree::eval_mode) const /phylanx/src/execution_tree/primitives/primitive_component.cpp:123:28

@stevenrbrandt
Copy link
Member

stevenrbrandt commented Nov 2, 2018

OK, using the Address Sanitizer

block(
    define(fib,n,
    if(n<2,n,
        fib(n-1)+fib(n-2))),
    cout(fib(16))
)

runs without difficulty. This was the code that originally prompted the ticket (correct me if I'm wrong).

@stevenrbrandt
Copy link
Member

Of course, this was clang 7 not 8

@stevenrbrandt
Copy link
Member

@khuck I've updated stevenrbrandt/phylanx.sanitized so that it's only 6.6GB. Still not tiny.

@stevenrbrandt
Copy link
Member

@khuck should we get this sanitized phylanx image running in your test framework?

@khuck
Copy link
Contributor Author

khuck commented Jan 4, 2019

@stevenrbrandt yes, if you could send me the cmake configuration steps, I would appreciate it.

@stevenrbrandt
Copy link
Member

@khuck docker pull stevenrbrandt/phylanx.sanitized to get the image. The Dockerfile itself is inside the image as /Dockerfile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants