Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"negative entry in reference count table" running octotiger on 32 nodes on queenbee #2171

Closed
dmarce1 opened this issue May 20, 2016 · 28 comments · Fixed by #2206
Closed

"negative entry in reference count table" running octotiger on 32 nodes on queenbee #2171

dmarce1 opened this issue May 20, 2016 · 28 comments · Fixed by #2206

Comments

@dmarce1
Copy link
Member

dmarce1 commented May 20, 2016

With the latest commit (4324def), I get a "negative entry in reference count table" error when running octotiger on 32 nodes on queenbee. The error output is here: https://gist.github.com/dmarce1/75c82c798733a9dcaa75b99bb3ca9c45

The core dump, executable, and all binaries used in the run are on SuperMIC at /project/dmarce1/core_dump_3

@hkaiser
Copy link
Member

hkaiser commented May 20, 2016

This just means, that our latest changes have not fixed the original problem you were having. It is still the very same issue related to reference counting which pops up for you. Thanks for letting us know.

See also #2122 and #2108, which are all different errors likely being caused by the same problem.

@hkaiser
Copy link
Member

hkaiser commented May 25, 2016

@sithhell said that he thinks this is caused by stack overflows. From his email:

Ok, that makes me believe that it is stack overflow that I can't
reproduce at the moment. However, I think I know where to add the
necessary code:

  1. parcelhandler::put_parcel
  2. apply_helper<Action, true> (in apply_helper.hpp), the thing that
    executed direct actions.

What we need here is the same code that we have in the handle_completion
function in detail/future_data.hpp.

@hkaiser
Copy link
Member

hkaiser commented May 29, 2016

@dmarce1 We have fixed another of those bugs which makes you wonder why things have worked before at all. The fix has been merged to master. I wouldn't be surprised if that has fixed your issues as well.

hkaiser added a commit that referenced this issue May 29, 2016
… small

- apply_helper
- put_parcel

This should solve the stack-overflows which are assumed to cause problems like reported in #2171
@hkaiser
Copy link
Member

hkaiser commented May 29, 2016

@sithhell I have implemented the stack checks for apply_helper and put_parcel as suggested.
@dmarce1 Please see the branch https://github.com/STEllAR-GROUP/hpx/tree/fixing_2171, this might help with your issues as well.

@dmarce1
Copy link
Member Author

dmarce1 commented Jun 1, 2016

When I compile octotiger with fixing_2171 it freezes on startup, it doesn't even get to the first line of main().

@hkaiser
Copy link
Member

hkaiser commented Jun 1, 2016

@dmarce1 If that's the case then something else is amiss - some stale binaries getting in the way, perhaps.

@dmarce1
Copy link
Member Author

dmarce1 commented Jun 1, 2016

I removed all the binaries and rebuilt everything (HPX and Octo-tiger) using the master HPX branch (the one with the merge of fixing_2171) - I get the same problem, freeze on startup.

@sithhell
Copy link
Member

sithhell commented Jun 1, 2016

How many nodes?

@dmarce1
Copy link
Member Author

dmarce1 commented Jun 1, 2016

I have tried on 128 and 32.

@dmarce1
Copy link
Member Author

dmarce1 commented Jun 2, 2016

It seems to work on 1 core, 1 full node, or on 16 nodes. But froze on 32 and 128 nodes.

@hkaiser
Copy link
Member

hkaiser commented Jun 2, 2016

@dmarce1 The startup hang should be fixed on master (tm)

@dmarce1
Copy link
Member Author

dmarce1 commented Jun 2, 2016

It looks like the hang is fixed.

@dmarce1
Copy link
Member Author

dmarce1 commented Jun 2, 2016

Ok, I'm still getting freezes, just after startup.

@dmarce1
Copy link
Member Author

dmarce1 commented Jun 3, 2016

I get freeze ups in the early stages of execution, during grid setup, for anything 16 nodes or more it seems. (I rebuilt everything, HPX from master and octotiger)

@hkaiser
Copy link
Member

hkaiser commented Jun 3, 2016

@dmarce1 We're working on this. Please go back to 306f128 which is the last commit before the changes which broke the runs.

@sithhell
Copy link
Member

sithhell commented Jun 6, 2016

@dmarce1 #2200 is trying to address this issue. Please try again.

@dmarce1
Copy link
Member Author

dmarce1 commented Jun 7, 2016

OK, the hangs are definitely gone but I still getting the same sorts of crashes.

> 0x2ad9336b1c57  : hpx::termination_handler(int) + 0x267 in /work/dmarce1/release/lib/libhpx.so.0
> 0x353f20f7e0    : ??? + 0x353f20f7e0 in /lib64/libpthread.so.0
> 0x2ad9339aeb95  : hpx::components::server::destroy_base_lco(hpx::naming::gid_type const&, hpx::naming::address const&, hpx::util::one_size_heap_list_base*, int, hpx::error_code&) + 0xa5 in /work/dmarce1/release/lib/libhpx.so.0
> 0x2ad9339bb501  : hpx::components::server::runtime_support::free_component(hpx::agas::gva const&, hpx::naming::gid_type const&, unsigned long) + 0x12a1 in /work/dmarce1/release/lib/libhpx.so.0
> 0x2ad934078113  : hpx::agas::server::primary_namespace::free_components_sync(std::list<hpx::agas::server::primary_namespace::free_entry, std::allocator<hpx::agas::server::primary_namespace::free_entry> >&, hpx::naming::gid_type const&, hpx::naming::gid_type const&, hpx::error_code&) + 0xae3 in /work/dmarce1/release/lib/libhpx.so.0
> 0x2ad93408395f  : hpx::agas::server::primary_namespace::decrement_credit(hpx::agas::request const&, hpx::error_code&) + 0xbaf in /work/dmarce1/release/lib/libhpx.so.0
> 0x2ad93408d7e3  : hpx::agas::server::primary_namespace::service(hpx::agas::request const&, hpx::error_code&) + 0x453 in /work/dmarce1/release/lib/libhpx.so.0
> 0x2ad93408d036  : hpx::agas::server::primary_namespace::bulk_service(std::vector<hpx::agas::request, std::allocator<hpx::agas::request> > const&, hpx::error_code&) + 0x86 in /work/dmarce1/release/lib/libhpx.so.0
> 0x2ad93416edea  : ??? + 0x2ad93416edea in /work/dmarce1/release/lib/libhpx.so.0
> 0x2ad93397c16a  : hpx::threads::coroutines::detail::coroutine_impl::operator()() + 0x18a in /work/dmarce1/release/lib/libhpx.so.0
> 0x2ad9335bdbb6  : ??? + 0x2ad9335bdbb6 in /work/dmarce1/release/lib/libhpx.so.012 frames:
> 0x2ad9336b1c57  : hpx::termination_handler(int) + 0x267 in /work/dmarce1/release/lib/libhpx.so.0
> 0x353f20f7e0    : ??? + 0x353f20f7e0 in /lib64/libpthread.so.0
> 0x2ad9339aeb95  : hpx::components::server::destroy_base_lco(hpx::naming::gid_type const&, hpx::naming::address const&, hpx::util::one_size_heap_list_base*, int, hpx::error_code&) + 0xa5 in /work/dmarce1/release/lib/libhpx.so.0
> 0x2ad9339bb501  : hpx::components::server::runtime_support::free_component(hpx::agas::gva const&, hpx::naming::gid_type const&, unsigned long) + 0x12a1 in /work/dmarce1/release/lib/libhpx.so.0
> 0x2ad934078113  : hpx::agas::server::primary_namespace::free_components_sync(std::list<hpx::agas::server::primary_namespace::free_entry, std::allocator<hpx::agas::server::primary_namespace::free_entry> >&, hpx::naming::gid_type const&, hpx::naming::gid_type const&, hpx::error_code&) + 0xae3 in /work/dmarce1/release/lib/libhpx.so.0
> 0x2ad93408395f  : hpx::agas::server::primary_namespace::decrement_credit(hpx::agas::request const&, hpx::error_code&) + 0xbaf in /work/dmarce1/release/lib/libhpx.so.0
> 0x2ad93408d7e3  : hpx::agas::server::primary_namespace::service(hpx::agas::request const&, hpx::error_code&) + 0x453 in /work/dmarce1/release/lib/libhpx.so.0
> 0x2ad93408d036  : hpx::agas::server::primary_namespace::bulk_service(std::vector<hpx::agas::request, std::allocator<hpx::agas::request> > const&, hpx::error_code&) + 0x86 in /work/dmarce1/release/lib/libhpx.so.0
> 0x2ad93416edea  : ??? + 0x2ad93416edea in /work/dmarce1/release/lib/libhpx.so.0
> 0x2ad93397c16a  : hpx::threads::coroutines::detail::coroutine_impl::operator()() + 0x18a in /work/dmarce1/release/lib/libhpx.so.0
> 0x2ad9335bdbb6  : ??? + 0x2ad9335bdbb6 in /work/dmarce1/release/lib/libhpx.so.0
> {what}: Segmentation fault
> 

@sithhell
Copy link
Member

sithhell commented Jun 8, 2016

@dmarce1 could you please try the (new) fixing_2171 branch and see if that fixes your problem?

@dmarce1
Copy link
Member Author

dmarce1 commented Jun 8, 2016

OK. I'm on it now.

@dmarce1
Copy link
Member Author

dmarce1 commented Jun 9, 2016

I am still getting similar errors:
{what}: primary_namespace::resolve_free_list, failed to resolve gid, gid({0000005d00000001, 000000002e9013f8}): HPX(internal_server_error)

@hkaiser
Copy link
Member

hkaiser commented Jun 9, 2016

@dmarce1 Are you at least using 0aa5be3 (see https://github.com/STEllAR-GROUP/hpx/pull/2206/commits)?

@dmarce1
Copy link
Member Author

dmarce1 commented Jun 10, 2016

I was using f8b9580 (and just got the same error). I will try with 0aa5be3

@dmarce1
Copy link
Member Author

dmarce1 commented Jun 10, 2016

Wait nevermind, I see what you're saying. f8b9580 supercedes 0aa5be3 so yes, I was at least using 0aa5be3. With a clean install of HPX

@dmarce1
Copy link
Member Author

dmarce1 commented Jun 10, 2016

OK, there is a remote possibility I was using the wrong binaries. I found some old HPX binaries in a directory I'm pretty sure wasn't used, but I am removing and rebuilding everything from scratch to make sure.

@dmarce1
Copy link
Member Author

dmarce1 commented Jun 10, 2016

Similar bug:

{stack-trace}: {stack-trace}: 12 frames:
0x2ba0e4f6d127  : hpx::termination_handler(int) + 0x267 in /work/dmarce1/release/hpx/lib/libhpx.so.0
0x357f80f7e0    : ??? + 0x357f80f7e0 in /lib64/libpthread.so.0
0x2ba0e526a945  : hpx::components::server::destroy_base_lco(hpx::naming::gid_type const&, hpx::naming::address const&, hpx::util::one_size_heap_list_base*, int, hpx::error_code&) + 0xa5 in /work/dmarce1/release/hpx/lib/libhpx.so.0
0x2ba0e52772b1  : hpx::components::server::runtime_support::free_component(hpx::agas::gva const&, hpx::naming::gid_type const&, unsigned long) + 0x12a1 in /work/dmarce1/release/hpx/lib/libhpx.so.0
0x2ba0e593b143  : hpx::agas::server::primary_namespace::free_components_sync(std::list<hpx::agas::server::primary_namespace::free_entry, std::allocator<hpx::agas::server::primary_namespace::free_entry> >&, hpx::naming::gid_type const&, hpx::naming::gid_type const&, hpx::error_code&) + 0xae3 in /work/dmarce1/release/hpx/lib/libhpx.so.0
0x2ba0e594698f  : hpx::agas::server::primary_namespace::decrement_credit(hpx::agas::request const&, hpx::error_code&) + 0xbaf in /work/dmarce1/release/hpx/lib/libhpx.so.0
0x2ba0e5950813  : hpx::agas::server::primary_namespace::service(hpx::agas::request const&, hpx::error_code&) + 0x453 in /work/dmarce1/release/hpx/lib/libhpx.so.0
0x2ba0e5950066  : hpx::agas::server::primary_namespace::bulk_service(std::vector<hpx::agas::request, std::allocator<hpx::agas::request> > const&, hpx::error_code&) + 0x86 in /work/dmarce1/release/hpx/lib/libhpx.so.0
0x2ba0e5a3383a  : ??? + 0x2ba0e5a3383a in /work/dmarce1/release/hpx/lib/libhpx.so.0
0x2ba0e52375ea  : hpx::threads::coroutines::detail::coroutine_impl::operator()() + 0x18a in /work/dmarce1/release/hpx/lib/libhpx.so.0
0x2ba0e4e790a6  : ??? + 0x2ba0e4e790a6 in /work/dmarce1/release/hpx/lib/libhpx.so.0
{what}: Segmentation fault
12 frames:
0x2ba0e4f6d127  : hpx::termination_handler(int) + 0x267 in /work/dmarce1/release/hpx/lib/libhpx.so.0
0x357f80f7e0    : ??? + 0x357f80f7e0 in /lib64/libpthread.so.0
0x2ba0e526a948  : hpx::components::server::destroy_base_lco(hpx::naming::gid_type const&, hpx::naming::address const&, hpx::util::one_size_heap_list_base*, int, hpx::error_code&) + 0xa8 in /work/dmarce1/release/hpx/lib/libhpx.so.0
0x2ba0e52772b1  : hpx::components::server::runtime_support::free_component(hpx::agas::gva const&, hpx::naming::gid_type const&, unsigned long) + 0x12a1 in /work/dmarce1/release/hpx/lib/libhpx.so.0
0x2ba0e593b143  : hpx::agas::server::primary_namespace::free_components_sync(std::list<hpx::agas::server::primary_namespace::free_entry, std::allocator<hpx::agas::server::primary_namespace::free_entry> >&, hpx::naming::gid_type const&, hpx::naming::gid_type const&, hpx::error_code&) + 0xae3 in /work/dmarce1/release/hpx/lib/libhpx.so.0
0x2ba0e594698f  : hpx::agas::server::primary_namespace::decrement_credit(hpx::agas::request const&, hpx::error_code&) + 0xbaf in /work/dmarce1/release/hpx/lib/libhpx.so.0
0x2ba0e5950813  : hpx::agas::server::primary_namespace::service(hpx::agas::request const&, hpx::error_code&) + 0x453 in /work/dmarce1/release/hpx/lib/libhpx.so.0
0x2ba0e5950066  : hpx::agas::server::primary_namespace::bulk_service(std::vector<hpx::agas::request, std::allocator<hpx::agas::request> > const&, hpx::error_code&) + 0x86 in /work/dmarce1/release/hpx/lib/libhpx.so.0
0x2ba0e5a3383a  : ??? + 0x2ba0e5a3383a in /work/dmarce1/release/hpx/lib/libhpx.so.0
0x2ba0e52375ea  : hpx::threads::coroutines::detail::coroutine_impl::operator()() + 0x18a in /work/dmarce1/release/hpx/lib/libhpx.so.0
0x2ba0e4e790a6  : ??? + 0x2ba0e4e790a6 in /work/dmarce1/release/hpx/lib/libhpx.so.0

@hkaiser
Copy link
Member

hkaiser commented Jun 10, 2016

Still the same old problem... sigh

@dmarce1 Thomas has reimplemented part of the remote promise/future architecture (responsible for those issues) and claims that all of our tests pass for him. If you're feeling adventurous, try running off the branch simplify_promise (https://github.com/STEllAR-GROUP/hpx/tree/simplify_promise).

@dmarce1
Copy link
Member Author

dmarce1 commented Jun 13, 2016

I tried running off the branch simplify_promise. So far, 67+ hours and no crash. Compared to the longest run to date (which ended with a crash), it has so far gone more than 5 X as many timesteps.

@hkaiser
Copy link
Member

hkaiser commented Jun 13, 2016

@dmarce1 We will merge the changes of the simplify_promise later as well. This ticket got closed automatically by merging #2206. I'll leave this ticket closed for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants