Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot construct component with large vector on a remote locality #2334

Closed
DavidPfander-UniStuttgart opened this issue Sep 16, 2016 · 14 comments

Comments

Projects
None yet
3 participants
@DavidPfander-UniStuttgart
Copy link
Contributor

commented Sep 16, 2016

I created a component with a data member that has a size 512mb. The component is created on the first remote locality (therefore the code example requires 2 localities). Running the example leads to two different errors, depending on whether a debug or a release build of HPX is used (see below). Now, if the matrix size is decreased (set N to 4096), everything seems to be ok.

Bug in the example or problem with HPX?

#include <hpx/hpx_init.hpp>
#include <hpx/include/util.hpp>
#include <hpx/include/components.hpp>

struct matrix_multiply_multiplier: hpx::components::component_base<
  matrix_multiply_multiplier> {
  size_t N;
  std::vector<double> A;

  // why does this get called?
  matrix_multiply_multiplier() :
    N(0) {}

  matrix_multiply_multiplier(size_t N, const std::vector<double> &A_
                 ) :
    N(N), A(A_) {}


};

HPX_REGISTER_COMPONENT(hpx::components::component<matrix_multiply_multiplier>,
               matrix_multiply_multiplier);

int hpx_main() {

  // works on my computer for N = 4096
  size_t N = 8192;
  std::vector<double> A(N * N);

  std::vector<hpx::id_type> remote_ids = hpx::find_remote_localities();

  hpx::components::client<matrix_multiply_multiplier> comp = 
     hpx::new_<matrix_multiply_multiplier>(remote_ids[0], N, A);

  return hpx::finalize();
}

int main(int argc, char **argv) {
  return hpx::init(argc, argv);
}

Output with release build of HPX:

pfandedd@dpfanderLSU ~ $ ./large_matrix_release -l2 -0
tcmalloc: large alloc 1073750016 bytes == 0x6172c000 @ 
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<hpx::exception> >'
  what():  null thread id encountered: HPX(null_thread_id)
Aborted

Important: depending on debug or release, the error occurs on a different locality!

Output with debug build of HPX:

pfandedd@dpfanderLSU ~ $ ./large_matrix_debug -l2 -1
{stack-trace}: 39 frames:
0x7f6eea13edd7  : ??? + 0x7f6eea13edd7 in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eea13ef24  : ??? + 0x7f6eea13ef24 in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eea1344c6  : hpx::termination_handler(int) + 0x19b in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6ee93fb330  : ??? + 0x7f6ee93fb330 in /lib/x86_64-linux-gnu/libpthread.so.0
0x7f6eea790d03  : hpx::threads::detail::thread_pool<hpx::threads::policies::local_priority_queue_scheduler<boost::mutex, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_lifo> >::get_state() const + 0x37 in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eea711f78  : hpx::threads::threadmanager_impl<hpx::threads::policies::local_priority_queue_scheduler<boost::mutex, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_fifo, hpx::threads::policies::lockfree_lifo> >::status() const + 0x1c in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eea160b34  : hpx::threads::threadmanager_is(hpx::state) + 0x5b in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eea518be8  : hpx::parcelset::parcelport::add_received_parcel(hpx::parcelset::parcel, unsigned long) + 0x6e in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eeaabbb55  : ??? + 0x7f6eeaabbb55 in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eeaab9116  : ??? + 0x7f6eeaab9116 in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eeaab64b5  : ??? + 0x7f6eeaab64b5 in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eeaab3346  : ??? + 0x7f6eeaab3346 in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eeaac1999  : ??? + 0x7f6eeaac1999 in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eeaac0327  : ??? + 0x7f6eeaac0327 in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eeaabe790  : ??? + 0x7f6eeaabe790 in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eeaabc832  : ??? + 0x7f6eeaabc832 in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eeaab953f  : ??? + 0x7f6eeaab953f in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eeaab6b57  : ??? + 0x7f6eeaab6b57 in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eeaac37a5  : ??? + 0x7f6eeaac37a5 in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eeaac369c  : ??? + 0x7f6eeaac369c in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eeaac3431  : ??? + 0x7f6eeaac3431 in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eeaac2f5b  : ??? + 0x7f6eeaac2f5b in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eeaac2760  : ??? + 0x7f6eeaac2760 in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eeaac1877  : ??? + 0x7f6eeaac1877 in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eea9b5e3a  : ??? + 0x7f6eea9b5e3a in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eea9b67ff  : ??? + 0x7f6eea9b67ff in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eea9b653f  : ??? + 0x7f6eea9b653f in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eea9b6a13  : ??? + 0x7f6eea9b6a13 in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eea9b50a5  : hpx::util::io_service_pool::thread_run(unsigned long) + 0xaf in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eea9b9131  : ??? + 0x7f6eea9b9131 in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eea9b909d  : ??? + 0x7f6eea9b909d in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eea9b902a  : ??? + 0x7f6eea9b902a in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eea9b8f75  : ??? + 0x7f6eea9b8f75 in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eea9b8f09  : ??? + 0x7f6eea9b8f09 in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6eea9b8ed0  : ??? + 0x7f6eea9b8ed0 in /home/pfandedd/git/hpx/build_debug/lib/libhpxd.so.1
0x7f6ee9614e7a  : ??? + 0x7f6ee9614e7a in /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.55.0
0x7f6ee93f3184  : ??? + 0x7f6ee93f3184 in /lib/x86_64-linux-gnu/libpthread.so.0
0x7f6ee89ac37d  : clone + 0x6d in /lib/x86_64-linux-gnu/libc.so.6
{what}: Floating point exception
{config}:
  HPX_HAVE_NATIVE_TLS=ON
  HPX_HAVE_STACKTRACES=ON
  HPX_HAVE_COMPRESSION_BZIP2=OFF
  HPX_HAVE_COMPRESSION_SNAPPY=OFF
  HPX_HAVE_COMPRESSION_ZLIB=OFF
  HPX_HAVE_PARCEL_COALESCING=ON
  HPX_HAVE_PARCELPORT_TCP=ON
  HPX_HAVE_PARCELPORT_MPI=OFF
  HPX_HAVE_PARCELPORT_IPC=OFF
  HPX_HAVE_PARCELPORT_IBVERBS=OFF
  HPX_HAVE_VERIFY_LOCKS=ON
  HPX_HAVE_HWLOC=ON
  HPX_HAVE_ITTNOTIFY=OFF
  HPX_HAVE_RUN_MAIN_EVERYWHERE=OFF
  HPX_PARCEL_MAX_CONNECTIONS=512
  HPX_PARCEL_MAX_CONNECTIONS_PER_LOCALITY=4
  HPX_AGAS_LOCAL_CACHE_SIZE=4096
  HPX_HAVE_MALLOC=tcmalloc
  HPX_PREFIX (configured)=/home/pfandedd/git/hpx/build_debug
  HPX_PREFIX=/home/pfandedd/git/hpx/build_debug
{version}: V1.0.0-trunk (AGAS: V3.0), Git: a36ab93eab
{boost}: V1.55.0
{build-type}: debug
{date}: Sep 13 2016 09:34:29
{platform}: linux
{compiler}: GNU C++ version 4.8.5
{stdlib}: GNU libstdc++ version 20150623
Aborted
@sithhell

This comment has been minimized.

Copy link
Member

commented Sep 16, 2016

@sithhell

This comment has been minimized.

Copy link
Member

commented Sep 18, 2016

@hkaiser

This comment has been minimized.

Copy link
Member

commented Sep 19, 2016

I have added the test from above here: https://github.com/STEllAR-GROUP/hpx/tree/fixing_2334. Unfortunately, the problem is not reproducible for me.

@sithhell

This comment has been minimized.

Copy link
Member

commented Sep 19, 2016

Could you please check again with latest master? If my suspicion was correct, this has been fixed with merging #2302. The relevant change should be that one: https://github.com/STEllAR-GROUP/hpx/pull/2302/files#diff-a8aad89cda0f631a36250f65a21f11d2R96

@DavidPfander-UniStuttgart

This comment has been minimized.

Copy link
Contributor Author

commented Sep 19, 2016

I checked again with the most recent master and the problem seems to be gone. I'm therefore closing this issue. Thank you very much!

@DavidPfander-UniStuttgart

This comment has been minimized.

Copy link
Contributor Author

commented Sep 19, 2016

Ok, there is still some issue as my code still crashes for large matrices (N=8192) and it still looks related to the matrix transfer. The error says "broken pipe" (see attachment) which again could be a bug in my program or a bug in HPX. As I haven't created a minimal example and it might be a bug in my code, I leave this issue closed.
error_large_matrix.txt

@sithhell

This comment has been minimized.

Copy link
Member

commented Sep 19, 2016

Am 19.09.2016 7:03 nachm. schrieb "DavidPfander-UniStuttgart" <
notifications@github.com>:

Ok, there is still some issue as my code still crashes for large matrices
(N=8192) and it still looks related to the matrix. The error says "broken
pipe" (see attachment) which again could be a bug in my program or a bug in
HPX. As I haven't created a minimal example and it might be a bug in my
code, I leave this issue closed.
error_large_matrix.txt

Broken pipes are usually errors you get with not closing a socket correctly
(for example piping the output to a pager and kill the pager while the
application still wants to write to stdout/stderr. The TCP parcel port uses
sockets as well


You are receiving this because you commented.

Reply to this email directly, view it on GitHub, or mute the thread.

@DavidPfander-UniStuttgart

This comment has been minimized.

Copy link
Contributor Author

commented Sep 19, 2016

According to you hint in the ste||ar-chat, I tried using commit ba57370, which was the commit before the merge of parcel_optimizations.
Unfortunately, I'm still getting errors with my debug build:

terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<hpx::exception> >'
  what():  null thread id encountered: HPX(null_thread_id)

@sithhell sithhell reopened this Sep 19, 2016

@sithhell

This comment has been minimized.

Copy link
Member

commented Sep 19, 2016

Sorry I was not being clear. You should use the mentioned commit and manually apply those changes:
https://github.com/STEllAR-GROUP/hpx/blob/master/src/runtime/threads/detail/thread_pool.cpp#L96-L102

@DavidPfander-UniStuttgart

This comment has been minimized.

Copy link
Contributor Author

commented Sep 19, 2016

I actually didn't read your instructions carefully enough, sorry for that.

Now with the change applied and HPX recompiled, I unfortunately still get the same error:

info: root node is not used for computation
computing on id: {0000000200000000, 0000000000000000}
using pseudodynamic distributed algorithm
tcmalloc: large alloc 1073750016 bytes == 0xa1c8c000 @ 
tcmalloc: large alloc 2147491840 bytes == 0xe1c8e000 @ 
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<hpx::exception> >'
  what():  null thread id encountered: HPX(null_thread_id)

I still cannot rule out that there is some problem in my code. I had some problems with the lifetime of components before (that lead to memory corruptions). I will try to create a minimal example, as soon as I find the time to do it.

@DavidPfander-UniStuttgart

This comment has been minimized.

Copy link
Contributor Author

commented Sep 24, 2016

After some some quite extensive testing, it turned out that my error is again related to creating a component with a large memory footprint.

I had to add

  comp.get_id();

and I had to increase the size of the matrix a from N=8192 to N=16384 (leads to a 2GB matrix). I suspect that this is related to the total bytes, because if I add a second matrix and both matrices are 8192x8192, the problem also appears.

If I slightly modify my example and run it only on a single locality, the problem also disappears.

The complete example is appended (the txt is a cpp, the binary has to be run with 2 localities). Additionally, I appended the console log for a debug build of the example.

large_matrix.txt
error.txt

@hkaiser

This comment has been minimized.

Copy link
Member

commented Sep 26, 2016

While the error message is misleading, the fact that things for arrays of this size fail is expected. The default maximal inbound message size allowed is currently 1GByte. This value is configurable through the configuration database entry hpx.parcel.max_message_size=<new_max_message_size> (for instance using the -I command line option).

@DavidPfander-UniStuttgart

This comment has been minimized.

Copy link
Contributor Author

commented Sep 27, 2016

I'm getting correct results with max_message_size increased. (Although I used --hpx:config , -I wouldn't work.)

hkaiser added a commit that referenced this issue Sep 29, 2016

Merge pull request #2348 from STEllAR-GROUP/fixing_2334
Adding test to verify #2334 is fixed
@sithhell

This comment has been minimized.

Copy link
Member

commented Oct 7, 2016

The regression test covering this issue is showing a failure still. The failure is due to a bug in the termination detection algorithm.

@sithhell sithhell reopened this Oct 7, 2016

sithhell added a commit that referenced this issue Oct 7, 2016

Fixing shutdown when parcels are still in flight
 - Making sure a receiver connection in the TCP parcelports waits until
   the connection has been completed to get completely closed
 - Make sure that no parcels are still in flight when starting the termination
   detection

This completely fixes #2334

sithhell added a commit that referenced this issue Oct 7, 2016

Fixing shutdown when parcels are still in flight
 - Making sure a receiver connection in the TCP parcelports waits until
   the connection has been completed to get completely closed
 - Make sure that no parcels are still in flight when starting the termination
   detection

This completely fixes #2334

sithhell added a commit that referenced this issue Oct 8, 2016

Fixing shutdown when parcels are still in flight
 - Making sure a receiver connection in the TCP parcelports waits until
   the connection has been completed to get completely closed
 - Make sure that no parcels are still in flight when starting the termination
   detection

This completely fixes #2334

@hkaiser hkaiser closed this in #2359 Oct 9, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.