Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang in wait_all() in distributed run #1946

Closed
hkaiser opened this issue Jan 7, 2016 · 3 comments
Closed

Hang in wait_all() in distributed run #1946

hkaiser opened this issue Jan 7, 2016 · 3 comments

Comments

@hkaiser
Copy link
Member

hkaiser commented Jan 7, 2016

Jan-Tobias Sohns wrote:

I further simplified my code, to extract my problem. I run this code 8
nodes of our cluster with 1 locality per node.

#include <hpx/hpx_init.hpp>
#include <hpx/hpx.hpp>
#include <hpx/include/actions.hpp>
#include <hpx/runtime/serialization/serialize.hpp>
#include <hpx/include/iostreams.hpp>

#include <math.h>
#include <vector>
#include <list>
#include <set>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>

#include <boost/ref.hpp>
#include <boost/format.hpp>
#include <boost/thread/locks.hpp>
#include <boost/serialization/vector.hpp>

void out(std::vector<uint> vec)
{
     hpx::cout << "out called " << hpx::find_here() << std::endl << 
hpx::flush;
}
HPX_PLAIN_ACTION(out, out_action);

int main(int argc, char* argv[])
{
     // Initialize and run HPX.
     return hpx::init(argc, argv);
}

int hpx_main(boost::program_options::variables_map& vm)
{
     // find locality info
     std::vector<hpx::naming::id_type> locs = hpx::find_all_localities();
     uint locid = hpx::get_locality_id();
     // create data
     std::vector<uint> vec;
     for (unsigned long j=0; j < 300000; j++)
     {
         vec.push_back(1);
     }
     // send out data
     for (uint j = 0; j < 8; j++)
     {
         std::vector<hpx::future<void> > fut1;
         for (uint i = 0; i < locs.size(); i++)
         {
             typedef out_action out_act;
             fut1.push_back(hpx::async<out_act>(locs.at(i), vec));
             hpx::cout << "Scheduled out to " << i+1 << std::endl << 
hpx::flush;
         }
         wait_all(fut1);
         hpx::cout << j+1 << ". round finished " << std::endl << hpx::flush;
     }
     hpx::cout << "program finished!!!" << std::endl << hpx::flush;
     return hpx::finalize();
}

And this is my output:

Scheduled out to 1
out called {0000000300000000, 0000000000000000}
out called {0000000500000000, 0000000000000000}
Scheduled out to 2
Scheduled out to 3
Scheduled out to 4
Scheduled out to 5
Scheduled out to 6
Scheduled out to 7
Scheduled out to 8
out called {0000000400000000, 0000000000000000}
out called {0000000100000000, 0000000000000000}
out called {0000000800000000, 0000000000000000}
out called {0000000600000000, 0000000000000000}
out called {0000000700000000, 0000000000000000}
out called {0000000200000000, 0000000000000000}
1. round finished
Scheduled out to 1
Scheduled out to 2
Scheduled out to 3
Scheduled out to 4
Scheduled out to 5
Scheduled out to 6
Scheduled out to 7
Scheduled out to 8
out called {0000000100000000, 0000000000000000}

then i get stuck in an endless loop until my job times out.

This same code runs completely if i take 8 localities on a single node,
or decrease my "vec"-size to 3000.
I need to send data of this size, because I'm trying to do image
compositing.

@hkaiser
Copy link
Member Author

hkaiser commented Jan 7, 2016

Our regression tests show that the code you supplied runs fine on master (see: here).

@sithhell
Copy link
Member

sithhell commented Jan 8, 2016

I could not reproduce this issue on supermic using multiple MPI versions
(mvapich2/2.0 and impi/5.0.1.035) in debug and release compiled with
intel compiler version 15.0.0 and boost 1.55.0.

@hkaiser
Copy link
Member Author

hkaiser commented Jan 9, 2016

Tim Biedert wrote:

Now I have “good” news: I have recompiled HPX 0.9.12 (git) using OpenMPI 1.8.5 (instead of Intel MPI 2016), and now the example seems to work as expected. As Jan already said, using 0.9.12 with Intel MPI 2016 before did not resolve the issue. So the problem seems to be MPI related.

Closing this...

@hkaiser hkaiser closed this as completed Jan 9, 2016
hkaiser added a commit that referenced this issue Jan 11, 2016
Adding regression test for #1946: Hang in wait_all() in distributed run
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants