Hang in wait_all() in distributed run #1946

hkaiser · 2016-01-07T14:09:49Z

Jan-Tobias Sohns wrote:

I further simplified my code, to extract my problem. I run this code 8
nodes of our cluster with 1 locality per node.

#include <hpx/hpx_init.hpp>
#include <hpx/hpx.hpp>
#include <hpx/include/actions.hpp>
#include <hpx/runtime/serialization/serialize.hpp>
#include <hpx/include/iostreams.hpp>

#include <math.h>
#include <vector>
#include <list>
#include <set>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>

#include <boost/ref.hpp>
#include <boost/format.hpp>
#include <boost/thread/locks.hpp>
#include <boost/serialization/vector.hpp>

void out(std::vector<uint> vec)
{
     hpx::cout << "out called " << hpx::find_here() << std::endl << 
hpx::flush;
}
HPX_PLAIN_ACTION(out, out_action);

int main(int argc, char* argv[])
{
     // Initialize and run HPX.
     return hpx::init(argc, argv);
}

int hpx_main(boost::program_options::variables_map& vm)
{
     // find locality info
     std::vector<hpx::naming::id_type> locs = hpx::find_all_localities();
     uint locid = hpx::get_locality_id();
     // create data
     std::vector<uint> vec;
     for (unsigned long j=0; j < 300000; j++)
     {
         vec.push_back(1);
     }
     // send out data
     for (uint j = 0; j < 8; j++)
     {
         std::vector<hpx::future<void> > fut1;
         for (uint i = 0; i < locs.size(); i++)
         {
             typedef out_action out_act;
             fut1.push_back(hpx::async<out_act>(locs.at(i), vec));
             hpx::cout << "Scheduled out to " << i+1 << std::endl << 
hpx::flush;
         }
         wait_all(fut1);
         hpx::cout << j+1 << ". round finished " << std::endl << hpx::flush;
     }
     hpx::cout << "program finished!!!" << std::endl << hpx::flush;
     return hpx::finalize();
}

And this is my output:

Scheduled out to 1
out called {0000000300000000, 0000000000000000}
out called {0000000500000000, 0000000000000000}
Scheduled out to 2
Scheduled out to 3
Scheduled out to 4
Scheduled out to 5
Scheduled out to 6
Scheduled out to 7
Scheduled out to 8
out called {0000000400000000, 0000000000000000}
out called {0000000100000000, 0000000000000000}
out called {0000000800000000, 0000000000000000}
out called {0000000600000000, 0000000000000000}
out called {0000000700000000, 0000000000000000}
out called {0000000200000000, 0000000000000000}
1. round finished
Scheduled out to 1
Scheduled out to 2
Scheduled out to 3
Scheduled out to 4
Scheduled out to 5
Scheduled out to 6
Scheduled out to 7
Scheduled out to 8
out called {0000000100000000, 0000000000000000}

then i get stuck in an endless loop until my job times out.

This same code runs completely if i take 8 localities on a single node,
or decrease my "vec"-size to 3000.
I need to send data of this size, because I'm trying to do image
compositing.

The text was updated successfully, but these errors were encountered:

hkaiser · 2016-01-07T19:21:34Z

Our regression tests show that the code you supplied runs fine on master (see: here).

sithhell · 2016-01-08T08:54:14Z

I could not reproduce this issue on supermic using multiple MPI versions
(mvapich2/2.0 and impi/5.0.1.035) in debug and release compiled with
intel compiler version 15.0.0 and boost 1.55.0.

hkaiser · 2016-01-09T00:40:30Z

Tim Biedert wrote:

Now I have “good” news: I have recompiled HPX 0.9.12 (git) using OpenMPI 1.8.5 (instead of Intel MPI 2016), and now the example seems to work as expected. As Jan already said, using 0.9.12 with Intel MPI 2016 before did not resolve the issue. So the problem seems to be MPI related.

Closing this...

Adding regression test for #1946: Hang in wait_all() in distributed run

hkaiser added type: defect category: actions labels Jan 7, 2016

hkaiser added this to the 0.9.12 milestone Jan 7, 2016

hkaiser added a commit that referenced this issue Jan 7, 2016

Adding regression test for #1946: Hang in wait_all() in distributed run

cf1f2e4

hkaiser closed this as completed Jan 9, 2016

hkaiser mentioned this issue Jan 10, 2016

Adding regression test for #1946: Hang in wait_all() in distributed run #1955

Merged

hkaiser added a commit that referenced this issue Jan 11, 2016

Merge pull request #1955 from STEllAR-GROUP/fixing_1946

0262138

Adding regression test for #1946: Hang in wait_all() in distributed run

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hang in wait_all() in distributed run #1946

Hang in wait_all() in distributed run #1946

hkaiser commented Jan 7, 2016

hkaiser commented Jan 7, 2016

sithhell commented Jan 8, 2016

hkaiser commented Jan 9, 2016

Hang in wait_all() in distributed run #1946

Hang in wait_all() in distributed run #1946

Comments

hkaiser commented Jan 7, 2016

hkaiser commented Jan 7, 2016

sithhell commented Jan 8, 2016

hkaiser commented Jan 9, 2016