You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
every worker requests work (which is stripped down to a call of bool(void) in this example) and hands in the work after calculating it (void(void) in this example).
in the first iteration the worker 'initializes' component.
a total of 1000 packets get generated, after that the workers get shut down
i have the following counters set up:
number of packets requested
different points at the initialize-if-clause
number of packets requested.
At the end of every worker iteration I print these counters to hpx::cout.
In my example, I ran 4 nodes (mpi parcelport), that is a total of 32 workers.
That should give me the following output:
or in a different order, as it is executed in parallel.
But it should have the (1000/32/32/32/1000) in somewhere, and then synchronize and finish the program.
or in a different order.
which indicates that multiple (in this case 15, varies from 0-20) workers got stuck between counter 3 and 4.
To be precise, at:
worker.cpp(line 101): kernel.set_arg(buf).get();
This only happens on distributed execution. (So far only tried on 4+ nodes, doesn't happen reproducably on two nodes)
I do not know what causes this lock.
The text was updated successfully, but these errors were encountered:
I found this problem where calling an hpx server action from a client never actually gets executed and the future never gets triggered.
I tried really hard to reduce it to a minimal problem, and this is as small as i could get it:
https://www.dropbox.com/s/6mkwkfrbnsnx9c7/deadlock_example.zip?dl=1
The way this program works:
i have the following counters set up:
At the end of every worker iteration I print these counters to hpx::cout.
In my example, I ran 4 nodes (mpi parcelport), that is a total of 32 workers.
That should give me the following output:
or in a different order, as it is executed in parallel.
But it should have the
(1000/32/32/32/1000)
in somewhere, and then synchronize and finish the program.what i get, though, is:
or in a different order.
which indicates that multiple (in this case 15, varies from 0-20) workers got stuck between counter 3 and 4.
To be precise, at:
This only happens on distributed execution. (So far only tried on 4+ nodes, doesn't happen reproducably on two nodes)
I do not know what causes this lock.
The text was updated successfully, but these errors were encountered: