New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock .. somewhere? (probably serialization) #1189

Closed
Finomnis opened this Issue Jul 10, 2014 · 1 comment

Comments

Projects
None yet
2 participants
@Finomnis
Contributor

Finomnis commented Jul 10, 2014

I found this problem where calling an hpx server action from a client never actually gets executed and the future never gets triggered.

I tried really hard to reduce it to a minimal problem, and this is as small as i could get it:

https://www.dropbox.com/s/6mkwkfrbnsnx9c7/deadlock_example.zip?dl=1

The way this program works:

  • it creates 8 worker threads per locality
  • every worker requests work (which is stripped down to a call of bool(void) in this example) and hands in the work after calculating it (void(void) in this example).
  • in the first iteration the worker 'initializes' component.
  • a total of 1000 packets get generated, after that the workers get shut down

i have the following counters set up:

  1. number of packets requested
      1. different points at the initialize-if-clause
  2. number of packets requested.

At the end of every worker iteration I print these counters to hpx::cout.

In my example, I ran 4 nodes (mpi parcelport), that is a total of 32 workers.
That should give me the following output:

... cut away ...
(995/32/32/32/995)
(996/32/32/32/996)
(997/32/32/32/997)
(998/32/32/32/998)
(999/32/32/32/999)
(1000/32/32/32/1000)

or in a different order, as it is executed in parallel.
But it should have the (1000/32/32/32/1000) in somewhere, and then synchronize and finish the program.

what i get, though, is:

... cut away ...
(998/32/32/17/983)
(999/32/32/17/984)
(1000/32/32/17/985)
<deadlock>

or in a different order.
which indicates that multiple (in this case 15, varies from 0-20) workers got stuck between counter 3 and 4.
To be precise, at:

worker.cpp(line 101): kernel.set_arg(buf).get();

This only happens on distributed execution. (So far only tried on 4+ nodes, doesn't happen reproducably on two nodes)

I do not know what causes this lock.

@hkaiser hkaiser added this to the 0.9.9 milestone Jul 10, 2014

@hkaiser hkaiser self-assigned this Jul 10, 2014

@hkaiser hkaiser referenced this issue Jul 12, 2014

Merged

Fixing 1189 #1191

@hkaiser

This comment has been minimized.

Show comment
Hide comment
@hkaiser

hkaiser Jul 13, 2014

Member

This has been fixed by c3f50f1

Member

hkaiser commented Jul 13, 2014

This has been fixed by c3f50f1

@hkaiser hkaiser closed this Jul 13, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment