Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with distributing factory #430

Closed
brycelelbach opened this issue Jul 10, 2012 · 15 comments
Closed

Problem with distributing factory #430

brycelelbach opened this issue Jul 10, 2012 · 15 comments
Milestone

Comments

@brycelelbach
Copy link
Member

[reported by manderson] [Trac time Mon Jun 11 23:05:23 2012] distributing factory is hanging on distributed localities.

To reproduce: compile adaptive_dataflow r8210

Simulate distributed as follows:

./ad_client -1 -l2 --np 20

./ad_client -0 -l2 --np 20

prints:

Number of components: 20
Number of timesteps: 10
Number of localities: 2
BEFORE DISTRIBUTING FACTORY

and then hangs.

Strangely, if the number of components is reduced to, say, 12, it works:

./ad_client -1 -l2 --np 12

./ad_client -0 -l2 --np 12

init time: 0.0529384
gid time: 8.079e-06
compute time: 0.358329
rhs time: 0.267704
remove time: 0
Elapsed time: 0.787214 [s]

Suggestions?

@brycelelbach
Copy link
Member Author

[comment by manderson] [Trac time Tue Jun 12 15:35:47 2012] I have duplicated the problem on ithaca:

./ad_client --np 18 -0 -l2 --hpx:pu-offset 0

./ad_client --np 18 -1 -l2 --hpx:pu-offset 1

Output:
Number of components: 18
Number of timesteps: 10
Number of localities: 2
BEFORE DISTRIBUTING FACTORY
(hangs...)

@brycelelbach
Copy link
Member Author

[comment by manderson] [Trac time Tue Jun 12 17:11:29 2012] In c92b72b, I have commented out everything in the code except factory.create_components, which hangs in distributed if the number of components is large enough.

@brycelelbach
Copy link
Member Author

[comment by blelbach] [Trac time Tue Jun 12 23:36:08 2012] Reproduced on Hermione, 43a47e4, GCC 4.6.2, Boost trunk, release build. Ran with:

bin/ad_client --np 20 -0 -l2
bin/ad_client -1 -l2

Doesn't seem to be a problem with Matt's code.

@brycelelbach
Copy link
Member Author

[comment by blelbach] [Trac time Tue Jun 12 23:40:45 2012] Looks like this is deadlock, possibly some sort of issue arising from locking an HPX mutex, then trying to lock an OS-level mutex and causing the entire OS-thread to be suspended (and preventing the scheduler from executing the HPX-thread that would unlock the OS-level mutex).

I suspect this is deadlock because this does not reproduce if you run the application with multiple threads on each locality (which gives more leeway, because a few of the OS-threads scheduling HPX-threads can be suspended at the kernel level without choking the threadmanager completely).

The following results in a successful run on Ithaca:

bin/ad_client --np 20 -0 -l2 -t4
bin/ad_client -1 -l2 -t4

@brycelelbach
Copy link
Member Author

[comment by blelbach] [Trac time Tue Jun 12 23:45:14 2012] Looks like a problem on the locality where the distributing factory lives/is invoked. The following does work (single threaded worker, multithreaded console/AGAS):

bin/ad_client --np 20 -0 -l2 -t4
bin/ad_client -1 -l2

However, this '''doesn't''' work:

bin/ad_client --np 20 -0 -l2
bin/ad_client -1 -l2 -t4

@brycelelbach
Copy link
Member Author

[comment by blelbach] [Trac time Tue Jun 12 23:56:06 2012] Possible culprit: 28daa45 (adds a lock of an OS-thread mutex to create_one_component, fits the profile described above).

@brycelelbach
Copy link
Member Author

[comment by blelbach] [Trac time Wed Jun 13 00:00:18 2012] Actually, 28daa45 locks a local spinlock - it's still looking like the most likely cause of the issue.

@brycelelbach
Copy link
Member Author

[comment by blelbach] [Trac time Wed Jun 13 00:11:11 2012] Also, runs fine if logging is enabled, at least with 20 components.

@brycelelbach
Copy link
Member Author

[comment by blelbach] [Trac time Wed Jun 13 00:11:58 2012] Also works with 200 components and logging enabled.

@brycelelbach
Copy link
Member Author

[comment by blelbach] [Trac time Wed Jun 13 00:45:55 2012] Can reproduce it with level 3 logs:

(T00000000/000000000272e000.01/----------------) P--------/----------------.-- 00:12.36.750 [0000000000000001]   <error> [ERR] created exception: the sine component is not enabled on the commandline (--sine), bailing out: HPX(component_load_failure)
(T00000000/000000000272e000.01/----------------) P--------/----------------.-- 00:12.36.768 [0000000000000002] <warning>  [RT] caught exception while loading sine_counter, HPX(component_load_failure): the sine component is not enabled on the commandline (--sine), bailing out: HPX(component_load_failure)
(T00000000/----------------.--/----------------) P--------/----------------.-- 00:12.36.774 [0000000000000003]   <error>  [TM] Listing suspended threads while queue (0) is empty:
(T00000000/----------------.--/----------------) P--------/----------------.-- 00:12.36.774 [0000000000000004]   <error>  [TM] queue(0): suspended(0x272e000.03/00000000) P00000000: pre_main: barrier::set_event
(T00000000/----------------.--/----------------) P--------/----------------.-- 00:12.36.774 [0000000000000005]   <error>  [TM] queue(0): no new work available, are we deadlocked?
(T00000000/----------------.--/----------------) P--------/----------------.-- 00:12.38.608 [0000000000000006]   <error>  [TM] Listing suspended threads while queue (0) is empty:
(T00000000/----------------.--/----------------) P--------/----------------.-- 00:12.38.608 [0000000000000007]   <error>  [TM] queue(0): suspended(0x272e000.06/00000000) P00000000: pre_main: barrier::set_event
(T00000000/----------------.--/----------------) P--------/----------------.-- 00:12.38.608 [0000000000000008]   <error>  [TM] queue(0): no new work available, are we deadlocked?
(T00000000/----------------.--/----------------) P--------/----------------.-- 00:12.38.641 [0000000000000009]   <error>  [TM] Listing suspended threads while queue (0) is empty:
(T00000000/----------------.--/----------------) P--------/----------------.-- 00:12.38.641 [000000000000000a]   <error>  [TM] queue(0): suspended(0x272e000.08/00000000) P00000000: hpx_main: full_empty_entry::enqueue_full_full
(T00000000/----------------.--/----------------) P--------/----------------.-- 00:12.38.641 [000000000000000b]   <error>  [TM] queue(0): suspended(0x272e040.04/028c2110) P0x272e000: distributing_factory_create_components_action: full_empty_entry::enqueue_full_full
(T00000000/----------------.--/----------------) P--------/----------------.-- 00:12.38.641 [000000000000000c]   <error>  [TM] queue(0): no new work available, are we deadlocked?
(T00000000/----------------.--/----------------) P--------/----------------.-- 00:12.38.767 [000000000000000d]   <error>  [TM] Listing suspended threads while queue (0) is empty:
(T00000000/----------------.--/----------------) P--------/----------------.-- 00:12.38.767 [000000000000000e]   <error>  [TM] queue(0): suspended(0x272e040.06/028c2110) P0x272e000: distributing_factory_create_components_action: full_empty_entry::enqueue_full_full
(T00000000/----------------.--/----------------) P--------/----------------.-- 00:12.38.767 [000000000000000f]   <error>  [TM] queue(0): no new work available, are we deadlocked?

@brycelelbach
Copy link
Member Author

[comment by blelbach] [Trac time Wed Jun 13 02:35:12 2012] Wasn't r8201

@brycelelbach
Copy link
Member Author

[comment by blelbach] [Trac time Wed Jun 13 05:30:03 2012] Can get full logs with:

bin/ad_client -0 -l2 --np 4000 -t2 --hpx:debug-hpx-log='file(/tmp/hpx.log)'
bin/ad_client -1 -l2

@brycelelbach
Copy link
Member Author

[comment by blelbach] [Trac time Wed Jun 13 05:51:13 2012] Problem diagnosed, it's the locking issue described in comment 4 - the lock causing the problems is the lock on the connection_cache. Fix pending in the morning.

@brycelelbach
Copy link
Member Author

[comment by blelbach] [Trac time Wed Jun 13 21:49:49 2012] Resolved in 2fb6c1b. The diagnosis above is close but not entirely correct - it was actually a locking problem in the parcelport. See the commit message for more details.

@brycelelbach
Copy link
Member Author

[comment by blelbach] [Trac time Tue Jul 3 00:15:41 2012] Milestone 0.9.0-rc2 deleted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant