New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with distributing factory #430
Comments
[comment by manderson] [Trac time Tue Jun 12 15:35:47 2012] I have duplicated the problem on ithaca: ./ad_client --np 18 -0 -l2 --hpx:pu-offset 0 ./ad_client --np 18 -1 -l2 --hpx:pu-offset 1 Output: |
[comment by manderson] [Trac time Tue Jun 12 17:11:29 2012] In c92b72b, I have commented out everything in the code except factory.create_components, which hangs in distributed if the number of components is large enough. |
[comment by blelbach] [Trac time Tue Jun 12 23:36:08 2012] Reproduced on Hermione, 43a47e4, GCC 4.6.2, Boost trunk, release build. Ran with:
Doesn't seem to be a problem with Matt's code. |
[comment by blelbach] [Trac time Tue Jun 12 23:40:45 2012] Looks like this is deadlock, possibly some sort of issue arising from locking an HPX mutex, then trying to lock an OS-level mutex and causing the entire OS-thread to be suspended (and preventing the scheduler from executing the HPX-thread that would unlock the OS-level mutex). I suspect this is deadlock because this does not reproduce if you run the application with multiple threads on each locality (which gives more leeway, because a few of the OS-threads scheduling HPX-threads can be suspended at the kernel level without choking the threadmanager completely). The following results in a successful run on Ithaca:
|
[comment by blelbach] [Trac time Tue Jun 12 23:45:14 2012] Looks like a problem on the locality where the distributing factory lives/is invoked. The following does work (single threaded worker, multithreaded console/AGAS):
However, this '''doesn't''' work:
|
[comment by blelbach] [Trac time Tue Jun 12 23:56:06 2012] Possible culprit: 28daa45 (adds a lock of an OS-thread mutex to create_one_component, fits the profile described above). |
[comment by blelbach] [Trac time Wed Jun 13 00:00:18 2012] Actually, 28daa45 locks a local spinlock - it's still looking like the most likely cause of the issue. |
[comment by blelbach] [Trac time Wed Jun 13 00:11:11 2012] Also, runs fine if logging is enabled, at least with 20 components. |
[comment by blelbach] [Trac time Wed Jun 13 00:11:58 2012] Also works with 200 components and logging enabled. |
[comment by blelbach] [Trac time Wed Jun 13 00:45:55 2012] Can reproduce it with level 3 logs:
|
[comment by blelbach] [Trac time Wed Jun 13 02:35:12 2012] Wasn't r8201 |
[comment by blelbach] [Trac time Wed Jun 13 05:30:03 2012] Can get full logs with:
|
[comment by blelbach] [Trac time Wed Jun 13 05:51:13 2012] Problem diagnosed, it's the locking issue described in comment 4 - the lock causing the problems is the lock on the connection_cache. Fix pending in the morning. |
[comment by blelbach] [Trac time Wed Jun 13 21:49:49 2012] Resolved in 2fb6c1b. The diagnosis above is close but not entirely correct - it was actually a locking problem in the parcelport. See the commit message for more details. |
[comment by blelbach] [Trac time Tue Jul 3 00:15:41 2012] Milestone 0.9.0-rc2 deleted |
[reported by manderson] [Trac time Mon Jun 11 23:05:23 2012] distributing factory is hanging on distributed localities.
To reproduce: compile adaptive_dataflow r8210
Simulate distributed as follows:
./ad_client -1 -l2 --np 20
./ad_client -0 -l2 --np 20
prints:
Number of components: 20
Number of timesteps: 10
Number of localities: 2
BEFORE DISTRIBUTING FACTORY
and then hangs.
Strangely, if the number of components is reduced to, say, 12, it works:
./ad_client -1 -l2 --np 12
./ad_client -0 -l2 --np 12
init time: 0.0529384
gid time: 8.079e-06
compute time: 0.358329
rhs time: 0.267704
remove time: 0
Elapsed time: 0.787214 [s]
Suggestions?
The text was updated successfully, but these errors were encountered: