New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Components are destructed too early #932
Comments
I'll take a look to see if any of the AGAS reference counting unit tests are failing. On 2013.10.07 06.18, Erik Schnetter wrote:
Bryce Adelstein-Lelbach aka wash STE||AR Group, Center for Computation and Technology, LSU225-317-3866 - iPhone 225-578-6182 - Work (no voicemail)stellar.cct.lsu.edu cppnow.org |
Eric, that's everything but a small test case. Please reduce this to the absolute minimum in order for us to be able to understand and fix the issue. |
I reduced the test case; see https://bitbucket.org/eschnett/block-matrix/branch/issue-932. |
Thanks, that's much appreciated. A quick analysis shows that you're running into #588: Continuations do not keep object alive. I'll raise the priority of solving this. |
To work around this issue, I now capture all id_type from components that are created, and future<id_type> from clients that are generated. This avoids segfaults at run time. However, the following problems occur with this approach:
|
As said I'm working on it. Please see #588 for any progress on this. |
Eric, if you have the time I'd encourage you to try out the branch https://github.com/STEllAR-GROUP/hpx/tree/fixing_588 where I have committed a first functional version of HPX with #588 being fixed. There is still some work to be done (mainly optimizations and cleaning up). You now should be able to remove all your hacks you introduced to keep things alive. |
I get a build error:
|
The following log shows that and a few other errors: hpx_clang33_x8664_boost154_debug |
Ok, that compiles now. Please try again. |
Things now build fine, and small tests work. In a larger test (still running on a single locality, but with multiple threads), HPX hangs in a wait_all call waiting for several actions to complete. These actions all run single, non-communicating threads performing (the same) matrix multiplication each. These hangs are non-deterministic. |
These are the last lines in my HPX log file when the application hangs. The log file is growing very quickly while hanging.
|
I just committed a possible fix for this (master branch): ed9aa59 |
I now see a deadlock in my application. I am using the MPI parcelport and am running with 2 processes. The problem does not appear when I use only a single MPI process. I am using c3dcb1c. I start with:
The last lines in the HPX log file are
|
You can use this https://bitbucket.org/eschnett/block-matrix (current head, 0751a51) to reproduce the problem. |
I ran the job in a debugger. The backtraces are: Process 0:
Process 1:
|
When I build and install "plugins.parcel.coalescing", then the application hangs at a later point. It appears that a call to MKL is not finishing, although I only use a single thread in MKL. |
This problem was caused by using MKL. Without MKL, things run fine. |
I'll close this because this is related to #588, which I'm actively working on. |
My application crashes because components are destructed although they are still accessible via an id_type. Since their data may then have been overwritten, this leads to segfaults or similar symptoms.
In discussions on #stellar, we assumed the issue was related to future<id_type> being returned from actions. This is wrong; all actions in my applications return id_type directly.
A presumed work-around, namely waiting for clients to have their futures ready before returning from actions, helps in many cases, but does not really solve the problem; when using multiple threads, the issue still appears from time to time.
A rather reliable test case is in https://bitbucket.org/eschnett/block-matrix, tagged with "future<id_type>". Run it on one locality with one thread. With valgrind, the code should abort with "Invalid read of size 8", likely in or near "matrix_t_component::faxpy" in "matrix_hpx.cc", line 48.
The text was updated successfully, but these errors were encountered: