New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"negative entry in reference count table" running octotiger on 32 nodes on queenbee #2171
Comments
This just means, that our latest changes have not fixed the original problem you were having. It is still the very same issue related to reference counting which pops up for you. Thanks for letting us know. See also #2122 and #2108, which are all different errors likely being caused by the same problem. |
@sithhell said that he thinks this is caused by stack overflows. From his email:
|
@dmarce1 We have fixed another of those bugs which makes you wonder why things have worked before at all. The fix has been merged to master. I wouldn't be surprised if that has fixed your issues as well. |
… small - apply_helper - put_parcel This should solve the stack-overflows which are assumed to cause problems like reported in #2171
@sithhell I have implemented the stack checks for |
When I compile octotiger with fixing_2171 it freezes on startup, it doesn't even get to the first line of main(). |
@dmarce1 If that's the case then something else is amiss - some stale binaries getting in the way, perhaps. |
I removed all the binaries and rebuilt everything (HPX and Octo-tiger) using the master HPX branch (the one with the merge of fixing_2171) - I get the same problem, freeze on startup. |
How many nodes? |
I have tried on 128 and 32. |
It seems to work on 1 core, 1 full node, or on 16 nodes. But froze on 32 and 128 nodes. |
@dmarce1 The startup hang should be fixed on master (tm) |
It looks like the hang is fixed. |
Ok, I'm still getting freezes, just after startup. |
I get freeze ups in the early stages of execution, during grid setup, for anything 16 nodes or more it seems. (I rebuilt everything, HPX from master and octotiger) |
OK, the hangs are definitely gone but I still getting the same sorts of crashes.
|
@dmarce1 could you please try the (new) fixing_2171 branch and see if that fixes your problem? |
OK. I'm on it now. |
I am still getting similar errors: |
@dmarce1 Are you at least using 0aa5be3 (see https://github.com/STEllAR-GROUP/hpx/pull/2206/commits)? |
OK, there is a remote possibility I was using the wrong binaries. I found some old HPX binaries in a directory I'm pretty sure wasn't used, but I am removing and rebuilding everything from scratch to make sure. |
Similar bug:
|
Still the same old problem... sigh @dmarce1 Thomas has reimplemented part of the remote promise/future architecture (responsible for those issues) and claims that all of our tests pass for him. If you're feeling adventurous, try running off the branch |
I tried running off the branch simplify_promise. So far, 67+ hours and no crash. Compared to the longest run to date (which ended with a crash), it has so far gone more than 5 X as many timesteps. |
With the latest commit (4324def), I get a "negative entry in reference count table" error when running octotiger on 32 nodes on queenbee. The error output is here: https://gist.github.com/dmarce1/75c82c798733a9dcaa75b99bb3ca9c45
The core dump, executable, and all binaries used in the run are on SuperMIC at /project/dmarce1/core_dump_3
The text was updated successfully, but these errors were encountered: