Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception in primary_namespace::resolve_free_list #2122

Closed
dmarce1 opened this issue Apr 26, 2016 · 11 comments

Comments

Projects
None yet
3 participants
@dmarce1
Copy link
Member

commented Apr 26, 2016

When running octotiger on 128 nodes at high resolution, I get this bug:

{what}: primary_namespace::resolve_free_list, failed to resolve gid, \
    gid({0000000700000001, 000000000b9815bc}): HPX(internal_server_error)

Full output is here:

https://gist.github.com/dmarce1/ff9b28e9b8f0d0d288ee6e557ec6fe8a

@hkaiser hkaiser added this to the 0.9.12 milestone Apr 26, 2016

@hkaiser

This comment has been minimized.

Copy link
Member

commented Apr 26, 2016

So something is definitely off with the garbage collection code. This error means something is free'd twice.

@sithhell

This comment has been minimized.

Copy link
Member

commented Apr 27, 2016

So something is definitely off with the garbage collection code. This error means something is free'd twice.

I was seeing a double free as well with the split_credit test (different symptoms though).

@dmarce1

This comment has been minimized.

Copy link
Member Author

commented Apr 29, 2016

I just ran the same thing over again and it failed with the same error on the first timestep.

This tarball has the executable and the core dump from the process HPX says caused the fault:
https://drive.google.com/file/d/0B_Hf1bEwvJEkbDMtMnRwRFFNbHc/view?usp=sharing

@hkaiser

This comment has been minimized.

Copy link
Member

commented Apr 29, 2016

@dmarce1 Is this reproducible now?

@dmarce1

This comment has been minimized.

Copy link
Member Author

commented Apr 29, 2016

This is a 128 node run. I've been able to run it twice - the first time this error turned up after a couple hundred timesteps - the second time it turned up immediately. I will try and get a 128 node interactive session to see how reliably it comes up. I could run it in a loop, but not with core dumps, as the core dump files are huge.

@dmarce1

This comment has been minimized.

Copy link
Member Author

commented May 1, 2016

I think this issue and the one here: #2108 must be the same bug

@hkaiser

This comment has been minimized.

Copy link
Member

commented May 1, 2016

@dmarce1 I agree, I think that all the problems you're seeing are a variation of the same cause. I've been thinking a lot about this lately and I now believe that it is not a race condition causing this. It's caused by an occasional out of order execution of credit-increment and credit-decrement operations. This would explain essentially all of the errors we're seeing: negative ref-counts, apparently duplicate object deletions, apparently non-existing gid mappings, etc. The question is under what circumstances does this happen? I will continue staring at the relevant pieces of code hoping to spot the problem. As long as we're not able to make it reproducible that's the only choice we have.

@dmarce1

This comment has been minimized.

Copy link
Member Author

commented May 2, 2016

Is it useful to have more core dumps? The core dump files can be very large, ~5 gigabytes for each node = 640 GB for a 128 node dump.

@hkaiser

This comment has been minimized.

Copy link
Member

commented May 17, 2016

@dmarce1 Please try again using master. We have just merged a fix which has caused problems for two other applications suffering from similarly rare race conditions in their code (see #2165).

@sithhell

This comment has been minimized.

Copy link
Member

commented Jun 22, 2016

We can close this once #2223 is merged.

@hkaiser

This comment has been minimized.

Copy link
Member

commented Jun 24, 2016

#2223 has been merged. This should be fixed, pleasse reopen if necessary.

@hkaiser hkaiser closed this Jun 24, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.