Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC bug on skyline benchmark? #156

Closed
shwestrick opened this issue May 31, 2022 · 3 comments
Closed

GC bug on skyline benchmark? #156

shwestrick opened this issue May 31, 2022 · 3 comments
Labels

Comments

@shwestrick
Copy link
Collaborator

Skyline benchmark from mpllang/parallel-ml-bench.

The bug sometimes causes a segfault, although I've also seen it hang. It also appears to occur only on small core counts.

To reproduce:

$ cd parallel-ml-bench/mpl
$ make skyline.mpl.bin
$ bin/skyline.mpl.bin @mpl procs 4 -- -repeat 20

The bug is still present as of the current commit, b69ca19

I think the bug is somewhere in CC. (If I disable forkGC in the scheduler, then the bug appears to go away. Or, at least, I haven't been able to trigger it after making that change.)

@shwestrick shwestrick added the bug label May 31, 2022
@shwestrick
Copy link
Collaborator Author

Some notes:

  • Using @mpl max-cc-depth 1 -- (see PR Some runtime controls for CGC #163) seems to make the bug go away. This limits CGC, making it very shallow, in particular by only allowing CGC on the root heap.
  • In gdb, I was able to trigger assertion failures on debug version using a smaller size (-size 100000).
  • The bug appears to be a dangling pointer originating from within the work-stealing deque.
  • Perhaps work-stealing deque is not properly tracked by remembered sets in CGC?
    • On shwestrick/mpl/gc-debug, I tried additionally snapshotting the contents of the deque when CGC is spawned. But this didn't seem to change anything.
  • Perhaps LGC is forwarding an object reachable from deque but not updating the down-pointer? (Then later CGC takes over, and witnesses a dangling down-pointer.)
    • TODO: double check that LGC handles objects reachable from deque properly.

@shwestrick
Copy link
Collaborator Author

Some possible progress on this.

I've discovered a race between LGC and scheduler steals. The problem is in the implementation of the ABP concurrent deque: on a steal, the read of the stolen value is performed optimistically before the CAS to confirm the steal. In-between the optimistic read and CAS, a concurrent LGC could relocate the object. To fix this, I think all down-pointers from the work-stealing deque need to be pinned.

In our implementation so far, we've been handling the work-stealing deque specially. Its updates are not subjected to the standard write barrier, because this would cause all down-pointers from the deque to stay live forever.

But, an interesting point: if I subject the deque to the standard write barrier, then the bug seems to go away. (At least, I haven't been able to trigger the bug in this case yet.)

So, the interesting challenge now is to figure out how to pin deque down-pointers while also allowing these to be unpinned appropriately at a later time. Our current unpin-depth trick won't work, because the deques live in the global heap (depth 0), and after scheduler initialization, the program will never again return to depth 0.

Proposal: we could use a hybrid remembered set strategy, delimited by depth. For objects x at depth 0 and down-pointers x[i] := y, we would use remembered set entries of the form (x,i,y), enabling us to later invalidate the entry when x[i] != y. For all other objects (at depth 1 or deeper), we would continue to use unpin depths.

@shwestrick
Copy link
Collaborator Author

This appears to be fixed! In d1646cf which was merged as part of #180.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant