GC bug on skyline benchmark? #156

shwestrick · 2022-05-31T14:24:03Z

Skyline benchmark from mpllang/parallel-ml-bench.

The bug sometimes causes a segfault, although I've also seen it hang. It also appears to occur only on small core counts.

To reproduce:

$ cd parallel-ml-bench/mpl
$ make skyline.mpl.bin
$ bin/skyline.mpl.bin @mpl procs 4 -- -repeat 20

The bug is still present as of the current commit, b69ca19

I think the bug is somewhere in CC. (If I disable forkGC in the scheduler, then the bug appears to go away. Or, at least, I haven't been able to trigger it after making that change.)

The text was updated successfully, but these errors were encountered:

shwestrick · 2022-08-30T15:40:57Z

Some notes:

Using @mpl max-cc-depth 1 -- (see PR Some runtime controls for CGC #163) seems to make the bug go away. This limits CGC, making it very shallow, in particular by only allowing CGC on the root heap.
In gdb, I was able to trigger assertion failures on debug version using a smaller size (-size 100000).
The bug appears to be a dangling pointer originating from within the work-stealing deque.
Perhaps work-stealing deque is not properly tracked by remembered sets in CGC?
- On shwestrick/mpl/gc-debug, I tried additionally snapshotting the contents of the deque when CGC is spawned. But this didn't seem to change anything.
Perhaps LGC is forwarding an object reachable from deque but not updating the down-pointer? (Then later CGC takes over, and witnesses a dangling down-pointer.)
- TODO: double check that LGC handles objects reachable from deque properly.

shwestrick · 2022-09-28T14:56:14Z

Some possible progress on this.

I've discovered a race between LGC and scheduler steals. The problem is in the implementation of the ABP concurrent deque: on a steal, the read of the stolen value is performed optimistically before the CAS to confirm the steal. In-between the optimistic read and CAS, a concurrent LGC could relocate the object. To fix this, I think all down-pointers from the work-stealing deque need to be pinned.

In our implementation so far, we've been handling the work-stealing deque specially. Its updates are not subjected to the standard write barrier, because this would cause all down-pointers from the deque to stay live forever.

But, an interesting point: if I subject the deque to the standard write barrier, then the bug seems to go away. (At least, I haven't been able to trigger the bug in this case yet.)

So, the interesting challenge now is to figure out how to pin deque down-pointers while also allowing these to be unpinned appropriately at a later time. Our current unpin-depth trick won't work, because the deques live in the global heap (depth 0), and after scheduler initialization, the program will never again return to depth 0.

Proposal: we could use a hybrid remembered set strategy, delimited by depth. For objects x at depth 0 and down-pointers x[i] := y, we would use remembered set entries of the form (x,i,y), enabling us to later invalidate the entry when x[i] != y. For all other objects (at depth 1 or deeper), we would continue to use unpin depths.

shwestrick · 2024-02-19T03:43:27Z

This appears to be fixed! In d1646cf which was merged as part of #180.

shwestrick added the bug label May 31, 2022

shwestrick mentioned this issue Oct 2, 2023

bugfix: clear work-stealing queue after returning to scheduler #178

Closed

shwestrick closed this as completed in d1646cf Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC bug on skyline benchmark? #156

GC bug on skyline benchmark? #156

shwestrick commented May 31, 2022

shwestrick commented Aug 30, 2022

shwestrick commented Sep 28, 2022

shwestrick commented Feb 19, 2024

GC bug on skyline benchmark? #156

GC bug on skyline benchmark? #156

Comments

shwestrick commented May 31, 2022

shwestrick commented Aug 30, 2022

shwestrick commented Sep 28, 2022

shwestrick commented Feb 19, 2024