Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a mark-and-sweep garbage collector #3072

Closed
lemmih opened this issue Jun 26, 2023 · 6 comments · Fixed by #3504 or #3689
Closed

Implement a mark-and-sweep garbage collector #3072

lemmih opened this issue Jun 26, 2023 · 6 comments · Fixed by #3504 or #3689
Assignees
Labels
Ready Issue is ready for work and anyone can freely assign it to themselves

Comments

@lemmih
Copy link
Contributor

lemmih commented Jun 26, 2023

Issue summary

Context: We previously investigated the feasibility of using a mark-and-sweep garbage collector, but it was prohibitively expensive to scan values with RocksDB. New data shows that ParityDB can efficiently scan through all values. This opens the door for a more space-efficient mark-and-sweep GC.

Important caveats: We require state-roots going back 2 x chain_finality = 1800 epochs. Any data that is reachable within 1800 epochs may not be garbage collected.

The Forest database contains a persistent (i.e. immutable), directed acyclic graph. This makes implementing a mark-and-sweep garbage collector reasonably straightforward. The algorithm could look like this:

  1. Scan through all IPLD values in the database and mark their hash value in a hash set. Collisions don't affect correctness, so the hash size matters little.
  2. Wait for at least 1800 epochs after the scan.
  3. Traverse the reachable graph from HEAD (including at least the last 1800 epochs) and remove reachable nodes from the hash set.
  4. Scan the database again and remove all elements still in the hash set.

Notes:

  • If there is a hash collision, some data may not be garbage collected. Hash collisions will never cause reachable data to be garbage collected.
  • ParityDB does not store keys. This is fine in Forest since our keys are always the hash of the IPLD value.
  • Make sure we no longer use buffered writes and use this approach instead if applicable.

Other information and links

Some knowledge in regards to CID construction:

  1. For the GraphDagCborBlake2b256 parityDB column we currently use this approach (V1, DAG_CBOR codec, Blake2b256 hash). We need to construct the CID manually, because this column does not have a btree_index therefore we have no way to iterate the keys. That's done due to performance reasons as DAG_CBOR is the most common type of key.
  2. For the GraphFull column we have the btree-index so we can iterate the keys. Otherwise there is no way to know how to get the CID hash.
@lemmih lemmih added the Ready Issue is ready for work and anyone can freely assign it to themselves label Jun 26, 2023
@ruseinov ruseinov self-assigned this Jun 27, 2023
@ruseinov
Copy link
Contributor

After looking into this, am I correct to assume that:

  1. With this approach we can ditch the whole old and new db concept, as we can just operate on the current db, deleting only unreachable stuff from the previously collected hash set.
  2. We'll need to calculate hashes of IPLD values according to the specified algorithm, which is already implemented in forest.
  3. We need to traverse the graph from HEAD to detect all the reachable nodes within 1800. The rest can be removed. Are there cases when we need more than 1800 epochs history?
  4. We aren't currently aiming at parallel DAG traversal, as far as I can see the current implementation is a recursive function that takes care of visiting everything that's reachable.

Additionally, once the above has been implemented we could look into parallel DAG traversal if that'd make sense performance-wise.

Please correct me if I'm wrong.

@hanabi1224
Copy link
Contributor

IIRC, the major blocker with parity-db is that, it's impossible to retrieve raw (CID) key with its iteration API, see paritytech/parity-db#187

@lemmih
Copy link
Contributor Author

lemmih commented Jun 28, 2023

After looking into this, am I correct to assume that:

1. With this approach we can ditch the whole `old and new` db concept, as we can just operate on the current db, deleting only unreachable stuff from the previously collected hash set.

Yes.

2. We'll need to calculate hashes of IPLD values according to the specified algorithm, which is already implemented in forest.

Yes. In essence, a CID is the hash of an IPLD value. But the devil is in the details. We can talk through the details in slack.

3. We need to traverse the graph from HEAD to detect all the reachable nodes within 1800. The rest can be removed. Are there cases when we need more than 1800 epochs history?

Yes, 1800 is the absolute minimum, but many node operators want more history than that. The exact number of epochs to keep should be configurable (but still with 1800 as the minimum).

4. We aren't currently aiming at parallel DAG traversal, as far as I can see the current implementation is a recursive function that takes care of visiting everything that's reachable.

Indeed. When exporting a snapshot, the key-value pairs must be emitted in depth-first order. But for the GC, a parallel traversal would be great.

Additionally, once the above has been implemented we could look into parallel DAG traversal if that'd make sense performance-wise.

Please correct me if I'm wrong.

Looks like we're on the same page.

@lemmih
Copy link
Contributor Author

lemmih commented Jun 28, 2023

IIRC, the major blocker with parity-db is that, it's impossible to retrieve raw (CID) key with its iteration API, see paritytech/parity-db#187

We don't need to retrieve them. We can re-generate them from the IPLD value. It costs a bit of CPU time but isn't too bad.

@ruseinov
Copy link
Contributor

I have got two questions:

  1. How are we going to deal with GC progress endpoint and manual runs? I have currently opted to have EITHER manual OR automatic GC, with the current approach there is little sense to allow for both at the same time. But we can if needed.
  2. We probably need to re-think the whole progress report thing too, as reporting GC progress now makes very little sense. What we could do is something like: 30% when we have marked all the keys, 60% when we filtered and 100% after sweeping, but it's pretty inaccurate. As far as I understand the GC progress report is a part of the filecoin rpc spec, would love some guidance here.

@LesnyRumcajs
Copy link
Member

LesnyRumcajs commented Nov 7, 2023

Re-opening, the feature was temporarily reverted in #3682

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ready Issue is ready for work and anyone can freely assign it to themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants