-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a mark-and-sweep garbage collector #3072
Comments
After looking into this, am I correct to assume that:
Additionally, once the above has been implemented we could look into parallel DAG traversal if that'd make sense performance-wise. Please correct me if I'm wrong. |
IIRC, the major blocker with parity-db is that, it's impossible to retrieve raw (CID) key with its iteration API, see paritytech/parity-db#187 |
Yes.
Yes. In essence, a CID is the hash of an IPLD value. But the devil is in the details. We can talk through the details in slack.
Yes, 1800 is the absolute minimum, but many node operators want more history than that. The exact number of epochs to keep should be configurable (but still with 1800 as the minimum).
Indeed. When exporting a snapshot, the key-value pairs must be emitted in depth-first order. But for the GC, a parallel traversal would be great.
Looks like we're on the same page. |
We don't need to retrieve them. We can re-generate them from the IPLD value. It costs a bit of CPU time but isn't too bad. |
I have got two questions:
|
Re-opening, the feature was temporarily reverted in #3682 |
Issue summary
Context: We previously investigated the feasibility of using a mark-and-sweep garbage collector, but it was prohibitively expensive to scan values with RocksDB. New data shows that ParityDB can efficiently scan through all values. This opens the door for a more space-efficient mark-and-sweep GC.
Important caveats: We require state-roots going back
2 x chain_finality = 1800
epochs. Any data that is reachable within 1800 epochs may not be garbage collected.The Forest database contains a persistent (i.e. immutable), directed acyclic graph. This makes implementing a mark-and-sweep garbage collector reasonably straightforward. The algorithm could look like this:
Notes:
Other information and links
Some knowledge in regards to CID construction:
GraphDagCborBlake2b256
parityDB column we currently use this approach (V1,DAG_CBOR
codec, Blake2b256 hash). We need to construct the CID manually, because this column does not have a btree_index therefore we have no way to iterate the keys. That's done due to performance reasons asDAG_CBOR
is the most common type of key.GraphFull
column we have the btree-index so we can iterate the keys. Otherwise there is no way to know how to get the CID hash.The text was updated successfully, but these errors were encountered: