Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Zero-copy DAG_CBOR into Vec<Cid> parser. #3379

Merged
merged 25 commits into from
Aug 18, 2023

Conversation

ruseinov
Copy link
Contributor

@ruseinov ruseinov commented Aug 15, 2023

Summary of changes

Changes introduced in this pull request:

  • Introduced a deserializer to speed up CBOR > Vec parsing.

Reference issue to close (if applicable)

Closes

Other information and links

Change checklist

  • I have performed a self-review of my own code,
  • I have made corresponding changes to the documentation,
  • I have added tests that prove my fix is effective or that my feature works (if possible),
  • I have made sure the CHANGELOG is up-to-date. All user-facing changes should be reflected in this document.

Copy link
Contributor

@aatifsyed aatifsyed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused - the title says zero-copy serialization of Vec<Cid>, but I'd expect to see a lifetime in that case, deserializing into a &'serial [Cid] or something.

Plus, the in-memory representation of a CID is different to the CBOR serialization, so I'd expect to see a new struct, but I don't - what am I missing?

@lemmih
Copy link
Contributor

lemmih commented Aug 15, 2023

I'm confused - the title says zero-copy serialization of Vec<Cid>, but I'd expect to see a lifetime in that case, deserializing into a &'serial [Cid] or something.

Plus, the in-memory representation of a CID is different to the CBOR serialization, so I'd expect to see a new struct, but I don't - what am I missing?

The CIDs are copied (which can be done without any heap allocations.) Other data, such as bytes and strings, are not copied.

@ruseinov
Copy link
Contributor Author

ruseinov commented Aug 15, 2023

I'm confused - the title says zero-copy serialization of Vec<Cid>, but I'd expect to see a lifetime in that case, deserializing into a &'serial [Cid] or something.

Plus, the in-memory representation of a CID is different to the CBOR serialization, so I'd expect to see a new struct, but I don't - what am I missing?

The title does not say anything about serialization whatsoever. It's deserialization of a CBOR-encoded blob into a Vec<Cid>.

  1. We are recursively traversing the DAG_CBOR-encoded blob without parsing everything into a recursive Ipld struct.
  2. All the results are saved into a Vec<Cid>.
  3. That yields 3x performance boost over normal parsing.

@ruseinov
Copy link
Contributor Author

Single-run results cargo forest-tool benchmark car-streaming --inspect forest_snapshot_calibnet_2023-08-02_height_788380.forest.car

 main traversed 87.43 GiB at 163.18 MiB/s in 00:09:08
 this branch traversed 87.43 GiB at 456.14 MiB/s in 00:03:16

ruseinov and others added 2 commits August 15, 2023 17:32
@ruseinov
Copy link
Contributor Author

I'm confused - the title says zero-copy serialization of Vec<Cid>, but I'd expect to see a lifetime in that case, deserializing into a &'serial [Cid] or something.
Plus, the in-memory representation of a CID is different to the CBOR serialization, so I'd expect to see a new struct, but I don't - what am I missing?

The CIDs are copied (which can be done without any heap allocations.) Other data, such as bytes and strings, are not copied.

Yeah, zero-copy here is a bit misleading, I'll make sure this is documented properly.

@lemmih
Copy link
Contributor

lemmih commented Aug 15, 2023

I'm confused - the title says zero-copy serialization of Vec<Cid>, but I'd expect to see a lifetime in that case, deserializing into a &'serial [Cid] or something.
Plus, the in-memory representation of a CID is different to the CBOR serialization, so I'd expect to see a new struct, but I don't - what am I missing?

The CIDs are copied (which can be done without any heap allocations.) Other data, such as bytes and strings, are not copied.

Yeah, zero-copy here is a bit misleading, I'll make sure this is documented properly.

I don't think it's inaccurate to call this a zero-copy parser. The PR is doing two things: a) parsing an ipld structure, b) extracting CIDs from Link branches. The parsing part is definitely zero-copy. It's not a generalized zero-copy Ipld parser, though.

@ruseinov ruseinov marked this pull request as ready for review August 15, 2023 21:54
@ruseinov ruseinov requested a review from a team as a code owner August 15, 2023 21:54
@ruseinov ruseinov requested review from elmattic and aatifsyed and removed request for a team August 15, 2023 21:54
@aatifsyed
Copy link
Contributor

Ahh I understand now!

The CIDs are copied (which can be done without any heap allocations.) Other data, such as bytes and strings, are not copied.
We are recursively traversing the DAG_CBOR-encoded blob without parsing everything into a recursive Ipld struct.
The PR is doing two things: a) parsing an ipld structure, b) extracting CIDs from Link branches. The parsing part is definitely zero-copy. It's not a generalized zero-copy Ipld parser, though.

Zero-copy in my head means holding references into buffers. Copying CIDs out of it is not zero copy ;)

I think (pre?) filtered deserialization could be a better term here?

}

#[inline]
fn visit_str<E>(self, _value: &str) -> Result<Self::Value, E>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nonblocking: can you macro these away with a comment? Or at least lift the most relevant fns to the top?
"the default visitor errors when visiting these, but we want to skip over them, so return Ok(())"
Out of curiosity why have you only done that for some of the visited types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only care about list, map and newtype_struct, because a list and a map could contain more of those and the only type that converts to a CID is the newtype_struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that introducing a macro here is going to help anything, it's just going to obfuscate the obvious. I'll re-arrange the order, however.

Comment on lines 16 to 22
impl Deref for CidVec {
type Target = Vec<Cid>;

fn deref(&self) -> &Self::Target {
&self.0
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing Deref for smart pointers makes accessing the data behind them convenient, which is why they implement Deref. On the other hand, the rules regarding Deref and DerefMut were designed specifically to accommodate smart pointers. Because of this, Deref should only be implemented for smart pointers to avoid confusion.

Not a smart pointer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a smart pointer, but since we have a thin wrapper here strictly for deserialization that also explicitly states it is a ‘Vec’ - it’s not surprising that we want to use a ‘CidVec’ as it’s underlying ‘Vec’, without any explicit conversions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it’s a “choose your poison” kind of thing and not using helpful abstractions in this case seems wasteful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have gone with into_inner here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, into_inner it is then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not using helpful abstractions in this case seems wasteful.

It's the wrong abstraction, and a commonly misused one at that, which is why it's called out in the docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the wrong abstraction, and a commonly misused one at that, which is why it's called out in the docs

Sure is, though I'm still of the opinion that thin wrappers could benefit a lot from this. I would go further and say that the need for such wrappers to override certain behaviours calls for something magic/transparent to make the wrapper adhere to the same interface as whatever is wrapped. I wish there was a more eloquent way of implementing custom ser/de for types, e.g. the ability to have an implementation for a type alias.

Personal opinions aside - this has been fixed.

src/utils/encoding/cid_de_cbor.rs Outdated Show resolved Hide resolved
Co-authored-by: Aatif Syed <38045910+aatifsyed@users.noreply.github.com>
@lemmih
Copy link
Contributor

lemmih commented Aug 16, 2023

Ahh I understand now!

The CIDs are copied (which can be done without any heap allocations.) Other data, such as bytes and strings, are not copied.
We are recursively traversing the DAG_CBOR-encoded blob without parsing everything into a recursive Ipld struct.
The PR is doing two things: a) parsing an ipld structure, b) extracting CIDs from Link branches. The parsing part is definitely zero-copy. It's not a generalized zero-copy Ipld parser, though.

Zero-copy in my head means holding references into buffers. Copying CIDs out of it is not zero copy ;)

Imagine a zero-copy parser that returns an Ipld structure with references instead of owned data. Then imagine traversing that structure to extract the CIDs. This PR does both at once, with the intermediate structure deforested out.

CHANGELOG.md Outdated
Comment on lines 37 to 38
- [#3379](https://github.com/ChainSafe/forest/pull/3379): Zero-copy DAG_CBOR
into Vec<Cid> parser.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not user-visible addition. I don't think it should be in the CHANGELOG. If we want a CHANGELOG entry, it should be under Changed, and it should say the performance was improved when walking the state graph.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

/// [`CidVec`] allows for efficient zero-copy de-serialization of `DAG_CBOR`-encoded nodes into a
/// vector of [`Cid`].
#[derive(Default)]
pub struct CidVec(Vec<Cid>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this name -it's too general - it was the original name for what is now FrozenCids, for example.

Can you think of a better name, or maybe different API? pub fn filter_cids(_: ???)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a good point. My idea behind this name is exactly that - this represents a Vec<Cid>. If we had a better way of dealing with ser/de without a wrapper - this wouldn't be needed.

Perhaps a good middleground is to do something like: pub fn extract_cids(cbor_blob: &bytes): Vec<Cid>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done.

Copy link
Member

@LesnyRumcajs LesnyRumcajs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this feature covered by unit tests?

@ruseinov
Copy link
Contributor Author

Is this feature covered by unit tests?

Good point, it actually isn't.

@ruseinov ruseinov marked this pull request as draft August 17, 2023 14:02
@ruseinov ruseinov marked this pull request as ready for review August 17, 2023 14:16
Comment on lines 218 to 221
// Cleaning up Integer and Float in order to avoid parser mistakes that result
// in tag detection and a subsequent Cid decoding failure.
// See https://github.com/ipld/serde_ipld_dagcbor/blob/master/src/de.rs#L178 and
// https://github.com/ipld/serde_ipld_dagcbor/blob/master/src/de.rs#L119 .
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use more explanation, for someone fairly fresh to this subject it's a bit shady to replace all numeric occurrences with zeroes.

Also, let's use permalinks, otherwise those links will get quickly outdated and point to random pieces of code (or 404).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ruseinov
Copy link
Contributor Author

Follow-up: create an issue to tackle roundtrip ser/de issues mentioned here:

// Cleaning up Integer and Float in order to avoid parser mistakes that result
// in tag detection and a subsequent Cid decoding failure.
// Otherwise the `serde_ipld_dagcbor` library incorrectly treats some of those
// values as [`cbor4ii::core::major::TAG`] and tries to deserialize a [`Cid`]
// from it.
// See https://github.com/ipld/serde_ipld_dagcbor/blob/37f6f00408331b76c6dac8ec4dc08a85d7764cec/src/de.rs#L178 and
// https://github.com/ipld/serde_ipld_dagcbor/blob/37f6f00408331b76c6dac8ec4dc08a85d7764cec/src/de.rs#L119.
//
// Note that we don't actually care about what integer or float contain for
// these tests, because our deserializer ignores those as it only cares about
// maps, lists and [`Cid`]s.

Copy link
Member

@LesnyRumcajs LesnyRumcajs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rock-solid

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 17, 2023
@ruseinov ruseinov added this pull request to the merge queue Aug 17, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 17, 2023
@ruseinov ruseinov added this pull request to the merge queue Aug 17, 2023
Merged via the queue into main with commit 7e2f19a Aug 18, 2023
20 checks passed
@ruseinov ruseinov deleted the ru/feature/streaming-inspect-bench branch August 18, 2023 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants