New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write our own serde format #340

Open
Keats opened this Issue Sep 20, 2018 · 15 comments

Comments

Projects
None yet
6 participants
@Keats
Copy link
Owner

Keats commented Sep 20, 2018

serde-json was chosen as the simplest format I could think of but it would be interesting to see what writing our own format would do. We only use serde-json to have easy serialization of user data in the context in Value nodes, not really for anything else.

Advantages:

  • we can add built-in support for dates
  • separation of int and float types, right now it's just Number
  • no parsing from string to size will be smaller than serde-json
  • probably quite a bit faster
  • no need for users to have serde-json, all you need would be in Tera
  • only need to impl serialization I think

Cons:

  • more code
  • might not be that much faster but it's pretty unlikely
  • would need to impl something equivalent to the JSON pointers
  • a big breaking change

If anyone is interesting in picking this up, please do! I won't have time to touch that for quite some time.

@Keats

This comment has been minimized.

Copy link
Owner Author

Keats commented Sep 20, 2018

@dtolnay hopefully it's OK to ping you

Before I or anyone start working on it, do you think it makes sense or am I overestimating potential perf gains?

Tera uses the Serialize trait in https://docs.rs/tera/0.11.16/tera/struct.Context.html#method.insert and serde_json::{to_value, Value}; a bit everywhere while rendering.

@dtolnay

This comment has been minimized.

Copy link

dtolnay commented Sep 20, 2018

I would guess roughly 3x performance improvement from serializing directly rather than passing everything through serde_json::Value.

@Keats

This comment has been minimized.

Copy link
Owner Author

Keats commented Sep 20, 2018

Sounds like a nice win then!

@Keats

This comment has been minimized.

Copy link
Owner Author

Keats commented Sep 21, 2018

And thanks for the comment!

@Keats Keats referenced this issue Sep 21, 2018

Open

v1 release #331

11 of 16 tasks complete
@Keats

This comment has been minimized.

Copy link
Owner Author

Keats commented Sep 21, 2018

Took the code from serde_json that seems needed to serialize to Value for Tera: https://github.com/Keats/serde-tera (minus dates)

I believe Value::String could be a Cow but I'm not sure where the rest of the perf gains would be?

@dtolnay

This comment has been minimized.

Copy link

dtolnay commented Sep 21, 2018

If you don't use serde_json::to_value but you copy all the code for serde_json::to_value out of serde_json and use your copy of it, that isn't going to be a performance improvement. 😜

I haven't looked at how the implementation currently works but in the case of:

{% for user in users %}
  <li><a href="{{ user.url }}">{{ user.username }}</a></li>
{% endfor %}

what I would expect for serializing directly with no Value involved would be a Serializer that handles this loop inside of Serializer::serialize_seq.

struct LoopSerializer<W> {
    out: W,
    body: /* some representation of the loop body */,
}

impl<W: io::Write> serde::Serializer for LoopSerializer<W> {
    /* ... */

    fn serialize_seq(self, _len: Option<usize>) -> Result<Self::SerializeSeq, Self::Error> {
        Ok(self)
    }
}

impl<W: io::Write> serde::ser::SerializeSeq for LoopSerializer<W> {
    /* ... */

    fn serialize_element<T>(&mut self, value: &T) -> Result<(), Self::Error>
    where
        T: ?Sized + Serialize,
    {
        /* render `value` according to `self.body` and write to `self.out` */
    }
}
@Keats

This comment has been minimized.

Copy link
Owner Author

Keats commented Sep 21, 2018

If you don't use serde_json::to_value but you copy all the code for serde_json::to_value out of serde_json and use your copy of it, that isn't going to be a performance improvement

I was mostly curious to see how creating a format looks in practice as I will probably write one for https://github.com/Keats/scl and wanted to see if Cow for Value::String would be possible and GoodEnough ™️
I didn't expect the code to be so short, serde is amazing!

what I would expect for serializing directly with no Value involved would be a Serializer that handles this loop inside of Serializer::serialize_seq.

Whoa that's interesting, I didn't think of it that way. I'm not sure how that would work in practice since we currently use json pointers and a few other Value things but that's a problem for future me or (hopefully) someone smarter than me.

@Evrey

This comment has been minimized.

Copy link

Evrey commented Nov 19, 2018

If you have ideas on how to improve on that side [serde performance, kill JSON], I am very happy to hear them!

Have you guys considered just writing schemas for Cap'n Proto or FlatBuffers? No copying, no parsing, just dumping a binary blob and applying an initial bounds check on the contents. You do already know the data structures you wish to exchange, after all.

Those are the two fastest and still very robust serialisation formats I know of. Last time I checked, the Cap'n Proto crates are more mature and FlatBuffers lacks the bounds checking code. On the other hand, if you have the patience for the bounds checking code to land, then note that FlatBuffers has the simpler and more compact wire format. Both of them don't support #[no_std], however, although they technically could.

As a streamable format, something like MessagePack could do the job. It is much more compact and very similar to JSON, has extension type support for low level custom data types, but it is much more complicated to parse than the two above-mentioned formats.

Refs:

Direct comparison against the pros/cons:

we can add built-in support for dates

In Cap'n Proto and FlatBuffers you'd just define your 0815 time struct with seconds and nano seconds. MessagePack recently-ish standardised time stamp extensions of varying precision.

separation of int and float types, right now it's just Number

All three formats do this. Also integer and floating-point types of different sizes/precision where needed.

no parsing from string to size will be smaller than serde-json

All three formats are binary. MessagePack interleaves the data with type meta data, i.e. it is a self-describing format. Cap'n Proto and Flatbuffers are statically typed, zero copy, and therefore require offset bounds checking to be safe. Can't get faster than that.

probably quite a bit faster

Cap'n Proto and FlatBuffers are both insanely fast and still very robust. MessagePack is still much faster than text formats, but it is noticeably slower than the other two formats.

no need for users to have serde-json, all you need would be in Tera

All three require runtime crates. However, the runtime code size for both Cap'n Proto and FlatBuffers is very small. And if you want to be really strict about foreign code: MessagePack and FlatBuffers are very easy to implement.

more code

Again, FlatBuffers has a very small runtime code footprint. I think Cap'n Proto as well.

might not be that much faster but it's pretty unlikely

Depends very much on the amount of data exchanged.

would need to impl something equivalent to the JSON pointers

Dunno about that, never heard JSON and pointers in the same sentence. Are they like YAML pointers?

a big breaking change

Better early than late. =)

@Keats

This comment has been minimized.

Copy link
Owner Author

Keats commented Nov 19, 2018

Dunno about that, never heard JSON and pointers in the same sentence. Are they like YAML pointers?

I don't know YAML pointers x) But JSON pointers are something to access data. For example if we have this JSON:

{
 foo: 1, 
 bar: { baz: 2}, 
 qux: [3, 4, 5]
}

We can get the first value of the qux array with /qux/0 or the baz property with /bar/baz. This is what Tera uses to grab data from the context and shouldn't be too hard to implement as long as you get some kind of Value.

Part of the issue is that I don't want people to to create .protobuf or something related as sometimes the context is very dynamic. For example in https://github.com/Keats/kickstart the context is defined in a .toml file and the user might not have Rust installed at all. Same for https://github.com/getzola/zola

MessagePack does look interesting though, is there some benchmark between its serde implementation and serde-json? From what I remember, MessagePack is not much more compact than JSON

@Evrey

This comment has been minimized.

Copy link

Evrey commented Nov 19, 2018

This is what Tera uses to grab data from the context

Ah, so basically just some path syntax to walk a JSON data structure.

Part of the issue is that I don't want people to to create .protobuf or something related as sometimes the context is very dynamic.

In that case definitely prefer MessagePack, UBJSON, BSON, and what else they are called. That's where self-describing formats shine.

is there some benchmark between its serde implementation and serde-json?

Not that I know. But tera would be a nice field test for making one.

From what I remember, MessagePack is not much more compact than JSON

That depends a lot on the kind of data moved around. If a lot of it consists of string identifiers, then the lower memory usage becomes unnoticeable. One could fix that by storing and sending 32-bit FNV1a hashes of identifiers instead.

Edit: Still, even if MessagePack would not be much smaller compared to JSON when used in tera, parsing and serialising is still potentially much faster. Strings are length-delimited and have no escape sequences, floats are stored using their very bits, etc.

@andy128k

This comment has been minimized.

Copy link
Contributor

andy128k commented Nov 19, 2018

Does Tera really need a format? It basically consumes data. Why not to introduce a trait and allow users to implement it?

pub trait TeraValue {
    fn as_str(&self) -> Option<&str>;
    fn as_int(&self) -> Option<i64>;
    fn as_uint(&self) -> Option<u64>;
    fn as_float(&self) -> Option<f64>;
    fn get(&self, index: usize) -> Option<&dyn TeraValue>;
    fn get_prop(&self, prop: &str) -> Option<&dyn TeraValue>;
    // ...
}

What am I missing?

@Keats

This comment has been minimized.

Copy link
Owner Author

Keats commented Nov 19, 2018

How does it work when the person defining the schema doesn't have Rust on the machine? Like in Zola or kickstart? The automatic Serde serialization makes it very good from a UX point of view for users, compare with the example in https://github.com/cobalt-org/liquid-rust#usage which would be really really tiring when you have dozens of fields

@Keats

This comment has been minimized.

Copy link
Owner Author

Keats commented Nov 20, 2018

(Keep in mind I might be wrong, if you think this can be done without degrading UX please try!)

@epage

This comment has been minimized.

Copy link

epage commented Dec 7, 2018

Got curious to see your issues :)

The automatic Serde serialization makes it very good from a UX point of view for users, compare with the example in https://github.com/cobalt-org/liquid-rust#usage which would be really really tiring when you have dozens of fields

Yes, if you have data already in a serde struct, then that is easiest. Some things I have done to improve usability

  • A conversion from serde to liquid Value function
  • A value literal macro

Taking the data by-reference means clients can control how the data is created and avoid a conversion cost during render.

In addition, I've been modifying liquid to instead accept a trait. I still need to iterate on this design more but for now it helps in the case where the user is composing data from multiple sources (e.g. in cobalt, data that is the same for every page vs per-page), you no longer need to put them all in the same liquid Value (instead a struct of liquid Values)

I'd like to go a step further with the trait and have a completely custom trait for walking the entire data structure so no conversion is needed except when non-leaf nodes are accessed. This would allow the user to better optimize things.

@mitchtbaum

This comment has been minimized.

Copy link

mitchtbaum commented Jan 14, 2019

If SCL has built-in support for dates, integers, and floats, and serde-scl could go beyond whatever limits serde-json had in its approach to leveraging serde, then what about using that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment