Write our own serde format #340

Keats · 2018-09-20T18:57:09Z

serde-json was chosen as the simplest format I could think of but it would be interesting to see what writing our own format would do. We only use serde-json to have easy serialization of user data in the context in Value nodes, not really for anything else.

Advantages:

we can add built-in support for dates
separation of int and float types, right now it's just Number
no parsing from string to size will be smaller than serde-json
probably quite a bit faster
no need for users to have serde-json, all you need would be in Tera
only need to impl serialization I think

Cons:

more code
might not be that much faster but it's pretty unlikely
would need to impl something equivalent to the JSON pointers
a big breaking change

If anyone is interesting in picking this up, please do! I won't have time to touch that for quite some time.

The text was updated successfully, but these errors were encountered:

Keats · 2018-09-20T19:15:11Z

@dtolnay hopefully it's OK to ping you

Before I or anyone start working on it, do you think it makes sense or am I overestimating potential perf gains?

Tera uses the Serialize trait in https://docs.rs/tera/0.11.16/tera/struct.Context.html#method.insert and serde_json::{to_value, Value}; a bit everywhere while rendering.

dtolnay · 2018-09-20T20:20:28Z

I would guess roughly 3x performance improvement from serializing directly rather than passing everything through serde_json::Value.

Keats · 2018-09-20T21:16:19Z

Sounds like a nice win then!

Keats · 2018-09-21T06:40:16Z

And thanks for the comment!

Keats · 2018-09-21T14:36:26Z

Took the code from serde_json that seems needed to serialize to Value for Tera: https://github.com/Keats/serde-tera (minus dates)

I believe Value::String could be a Cow but I'm not sure where the rest of the perf gains would be?

dtolnay · 2018-09-21T16:01:44Z

If you don't use serde_json::to_value but you copy all the code for serde_json::to_value out of serde_json and use your copy of it, that isn't going to be a performance improvement. 😜

I haven't looked at how the implementation currently works but in the case of:

{% for user in users %}
  <li><a href="{{ user.url }}">{{ user.username }}</a></li>
{% endfor %}

what I would expect for serializing directly with no Value involved would be a Serializer that handles this loop inside of Serializer::serialize_seq.

struct LoopSerializer<W> {
    out: W,
    body: /* some representation of the loop body */,
}

impl<W: io::Write> serde::Serializer for LoopSerializer<W> {
    /* ... */

    fn serialize_seq(self, _len: Option<usize>) -> Result<Self::SerializeSeq, Self::Error> {
        Ok(self)
    }
}

impl<W: io::Write> serde::ser::SerializeSeq for LoopSerializer<W> {
    /* ... */

    fn serialize_element<T>(&mut self, value: &T) -> Result<(), Self::Error>
    where
        T: ?Sized + Serialize,
    {
        /* render `value` according to `self.body` and write to `self.out` */
    }
}

Keats · 2018-09-21T18:05:02Z

If you don't use serde_json::to_value but you copy all the code for serde_json::to_value out of serde_json and use your copy of it, that isn't going to be a performance improvement

I was mostly curious to see how creating a format looks in practice as I will probably write one for https://github.com/Keats/scl and wanted to see if Cow for Value::String would be possible and GoodEnough ™️
I didn't expect the code to be so short, serde is amazing!

what I would expect for serializing directly with no Value involved would be a Serializer that handles this loop inside of Serializer::serialize_seq.

Whoa that's interesting, I didn't think of it that way. I'm not sure how that would work in practice since we currently use json pointers and a few other Value things but that's a problem for future me or (hopefully) someone smarter than me.

Evrey · 2018-11-19T18:02:44Z

If you have ideas on how to improve on that side [serde performance, kill JSON], I am very happy to hear them!

src

Have you guys considered just writing schemas for Cap'n Proto or FlatBuffers? No copying, no parsing, just dumping a binary blob and applying an initial bounds check on the contents. You do already know the data structures you wish to exchange, after all.

Those are the two fastest and still very robust serialisation formats I know of. Last time I checked, the Cap'n Proto crates are more mature and FlatBuffers lacks the bounds checking code. On the other hand, if you have the patience for the bounds checking code to land, then note that FlatBuffers has the simpler and more compact wire format. Both of them don't support #[no_std], however, although they technically could.

As a streamable format, something like MessagePack could do the job. It is much more compact and very similar to JSON, has extension type support for low level custom data types, but it is much more complicated to parse than the two above-mentioned formats.

Refs:

Cap'n Proto Runtime Crate
Cap'n Proto Compiletime Crate
FlatBuffers Runtime Crate
FlatBuffers Compiletime thing is the official binary.
MessagePack, addons like serde through other crates prefixed rmp_

Direct comparison against the pros/cons:

we can add built-in support for dates

In Cap'n Proto and FlatBuffers you'd just define your 0815 time struct with seconds and nano seconds. MessagePack recently-ish standardised time stamp extensions of varying precision.

separation of int and float types, right now it's just Number

All three formats do this. Also integer and floating-point types of different sizes/precision where needed.

no parsing from string to size will be smaller than serde-json

All three formats are binary. MessagePack interleaves the data with type meta data, i.e. it is a self-describing format. Cap'n Proto and Flatbuffers are statically typed, zero copy, and therefore require offset bounds checking to be safe. Can't get faster than that.

probably quite a bit faster

Cap'n Proto and FlatBuffers are both insanely fast and still very robust. MessagePack is still much faster than text formats, but it is noticeably slower than the other two formats.

no need for users to have serde-json, all you need would be in Tera

All three require runtime crates. However, the runtime code size for both Cap'n Proto and FlatBuffers is very small. And if you want to be really strict about foreign code: MessagePack and FlatBuffers are very easy to implement.

more code

Again, FlatBuffers has a very small runtime code footprint. I think Cap'n Proto as well.

might not be that much faster but it's pretty unlikely

Depends very much on the amount of data exchanged.

would need to impl something equivalent to the JSON pointers

Dunno about that, never heard JSON and pointers in the same sentence. Are they like YAML pointers?

a big breaking change

Better early than late. =)

Keats · 2018-11-19T19:09:18Z

Dunno about that, never heard JSON and pointers in the same sentence. Are they like YAML pointers?

I don't know YAML pointers x) But JSON pointers are something to access data. For example if we have this JSON:

{
 foo: 1, 
 bar: { baz: 2}, 
 qux: [3, 4, 5]
}

We can get the first value of the qux array with /qux/0 or the baz property with /bar/baz. This is what Tera uses to grab data from the context and shouldn't be too hard to implement as long as you get some kind of Value.

Part of the issue is that I don't want people to to create .protobuf or something related as sometimes the context is very dynamic. For example in https://github.com/Keats/kickstart the context is defined in a .toml file and the user might not have Rust installed at all. Same for https://github.com/getzola/zola

MessagePack does look interesting though, is there some benchmark between its serde implementation and serde-json? From what I remember, MessagePack is not much more compact than JSON

Evrey · 2018-11-19T19:23:26Z

This is what Tera uses to grab data from the context

Ah, so basically just some path syntax to walk a JSON data structure.

Part of the issue is that I don't want people to to create .protobuf or something related as sometimes the context is very dynamic.

In that case definitely prefer MessagePack, UBJSON, BSON, and what else they are called. That's where self-describing formats shine.

is there some benchmark between its serde implementation and serde-json?

Not that I know. But tera would be a nice field test for making one.

From what I remember, MessagePack is not much more compact than JSON

That depends a lot on the kind of data moved around. If a lot of it consists of string identifiers, then the lower memory usage becomes unnoticeable. One could fix that by storing and sending 32-bit FNV1a hashes of identifiers instead.

Edit: Still, even if MessagePack would not be much smaller compared to JSON when used in tera, parsing and serialising is still potentially much faster. Strings are length-delimited and have no escape sequences, floats are stored using their very bits, etc.

andy128k · 2018-11-19T19:29:21Z

Does Tera really need a format? It basically consumes data. Why not to introduce a trait and allow users to implement it?

pub trait TeraValue {
    fn as_str(&self) -> Option<&str>;
    fn as_int(&self) -> Option<i64>;
    fn as_uint(&self) -> Option<u64>;
    fn as_float(&self) -> Option<f64>;
    fn get(&self, index: usize) -> Option<&dyn TeraValue>;
    fn get_prop(&self, prop: &str) -> Option<&dyn TeraValue>;
    // ...
}

What am I missing?

Keats · 2018-11-19T21:35:18Z

How does it work when the person defining the schema doesn't have Rust on the machine? Like in Zola or kickstart? The automatic Serde serialization makes it very good from a UX point of view for users, compare with the example in https://github.com/cobalt-org/liquid-rust#usage which would be really really tiring when you have dozens of fields

Keats · 2018-11-20T23:42:38Z

(Keep in mind I might be wrong, if you think this can be done without degrading UX please try!)

epage · 2018-12-07T16:32:01Z

Got curious to see your issues :)

The automatic Serde serialization makes it very good from a UX point of view for users, compare with the example in https://github.com/cobalt-org/liquid-rust#usage which would be really really tiring when you have dozens of fields

Yes, if you have data already in a serde struct, then that is easiest. Some things I have done to improve usability

A conversion from serde to liquid Value function
A value literal macro

Taking the data by-reference means clients can control how the data is created and avoid a conversion cost during render.

In addition, I've been modifying liquid to instead accept a trait. I still need to iterate on this design more but for now it helps in the case where the user is composing data from multiple sources (e.g. in cobalt, data that is the same for every page vs per-page), you no longer need to put them all in the same liquid Value (instead a struct of liquid Values)

I'd like to go a step further with the trait and have a completely custom trait for walking the entire data structure so no conversion is needed except when non-leaf nodes are accessed. This would allow the user to better optimize things.

naturallymitchell · 2019-01-14T18:07:05Z

If SCL has built-in support for dates, integers, and floats, and serde-scl could go beyond whatever limits serde-json had in its approach to leveraging serde, then what about using that?

Keats · 2019-12-07T16:24:29Z

I'm closing this issue as it would be a very welcome feature but it might not use serde, it could be some custom traits.
A more generic issue is listed above.

Keats added help wanted For next major version labels Sep 20, 2018

Keats mentioned this issue Sep 20, 2018

Review usage of Rayon & improve performance getzola/zola#420

Closed

Keats mentioned this issue Sep 21, 2018

v1 release #331

Closed

16 tasks

Keats mentioned this issue Sep 24, 2018

Streamable templates #211

Closed

Keats mentioned this issue Oct 11, 2018

Issues with iterating over grandchild content getzola/zola#478

Closed

Keats mentioned this issue Nov 23, 2018

Out of memory, process killed getzola/zola#536

Closed

andy128k mentioned this issue Nov 27, 2018

Avoid allocations #364

Closed

Keats mentioned this issue Dec 7, 2019

Avoid the JSON serialization step #469

Open

Keats closed this as completed Dec 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write our own serde format #340

Write our own serde format #340

Keats commented Sep 20, 2018 •

edited

Loading

Keats commented Sep 20, 2018

dtolnay commented Sep 20, 2018

Keats commented Sep 20, 2018

Keats commented Sep 21, 2018

Keats commented Sep 21, 2018

dtolnay commented Sep 21, 2018

Keats commented Sep 21, 2018

Evrey commented Nov 19, 2018

Keats commented Nov 19, 2018

Evrey commented Nov 19, 2018 •

edited

Loading

andy128k commented Nov 19, 2018 •

edited

Loading

Keats commented Nov 19, 2018

Keats commented Nov 20, 2018

epage commented Dec 7, 2018

naturallymitchell commented Jan 14, 2019

Keats commented Dec 7, 2019

Write our own serde format #340

Write our own serde format #340

Comments

Keats commented Sep 20, 2018 • edited Loading

Keats commented Sep 20, 2018

dtolnay commented Sep 20, 2018

Keats commented Sep 20, 2018

Keats commented Sep 21, 2018

Keats commented Sep 21, 2018

dtolnay commented Sep 21, 2018

Keats commented Sep 21, 2018

Evrey commented Nov 19, 2018

Keats commented Nov 19, 2018

Evrey commented Nov 19, 2018 • edited Loading

andy128k commented Nov 19, 2018 • edited Loading

Keats commented Nov 19, 2018

Keats commented Nov 20, 2018

epage commented Dec 7, 2018

naturallymitchell commented Jan 14, 2019

Keats commented Dec 7, 2019

Keats commented Sep 20, 2018 •

edited

Loading

Evrey commented Nov 19, 2018 •

edited

Loading

andy128k commented Nov 19, 2018 •

edited

Loading