Add high-level Msgpack deserialization API #671

jarmuszz · 2025-04-16T20:47:09Z

This PR adds the high-level MessagePack deserialization API.

The high-level API is split into two modules:

high which allows for deserialization into scala's native data-types.
high.ast which deserializes data into wrapper objects, similarly to how cbor.high works.

The high module performs better than high.ast in terms of speed but is far less flexible, e.g. it currently lacks support for heterogeneous data.

The core of the implementation resides in the MsgpackDeserializer[A] typeclass which represents a process of extracting values of type A from a MsgpackItem stream. Deserializers can be ran via API endpoints specified in the package file. Instances of MsgpackDeserializer for common built-in types are available via the high.* import. The high.ast module exports an instance of the typeclass for MsgpackValue and some methods to reduce boilerplate.

MsgpackDeserializer is also monadic:

case class Record(id: Long, description: String, created: Instant)

implicit val recordDeserializer: MsgpackDeserializer[Record] =
  for {
    id      <- deserializer[Long]
    desc    <- deserializer[String]
    created <- deserializer[Instant]
  } yield Record(id, desc, created)

val stream: Stream[IO, Byte] = ???
val records: IO[List[Records]] =
  stream
    .through(deserialize[Record])
    .compile
    .toList

I have also decided to ditch the old benchmark data because it consisted primarily of strings. The new bench file is a bit bigger (~32k vs ~27k MsgpackItems), hence the result difference for previous benchmarks.

MsgpackDeserializerBenchmarks.deserializeValues  avgt   10  4256.090 ± 38.211  us/op
MsgpackDeserializerBenchmarks.deserialize        avgt   10  3801.777 ± 36.186  us/op
MsgpackItemParserBenchmarks.parseMsgpackItems    avgt   10  4268.732 ± 58.199  us/op
MsgpackItemSerializerBenchmarks.serialize        avgt   10  4130.059 ±  9.869  us/op
MsgpackItemSerializerBenchmarks.withValidation   avgt   10  5653.779 ± 71.014  us/op

Benchmarks were done on i7-8550U.

I'm sure it would be possible to add typeclass derivation of the MsgpackDeserializer. Sadly, I'm not really familiar with this part of Scala and I currently lack time to really dig into it.

The high-level API is split into two modules: - `high.dynamic` which deserialises data into wrapper objects, similarly to how `cbor.high` works. - `high.static` which allows for deserialization into scala's native data-types. I have also decided to ditch the old test data because it consisted primarily of strings which made decoders harder to analyse.

jarmuszz · 2025-04-19T09:45:10Z

Converting to draft as a couple of changes are needed to fix building for scalajs

java.time.Instant is not available in scalajs

scala.collection.mutable.ArrayDeque is not present in scala 2.12

satabin · 2025-04-30T19:38:50Z

Converting to draft as a couple of changes are needed to fix building for scalajs

Is it now ready to be reviewed?

jarmuszz · 2025-05-01T07:58:44Z

Yes it is, thank you. I forgot to put up a comment.

satabin

A had a first look and it looks generally ok, great work!!!!

I just am not sure either about the static/dynamic naming, and every time I looked at your PR I needed to re-read what you meant in the comment.

If we look at the JSON module, my understanding is that

dynamic is what we have with the values pipe in JSON. It produces a stream of JSON AST values out of the tokens (items in the msgpack implementation). It creates a structured representation of the input stream, which is a flat list of "events" describing the structure. What do you think about using the values name as well?
static is what we have with the deserialize pipe in JSON. It produces a stream of homogeneous typed values, out of the tokens (items in your case). What do you think about using the deserialize as well?

An other possibility for values could be ast, since this is what it produces. But I am not sure it helps with readability.

Note: In XML I used the DOM naming since this is a quite accepted term in this format. values would be documents in XML, since this is the name of the DOM root, and deserialize is called elements for XML.

Naming is hard 😅

jarmuszz · 2025-05-05T13:12:05Z

I think it would be best to merge high.static into high and rename dynamic to ast.

high.static -> high
high.dynamic -> high.ast

Methods in high cannot be actually used without decoder instances which are provided via static or dynamic imports. While the former package relies on user calling high's methods, the latter comes with it's own set of endpoints so it makes sense to keep it as a separate module. Merging high.static into high would simplify imports and make the package structure a bit cleaner.

As for the dynamic, renaming it into values could confuse users as there is a method with the same name inside the package. People would end-up calling msgpack.high.values.values. I think that ast works better in this case.

Also, changing wording from "decoding" to "deserialization" thought the code seems like a good idea to me.

Are these changes okay? If so, I'll push a commit in the following days.

Thank you for the feedback :)

satabin · 2025-05-05T16:28:24Z

I think it would be best to merge high.static into high and rename dynamic to ast.
* `high.static` -> `high`

* `high.dynamic` -> `high.ast`
Methods in high cannot be actually used without decoder instances which are provided via static or dynamic imports. While the former package relies on user calling high's methods, the latter comes with it's own set of endpoints so it makes sense to keep it as a separate module. Merging high.static into high would simplify imports and make the package structure a bit cleaner.

As for the dynamic, renaming it into values could confuse users as there is a method with the same name inside the package. People would end-up calling msgpack.high.values.values. I think that ast works better in this case.

Also, changing wording from "decoding" to "deserialization" thought the code seems like a good idea to me.

Are these changes okay? If so, I'll push a commit in the following days.

Thank you for the feedback :)

This sounds reasonable, yes.

Merge `high.static` into `high` and rename `high.dynamic` to `high.ast`. Change wording from "decoding" to "deserialization". Fix capitalization of benchmark files.

jarmuszz · 2025-05-07T16:43:01Z

I have restructured the module as described above.

satabin · 2025-05-07T16:34:30Z

msgpack/shared/src/main/scala/fs2/data/msgpack/high/package.scala

+
+  /** Summons a [[fs2.data.msgpack.high.MsgpackDeserializer]] instance for the desired type
+    */
+  def deserializer[A](implicit ev: MsgpackDeserializer[A]) = ev


A more idiomatic way of doing so is usually to have an apply method in the companion object of the typeclass. Like:

object MsgpackDeserializer { def apply[A](implicit ev: MsgpackDeserializer[A]): MsgpackDeserializer[A]= ev }

satabin · 2025-05-07T16:59:21Z

msgpack/shared/src/main/scala/fs2/data/msgpack/high/MsgpackDeserializer.scala

+  * Built-in instances are accessbile via an import `fs2.data.msgpack.high.*`.
+  */
+trait MsgpackDeserializer[A] { outer =>
+  private[high] def run[F[_]: RaiseThrowable](ctx: DeserializationContext[F]): DeserializationResult[F, A]


I am not a fan of this signature. Writing a deserializer for a new type requires knowing a lot of the internals, if possible at all. Generally speaking I like deserializer that have zero knowledge of the context in which they are called, and that do not need to mutate anything (in you case you mutate the context). For me a deserializer should be a pure function from its inputs without any size effect, and with a public deserialize method, that can be tested and called without a big context.

Can you write it with this signature (or a slight variant thereof if it fits better)? That would not involved any fs2 specific thing.

trait MsgpackDeserializer[A] { def deserialize(bytes: ByteVector): DeserializationResult[A] } sealed trait DeserializationResult[+A] object DeserializationResult { final case class Ok[A](value: A, reminder: ByteVector) extends DeserializationResult[A] final case class Error(msg: String) extends DeserializationResult[Nothing] final case class NeedsMoreBytes(hint: Option[Int]) extends DeserializationResult[Nothing] }

The results models the three possible outcome:

We could deserialize and return the value and bytes that were not read

An error occurred with a message

We need more input to deserialize and want to be called again with more data (nothing was consumed) with a hint of how much we need

This is similar to the approach taken by scodec or by framing and decoding in tokio (hint is given by resizing the buffer). This way we decorrelate the decoding itself from acquiring the bytes, and make it easier to write custom deserializer (you do not need a Pull to decode)

What do you think about writing it this way?

I think it should be possible to make the deserializer context-free the way you described. I have thought about design like this previously but I ended up with a conclusion that restarting the decoder over and over would be inefficient. Decoders would have to count the number of items in a buffer on every call which takes some time because we are working with structured data. Do you think there is a better way to handle this?

Correct me if I'm wrong, but msgpack encodes the length at the beginning for its compound types, right? If so, we can ensure that we read all the bytes before sending the data to decode to the decoder, limiting the roundtrips. What do you think?

Yes, you are right about the length encoding. It did not occur to me that we could check before sending the data to the decoder, it will probably solve the issue. I think this is a good approach, starting rewrite now :)

jarmuszz and others added 2 commits April 16, 2025 22:19

Merge branch 'gnieh:main' into main

2ebef73

jarmuszz requested a review from a team as a code owner April 16, 2025 20:47

jarmuszz marked this pull request as draft April 19, 2025 09:42

jarmuszz added 5 commits April 21, 2025 18:19

Remove dep of MsgpackValue on java.time.Instant

c40187e

java.time.Instant is not available in scalajs

Use java's ArrayDeque instead of scala's

88dc29e

scala.collection.mutable.ArrayDeque is not present in scala 2.12

Fix doc for valuesFromItems

7555857

Remove ambiguous references to fromItems

f806e79

Remove doc links pointing to fromItems and decode

240e056

jarmuszz marked this pull request as ready for review April 21, 2025 21:48

satabin reviewed May 3, 2025

View reviewed changes

satabin added enhancement msgpack labels May 6, 2025

jarmuszz and others added 2 commits May 7, 2025 18:24

Refactor fs2.data.msgpack.high module structure

cac186b

Merge `high.static` into `high` and rename `high.dynamic` to `high.ast`. Change wording from "decoding" to "deserialization". Fix capitalization of benchmark files.

Merge branch 'gnieh:main' into main

611516a

jarmuszz changed the title ~~Add high-level Msgpack decoding API~~ Add high-level Msgpack deserialization API May 7, 2025

satabin requested changes May 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add high-level Msgpack deserialization API #671

Add high-level Msgpack deserialization API #671

jarmuszz commented Apr 16, 2025 •

edited

Loading

jarmuszz commented Apr 19, 2025

satabin commented Apr 30, 2025

jarmuszz commented May 1, 2025

satabin left a comment

jarmuszz commented May 5, 2025

satabin commented May 5, 2025

jarmuszz commented May 7, 2025

satabin May 7, 2025

satabin May 7, 2025 •

edited

Loading

jarmuszz May 8, 2025

satabin May 8, 2025

jarmuszz May 9, 2025

Add high-level Msgpack deserialization API #671

Are you sure you want to change the base?

Add high-level Msgpack deserialization API #671

Conversation

jarmuszz commented Apr 16, 2025 • edited Loading

jarmuszz commented Apr 19, 2025

satabin commented Apr 30, 2025

jarmuszz commented May 1, 2025

satabin left a comment

Choose a reason for hiding this comment

jarmuszz commented May 5, 2025

satabin commented May 5, 2025

jarmuszz commented May 7, 2025

satabin May 7, 2025

Choose a reason for hiding this comment

satabin May 7, 2025 • edited Loading

Choose a reason for hiding this comment

jarmuszz May 8, 2025

Choose a reason for hiding this comment

satabin May 8, 2025

Choose a reason for hiding this comment

jarmuszz May 9, 2025

Choose a reason for hiding this comment

jarmuszz commented Apr 16, 2025 •

edited

Loading

satabin May 7, 2025 •

edited

Loading