Skip to content

Add high-level Msgpack deserialization API #671

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

jarmuszz
Copy link
Contributor

@jarmuszz jarmuszz commented Apr 16, 2025

This PR adds the high-level MessagePack deserialization API.

The high-level API is split into two modules:

  • high which allows for deserialization into scala's native data-types.
  • high.ast which deserializes data into wrapper objects, similarly to how cbor.high works.

The high module performs better than high.ast in terms of speed but is far less flexible, e.g. it currently lacks support for heterogeneous data.

The core of the implementation resides in the MsgpackDeserializer[A] typeclass which represents a process of extracting values of type A from a MsgpackItem stream. Deserializers can be ran via API endpoints specified in the package file. Instances of MsgpackDeserializer for common built-in types are available via the high.* import. The high.ast module exports an instance of the typeclass for MsgpackValue and some methods to reduce boilerplate.

MsgpackDeserializer is also monadic:

case class Record(id: Long, description: String, created: Instant)

implicit val recordDeserializer: MsgpackDeserializer[Record] =
  for {
    id      <- deserializer[Long]
    desc    <- deserializer[String]
    created <- deserializer[Instant]
  } yield Record(id, desc, created)

val stream: Stream[IO, Byte] = ???
val records: IO[List[Records]] =
  stream
    .through(deserialize[Record])
    .compile
    .toList

I have also decided to ditch the old benchmark data because it consisted primarily of strings. The new bench file is a bit bigger (~32k vs ~27k MsgpackItems), hence the result difference for previous benchmarks.

MsgpackDeserializerBenchmarks.deserializeValues  avgt   10  4256.090 ± 38.211  us/op
MsgpackDeserializerBenchmarks.deserialize        avgt   10  3801.777 ± 36.186  us/op
MsgpackItemParserBenchmarks.parseMsgpackItems    avgt   10  4268.732 ± 58.199  us/op
MsgpackItemSerializerBenchmarks.serialize        avgt   10  4130.059 ±  9.869  us/op
MsgpackItemSerializerBenchmarks.withValidation   avgt   10  5653.779 ± 71.014  us/op

Benchmarks were done on i7-8550U.

I'm sure it would be possible to add typeclass derivation of the MsgpackDeserializer. Sadly, I'm not really familiar with this part of Scala and I currently lack time to really dig into it.

jarmuszz and others added 2 commits April 16, 2025 22:19
The high-level API is split into two modules:
- `high.dynamic` which deserialises  data into wrapper objects,
  similarly to how `cbor.high` works.
- `high.static` which allows for deserialization into scala's native
  data-types.

I have also decided to ditch the old test data because it consisted
primarily of strings which made decoders harder to analyse.
@jarmuszz jarmuszz requested a review from a team as a code owner April 16, 2025 20:47
@jarmuszz jarmuszz marked this pull request as draft April 19, 2025 09:42
@jarmuszz
Copy link
Contributor Author

Converting to draft as a couple of changes are needed to fix building for scalajs

@jarmuszz jarmuszz marked this pull request as ready for review April 21, 2025 21:48
@satabin
Copy link
Member

satabin commented Apr 30, 2025

Converting to draft as a couple of changes are needed to fix building for scalajs

Is it now ready to be reviewed?

@jarmuszz
Copy link
Contributor Author

jarmuszz commented May 1, 2025

Yes it is, thank you. I forgot to put up a comment.

Copy link
Member

@satabin satabin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A had a first look and it looks generally ok, great work!!!!

I just am not sure either about the static/dynamic naming, and every time I looked at your PR I needed to re-read what you meant in the comment.

If we look at the JSON module, my understanding is that

  • dynamic is what we have with the values pipe in JSON. It produces a stream of JSON AST values out of the tokens (items in the msgpack implementation). It creates a structured representation of the input stream, which is a flat list of "events" describing the structure. What do you think about using the values name as well?
  • static is what we have with the deserialize pipe in JSON. It produces a stream of homogeneous typed values, out of the tokens (items in your case). What do you think about using the deserialize as well?

An other possibility for values could be ast, since this is what it produces. But I am not sure it helps with readability.

Note: In XML I used the DOM naming since this is a quite accepted term in this format. values would be documents in XML, since this is the name of the DOM root, and deserialize is called elements for XML.

Naming is hard 😅

@jarmuszz
Copy link
Contributor Author

jarmuszz commented May 5, 2025

I think it would be best to merge high.static into high and rename dynamic to ast.

  • high.static -> high
  • high.dynamic -> high.ast

Methods in high cannot be actually used without decoder instances which are provided via static or dynamic imports. While the former package relies on user calling high's methods, the latter comes with it's own set of endpoints so it makes sense to keep it as a separate module. Merging high.static into high would simplify imports and make the package structure a bit cleaner.

As for the dynamic, renaming it into values could confuse users as there is a method with the same name inside the package. People would end-up calling msgpack.high.values.values. I think that ast works better in this case.

Also, changing wording from "decoding" to "deserialization" thought the code seems like a good idea to me.

Are these changes okay? If so, I'll push a commit in the following days.

Thank you for the feedback :)

@satabin
Copy link
Member

satabin commented May 5, 2025

I think it would be best to merge high.static into high and rename dynamic to ast.

* `high.static` -> `high`

* `high.dynamic` -> `high.ast`

Methods in high cannot be actually used without decoder instances which are provided via static or dynamic imports. While the former package relies on user calling high's methods, the latter comes with it's own set of endpoints so it makes sense to keep it as a separate module. Merging high.static into high would simplify imports and make the package structure a bit cleaner.

As for the dynamic, renaming it into values could confuse users as there is a method with the same name inside the package. People would end-up calling msgpack.high.values.values. I think that ast works better in this case.

Also, changing wording from "decoding" to "deserialization" thought the code seems like a good idea to me.

Are these changes okay? If so, I'll push a commit in the following days.

Thank you for the feedback :)

This sounds reasonable, yes.

@satabin satabin added enhancement New feature or request msgpack labels May 6, 2025
jarmuszz and others added 2 commits May 7, 2025 18:24
Merge `high.static` into `high` and rename `high.dynamic` to `high.ast`.

Change wording from "decoding" to "deserialization".

Fix capitalization of benchmark files.
@jarmuszz jarmuszz changed the title Add high-level Msgpack decoding API Add high-level Msgpack deserialization API May 7, 2025
@jarmuszz
Copy link
Contributor Author

jarmuszz commented May 7, 2025

I have restructured the module as described above.


/** Summons a [[fs2.data.msgpack.high.MsgpackDeserializer]] instance for the desired type
*/
def deserializer[A](implicit ev: MsgpackDeserializer[A]) = ev
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more idiomatic way of doing so is usually to have an apply method in the companion object of the typeclass. Like:

object MsgpackDeserializer {
  def apply[A](implicit ev: MsgpackDeserializer[A]): MsgpackDeserializer[A]=  ev
}

* Built-in instances are accessbile via an import `fs2.data.msgpack.high.*`.
*/
trait MsgpackDeserializer[A] { outer =>
private[high] def run[F[_]: RaiseThrowable](ctx: DeserializationContext[F]): DeserializationResult[F, A]
Copy link
Member

@satabin satabin May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of this signature. Writing a deserializer for a new type requires knowing a lot of the internals, if possible at all. Generally speaking I like deserializer that have zero knowledge of the context in which they are called, and that do not need to mutate anything (in you case you mutate the context). For me a deserializer should be a pure function from its inputs without any size effect, and with a public deserialize method, that can be tested and called without a big context.

Can you write it with this signature (or a slight variant thereof if it fits better)? That would not involved any fs2 specific thing.

trait MsgpackDeserializer[A] {
  def deserialize(bytes: ByteVector): DeserializationResult[A]
}

sealed trait DeserializationResult[+A]
object DeserializationResult {
  final case class Ok[A](value: A, reminder: ByteVector) extends DeserializationResult[A]
  final case class Error(msg: String) extends DeserializationResult[Nothing]
  final case class NeedsMoreBytes(hint: Option[Int]) extends DeserializationResult[Nothing]
}

The results models the three possible outcome:

  • We could deserialize and return the value and bytes that were not read
  • An error occurred with a message
  • We need more input to deserialize and want to be called again with more data (nothing was consumed) with a hint of how much we need

This is similar to the approach taken by scodec or by framing and decoding in tokio (hint is given by resizing the buffer). This way we decorrelate the decoding itself from acquiring the bytes, and make it easier to write custom deserializer (you do not need a Pull to decode)

What do you think about writing it this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be possible to make the deserializer context-free the way you described. I have thought about design like this previously but I ended up with a conclusion that restarting the decoder over and over would be inefficient. Decoders would have to count the number of items in a buffer on every call which takes some time because we are working with structured data. Do you think there is a better way to handle this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but msgpack encodes the length at the beginning for its compound types, right? If so, we can ensure that we read all the bytes before sending the data to decode to the decoder, limiting the roundtrips. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right about the length encoding. It did not occur to me that we could check before sending the data to the decoder, it will probably solve the issue. I think this is a good approach, starting rewrite now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request msgpack
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants