-
-
Notifications
You must be signed in to change notification settings - Fork 27
Add high-level Msgpack deserialization API #671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The high-level API is split into two modules: - `high.dynamic` which deserialises data into wrapper objects, similarly to how `cbor.high` works. - `high.static` which allows for deserialization into scala's native data-types. I have also decided to ditch the old test data because it consisted primarily of strings which made decoders harder to analyse.
Converting to draft as a couple of changes are needed to fix building for scalajs |
java.time.Instant is not available in scalajs
scala.collection.mutable.ArrayDeque is not present in scala 2.12
Is it now ready to be reviewed? |
Yes it is, thank you. I forgot to put up a comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A had a first look and it looks generally ok, great work!!!!
I just am not sure either about the static/dynamic naming, and every time I looked at your PR I needed to re-read what you meant in the comment.
If we look at the JSON module, my understanding is that
dynamic
is what we have with thevalues
pipe in JSON. It produces a stream of JSON AST values out of the tokens (items in the msgpack implementation). It creates a structured representation of the input stream, which is a flat list of "events" describing the structure. What do you think about using thevalues
name as well?static
is what we have with thedeserialize
pipe in JSON. It produces a stream of homogeneous typed values, out of the tokens (items in your case). What do you think about using thedeserialize
as well?
An other possibility for values
could be ast
, since this is what it produces. But I am not sure it helps with readability.
Note: In XML I used the DOM naming since this is a quite accepted term in this format. values
would be documents
in XML, since this is the name of the DOM root, and deserialize
is called elements
for XML.
Naming is hard 😅
I think it would be best to merge
Methods in As for the Also, changing wording from "decoding" to "deserialization" thought the code seems like a good idea to me. Are these changes okay? If so, I'll push a commit in the following days. Thank you for the feedback :) |
This sounds reasonable, yes. |
Merge `high.static` into `high` and rename `high.dynamic` to `high.ast`. Change wording from "decoding" to "deserialization". Fix capitalization of benchmark files.
I have restructured the module as described above. |
|
||
/** Summons a [[fs2.data.msgpack.high.MsgpackDeserializer]] instance for the desired type | ||
*/ | ||
def deserializer[A](implicit ev: MsgpackDeserializer[A]) = ev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more idiomatic way of doing so is usually to have an apply
method in the companion object of the typeclass. Like:
object MsgpackDeserializer {
def apply[A](implicit ev: MsgpackDeserializer[A]): MsgpackDeserializer[A]= ev
}
* Built-in instances are accessbile via an import `fs2.data.msgpack.high.*`. | ||
*/ | ||
trait MsgpackDeserializer[A] { outer => | ||
private[high] def run[F[_]: RaiseThrowable](ctx: DeserializationContext[F]): DeserializationResult[F, A] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a fan of this signature. Writing a deserializer for a new type requires knowing a lot of the internals, if possible at all. Generally speaking I like deserializer that have zero knowledge of the context in which they are called, and that do not need to mutate anything (in you case you mutate the context). For me a deserializer should be a pure function from its inputs without any size effect, and with a public deserialize
method, that can be tested and called without a big context.
Can you write it with this signature (or a slight variant thereof if it fits better)? That would not involved any fs2 specific thing.
trait MsgpackDeserializer[A] {
def deserialize(bytes: ByteVector): DeserializationResult[A]
}
sealed trait DeserializationResult[+A]
object DeserializationResult {
final case class Ok[A](value: A, reminder: ByteVector) extends DeserializationResult[A]
final case class Error(msg: String) extends DeserializationResult[Nothing]
final case class NeedsMoreBytes(hint: Option[Int]) extends DeserializationResult[Nothing]
}
The results models the three possible outcome:
- We could deserialize and return the value and bytes that were not read
- An error occurred with a message
- We need more input to deserialize and want to be called again with more data (nothing was consumed) with a hint of how much we need
This is similar to the approach taken by scodec or by framing and decoding in tokio (hint is given by resizing the buffer). This way we decorrelate the decoding itself from acquiring the bytes, and make it easier to write custom deserializer (you do not need a Pull
to decode)
What do you think about writing it this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be possible to make the deserializer context-free the way you described. I have thought about design like this previously but I ended up with a conclusion that restarting the decoder over and over would be inefficient. Decoders would have to count the number of items in a buffer on every call which takes some time because we are working with structured data. Do you think there is a better way to handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but msgpack encodes the length at the beginning for its compound types, right? If so, we can ensure that we read all the bytes before sending the data to decode to the decoder, limiting the roundtrips. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right about the length encoding. It did not occur to me that we could check before sending the data to the decoder, it will probably solve the issue. I think this is a good approach, starting rewrite now :)
This PR adds the high-level MessagePack deserialization API.
The high-level API is split into two modules:
high
which allows for deserialization into scala's native data-types.high.ast
which deserializes data into wrapper objects, similarly to howcbor.high
works.The
high
module performs better thanhigh.ast
in terms of speed but is far less flexible, e.g. it currently lacks support for heterogeneous data.The core of the implementation resides in the
MsgpackDeserializer[A]
typeclass which represents a process of extracting values of typeA
from aMsgpackItem
stream. Deserializers can be ran via API endpoints specified in the package file. Instances ofMsgpackDeserializer
for common built-in types are available via thehigh.*
import. Thehigh.ast
module exports an instance of the typeclass forMsgpackValue
and some methods to reduce boilerplate.MsgpackDeserializer
is also monadic:I have also decided to ditch the old benchmark data because it consisted primarily of strings. The new bench file is a bit bigger (~32k vs ~27k MsgpackItems), hence the result difference for previous benchmarks.
Benchmarks were done on i7-8550U.
I'm sure it would be possible to add typeclass derivation of the
MsgpackDeserializer
. Sadly, I'm not really familiar with this part of Scala and I currently lack time to really dig into it.