Skip to content
This repository
branch: master
Fetching contributors…

Octocat-spinner-32-eaf2f5

Cannot retrieve contributors at this time

file 508 lines (395 sloc) 23.971 kb

NAME

Sereal - Protocol definition

SYNOPSIS

This document describes the format and encoding of a Sereal data packet.

VERSION

This is the Sereal specification version 2.02.

The integer part of the document version corresponds to the Sereal protocol version. For details on incompatible changes between major protocol versions, see the "PROTOCOL CHANGES" below.

DESCRIPTION

A serialized structure is converted into a "document". A document is made up of two parts, the document header and the document body.

General Points

Strictness

A compliant Sereal decoder must detect invalid documents and handle them as a safe exception in the respective implementation language. That is to say, without a crash or worse.

Little Endian

All numeric data is in little endian format.

IEEE Floats

Floating points types are in IEEE format.

Varints

Heavy use is made of a variable length integer encoding commonly called a "varint" (Google calls it a Varint128). This encoding uses the high bit of each byte to signal there is another byte worth of data coming, and the last byte always having the high bit off. The data is in little endian order with the low seven bits in the first byte, and the next 7 in the next etc.

See Google's description.

Document Header Format

A header consists of multiple components:

   <MAGIC> <VERSION-TYPE> <HEADER-SUFFIX-SIZE> <OPT-SUFFIX>

MAGIC

A "magic string" that identifies a document as being in the Sereal format. The value of this string is "=srl", and when decoded as an unsigned 32 bit integer on a little endian machine has a value of 0x6c72733d.

VERSION-TYPE

A single byte, of which the high 4 bits are used to represent the "type" of the document, and the low 4 bits used to represent the version of the Sereal protocol the document complies with.

Up until now there have been versions 1 and 2 of the Sereal protocol. So the low four bits will be 1 or 2.

Currently only three types are defined:

0

Raw Sereal format. The data can be processed verbatim.

1

This is not a valid document type for Sereal protocol version 2!

In Sereal protocol version 1, this used to be "Compressed Sereal format, using Google's Snappy compression internally." It has long been advised to prefer 2, "incremental-decoding-enabled compressed Sereal," wherever possible.

2

Compressed Sereal format, using Google's Snappy compression internally as format 1, but supporting incremental-parsing. Long preferred over 1 as this is considered a bug fix in the Snappy compression support.

The format is:

  <Varint><Snappy Blob>

where the varint signifies the length of the Snappy-compressed blob following it. See "NOTES ON IMPLEMENTATION" below for a discussion on how to implement this efficiently.

Additional compression types are envisaged and will be assigned type numbers by the maintainers of the protocol.

HEADER-SUFFIX-SIZE

The structure of the header includes support for embedding additional data. This is accomplished by specifying the length of the suffix in the header with a varint. Headers with no suffix will set this to a binary 0. This is intended for future format extensions that retain some level of compatibility for old decoders (which know how to skip the extended header due to the embedded length).

OPT-SUFFIX

The suffix may contain whatever data the encoder wishes to embed in the header. In version 1 of the protocol the decoder never looked inside this data. Later versions may introduce additional rules for this field. In version 2 of the protocol, this variable-length part of the header may be empty or have the following format:

    <8bit-BITFIELD> <OPT-USER-META-DATA>
8bit-BITFIELD

If not present, all bits are assumed off. In version 2 of the protocol, only the least significant bit is meaningful: If set, the bitfield is followed by the <USER-META-DATA>. If not set, there is no user meta data.

OPT-USER-META-DATA

If the least significant bit of the preceding bitfield is set, this may be an arbitrary Sereal document body. Like any other Sereal document body, it is self-contained and can be deserialized independently from any other document bodies in the Sereal document. This document body is NEVER compressed.

This is intended for embedding small amounts of meta data, such as routing information, in a document that allows users to avoid deserializing very large document bodies needlessly or having to call into decompression logic.

Document Body Format

The body is made up of tagged data items:

    <TAG> <OPT-DATA>

Tagged items can be containers that hold other tagged items. At the top level, the body holds only ONE tagged item (often an array or hash) that holds others.

TAG

A tag is a single byte which specifies the type of the data being decoded.

The high bit of each tag is used to signal to the decoder that the deserialized data needs to be stored and tracked and will be reused again elsewhere in the serialization. This is sometimes called the "track flag" or the "F-bit" in code and documentation. Its status should be ignored when processing a tag, meaning code should mask off the high bit and only use the low 7 bits.

Some tags, such as POS, NEG and SHORT_BINARY contain embedded in them either the data (in the case of POS and NEG) or the length of the OPT-DATA section (in the case of SHORT_BINARY).

OPT-DATA

This field may contain an arbitrary set of bytes, either determined implicitly by the tag (such as for FLOAT), explicitly in the tag (as in SHORT_BINARY) or in a varint following the tag (such as for STRING).

When referring to an offset below, what's meant is a varint encoded absolute integer byte position in the document body. That is, an offset of 10 refers to the tenth byte in the Sereal document body (ie. excluding its header). Sereal version 1 used to mandate offsets from the start of the document header.

Tags

                  Tag | Char | Dec |  Hex |     Binary | Follow
    ------------------+------+-----+------+----------- |-----------------------------------------
    POS_0             |      |   0 | 0x00 | 0b00000000 | small positive integer - value in low 4 bits (identity)
    POS_1             |      |   1 | 0x01 | 0b00000001 |
    POS_2             |      |   2 | 0x02 | 0b00000010 |
    POS_3             |      |   3 | 0x03 | 0b00000011 |
    POS_4             |      |   4 | 0x04 | 0b00000100 |
    POS_5             |      |   5 | 0x05 | 0b00000101 |
    POS_6             |      |   6 | 0x06 | 0b00000110 |
    POS_7             | "\a" |   7 | 0x07 | 0b00000111 |
    POS_8             | "\b" |   8 | 0x08 | 0b00001000 |
    POS_9             | "\t" |   9 | 0x09 | 0b00001001 |
    POS_10            | "\n" |  10 | 0x0a | 0b00001010 |
    POS_11            |      |  11 | 0x0b | 0b00001011 |
    POS_12            | "\f" |  12 | 0x0c | 0b00001100 |
    POS_13            | "\r" |  13 | 0x0d | 0b00001101 |
    POS_14            |      |  14 | 0x0e | 0b00001110 |
    POS_15            |      |  15 | 0x0f | 0b00001111 | small positive integer - value in low 4 bits (identity)
    NEG_16            |      |  16 | 0x10 | 0b00010000 | small negative integer - value in low 4 bits (k+32)
    NEG_15            |      |  17 | 0x11 | 0b00010001 |
    NEG_14            |      |  18 | 0x12 | 0b00010010 |
    NEG_13            |      |  19 | 0x13 | 0b00010011 |
    NEG_12            |      |  20 | 0x14 | 0b00010100 |
    NEG_11            |      |  21 | 0x15 | 0b00010101 |
    NEG_10            |      |  22 | 0x16 | 0b00010110 |
    NEG_9             |      |  23 | 0x17 | 0b00010111 |
    NEG_8             |      |  24 | 0x18 | 0b00011000 |
    NEG_7             |      |  25 | 0x19 | 0b00011001 |
    NEG_6             |      |  26 | 0x1a | 0b00011010 |
    NEG_5             | "\e" |  27 | 0x1b | 0b00011011 |
    NEG_4             |      |  28 | 0x1c | 0b00011100 |
    NEG_3             |      |  29 | 0x1d | 0b00011101 |
    NEG_2             |      |  30 | 0x1e | 0b00011110 |
    NEG_1             |      |  31 | 0x1f | 0b00011111 | small negative integer - value in low 4 bits (k+32)
    VARINT            | " "  |  32 | 0x20 | 0b00100000 | <VARINT> - Varint variable length integer
    ZIGZAG            | "!"  |  33 | 0x21 | 0b00100001 | <ZIGZAG-VARINT> - Zigzag variable length integer
    FLOAT             | "\"" |  34 | 0x22 | 0b00100010 | <IEEE-FLOAT>
    DOUBLE            | "#"  |  35 | 0x23 | 0b00100011 | <IEEE-DOUBLE>
    LONG_DOUBLE       | "\$" |  36 | 0x24 | 0b00100100 | <IEEE-LONG-DOUBLE>
    UNDEF             | "%"  |  37 | 0x25 | 0b00100101 | None - Perl undef
    BINARY            | "&"  |  38 | 0x26 | 0b00100110 | <LEN-VARINT> <BYTES> - binary/(latin1) string
    STR_UTF8          | "'"  |  39 | 0x27 | 0b00100111 | <LEN-VARINT> <UTF8> - utf8 string
    REFN              | "("  |  40 | 0x28 | 0b00101000 | <ITEM-TAG>    - ref to next item
    REFP              | ")"  |  41 | 0x29 | 0b00101001 | <OFFSET-VARINT> - ref to previous item stored at offset
    HASH              | "*"  |  42 | 0x2a | 0b00101010 | <COUNT-VARINT> [<KEY-TAG> <ITEM-TAG> ...] - count followed by key/value pairs
    ARRAY             | "+"  |  43 | 0x2b | 0b00101011 | <COUNT-VARINT> [<ITEM-TAG> ...] - count followed by items
    OBJECT            | ","  |  44 | 0x2c | 0b00101100 | <STR-TAG> <ITEM-TAG> - class, object-item
    OBJECTV           | "-"  |  45 | 0x2d | 0b00101101 | <OFFSET-VARINT> <ITEM-TAG> - offset of previously used classname tag - object-item
    ALIAS             | "."  |  46 | 0x2e | 0b00101110 | <OFFSET-VARINT> - alias to item defined at offset
    COPY              | "/"  |  47 | 0x2f | 0b00101111 | <OFFSET-VARINT> - copy of item defined at offset
    WEAKEN            | "0"  |  48 | 0x30 | 0b00110000 | <REF-TAG> - Weaken the following reference
    REGEXP            | "1"  |  49 | 0x31 | 0b00110001 | <PATTERN-STR-TAG> <MODIFIERS-STR-TAG>
    OBJECT_FREEZE     | "2"  |  50 | 0x32 | 0b00110010 | <STR-TAG> <ITEM-TAG> - class, object-item. Need to call "THAW" method on class after decoding
    OBJECTV_FREEZE    | "3"  |  51 | 0x33 | 0b00110011 | <OFFSET-VARINT> <ITEM-TAG> - (OBJECTV_FREEZE is to OBJECT_FREEZE as OBJECTV is to OBJECT)
    RESERVED_0        | "4"  |  52 | 0x34 | 0b00110100 | reserved
    RESERVED_1        | "5"  |  53 | 0x35 | 0b00110101 |
    RESERVED_2        | "6"  |  54 | 0x36 | 0b00110110 |
    RESERVED_3        | "7"  |  55 | 0x37 | 0b00110111 |
    RESERVED_4        | "8"  |  56 | 0x38 | 0b00111000 |
    RESERVED_5        | "9"  |  57 | 0x39 | 0b00111001 | reserved
    FALSE             | ":"  |  58 | 0x3a | 0b00111010 | false (PL_sv_no)
    TRUE              | ";"  |  59 | 0x3b | 0b00111011 | true  (PL_sv_yes)
    MANY              | "<"  |  60 | 0x3c | 0b00111100 | <LEN-VARINT> <TYPE-BYTE> <TAG-DATA> - repeated tag (not done yet, will be implemented in version 3)
    PACKET_START      | "="  |  61 | 0x3d | 0b00111101 | (first byte of magic string in header)
    EXTEND            | ">"  |  62 | 0x3e | 0b00111110 | <BYTE> - for additional tags
    PAD               | "?"  |  63 | 0x3f | 0b00111111 | (ignored tag, skip to next byte)
    ARRAYREF_0        | "\@" |  64 | 0x40 | 0b01000000 | [<ITEM-TAG> ...] - count of items in low 4 bits (ARRAY must be refcnt=1)
    ARRAYREF_1        | "A"  |  65 | 0x41 | 0b01000001 |
    ARRAYREF_2        | "B"  |  66 | 0x42 | 0b01000010 |
    ARRAYREF_3        | "C"  |  67 | 0x43 | 0b01000011 |
    ARRAYREF_4        | "D"  |  68 | 0x44 | 0b01000100 |
    ARRAYREF_5        | "E"  |  69 | 0x45 | 0b01000101 |
    ARRAYREF_6        | "F"  |  70 | 0x46 | 0b01000110 |
    ARRAYREF_7        | "G"  |  71 | 0x47 | 0b01000111 |
    ARRAYREF_8        | "H"  |  72 | 0x48 | 0b01001000 |
    ARRAYREF_9        | "I"  |  73 | 0x49 | 0b01001001 |
    ARRAYREF_10       | "J"  |  74 | 0x4a | 0b01001010 |
    ARRAYREF_11       | "K"  |  75 | 0x4b | 0b01001011 |
    ARRAYREF_12       | "L"  |  76 | 0x4c | 0b01001100 |
    ARRAYREF_13       | "M"  |  77 | 0x4d | 0b01001101 |
    ARRAYREF_14       | "N"  |  78 | 0x4e | 0b01001110 |
    ARRAYREF_15       | "O"  |  79 | 0x4f | 0b01001111 | [<ITEM-TAG> ...] - count of items in low 4 bits (ARRAY must be refcnt=1)
    HASHREF_0         | "P"  |  80 | 0x50 | 0b01010000 | [<KEY-TAG> <ITEM-TAG> ...] - count in low 4 bits, key/value pairs (HASH must be refcnt=1)
    HASHREF_1         | "Q"  |  81 | 0x51 | 0b01010001 |
    HASHREF_2         | "R"  |  82 | 0x52 | 0b01010010 |
    HASHREF_3         | "S"  |  83 | 0x53 | 0b01010011 |
    HASHREF_4         | "T"  |  84 | 0x54 | 0b01010100 |
    HASHREF_5         | "U"  |  85 | 0x55 | 0b01010101 |
    HASHREF_6         | "V"  |  86 | 0x56 | 0b01010110 |
    HASHREF_7         | "W"  |  87 | 0x57 | 0b01010111 |
    HASHREF_8         | "X"  |  88 | 0x58 | 0b01011000 |
    HASHREF_9         | "Y"  |  89 | 0x59 | 0b01011001 |
    HASHREF_10        | "Z"  |  90 | 0x5a | 0b01011010 |
    HASHREF_11        | "["  |  91 | 0x5b | 0b01011011 |
    HASHREF_12        | "\\" |  92 | 0x5c | 0b01011100 |
    HASHREF_13        | "]"  |  93 | 0x5d | 0b01011101 |
    HASHREF_14        | "^"  |  94 | 0x5e | 0b01011110 |
    HASHREF_15        | "_"  |  95 | 0x5f | 0b01011111 | [<KEY-TAG> <ITEM-TAG> ...] - count in low 4 bits, key/value pairs (HASH must be refcnt=1)
    SHORT_BINARY_0    | "`"  |  96 | 0x60 | 0b01100000 | <BYTES> - binary/latin1 string, length encoded in low 5 bits of tag
    SHORT_BINARY_1    | "a"  |  97 | 0x61 | 0b01100001 |
    SHORT_BINARY_2    | "b"  |  98 | 0x62 | 0b01100010 |
    SHORT_BINARY_3    | "c"  |  99 | 0x63 | 0b01100011 |
    SHORT_BINARY_4    | "d"  | 100 | 0x64 | 0b01100100 |
    SHORT_BINARY_5    | "e"  | 101 | 0x65 | 0b01100101 |
    SHORT_BINARY_6    | "f"  | 102 | 0x66 | 0b01100110 |
    SHORT_BINARY_7    | "g"  | 103 | 0x67 | 0b01100111 |
    SHORT_BINARY_8    | "h"  | 104 | 0x68 | 0b01101000 |
    SHORT_BINARY_9    | "i"  | 105 | 0x69 | 0b01101001 |
    SHORT_BINARY_10   | "j"  | 106 | 0x6a | 0b01101010 |
    SHORT_BINARY_11   | "k"  | 107 | 0x6b | 0b01101011 |
    SHORT_BINARY_12   | "l"  | 108 | 0x6c | 0b01101100 |
    SHORT_BINARY_13   | "m"  | 109 | 0x6d | 0b01101101 |
    SHORT_BINARY_14   | "n"  | 110 | 0x6e | 0b01101110 |
    SHORT_BINARY_15   | "o"  | 111 | 0x6f | 0b01101111 |
    SHORT_BINARY_16   | "p"  | 112 | 0x70 | 0b01110000 |
    SHORT_BINARY_17   | "q"  | 113 | 0x71 | 0b01110001 |
    SHORT_BINARY_18   | "r"  | 114 | 0x72 | 0b01110010 |
    SHORT_BINARY_19   | "s"  | 115 | 0x73 | 0b01110011 |
    SHORT_BINARY_20   | "t"  | 116 | 0x74 | 0b01110100 |
    SHORT_BINARY_21   | "u"  | 117 | 0x75 | 0b01110101 |
    SHORT_BINARY_22   | "v"  | 118 | 0x76 | 0b01110110 |
    SHORT_BINARY_23   | "w"  | 119 | 0x77 | 0b01110111 |
    SHORT_BINARY_24   | "x"  | 120 | 0x78 | 0b01111000 |
    SHORT_BINARY_25   | "y"  | 121 | 0x79 | 0b01111001 |
    SHORT_BINARY_26   | "z"  | 122 | 0x7a | 0b01111010 |
    SHORT_BINARY_27   | "{"  | 123 | 0x7b | 0b01111011 |
    SHORT_BINARY_28   | "|"  | 124 | 0x7c | 0b01111100 |
    SHORT_BINARY_29   | "}"  | 125 | 0x7d | 0b01111101 |
    SHORT_BINARY_30   | "~"  | 126 | 0x7e | 0b01111110 |
    SHORT_BINARY_31   |      | 127 | 0x7f | 0b01111111 | <BYTES> - binary/latin1 string, length encoded in low 5 bits of tag

The Track Bit And Cyclic Data Structures

The protocol uses a combination of the offset of a tracked tag and the flag bit to be able to encode and reconstruct cyclic structures in a single pass.

An encoder must track duplicated items and generate the appropriate ALIAS or REFP tags to reconstruct them, and when it does so ensure that the high bit of the original tag has been set.

When a decoder encounters a tag with its flag set it will remember the offset of the tag in the output packet and the item that was decoded from that tag. At a later point in the packet there will be an ALIAS or REFP instruction which will refer to the item by its offset, and the decoder will reuse it as needed.

The COPY Tag

Sometimes it is convenient to be able to reuse a previously emitted sequence in the packet to reduce duplication. For instance a data structure with many hashes with the same keys. The COPY tag is used for this. Its argument is a varint which is the offset of a previously emitted tag, and decoders are to behave as though the tag it references was inserted into the packet stream as a replacement for the COPY tag.

Note, that in this case the track flag is not set. It is assumed the decoder can jump back to reread the tag from its location alone.

COPY tags are forbidden from referring to another COPY tag, and are also forbidden from referring to anything containing a COPY tag, with the exception that a COPY tag used as a value may refer to an tag that uses a COPY tag for a classname or hash key.

String Types

Sereal supports three string representations. Two are "encodingless" and are SHORT_BINARY and BINARY, where binary means "raw bytes". The other is STR_UTF8 which is expected to contain valid canonical UTF8 encoded unicode text data. Under normal circumstances a decoder is not expected to validate that this is actually the case, and is allowed to simply extract the data verbatim.

SHORT_BINARY stores the length of the string in the tag itself and is used for strings of less than 32 characters long. Both BINARY and STR_UTF8 use a varint to indicate the number of bytes (octets) in the string.

Hash Keys

Hash keys are always one of the string types, or a COPY tag referencing a string.

Handling Objects

Objects are serialized as a class name and a tag which represents the objects data. In Perl land this will always be a reference. Mapping Perl objects to other languages is left to the future, but the OBJECT_FREEZE and OBJECTV_FREEZE tags provide a basic method of doing that, see below.

Note that classnames MUST be a string, or a COPY tag referencing a string.

OBJECTV varints MUST reference a previously used classname, and not an arbitrary string.

Sereal implementations may choose to allow authors of classes to provide hooks for custom object serialization. Depending on the Sereal implementation, this feature may require enabling with an encoder option on the encoding side, but compliant decoders must at least recognize the OBJECT_FREEZE and OBJECTV_FREEZE tags. The interface shall be such that if enabled in the encoder, for each object in the input that has a FREEZE method, the encoder will invoke said FREEZE method on the object and pass in the string Sereal to allow distinguishing from other serializers (this is inspired by the CBOR::XS CBOR implementation). If there is no FREEZE method available, then a normal OBJECT or OBJECTV tag is emitted, serializing the object content deeply. If invoked, the FREEZE method must return a list of data structures that are serializable by Sereal. The encoder shall emit an OBJECT_FREEZE or OBJECTV_FREEZE tag followed by a reference (REFN) to an array (ARRAY) of the Sereal-encoded data structures that were returned from FREEZE.

Upon decoding OBJECT_FREEZE or OBJECTV_FREEZE, a compliant decoder (unless explicitly instructed not to) will invoke the THAW class method of the given class. (Likely, implementations should throw a fatal error if no such method exists for a class referenced by OBJECT(V)_FREEZE.) Arguments to that method will be the string Sereal as first argument, and then the decoded data structures that were returned from the FREEZE call. The return value of that THAW call needs to be included in the final output structure. See the documentation of the Perl Sereal implemenation for examples of FREEZE/THAW methods.

PROTOCOL CHANGES

Protocol Version 2

In Sereal protocol version 2, offsets were changed from being relative to the start of the document (including header) to being relative to the start of the document body (ie. excluding the document header). This means that Sereal document bodies are now self-contained - relocatable within the document. Note that the offset is 1-based, which means that to point the first byte of the body its value must be 1.

Additionally, protocol version 2 introduced the 8bit bit-field (8bit-BITFIELD) in the variable-length/optional header part (OPT-SUFFIX) of the document and the user-meta-data section (OPT-USER-META-DATA) of the variable-length header.

Protocol version 2 introduces the OBJECT_FREEZE and OBJECTV_FREEZE tags in place of two previously reserved tags. The meaning and implementation of these two tags is described in the "Handling Objects" section of this document. In a nutshell, it allows application developers to have custom hooks for serializing and deserializing the instances of their classes.

NOTES ON IMPLEMENTATION

Encoding the Length of Compressed Documents

With Sereal body format type 2 (see above), you need to encode (as a varint) the length of the Snappy-compressed document as a prefix to the document body. This is somewhat tricky to do efficiently since at first sight, the amount of space required to encode a varint depends on the size of the output. This means that you need to do the normal Sereal-encoding of the document body, then compress the output of that, then append the varint encoded length of the compressed data to a Sereal header, then append the compressed data. In this naive way of implementing this Snappy compression support, you may end up having to copy around the entire document up to three times (and may end up having to allocate 3x the space, too). That is very inefficient.

There is a better way, though, that's just a tiny bit subtle. Thankfully, you have an upper bound on the size of the compressed blob. It's the uncompressed blob plus the size of the Snappy header (a Snappy library call can tell you what that is in practice). What you can do is before compressing, you allocate a varint that is long enough to encode an integer that is big enough to represent the upper limit on the compressed output size. Then you proceed to point the compressor into the buffer right after the thusly preallocated varint. After compression, you'll know the real size of the compressed blob. Now, you go back to the varint and fill it in. If the reserved space for the varint is larger than what you actually need, then thanks to the way varints work, you can simply set the high bit on the last byte of the varint, and continue to set the high bits of all following padding bytes except the last, which you set to 0 (NUL). For details on why that works, please refer to the Google ProtoBuf documentation referenced earlier. With this specially crafted varint, any normal varint parsing function will treat it as a single varint and skip right to the start of theSnappy-compressed blob. The varint is a correct varint, just not in the canonical form. With this modified plan, you should only need one extra malloc, and (beyond that which the Snappy implementation does), no extra, large memcpy operations.

AUTHOR

Yves Orton <demerphq@gmail.com>

Damian Gryski

Steffen Mueller <smueller@cpan.org>

Rafaël Garcia-Suarez

Ævar Arnfjörð Bjarmason <avar@cpan.org>

ACKNOWLEDGMENT

This protocol was originally developed for Booking.com. With approval from Booking.com, this document was generalized and published on github and CPAN, for which the authors would like to express their gratitude.

COPYRIGHT AND LICENSE

Copyright (C) 2012, 2013 by Steffen Mueller

Copyright (C) 2012, 2013 by Yves Orton

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Something went wrong with that request. Please try again.