Skip to content
This repository

Fast, compact, schema-less, binary serialization and deserialization oriented towards dynamic languages

tag: Sereal-Encoder…

Fetching latest commit…

Cannot retrieve the latest commit at this time

README.pod

Sereal - A binary serialization format

This repository is the home of the Sereal data serialization format. This format was started because the authors had technical reasons for producing a better Storable.

Before we embarked on this project we had a look at various prior art. This included a review of Google Protocol Buffers and of the MessagePack protocol. Neither suited our needs so we designed this, liberally borrowing ideas from the other projects.

We wanted to be able serialize shared references properly. Many serialization formats do not support this out of the box.

Perl has the concept of a special type of reference called a "weakref" which is used to create cyclic reference structures which do not leak memory. We need to handle these structures.

Perl supports aliases which are a special kind of reference which is effectively a C level pointer instead of a Perl level RV. We needed to be able to represent these as well.

Blessing a reference can be dangerous in some circumstances. We needed to be able to serialize objects safely and reliably, and we wanted a sane control mechanism for doing so.

In Perl, a regexp is a native type. We wanted to be able serialize these at a native level without losing data such as modifiers.

We want to be able to represent common structures as small as is reasonable. Although not to the extreme that this makes the protocol error prone. This includes such thing as removing redundancy from the serialized structure (such as hash keys or classnames) automatically.

We want to be able to serialize and deserialize quickly. Some of the design decisions and trade-offs were aimed squarely at performance.

We wanted to separate the functions of serializing from deserializing so they could be upgraded independently.

We wanted the protocol to be robust to forward/backwards compatibility issues. It should be possible to partially read new formats with an old decoder, and possibly output old formats with a new decoder.

We want the format to be usable by other languages, especially dynamic languages. We hope to have a Java port soon, right Eric?

A serialized structure is converted into a "document". A document is made up of two parts, the header and the body.

Sereal is a little-endian data structure. Additionally it uses IEEE 754 floats.

A header consists of multiple components:

   <MAGIC> <VERSION-TYPE> <HEADER-SUFFIX-SIZE> <OPT-SUFFIX>

A "magic string" that identifies a document as being in the Sereal format. The value of this string is "=srl", and when decoded as an unsigned 32 bit integer on a little endian machine has a value of 0x6c72733d.

A single byte, of which the high 4 bits are used to represent the "type" of the document, and the low 4 bits used to represent the version of the Sereal protocol the document complies with.

Up until now there has only been one version of sereal released so the low bits will be 1.

Currently only two types are defined:

Raw Sereal format. The data can be processed verbatim.

Compressed Sereal format, using Google's Snappy compression internally.

Additional compression types are envisaged and will be assigned type numbers by the maintainers of the protocol.

The header includes support for additional arbitrary data to be embedded in the header. This is accomplished by specifying the length of the suffix in the header with a varint. Headers with no suffix will set this to 0.

The suffix may contain whatever data the encoder wishes to embed in the header. In version 1 of the protocol the decoder will never look inside this data. Later versions may introduce new rules for this field.

The body is made up of one or more tagged data items:

    <TAG> <OPT-DATA>

A tag is a single byte which specifies the type of the data being decoded.

The high bit of each tag is used to signal to the decoder that the deserialized data needs to be stored and tracked and will be reused again elsewhere in the serialization.

Some tags, such as POS, NEG and ASCII contain embedded in them either the data (in the case of POS and NEG) or the length of the OPT-DATA section (in the case of ASCII).

This field may contain an arbitrary set of bytes, either determined implicitly by the tag (such as for FLOAT), explicitly in the tag (as in ASCII) or in a varint following the tag (such as for STRING).

              Tag | Char | Dec |  Hex |     Binary | Follow
    --------------+------+-----+------+----------- |-----------------------------------------
    POS_0         |      |   0 | 0x00 | 0b00000000 | small positive integer - value in low 4 bits (identity)
    POS_1         |      |   1 | 0x01 | 0b00000001 |
    POS_2         |      |   2 | 0x02 | 0b00000010 |
    POS_3         |      |   3 | 0x03 | 0b00000011 |
    POS_4         |      |   4 | 0x04 | 0b00000100 |
    POS_5         |      |   5 | 0x05 | 0b00000101 |
    POS_6         |      |   6 | 0x06 | 0b00000110 |
    POS_7         | "\a" |   7 | 0x07 | 0b00000111 |
    POS_8         | "\b" |   8 | 0x08 | 0b00001000 |
    POS_9         | "\t" |   9 | 0x09 | 0b00001001 |
    POS_10        | "\n" |  10 | 0x0a | 0b00001010 |
    POS_11        |      |  11 | 0x0b | 0b00001011 |
    POS_12        | "\f" |  12 | 0x0c | 0b00001100 |
    POS_13        | "\r" |  13 | 0x0d | 0b00001101 |
    POS_14        |      |  14 | 0x0e | 0b00001110 |
    POS_15        |      |  15 | 0x0f | 0b00001111 | small positive integer - value in low 4 bits (identity)
    NEG_1         |      |  16 | 0x10 | 0b00010000 | small negative integer - value in low 4 bits (-k+15)
    NEG_2         |      |  17 | 0x11 | 0b00010001 |
    NEG_3         |      |  18 | 0x12 | 0b00010010 |
    NEG_4         |      |  19 | 0x13 | 0b00010011 |
    NEG_5         |      |  20 | 0x14 | 0b00010100 |
    NEG_6         |      |  21 | 0x15 | 0b00010101 |
    NEG_7         |      |  22 | 0x16 | 0b00010110 |
    NEG_8         |      |  23 | 0x17 | 0b00010111 |
    NEG_9         |      |  24 | 0x18 | 0b00011000 |
    NEG_10        |      |  25 | 0x19 | 0b00011001 |
    NEG_11        |      |  26 | 0x1a | 0b00011010 |
    NEG_12        | "\e" |  27 | 0x1b | 0b00011011 |
    NEG_13        |      |  28 | 0x1c | 0b00011100 |
    NEG_14        |      |  29 | 0x1d | 0b00011101 |
    NEG_15        |      |  30 | 0x1e | 0b00011110 |
    NEG_16        |      |  31 | 0x1f | 0b00011111 | small negative integer - value in low 4 bits (-k+15)
    VARINT        | " "  |  32 | 0x20 | 0b00100000 | <VARINT> - Varint variable length integer
    ZIGZAG        | "!"  |  33 | 0x21 | 0b00100001 | <ZIGZAG-VARINT> - Zigzag variable length integer
    FLOAT         | "\"" |  34 | 0x22 | 0b00100010 | <IEEE-FLOAT>
    DOUBLE        | "#"  |  35 | 0x23 | 0b00100011 | <IEEE-DOUBLE>
    LONG_DOUBLE   | "\$" |  36 | 0x24 | 0b00100100 | <IEEE-LONG-DOUBLE>
    UNDEF         | "%"  |  37 | 0x25 | 0b00100101 | None - Perl undef
    STRING        | "&"  |  38 | 0x26 | 0b00100110 | <LEN-VARINT> <BYTES> - binary/(latin1) string
    STRING_UTF8   | "'"  |  39 | 0x27 | 0b00100111 | <LEN-VARINT> <UTF8> - utf8 string
    REFN          | "("  |  40 | 0x28 | 0b00101000 | <ITEM-TAG>    - ref to next item
    REFP          | ")"  |  41 | 0x29 | 0b00101001 | <OFFSET-VARINT> - ref to previous item stored at offset
    HASH          | "*"  |  42 | 0x2a | 0b00101010 | <COUNT-VARINT> [<KEY-TAG> <ITEM-TAG> ...] - count followed by key/value pairs
    ARRAY         | "+"  |  43 | 0x2b | 0b00101011 | <COUNT-VARINT> [<ITEM-TAG> ...] - count followed by items
    BLESS         | ","  |  44 | 0x2c | 0b00101100 | <ITEM-TAG> <STR-TAG> - item / class
    BLESSV        | "-"  |  45 | 0x2d | 0b00101101 | <OFFSET-VARINT> <ITEM-TAG> - class at offset - item to bless
    ALIAS         | "."  |  46 | 0x2e | 0b00101110 | <OFFSET-VARINT> - alias to item defined at offset
    COPY          | "/"  |  47 | 0x2f | 0b00101111 | <OFFSET-VARINT> - copy of alias defined at offset
    WEAKEN        | "0"  |  48 | 0x30 | 0b00110000 | <REF-TAG> - Weaken the following reference
    REGEXP        | "1"  |  49 | 0x31 | 0b00110001 | <PATTERN-STR-TAG> <MODIFIERS-STR-TAG>
    INT1          | "2"  |  50 | 0x32 | 0b00110010 | <BYTE>  - one byte integer   #proposed#
    INT2          | "3"  |  51 | 0x33 | 0b00110011 | <BYTES> - two byte integer   #proposed#
    INT3          | "4"  |  52 | 0x34 | 0b00110100 | <BYTES> - three byte integer #proposed#
    INT4          | "5"  |  53 | 0x35 | 0b00110101 | <BYTES> - four byte integer  #proposed#
    UINT1         | "6"  |  54 | 0x36 | 0b00110110 | <BYTE>  - one byte unsigned integer   #proposed#
    UINT2         | "7"  |  55 | 0x37 | 0b00110111 | <BYTES> - two byte unsigned integer   #proposed#
    UINT3         | "8"  |  56 | 0x38 | 0b00111000 | <BYTES> - three byte unsigned integer #proposed#
    UINT4         | "9"  |  57 | 0x39 | 0b00111001 | <BYTES> - four byte unsigned integer  #proposed#
    FALSE         | ":"  |  58 | 0x3a | 0b00111010 | false #proposed#
    TRUE          | ";"  |  59 | 0x3b | 0b00111011 | true  #proposed#
    REPEATED      | "<"  |  60 | 0x3c | 0b00111100 | <LEN-VARINT> <TAG-BYTE> <TAG-DATA> - repeated tag (unimplemented)
    PACKET_START  | "="  |  61 | 0x3d | 0b00111101 | (first byte of magic string in header)
    EXTEND        | ">"  |  62 | 0x3e | 0b00111110 | <BYTE> - for additional tags
    PAD           | "?"  |  63 | 0x3f | 0b00111111 | (ignored tag, skip to next byte)
    ARRAYREF_0    | "\@" |  64 | 0x40 | 0b01000000 | [<ITEM-TAG> ...] - count of itmes in low 4 bits
    ARRAYREF_1    | "A"  |  65 | 0x41 | 0b01000001 |
    ARRAYREF_2    | "B"  |  66 | 0x42 | 0b01000010 |
    ARRAYREF_3    | "C"  |  67 | 0x43 | 0b01000011 |
    ARRAYREF_4    | "D"  |  68 | 0x44 | 0b01000100 |
    ARRAYREF_5    | "E"  |  69 | 0x45 | 0b01000101 |
    ARRAYREF_6    | "F"  |  70 | 0x46 | 0b01000110 |
    ARRAYREF_7    | "G"  |  71 | 0x47 | 0b01000111 |
    ARRAYREF_8    | "H"  |  72 | 0x48 | 0b01001000 |
    ARRAYREF_9    | "I"  |  73 | 0x49 | 0b01001001 |
    ARRAYREF_10   | "J"  |  74 | 0x4a | 0b01001010 |
    ARRAYREF_11   | "K"  |  75 | 0x4b | 0b01001011 |
    ARRAYREF_12   | "L"  |  76 | 0x4c | 0b01001100 |
    ARRAYREF_13   | "M"  |  77 | 0x4d | 0b01001101 |
    ARRAYREF_14   | "N"  |  78 | 0x4e | 0b01001110 |
    ARRAYREF_15   | "O"  |  79 | 0x4f | 0b01001111 | [<ITEM-TAG> ...] - count of itmes in low 4 bits
    HASHREF_0     | "P"  |  80 | 0x50 | 0b01010000 | [<KEY-TAG> <ITEM-TAG> ...] - count in low 4 bits, key/value pairs
    HASHREF_1     | "Q"  |  81 | 0x51 | 0b01010001 |
    HASHREF_2     | "R"  |  82 | 0x52 | 0b01010010 |
    HASHREF_3     | "S"  |  83 | 0x53 | 0b01010011 |
    HASHREF_4     | "T"  |  84 | 0x54 | 0b01010100 |
    HASHREF_5     | "U"  |  85 | 0x55 | 0b01010101 |
    HASHREF_6     | "V"  |  86 | 0x56 | 0b01010110 |
    HASHREF_7     | "W"  |  87 | 0x57 | 0b01010111 |
    HASHREF_8     | "X"  |  88 | 0x58 | 0b01011000 |
    HASHREF_9     | "Y"  |  89 | 0x59 | 0b01011001 |
    HASHREF_10    | "Z"  |  90 | 0x5a | 0b01011010 |
    HASHREF_11    | "["  |  91 | 0x5b | 0b01011011 |
    HASHREF_12    | "\\" |  92 | 0x5c | 0b01011100 |
    HASHREF_13    | "]"  |  93 | 0x5d | 0b01011101 |
    HASHREF_14    | "^"  |  94 | 0x5e | 0b01011110 |
    HASHREF_15    | "_"  |  95 | 0x5f | 0b01011111 | [<KEY-TAG> <ITEM-TAG> ...] - count in low 4 bits, key/value pairs
    ASCII_0       | "`"  |  96 | 0x60 | 0b01100000 | <BYTES> - binary/latin1 string, length encoded in low 5 bits of tag
    ASCII_1       | "a"  |  97 | 0x61 | 0b01100001 |
    ASCII_2       | "b"  |  98 | 0x62 | 0b01100010 |
    ASCII_3       | "c"  |  99 | 0x63 | 0b01100011 |
    ASCII_4       | "d"  | 100 | 0x64 | 0b01100100 |
    ASCII_5       | "e"  | 101 | 0x65 | 0b01100101 |
    ASCII_6       | "f"  | 102 | 0x66 | 0b01100110 |
    ASCII_7       | "g"  | 103 | 0x67 | 0b01100111 |
    ASCII_8       | "h"  | 104 | 0x68 | 0b01101000 |
    ASCII_9       | "i"  | 105 | 0x69 | 0b01101001 |
    ASCII_10      | "j"  | 106 | 0x6a | 0b01101010 |
    ASCII_11      | "k"  | 107 | 0x6b | 0b01101011 |
    ASCII_12      | "l"  | 108 | 0x6c | 0b01101100 |
    ASCII_13      | "m"  | 109 | 0x6d | 0b01101101 |
    ASCII_14      | "n"  | 110 | 0x6e | 0b01101110 |
    ASCII_15      | "o"  | 111 | 0x6f | 0b01101111 |
    ASCII_16      | "p"  | 112 | 0x70 | 0b01110000 |
    ASCII_17      | "q"  | 113 | 0x71 | 0b01110001 |
    ASCII_18      | "r"  | 114 | 0x72 | 0b01110010 |
    ASCII_19      | "s"  | 115 | 0x73 | 0b01110011 |
    ASCII_20      | "t"  | 116 | 0x74 | 0b01110100 |
    ASCII_21      | "u"  | 117 | 0x75 | 0b01110101 |
    ASCII_22      | "v"  | 118 | 0x76 | 0b01110110 |
    ASCII_23      | "w"  | 119 | 0x77 | 0b01110111 |
    ASCII_24      | "x"  | 120 | 0x78 | 0b01111000 |
    ASCII_25      | "y"  | 121 | 0x79 | 0b01111001 |
    ASCII_26      | "z"  | 122 | 0x7a | 0b01111010 |
    ASCII_27      | "{"  | 123 | 0x7b | 0b01111011 |
    ASCII_28      | "|"  | 124 | 0x7c | 0b01111100 |
    ASCII_29      | "}"  | 125 | 0x7d | 0b01111101 |
    ASCII_30      | "~"  | 126 | 0x7e | 0b01111110 |
    ASCII_31      |      | 127 | 0x7f | 0b01111111 | <BYTES> - binary/latin1 string, length encoded in low 5 bits of tag

Yves Orton <demerphq@gmail.com>

Damian Gryski

Steffen Mueller <smueller@cpan.org>

Rafaël Garcia-Suarez

Ævar Arnfjörð Bjarmason

This protocol was originally developed for Booking.com. With approval from Booking.com, this module was generalized and published on CPAN, for which the authors would like to express their gratitude.

Copyright (C) 2012 by Steffen Mueller Copyright (C) 2012 by Yves Orton

Something went wrong with that request. Please try again.