Skip to content

Latest commit

 

History

History
359 lines (273 loc) · 15.7 KB

sereal_spec.pod

File metadata and controls

359 lines (273 loc) · 15.7 KB

NAME

Sereal - Protocol definition

SYNOPSIS

This document describes the format and encoding of a Sereal data packet.

DESCRIPTION

A serialized structure is converted into a "document". A document is made up of two parts, the header and the body.

General Points

Little Endian

All numeric data is in little endian format.

IEEE Floats

Floating points types are in IEEE format.

Varints

Heavy use is made of a variable length integer encoding commonly called a "varint" (Google calls it a Varint128). This encoding uses the high bit of each byte to signal there is another byte worth of data coming, and the last byte always having the high bit off. The data is in little endian order with the low seven bits in the first byte, and the next 7 in the next etc.

See Google's description.

Header Format

A header consists of multiple components:

<MAGIC> <VERSION-TYPE> <HEADER-SUFFIX-SIZE> <OPT-SUFFIX>
MAGIC

A "magic string" that identifies a document as being in the Sereal format. The value of this string is "=srl", and when decoded as an unsigned 32 bit integer on a little endian machine has a value of 0x6c72733d.

VERSION-TYPE

A single byte, of which the high 4 bits are used to represent the "type" of the document, and the low 4 bits used to represent the version of the Sereal protocol the document complies with.

Up until now there has only been one version of Sereal released so the low bits will be 1.

Currently only two types are defined:

0

Raw Sereal format. The data can be processed verbatim.

1

Compressed Sereal format, using Google's Snappy compression internally.

Additional compression types are envisaged and will be assigned type numbers by the maintainers of the protocol.

HEADER-SUFFIX-SIZE

The structure of the header includes support for embedding additional data. This is accomplished by specifying the length of the suffix in the header with a varint. Headers with no suffix will set this to a binary 0. This is intended for future format extensions that retain some level of compatibility for old decoders (which know how to skip the extended header due to the embedded length).

OPT-SUFFIX

The suffix may contain whatever data the encoder wishes to embed in the header. In version 1 of the protocol the decoder will never look inside this data. Later versions may introduce new rules for this field.

Body Format

The body is made up of one or more tagged data items:

<TAG> <OPT-DATA>
TAG

A tag is a single byte which specifies the type of the data being decoded.

The high bit of each tag is used to signal to the decoder that the deserialized data needs to be stored and tracked and will be reused again elsewhere in the serialization. This is sometimes called the "track flag" or the "F-bit" in code and documentation. Its status should be ignored when processing a tag, meaning code should mask off the high bit and only use the low 7 bits.

Some tags, such as POS, NEG and SHORT_BINARY contain embedded in them either the data (in the case of POS and NEG) or the length of the OPT-DATA section (in the case of SHORT_BINARY).

OPT-DATA

This field may contain an arbitrary set of bytes, either determined implicitly by the tag (such as for FLOAT), explicitly in the tag (as in SHORT_BINARY) or in a varint following the tag (such as for STRING).

Tags

Tag | Char | Dec |  Hex |     Binary | Follow
    ------------------+------+-----+------+----------- |-----------------------------------------
    POS_0             |      |   0 | 0x00 | 0b00000000 | small positive integer - value in low 4 bits (identity)
    POS_1             |      |   1 | 0x01 | 0b00000001 |
    POS_2             |      |   2 | 0x02 | 0b00000010 |
    POS_3             |      |   3 | 0x03 | 0b00000011 |
    POS_4             |      |   4 | 0x04 | 0b00000100 |
    POS_5             |      |   5 | 0x05 | 0b00000101 |
    POS_6             |      |   6 | 0x06 | 0b00000110 |
    POS_7             | "\a" |   7 | 0x07 | 0b00000111 |
    POS_8             | "\b" |   8 | 0x08 | 0b00001000 |
    POS_9             | "\t" |   9 | 0x09 | 0b00001001 |
    POS_10            | "\n" |  10 | 0x0a | 0b00001010 |
    POS_11            |      |  11 | 0x0b | 0b00001011 |
    POS_12            | "\f" |  12 | 0x0c | 0b00001100 |
    POS_13            | "\r" |  13 | 0x0d | 0b00001101 |
    POS_14            |      |  14 | 0x0e | 0b00001110 |
    POS_15            |      |  15 | 0x0f | 0b00001111 | small positive integer - value in low 4 bits (identity)
    NEG_16            |      |  16 | 0x10 | 0b00010000 | small negative integer - value in low 4 bits (k+32)
    NEG_15            |      |  17 | 0x11 | 0b00010001 |
    NEG_14            |      |  18 | 0x12 | 0b00010010 |
    NEG_13            |      |  19 | 0x13 | 0b00010011 |
    NEG_12            |      |  20 | 0x14 | 0b00010100 |
    NEG_11            |      |  21 | 0x15 | 0b00010101 |
    NEG_10            |      |  22 | 0x16 | 0b00010110 |
    NEG_9             |      |  23 | 0x17 | 0b00010111 |
    NEG_8             |      |  24 | 0x18 | 0b00011000 |
    NEG_7             |      |  25 | 0x19 | 0b00011001 |
    NEG_6             |      |  26 | 0x1a | 0b00011010 |
    NEG_5             | "\e" |  27 | 0x1b | 0b00011011 |
    NEG_4             |      |  28 | 0x1c | 0b00011100 |
    NEG_3             |      |  29 | 0x1d | 0b00011101 |
    NEG_2             |      |  30 | 0x1e | 0b00011110 |
    NEG_1             |      |  31 | 0x1f | 0b00011111 | small negative integer - value in low 4 bits (k+32)
    VARINT            | " "  |  32 | 0x20 | 0b00100000 | <VARINT> - Varint variable length integer
    ZIGZAG            | "!"  |  33 | 0x21 | 0b00100001 | <ZIGZAG-VARINT> - Zigzag variable length integer
    FLOAT             | "\"" |  34 | 0x22 | 0b00100010 | <IEEE-FLOAT>
    DOUBLE            | "#"  |  35 | 0x23 | 0b00100011 | <IEEE-DOUBLE>
    LONG_DOUBLE       | "\$" |  36 | 0x24 | 0b00100100 | <IEEE-LONG-DOUBLE>
    UNDEF             | "%"  |  37 | 0x25 | 0b00100101 | None - Perl undef
    BINARY            | "&"  |  38 | 0x26 | 0b00100110 | <LEN-VARINT> <BYTES> - binary/(latin1) string
    STR_UTF8          | "'"  |  39 | 0x27 | 0b00100111 | <LEN-VARINT> <UTF8> - utf8 string
    REFN              | "("  |  40 | 0x28 | 0b00101000 | <ITEM-TAG>    - ref to next item
    REFP              | ")"  |  41 | 0x29 | 0b00101001 | <OFFSET-VARINT> - ref to previous item stored at offset
    HASH              | "*"  |  42 | 0x2a | 0b00101010 | <COUNT-VARINT> [<KEY-TAG> <ITEM-TAG> ...] - count followed by key/value pairs
    ARRAY             | "+"  |  43 | 0x2b | 0b00101011 | <COUNT-VARINT> [<ITEM-TAG> ...] - count followed by items
    OBJECT            | ","  |  44 | 0x2c | 0b00101100 | <STR-TAG> <ITEM-TAG> - class, object-item
    OBJECTV           | "-"  |  45 | 0x2d | 0b00101101 | <OFFSET-VARINT> <ITEM-TAG> - offset of previously used classname tag - object-item
    ALIAS             | "."  |  46 | 0x2e | 0b00101110 | <OFFSET-VARINT> - alias to item defined at offset
    COPY              | "/"  |  47 | 0x2f | 0b00101111 | <OFFSET-VARINT> - copy of item defined at offset
    WEAKEN            | "0"  |  48 | 0x30 | 0b00110000 | <REF-TAG> - Weaken the following reference
    REGEXP            | "1"  |  49 | 0x31 | 0b00110001 | <PATTERN-STR-TAG> <MODIFIERS-STR-TAG>
    RESERVED_0        | "2"  |  50 | 0x32 | 0b00110010 | reserved
    RESERVED_1        | "3"  |  51 | 0x33 | 0b00110011 |
    RESERVED_2        | "4"  |  52 | 0x34 | 0b00110100 |
    RESERVED_3        | "5"  |  53 | 0x35 | 0b00110101 |
    RESERVED_4        | "6"  |  54 | 0x36 | 0b00110110 |
    RESERVED_5        | "7"  |  55 | 0x37 | 0b00110111 |
    RESERVED_6        | "8"  |  56 | 0x38 | 0b00111000 |
    RESERVED_7        | "9"  |  57 | 0x39 | 0b00111001 | reserved
    FALSE             | ":"  |  58 | 0x3a | 0b00111010 | false (PL_sv_no)
    TRUE              | ";"  |  59 | 0x3b | 0b00111011 | true  (PL_sv_yes)
    MANY              | "<"  |  60 | 0x3c | 0b00111100 | <LEN-VARINT> <TYPE-BYTE> <TAG-DATA> - repeated tag (not done yet, will be implemented in version 2)
    PACKET_START      | "="  |  61 | 0x3d | 0b00111101 | (first byte of magic string in header)
    EXTEND            | ">"  |  62 | 0x3e | 0b00111110 | <BYTE> - for additional tags
    PAD               | "?"  |  63 | 0x3f | 0b00111111 | (ignored tag, skip to next byte)
    ARRAYREF_0        | "\@" |  64 | 0x40 | 0b01000000 | [<ITEM-TAG> ...] - count of items in low 4 bits (ARRAY must be refcnt=1)
    ARRAYREF_1        | "A"  |  65 | 0x41 | 0b01000001 |
    ARRAYREF_2        | "B"  |  66 | 0x42 | 0b01000010 |
    ARRAYREF_3        | "C"  |  67 | 0x43 | 0b01000011 |
    ARRAYREF_4        | "D"  |  68 | 0x44 | 0b01000100 |
    ARRAYREF_5        | "E"  |  69 | 0x45 | 0b01000101 |
    ARRAYREF_6        | "F"  |  70 | 0x46 | 0b01000110 |
    ARRAYREF_7        | "G"  |  71 | 0x47 | 0b01000111 |
    ARRAYREF_8        | "H"  |  72 | 0x48 | 0b01001000 |
    ARRAYREF_9        | "I"  |  73 | 0x49 | 0b01001001 |
    ARRAYREF_10       | "J"  |  74 | 0x4a | 0b01001010 |
    ARRAYREF_11       | "K"  |  75 | 0x4b | 0b01001011 |
    ARRAYREF_12       | "L"  |  76 | 0x4c | 0b01001100 |
    ARRAYREF_13       | "M"  |  77 | 0x4d | 0b01001101 |
    ARRAYREF_14       | "N"  |  78 | 0x4e | 0b01001110 |
    ARRAYREF_15       | "O"  |  79 | 0x4f | 0b01001111 | [<ITEM-TAG> ...] - count of items in low 4 bits (ARRAY must be refcnt=1)
    HASHREF_0         | "P"  |  80 | 0x50 | 0b01010000 | [<KEY-TAG> <ITEM-TAG> ...] - count in low 4 bits, key/value pairs (HASH must be refcnt=1)
    HASHREF_1         | "Q"  |  81 | 0x51 | 0b01010001 |
    HASHREF_2         | "R"  |  82 | 0x52 | 0b01010010 |
    HASHREF_3         | "S"  |  83 | 0x53 | 0b01010011 |
    HASHREF_4         | "T"  |  84 | 0x54 | 0b01010100 |
    HASHREF_5         | "U"  |  85 | 0x55 | 0b01010101 |
    HASHREF_6         | "V"  |  86 | 0x56 | 0b01010110 |
    HASHREF_7         | "W"  |  87 | 0x57 | 0b01010111 |
    HASHREF_8         | "X"  |  88 | 0x58 | 0b01011000 |
    HASHREF_9         | "Y"  |  89 | 0x59 | 0b01011001 |
    HASHREF_10        | "Z"  |  90 | 0x5a | 0b01011010 |
    HASHREF_11        | "["  |  91 | 0x5b | 0b01011011 |
    HASHREF_12        | "\\" |  92 | 0x5c | 0b01011100 |
    HASHREF_13        | "]"  |  93 | 0x5d | 0b01011101 |
    HASHREF_14        | "^"  |  94 | 0x5e | 0b01011110 |
    HASHREF_15        | "_"  |  95 | 0x5f | 0b01011111 | [<KEY-TAG> <ITEM-TAG> ...] - count in low 4 bits, key/value pairs (HASH must be refcnt=1)
    SHORT_BINARY_0    | "`"  |  96 | 0x60 | 0b01100000 | <BYTES> - binary/latin1 string, length encoded in low 5 bits of tag
    SHORT_BINARY_1    | "a"  |  97 | 0x61 | 0b01100001 |
    SHORT_BINARY_2    | "b"  |  98 | 0x62 | 0b01100010 |
    SHORT_BINARY_3    | "c"  |  99 | 0x63 | 0b01100011 |
    SHORT_BINARY_4    | "d"  | 100 | 0x64 | 0b01100100 |
    SHORT_BINARY_5    | "e"  | 101 | 0x65 | 0b01100101 |
    SHORT_BINARY_6    | "f"  | 102 | 0x66 | 0b01100110 |
    SHORT_BINARY_7    | "g"  | 103 | 0x67 | 0b01100111 |
    SHORT_BINARY_8    | "h"  | 104 | 0x68 | 0b01101000 |
    SHORT_BINARY_9    | "i"  | 105 | 0x69 | 0b01101001 |
    SHORT_BINARY_10   | "j"  | 106 | 0x6a | 0b01101010 |
    SHORT_BINARY_11   | "k"  | 107 | 0x6b | 0b01101011 |
    SHORT_BINARY_12   | "l"  | 108 | 0x6c | 0b01101100 |
    SHORT_BINARY_13   | "m"  | 109 | 0x6d | 0b01101101 |
    SHORT_BINARY_14   | "n"  | 110 | 0x6e | 0b01101110 |
    SHORT_BINARY_15   | "o"  | 111 | 0x6f | 0b01101111 |
    SHORT_BINARY_16   | "p"  | 112 | 0x70 | 0b01110000 |
    SHORT_BINARY_17   | "q"  | 113 | 0x71 | 0b01110001 |
    SHORT_BINARY_18   | "r"  | 114 | 0x72 | 0b01110010 |
    SHORT_BINARY_19   | "s"  | 115 | 0x73 | 0b01110011 |
    SHORT_BINARY_20   | "t"  | 116 | 0x74 | 0b01110100 |
    SHORT_BINARY_21   | "u"  | 117 | 0x75 | 0b01110101 |
    SHORT_BINARY_22   | "v"  | 118 | 0x76 | 0b01110110 |
    SHORT_BINARY_23   | "w"  | 119 | 0x77 | 0b01110111 |
    SHORT_BINARY_24   | "x"  | 120 | 0x78 | 0b01111000 |
    SHORT_BINARY_25   | "y"  | 121 | 0x79 | 0b01111001 |
    SHORT_BINARY_26   | "z"  | 122 | 0x7a | 0b01111010 |
    SHORT_BINARY_27   | "{"  | 123 | 0x7b | 0b01111011 |
    SHORT_BINARY_28   | "|"  | 124 | 0x7c | 0b01111100 |
    SHORT_BINARY_29   | "}"  | 125 | 0x7d | 0b01111101 |
    SHORT_BINARY_30   | "~"  | 126 | 0x7e | 0b01111110 |
    SHORT_BINARY_31   |      | 127 | 0x7f | 0b01111111 | <BYTES> - binary/latin1 string, length encoded in low 5 bits of tag

The Track Bit And Cyclic Data Structures

The protocol uses a combination of the offset of a tracked tag and the flag bit to be able to encode and reconstruct cyclic structures in a single pass.

An encoder must track duplicated items and generate the appropriate ALIAS or REFP tags to reconstruct them, and when it does so ensure that the high bit of the original tag has been set.

When a decoder encounters a tag with its flag set it will remember the offset of the tag in the output packet and the item that was decoded from that tag. At a later point in the packet there will be an ALIAS or REFP instruction which will refer to the item by its offset, and the decoder will reuse it as needed.

The COPY Tag

Sometimes it is convenient to be able to reuse a previously emitted sequence in the packet to reduce duplication. For instance a data structure with many hashes with the same keys. The COPY tag is used for this. Its argument is a varint which is the offset of a previously emitted tag, and decoders are to behave as though the tag it references was inserted into the packet stream as a replacement for the COPY tag.

Note, that in this case the track flag is not set. It is assumed the decoder can jump back to reread the tag from its location alone.

Copy tags are forbidden from referring to another COPY tag, and are also forbidden from referring to anything containing a COPY tag, with the exception that a COPY tag used as a value may refer to an tag that uses a COPY tag for a classname or hash key.

String Types

Sereal supports three string representations. Two are "encodingless" and are SHORT_BINARY and BINARY, where binary means "raw bytes". The other is STR_UTF8 which is expected to contain valid canonical UTF8 encoded unicode text data. Under normal circumstances a decoder is not expected to validate that this is actually the case, and is allowed to simply extract the data verbatim.

SHORT_BINARY stores the length of the string in the tag itself and is used for strings of less than 32 characters long. Both BINARY and STR_UTF8 use a varint to indicate the number of bytes (octets) in the string.

Hash Keys

Hash keys are always one of the string types, or a COPY tag referencing a string.

Handling objects

Objects are serialized as a class name and a tag which represents the objects data. In Perl land this will always be a reference. Mapping perl objects to other languages is left to the future.

Note that classnames MUST be a string, or a COPY tag referencing a string.

OBJECTV varints MUST reference a previously used classname, and not an arbitrary string.

AUTHOR

Yves Orton <demerphq@gmail.com>

Damian Gryski

Steffen Mueller <smueller@cpan.org>

Rafaël Garcia-Suarez

Ævar Arnfjörð Bjarmason

ACKNOWLEDGMENT

This protocol was originally developed for Booking.com. With approval from Booking.com, this document was generalized and published on github and CPAN, for which the authors would like to express their gratitude.

COPYRIGHT AND LICENSE

Copyright (C) 2012 by Steffen Mueller

Copyright (C) 2012 by Yves Orton

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

POD ERRORS

Hey! The above document had some coding errors, which are explained below:

Around line 74:

Expected text after =item, not a number