Permalink
Browse files

split out the spec from the README

  • Loading branch information...
1 parent 29bd6a4 commit 7d4d8f8fb7e96feb88c64064dca53ecd1e072eaf @demerphq demerphq committed Sep 8, 2012
Showing with 326 additions and 284 deletions.
  1. +1 −284 README.pod
  2. +325 −0 sereal_spec.pod
View
@@ -83,290 +83,7 @@ languages. We hope to have a Java port soon, right Eric?
=head1 SPECIFICATION
-A serialized structure is converted into a "document". A document is made
-up of two parts, the header and the body.
-
-=head2 General Points
-
-=over 4
-
-=item Little Endian
-
-All numeric data is in little endian format.
-
-=item IEEE Floats
-
-Floating points types are in IEEE format.
-
-=item Varints
-
-Heavy use is made of a variable length integer encoding commonly called
-a "varint" (Google calls it a Varint128). This encoding uses the high bit
-of each byte to signal there is another byte worth of data coming, and the
-last byte always having the high bit off. The data is in little endian
-order with the low seven bits in the first byte, and the next 7 in the
-next etc.
-
-See L<googles description|https://developers.google.com/protocol-buffers/docs/encoding#varints>
-
-=back
-
-=head2 Header Format
-
-A header consists of multiple components:
-
- <MAGIC> <VERSION-TYPE> <HEADER-SUFFIX-SIZE> <OPT-SUFFIX>
-
-=over 4
-
-=item MAGIC
-
-A "magic string" that identifies a document as being in the Sereal format.
-The value of this string is "=srl", and when decoded as an unsigned 32 bit
-integer on a little endian machine has a value of 0x6c72733d.
-
-=item VERSION-TYPE
-
-A single byte, of which the high 4 bits are used to represent the "type"
-of the document, and the low 4 bits used to represent the version of the
-Sereal protocol the document complies with.
-
-Up until now there has only been one version of sereal released so the
-low bits will be 1.
-
-Currently only two types are defined:
-
-=over 4
-
-=item 0
-
-Raw Sereal format. The data can be processed verbatim.
-
-=item 1
-
-Compressed Sereal format, using Google's Snappy compression internally.
-
-=back
-
-Additional compression types are envisaged and will be assigned type
-numbers by the maintainers of the protocol.
-
-=item HEADER-SUFFIX-SIZE
-
-The header includes support for additional arbitrary data to be embedded
-in the header. This is accomplished by specifying the length of the suffix
-in the header with a varint. Headers with no suffix will set this to 0.
-
-=item OPT-SUFFIX
-
-The suffix may contain whatever data the encoder wishes to embed in the
-header. In version 1 of the protocol the decoder will never look inside
-this data. Later versions may introduce new rules for this field.
-
-=back
-
-=head2 Body Format
-
-The body is made up of one or more tagged data items:
-
- <TAG> <OPT-DATA>
-
-=over 4
-
-=item TAG
-
-A tag is a single byte which specifies the type of the data being decoded.
-
-The high bit of each tag is used to signal to the decoder that the
-deserialized data needs to be stored and tracked and will be reused again
-elsewhere in the serialization. This is sometimes called the "track flag"
-or the "F-bit" in code and documentation. Its status should be ignored
-when processing a tag, meaning code should mask off the high bit and
-only use the low 7 bits.
-
-Some tags, such as POS, NEG and SHORT_BINARY contain embedded in them
-either the data (in the case of POS and NEG) or the length of the
-OPT-DATA section (in the case of SHORT_BINARY).
-
-=item OPT-DATA
-
-This field may contain an arbitrary set of bytes, either determined
-implicitly by the tag (such as for FLOAT), explicitly in the tag (as in
-SHORT_BINARY) or in a varint following the tag (such as for STRING).
-
-=back
-
-=head3 Tags
-
-=for autoupdater start
-
-
- Tag | Char | Dec | Hex | Binary | Follow
- ------------------+------+-----+------+----------- |-----------------------------------------
- POS_0 | | 0 | 0x00 | 0b00000000 | small positive integer - value in low 4 bits (identity)
- POS_1 | | 1 | 0x01 | 0b00000001 |
- POS_2 | | 2 | 0x02 | 0b00000010 |
- POS_3 | | 3 | 0x03 | 0b00000011 |
- POS_4 | | 4 | 0x04 | 0b00000100 |
- POS_5 | | 5 | 0x05 | 0b00000101 |
- POS_6 | | 6 | 0x06 | 0b00000110 |
- POS_7 | "\a" | 7 | 0x07 | 0b00000111 |
- POS_8 | "\b" | 8 | 0x08 | 0b00001000 |
- POS_9 | "\t" | 9 | 0x09 | 0b00001001 |
- POS_10 | "\n" | 10 | 0x0a | 0b00001010 |
- POS_11 | | 11 | 0x0b | 0b00001011 |
- POS_12 | "\f" | 12 | 0x0c | 0b00001100 |
- POS_13 | "\r" | 13 | 0x0d | 0b00001101 |
- POS_14 | | 14 | 0x0e | 0b00001110 |
- POS_15 | | 15 | 0x0f | 0b00001111 | small positive integer - value in low 4 bits (identity)
- NEG_16 | | 16 | 0x10 | 0b00010000 | small negative integer - value in low 4 bits (k+32)
- NEG_15 | | 17 | 0x11 | 0b00010001 |
- NEG_14 | | 18 | 0x12 | 0b00010010 |
- NEG_13 | | 19 | 0x13 | 0b00010011 |
- NEG_12 | | 20 | 0x14 | 0b00010100 |
- NEG_11 | | 21 | 0x15 | 0b00010101 |
- NEG_10 | | 22 | 0x16 | 0b00010110 |
- NEG_9 | | 23 | 0x17 | 0b00010111 |
- NEG_8 | | 24 | 0x18 | 0b00011000 |
- NEG_7 | | 25 | 0x19 | 0b00011001 |
- NEG_6 | | 26 | 0x1a | 0b00011010 |
- NEG_5 | "\e" | 27 | 0x1b | 0b00011011 |
- NEG_4 | | 28 | 0x1c | 0b00011100 |
- NEG_3 | | 29 | 0x1d | 0b00011101 |
- NEG_2 | | 30 | 0x1e | 0b00011110 |
- NEG_1 | | 31 | 0x1f | 0b00011111 | small negative integer - value in low 4 bits (k+32)
- VARINT | " " | 32 | 0x20 | 0b00100000 | <VARINT> - Varint variable length integer
- ZIGZAG | "!" | 33 | 0x21 | 0b00100001 | <ZIGZAG-VARINT> - Zigzag variable length integer
- FLOAT | "\"" | 34 | 0x22 | 0b00100010 | <IEEE-FLOAT>
- DOUBLE | "#" | 35 | 0x23 | 0b00100011 | <IEEE-DOUBLE>
- LONG_DOUBLE | "\$" | 36 | 0x24 | 0b00100100 | <IEEE-LONG-DOUBLE>
- UNDEF | "%" | 37 | 0x25 | 0b00100101 | None - Perl undef
- BINARY | "&" | 38 | 0x26 | 0b00100110 | <LEN-VARINT> <BYTES> - binary/(latin1) string
- STR_UTF8 | "'" | 39 | 0x27 | 0b00100111 | <LEN-VARINT> <UTF8> - utf8 string
- REFN | "(" | 40 | 0x28 | 0b00101000 | <ITEM-TAG> - ref to next item
- REFP | ")" | 41 | 0x29 | 0b00101001 | <OFFSET-VARINT> - ref to previous item stored at offset
- HASH | "*" | 42 | 0x2a | 0b00101010 | <COUNT-VARINT> [<KEY-TAG> <ITEM-TAG> ...] - count followed by key/value pairs
- ARRAY | "+" | 43 | 0x2b | 0b00101011 | <COUNT-VARINT> [<ITEM-TAG> ...] - count followed by items
- OBJECT | "," | 44 | 0x2c | 0b00101100 | <STR-TAG> <ITEM-TAG> - class, object-item
- OBJECTV | "-" | 45 | 0x2d | 0b00101101 | <OFFSET-VARINT> <ITEM-TAG> - class name at offset - object-item
- ALIAS | "." | 46 | 0x2e | 0b00101110 | <OFFSET-VARINT> - alias to item defined at offset
- COPY | "/" | 47 | 0x2f | 0b00101111 | <OFFSET-VARINT> - copy of item defined at offset
- WEAKEN | "0" | 48 | 0x30 | 0b00110000 | <REF-TAG> - Weaken the following reference
- REGEXP | "1" | 49 | 0x31 | 0b00110001 | <PATTERN-STR-TAG> <MODIFIERS-STR-TAG>
- RESERVED_0 | "2" | 50 | 0x32 | 0b00110010 | reserved
- RESERVED_1 | "3" | 51 | 0x33 | 0b00110011 |
- RESERVED_2 | "4" | 52 | 0x34 | 0b00110100 |
- RESERVED_3 | "5" | 53 | 0x35 | 0b00110101 |
- RESERVED_4 | "6" | 54 | 0x36 | 0b00110110 |
- RESERVED_5 | "7" | 55 | 0x37 | 0b00110111 |
- RESERVED_6 | "8" | 56 | 0x38 | 0b00111000 |
- RESERVED_7 | "9" | 57 | 0x39 | 0b00111001 | reserved
- FALSE | ":" | 58 | 0x3a | 0b00111010 | false (PL_sv_no)
- TRUE | ";" | 59 | 0x3b | 0b00111011 | true (PL_sv_yes)
- MANY | "<" | 60 | 0x3c | 0b00111100 | <LEN-VARINT> <TYPE-BYTE> <TAG-DATA> - repeated tag (unimplemented)
- PACKET_START | "=" | 61 | 0x3d | 0b00111101 | (first byte of magic string in header)
- EXTEND | ">" | 62 | 0x3e | 0b00111110 | <BYTE> - for additional tags
- PAD | "?" | 63 | 0x3f | 0b00111111 | (ignored tag, skip to next byte)
- ARRAYREF_0 | "\@" | 64 | 0x40 | 0b01000000 | [<ITEM-TAG> ...] - count of itmes in low 4 bits (ARRAY must be refcnt=1)
- ARRAYREF_1 | "A" | 65 | 0x41 | 0b01000001 |
- ARRAYREF_2 | "B" | 66 | 0x42 | 0b01000010 |
- ARRAYREF_3 | "C" | 67 | 0x43 | 0b01000011 |
- ARRAYREF_4 | "D" | 68 | 0x44 | 0b01000100 |
- ARRAYREF_5 | "E" | 69 | 0x45 | 0b01000101 |
- ARRAYREF_6 | "F" | 70 | 0x46 | 0b01000110 |
- ARRAYREF_7 | "G" | 71 | 0x47 | 0b01000111 |
- ARRAYREF_8 | "H" | 72 | 0x48 | 0b01001000 |
- ARRAYREF_9 | "I" | 73 | 0x49 | 0b01001001 |
- ARRAYREF_10 | "J" | 74 | 0x4a | 0b01001010 |
- ARRAYREF_11 | "K" | 75 | 0x4b | 0b01001011 |
- ARRAYREF_12 | "L" | 76 | 0x4c | 0b01001100 |
- ARRAYREF_13 | "M" | 77 | 0x4d | 0b01001101 |
- ARRAYREF_14 | "N" | 78 | 0x4e | 0b01001110 |
- ARRAYREF_15 | "O" | 79 | 0x4f | 0b01001111 | [<ITEM-TAG> ...] - count of itmes in low 4 bits (ARRAY must be refcnt=1)
- HASHREF_0 | "P" | 80 | 0x50 | 0b01010000 | [<KEY-TAG> <ITEM-TAG> ...] - count in low 4 bits, key/value pairs (HASH must be refcnt= 1)
- HASHREF_1 | "Q" | 81 | 0x51 | 0b01010001 |
- HASHREF_2 | "R" | 82 | 0x52 | 0b01010010 |
- HASHREF_3 | "S" | 83 | 0x53 | 0b01010011 |
- HASHREF_4 | "T" | 84 | 0x54 | 0b01010100 |
- HASHREF_5 | "U" | 85 | 0x55 | 0b01010101 |
- HASHREF_6 | "V" | 86 | 0x56 | 0b01010110 |
- HASHREF_7 | "W" | 87 | 0x57 | 0b01010111 |
- HASHREF_8 | "X" | 88 | 0x58 | 0b01011000 |
- HASHREF_9 | "Y" | 89 | 0x59 | 0b01011001 |
- HASHREF_10 | "Z" | 90 | 0x5a | 0b01011010 |
- HASHREF_11 | "[" | 91 | 0x5b | 0b01011011 |
- HASHREF_12 | "\\" | 92 | 0x5c | 0b01011100 |
- HASHREF_13 | "]" | 93 | 0x5d | 0b01011101 |
- HASHREF_14 | "^" | 94 | 0x5e | 0b01011110 |
- HASHREF_15 | "_" | 95 | 0x5f | 0b01011111 | [<KEY-TAG> <ITEM-TAG> ...] - count in low 4 bits, key/value pairs (HASH must be refcnt= 1)
- SHORT_BINARY_0 | "`" | 96 | 0x60 | 0b01100000 | <BYTES> - binary/latin1 string, length encoded in low 5 bits of tag
- SHORT_BINARY_1 | "a" | 97 | 0x61 | 0b01100001 |
- SHORT_BINARY_2 | "b" | 98 | 0x62 | 0b01100010 |
- SHORT_BINARY_3 | "c" | 99 | 0x63 | 0b01100011 |
- SHORT_BINARY_4 | "d" | 100 | 0x64 | 0b01100100 |
- SHORT_BINARY_5 | "e" | 101 | 0x65 | 0b01100101 |
- SHORT_BINARY_6 | "f" | 102 | 0x66 | 0b01100110 |
- SHORT_BINARY_7 | "g" | 103 | 0x67 | 0b01100111 |
- SHORT_BINARY_8 | "h" | 104 | 0x68 | 0b01101000 |
- SHORT_BINARY_9 | "i" | 105 | 0x69 | 0b01101001 |
- SHORT_BINARY_10 | "j" | 106 | 0x6a | 0b01101010 |
- SHORT_BINARY_11 | "k" | 107 | 0x6b | 0b01101011 |
- SHORT_BINARY_12 | "l" | 108 | 0x6c | 0b01101100 |
- SHORT_BINARY_13 | "m" | 109 | 0x6d | 0b01101101 |
- SHORT_BINARY_14 | "n" | 110 | 0x6e | 0b01101110 |
- SHORT_BINARY_15 | "o" | 111 | 0x6f | 0b01101111 |
- SHORT_BINARY_16 | "p" | 112 | 0x70 | 0b01110000 |
- SHORT_BINARY_17 | "q" | 113 | 0x71 | 0b01110001 |
- SHORT_BINARY_18 | "r" | 114 | 0x72 | 0b01110010 |
- SHORT_BINARY_19 | "s" | 115 | 0x73 | 0b01110011 |
- SHORT_BINARY_20 | "t" | 116 | 0x74 | 0b01110100 |
- SHORT_BINARY_21 | "u" | 117 | 0x75 | 0b01110101 |
- SHORT_BINARY_22 | "v" | 118 | 0x76 | 0b01110110 |
- SHORT_BINARY_23 | "w" | 119 | 0x77 | 0b01110111 |
- SHORT_BINARY_24 | "x" | 120 | 0x78 | 0b01111000 |
- SHORT_BINARY_25 | "y" | 121 | 0x79 | 0b01111001 |
- SHORT_BINARY_26 | "z" | 122 | 0x7a | 0b01111010 |
- SHORT_BINARY_27 | "{" | 123 | 0x7b | 0b01111011 |
- SHORT_BINARY_28 | "|" | 124 | 0x7c | 0b01111100 |
- SHORT_BINARY_29 | "}" | 125 | 0x7d | 0b01111101 |
- SHORT_BINARY_30 | "~" | 126 | 0x7e | 0b01111110 |
- SHORT_BINARY_31 | | 127 | 0x7f | 0b01111111 | <BYTES> - binary/latin1 string, length encoded in low 5 bits of tag
-
-=for autoupdater stop
-
-=head4 The Track Bit And Cyclic Data Structures
-
-The protocol uses a combination of the offset of a tracked tag and the
-flag bit to be able to encode and reconstruct cyclic structures in a single
-pass.
-
-An encoder must track duplicated items and generate the appropriate ALIAS or
-REFP tags to reconstruct them, and when it does so ensure that the high
-bit of the original tag has been set.
-
-When a decoder encounters a tag with its flag set it will remember the
-offset of the tag in the output packet and the item that was decoded from
-that tag. At a later point in the packet there will be an ALIAS or REFP
-instruction which will refer to the item by its offset, and the decoder
-will reuse it as needed.
-
-=head4 The COPY Tag
-
-Sometimes it is convenient to be able to reuse a previously emitted
-sequence in the packet to reduce duplication. For instance a data
-structure with many hashes with the same keys. The COPY tag is used for
-this. Its argument is a varint which is the offset of a previously
-emitted tag, and decoders are to behave as though the tag it references
-was inserted into the packet stream as a replacement for the COPY tag.
-
-Note, that in this case the track flag is B<not> set. It is assumed the
-decoder can jump back to reread the tag from its location alone.
-
-=head4 Handling objects
-
-Objects are serialized as a class name and a tag which represents the
-objects data. In Perl land this will always be a reference. Mapping perl
-objects to other languages is left to the future.
+You can find the specification at L<sereal_spec.pod|sereal_spec.pod>
=head1 AUTHOR
Oops, something went wrong.

0 comments on commit 7d4d8f8

Please sign in to comment.