Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 360 lines (273 sloc) 16.071 kb
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
1 =pod
2
3 =encoding utf8
4
5 =head1 NAME
6
7 Sereal - Protocol definition
8
9 =head1 SYNOPSIS
10
11 This document describes the format and encoding of a Sereal data packet.
12
13 =head1 DESCRIPTION
14
15 A serialized structure is converted into a "document". A document is made
16 up of two parts, the header and the body.
17
18 =head2 General Points
19
20 =over 4
21
22 =item Little Endian
23
24 All numeric data is in little endian format.
25
26 =item IEEE Floats
27
28 Floating points types are in IEEE format.
29
30 =item Varints
31
32 Heavy use is made of a variable length integer encoding commonly called
33 a "varint" (Google calls it a Varint128). This encoding uses the high bit
34 of each byte to signal there is another byte worth of data coming, and the
35 last byte always having the high bit off. The data is in little endian
36 order with the low seven bits in the first byte, and the next 7 in the
37 next etc.
38
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
39 See L<Google's description|https://developers.google.com/protocol-buffers/docs/encoding#varints>.
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
40
41 =back
42
43 =head2 Header Format
44
45 A header consists of multiple components:
46
47 <MAGIC> <VERSION-TYPE> <HEADER-SUFFIX-SIZE> <OPT-SUFFIX>
48
49 =over 4
50
51 =item MAGIC
52
53 A "magic string" that identifies a document as being in the Sereal format.
54 The value of this string is "=srl", and when decoded as an unsigned 32 bit
55 integer on a little endian machine has a value of 0x6c72733d.
56
57 =item VERSION-TYPE
58
59 A single byte, of which the high 4 bits are used to represent the "type"
60 of the document, and the low 4 bits used to represent the version of the
61 Sereal protocol the document complies with.
62
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
63 Up until now there has only been one version of Sereal released so the
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
64 low bits will be 1.
65
66 Currently only two types are defined:
67
68 =over 4
69
70 =item 0
71
72 Raw Sereal format. The data can be processed verbatim.
73
74 =item 1
75
76 Compressed Sereal format, using Google's Snappy compression internally.
77
78 =back
79
80 Additional compression types are envisaged and will be assigned type
81 numbers by the maintainers of the protocol.
82
83 =item HEADER-SUFFIX-SIZE
84
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
85 The structure of the header includes support for embedding additional data.
86 This is accomplished by specifying the length of the suffix
87 in the header with a varint. Headers with no suffix will set this to a
88 binary 0. This is intended for future format extensions that retain some
89 level of compatibility for old decoders (which know how to skip the
90 extended header due to the embedded length).
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
91
92 =item OPT-SUFFIX
93
94 The suffix may contain whatever data the encoder wishes to embed in the
95 header. In version 1 of the protocol the decoder will never look inside
96 this data. Later versions may introduce new rules for this field.
97
98 =back
99
100 =head2 Body Format
101
102 The body is made up of one or more tagged data items:
103
104 <TAG> <OPT-DATA>
105
106 =over 4
107
108 =item TAG
109
110 A tag is a single byte which specifies the type of the data being decoded.
111
112 The high bit of each tag is used to signal to the decoder that the
113 deserialized data needs to be stored and tracked and will be reused again
114 elsewhere in the serialization. This is sometimes called the "track flag"
115 or the "F-bit" in code and documentation. Its status should be ignored
116 when processing a tag, meaning code should mask off the high bit and
117 only use the low 7 bits.
118
119 Some tags, such as POS, NEG and SHORT_BINARY contain embedded in them
120 either the data (in the case of POS and NEG) or the length of the
121 OPT-DATA section (in the case of SHORT_BINARY).
122
123 =item OPT-DATA
124
125 This field may contain an arbitrary set of bytes, either determined
126 implicitly by the tag (such as for FLOAT), explicitly in the tag (as in
127 SHORT_BINARY) or in a varint following the tag (such as for STRING).
128
129 =back
130
131 =head3 Tags
132
133 =for autoupdater start
134
135
136 Tag | Char | Dec | Hex | Binary | Follow
137 ------------------+------+-----+------+----------- |-----------------------------------------
138 POS_0 | | 0 | 0x00 | 0b00000000 | small positive integer - value in low 4 bits (identity)
139 POS_1 | | 1 | 0x01 | 0b00000001 |
140 POS_2 | | 2 | 0x02 | 0b00000010 |
141 POS_3 | | 3 | 0x03 | 0b00000011 |
142 POS_4 | | 4 | 0x04 | 0b00000100 |
143 POS_5 | | 5 | 0x05 | 0b00000101 |
144 POS_6 | | 6 | 0x06 | 0b00000110 |
145 POS_7 | "\a" | 7 | 0x07 | 0b00000111 |
146 POS_8 | "\b" | 8 | 0x08 | 0b00001000 |
147 POS_9 | "\t" | 9 | 0x09 | 0b00001001 |
148 POS_10 | "\n" | 10 | 0x0a | 0b00001010 |
149 POS_11 | | 11 | 0x0b | 0b00001011 |
150 POS_12 | "\f" | 12 | 0x0c | 0b00001100 |
151 POS_13 | "\r" | 13 | 0x0d | 0b00001101 |
152 POS_14 | | 14 | 0x0e | 0b00001110 |
153 POS_15 | | 15 | 0x0f | 0b00001111 | small positive integer - value in low 4 bits (identity)
154 NEG_16 | | 16 | 0x10 | 0b00010000 | small negative integer - value in low 4 bits (k+32)
155 NEG_15 | | 17 | 0x11 | 0b00010001 |
156 NEG_14 | | 18 | 0x12 | 0b00010010 |
157 NEG_13 | | 19 | 0x13 | 0b00010011 |
158 NEG_12 | | 20 | 0x14 | 0b00010100 |
159 NEG_11 | | 21 | 0x15 | 0b00010101 |
160 NEG_10 | | 22 | 0x16 | 0b00010110 |
161 NEG_9 | | 23 | 0x17 | 0b00010111 |
162 NEG_8 | | 24 | 0x18 | 0b00011000 |
163 NEG_7 | | 25 | 0x19 | 0b00011001 |
164 NEG_6 | | 26 | 0x1a | 0b00011010 |
165 NEG_5 | "\e" | 27 | 0x1b | 0b00011011 |
166 NEG_4 | | 28 | 0x1c | 0b00011100 |
167 NEG_3 | | 29 | 0x1d | 0b00011101 |
168 NEG_2 | | 30 | 0x1e | 0b00011110 |
169 NEG_1 | | 31 | 0x1f | 0b00011111 | small negative integer - value in low 4 bits (k+32)
170 VARINT | " " | 32 | 0x20 | 0b00100000 | <VARINT> - Varint variable length integer
171 ZIGZAG | "!" | 33 | 0x21 | 0b00100001 | <ZIGZAG-VARINT> - Zigzag variable length integer
172 FLOAT | "\"" | 34 | 0x22 | 0b00100010 | <IEEE-FLOAT>
173 DOUBLE | "#" | 35 | 0x23 | 0b00100011 | <IEEE-DOUBLE>
174 LONG_DOUBLE | "\$" | 36 | 0x24 | 0b00100100 | <IEEE-LONG-DOUBLE>
175 UNDEF | "%" | 37 | 0x25 | 0b00100101 | None - Perl undef
176 BINARY | "&" | 38 | 0x26 | 0b00100110 | <LEN-VARINT> <BYTES> - binary/(latin1) string
177 STR_UTF8 | "'" | 39 | 0x27 | 0b00100111 | <LEN-VARINT> <UTF8> - utf8 string
178 REFN | "(" | 40 | 0x28 | 0b00101000 | <ITEM-TAG> - ref to next item
179 REFP | ")" | 41 | 0x29 | 0b00101001 | <OFFSET-VARINT> - ref to previous item stored at offset
180 HASH | "*" | 42 | 0x2a | 0b00101010 | <COUNT-VARINT> [<KEY-TAG> <ITEM-TAG> ...] - count followed by key/value pairs
181 ARRAY | "+" | 43 | 0x2b | 0b00101011 | <COUNT-VARINT> [<ITEM-TAG> ...] - count followed by items
182 OBJECT | "," | 44 | 0x2c | 0b00101100 | <STR-TAG> <ITEM-TAG> - class, object-item
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
183 OBJECTV | "-" | 45 | 0x2d | 0b00101101 | <OFFSET-VARINT> <ITEM-TAG> - offset of previously used classname tag - object-item
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
184 ALIAS | "." | 46 | 0x2e | 0b00101110 | <OFFSET-VARINT> - alias to item defined at offset
185 COPY | "/" | 47 | 0x2f | 0b00101111 | <OFFSET-VARINT> - copy of item defined at offset
186 WEAKEN | "0" | 48 | 0x30 | 0b00110000 | <REF-TAG> - Weaken the following reference
187 REGEXP | "1" | 49 | 0x31 | 0b00110001 | <PATTERN-STR-TAG> <MODIFIERS-STR-TAG>
188 RESERVED_0 | "2" | 50 | 0x32 | 0b00110010 | reserved
189 RESERVED_1 | "3" | 51 | 0x33 | 0b00110011 |
190 RESERVED_2 | "4" | 52 | 0x34 | 0b00110100 |
191 RESERVED_3 | "5" | 53 | 0x35 | 0b00110101 |
192 RESERVED_4 | "6" | 54 | 0x36 | 0b00110110 |
193 RESERVED_5 | "7" | 55 | 0x37 | 0b00110111 |
194 RESERVED_6 | "8" | 56 | 0x38 | 0b00111000 |
195 RESERVED_7 | "9" | 57 | 0x39 | 0b00111001 | reserved
196 FALSE | ":" | 58 | 0x3a | 0b00111010 | false (PL_sv_no)
197 TRUE | ";" | 59 | 0x3b | 0b00111011 | true (PL_sv_yes)
f9641fb @tsee Update spec: MANY will only be in version 2
tsee authored
198 MANY | "<" | 60 | 0x3c | 0b00111100 | <LEN-VARINT> <TYPE-BYTE> <TAG-DATA> - repeated tag (not done yet, will be implemented in version 2)
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
199 PACKET_START | "=" | 61 | 0x3d | 0b00111101 | (first byte of magic string in header)
200 EXTEND | ">" | 62 | 0x3e | 0b00111110 | <BYTE> - for additional tags
201 PAD | "?" | 63 | 0x3f | 0b00111111 | (ignored tag, skip to next byte)
0675748 @demerphq fix some spelling errors and regenerate the files
demerphq authored
202 ARRAYREF_0 | "\@" | 64 | 0x40 | 0b01000000 | [<ITEM-TAG> ...] - count of items in low 4 bits (ARRAY must be refcnt=1)
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
203 ARRAYREF_1 | "A" | 65 | 0x41 | 0b01000001 |
204 ARRAYREF_2 | "B" | 66 | 0x42 | 0b01000010 |
205 ARRAYREF_3 | "C" | 67 | 0x43 | 0b01000011 |
206 ARRAYREF_4 | "D" | 68 | 0x44 | 0b01000100 |
207 ARRAYREF_5 | "E" | 69 | 0x45 | 0b01000101 |
208 ARRAYREF_6 | "F" | 70 | 0x46 | 0b01000110 |
209 ARRAYREF_7 | "G" | 71 | 0x47 | 0b01000111 |
210 ARRAYREF_8 | "H" | 72 | 0x48 | 0b01001000 |
211 ARRAYREF_9 | "I" | 73 | 0x49 | 0b01001001 |
212 ARRAYREF_10 | "J" | 74 | 0x4a | 0b01001010 |
213 ARRAYREF_11 | "K" | 75 | 0x4b | 0b01001011 |
214 ARRAYREF_12 | "L" | 76 | 0x4c | 0b01001100 |
215 ARRAYREF_13 | "M" | 77 | 0x4d | 0b01001101 |
216 ARRAYREF_14 | "N" | 78 | 0x4e | 0b01001110 |
0675748 @demerphq fix some spelling errors and regenerate the files
demerphq authored
217 ARRAYREF_15 | "O" | 79 | 0x4f | 0b01001111 | [<ITEM-TAG> ...] - count of items in low 4 bits (ARRAY must be refcnt=1)
218 HASHREF_0 | "P" | 80 | 0x50 | 0b01010000 | [<KEY-TAG> <ITEM-TAG> ...] - count in low 4 bits, key/value pairs (HASH must be refcnt=1)
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
219 HASHREF_1 | "Q" | 81 | 0x51 | 0b01010001 |
220 HASHREF_2 | "R" | 82 | 0x52 | 0b01010010 |
221 HASHREF_3 | "S" | 83 | 0x53 | 0b01010011 |
222 HASHREF_4 | "T" | 84 | 0x54 | 0b01010100 |
223 HASHREF_5 | "U" | 85 | 0x55 | 0b01010101 |
224 HASHREF_6 | "V" | 86 | 0x56 | 0b01010110 |
225 HASHREF_7 | "W" | 87 | 0x57 | 0b01010111 |
226 HASHREF_8 | "X" | 88 | 0x58 | 0b01011000 |
227 HASHREF_9 | "Y" | 89 | 0x59 | 0b01011001 |
228 HASHREF_10 | "Z" | 90 | 0x5a | 0b01011010 |
229 HASHREF_11 | "[" | 91 | 0x5b | 0b01011011 |
230 HASHREF_12 | "\\" | 92 | 0x5c | 0b01011100 |
231 HASHREF_13 | "]" | 93 | 0x5d | 0b01011101 |
232 HASHREF_14 | "^" | 94 | 0x5e | 0b01011110 |
0675748 @demerphq fix some spelling errors and regenerate the files
demerphq authored
233 HASHREF_15 | "_" | 95 | 0x5f | 0b01011111 | [<KEY-TAG> <ITEM-TAG> ...] - count in low 4 bits, key/value pairs (HASH must be refcnt=1)
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
234 SHORT_BINARY_0 | "`" | 96 | 0x60 | 0b01100000 | <BYTES> - binary/latin1 string, length encoded in low 5 bits of tag
235 SHORT_BINARY_1 | "a" | 97 | 0x61 | 0b01100001 |
236 SHORT_BINARY_2 | "b" | 98 | 0x62 | 0b01100010 |
237 SHORT_BINARY_3 | "c" | 99 | 0x63 | 0b01100011 |
238 SHORT_BINARY_4 | "d" | 100 | 0x64 | 0b01100100 |
239 SHORT_BINARY_5 | "e" | 101 | 0x65 | 0b01100101 |
240 SHORT_BINARY_6 | "f" | 102 | 0x66 | 0b01100110 |
241 SHORT_BINARY_7 | "g" | 103 | 0x67 | 0b01100111 |
242 SHORT_BINARY_8 | "h" | 104 | 0x68 | 0b01101000 |
243 SHORT_BINARY_9 | "i" | 105 | 0x69 | 0b01101001 |
244 SHORT_BINARY_10 | "j" | 106 | 0x6a | 0b01101010 |
245 SHORT_BINARY_11 | "k" | 107 | 0x6b | 0b01101011 |
246 SHORT_BINARY_12 | "l" | 108 | 0x6c | 0b01101100 |
247 SHORT_BINARY_13 | "m" | 109 | 0x6d | 0b01101101 |
248 SHORT_BINARY_14 | "n" | 110 | 0x6e | 0b01101110 |
249 SHORT_BINARY_15 | "o" | 111 | 0x6f | 0b01101111 |
250 SHORT_BINARY_16 | "p" | 112 | 0x70 | 0b01110000 |
251 SHORT_BINARY_17 | "q" | 113 | 0x71 | 0b01110001 |
252 SHORT_BINARY_18 | "r" | 114 | 0x72 | 0b01110010 |
253 SHORT_BINARY_19 | "s" | 115 | 0x73 | 0b01110011 |
254 SHORT_BINARY_20 | "t" | 116 | 0x74 | 0b01110100 |
255 SHORT_BINARY_21 | "u" | 117 | 0x75 | 0b01110101 |
256 SHORT_BINARY_22 | "v" | 118 | 0x76 | 0b01110110 |
257 SHORT_BINARY_23 | "w" | 119 | 0x77 | 0b01110111 |
258 SHORT_BINARY_24 | "x" | 120 | 0x78 | 0b01111000 |
259 SHORT_BINARY_25 | "y" | 121 | 0x79 | 0b01111001 |
260 SHORT_BINARY_26 | "z" | 122 | 0x7a | 0b01111010 |
261 SHORT_BINARY_27 | "{" | 123 | 0x7b | 0b01111011 |
262 SHORT_BINARY_28 | "|" | 124 | 0x7c | 0b01111100 |
263 SHORT_BINARY_29 | "}" | 125 | 0x7d | 0b01111101 |
264 SHORT_BINARY_30 | "~" | 126 | 0x7e | 0b01111110 |
265 SHORT_BINARY_31 | | 127 | 0x7f | 0b01111111 | <BYTES> - binary/latin1 string, length encoded in low 5 bits of tag
266
267 =for autoupdater stop
268
269 =head4 The Track Bit And Cyclic Data Structures
270
271 The protocol uses a combination of the offset of a tracked tag and the
272 flag bit to be able to encode and reconstruct cyclic structures in a single
273 pass.
274
275 An encoder must track duplicated items and generate the appropriate ALIAS or
276 REFP tags to reconstruct them, and when it does so ensure that the high
277 bit of the original tag has been set.
278
279 When a decoder encounters a tag with its flag set it will remember the
280 offset of the tag in the output packet and the item that was decoded from
281 that tag. At a later point in the packet there will be an ALIAS or REFP
282 instruction which will refer to the item by its offset, and the decoder
283 will reuse it as needed.
284
285 =head4 The COPY Tag
286
287 Sometimes it is convenient to be able to reuse a previously emitted
288 sequence in the packet to reduce duplication. For instance a data
289 structure with many hashes with the same keys. The COPY tag is used for
290 this. Its argument is a varint which is the offset of a previously
291 emitted tag, and decoders are to behave as though the tag it references
292 was inserted into the packet stream as a replacement for the COPY tag.
293
294 Note, that in this case the track flag is B<not> set. It is assumed the
295 decoder can jump back to reread the tag from its location alone.
296
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
297 Copy tags are forbidden from referring to another COPY tag, and are also
298 forbidden from referring to anything containing a COPY tag, with the
299 exception that a COPY tag used as a value may refer to an tag that uses
300 a COPY tag for a classname or hash key.
301
302 =head4 String Types
303
304 Sereal supports three string representations. Two are "encodingless" and
305 are SHORT_BINARY and STR_BINARY, where binary means "raw bytes". The other
306 is STR_UTF8 which is expected to contain valid canonical UTF8 encoded
307 unicode text data. Under normal circumstances a decoder is not expected
308 to validate that this is actually the case, and is allowed to simply
309 extract the data verbatim.
310
311 SHORT_BINARY stores the length of the string in the tag itself and is used
312 for strings of less than 32 characters long. Both STR_BINARY and STR_UTF8
313 use a varint to indicate the number of B<bytes> (octets) in the string.
314
315 =head4 Hash Keys
316
317 Hashs keys are always one of the string types, or a COPY tag referencing a
318 string.
319
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
320 =head4 Handling objects
321
322 Objects are serialized as a class name and a tag which represents the
323 objects data. In Perl land this will always be a reference. Mapping perl
324 objects to other languages is left to the future.
325
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
326 Note that classnames MUST be a string, or a COPY tag referencing a string.
327
328 OBJECTV varints MUST reference a previously used classname, and not an
329 arbitrary string.
330
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
331 =head1 AUTHOR
332
333 Yves Orton E<lt>demerphq@gmail.comE<gt>
334
335 Damian Gryski
336
337 Steffen Mueller E<lt>smueller@cpan.orgE<gt>
338
339 Rafaël Garcia-Suarez
340
341 Ævar Arnfjörð Bjarmason
342
343 =head1 ACKNOWLEDGMENT
344
345 This protocol was originally developed for Booking.com. With approval
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
346 from Booking.com, this document was generalized and published on github
347 and CPAN, for which the authors would like to express their gratitude.
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
348
349 =head1 COPYRIGHT AND LICENSE
350
351 Copyright (C) 2012 by Steffen Mueller
352
353 Copyright (C) 2012 by Yves Orton
354
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
355 This library is free software; you can redistribute it and/or modify
356 it under the same terms as Perl itself.
357
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
358 =cut
359
Something went wrong with that request. Please try again.