Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 364 lines (276 sloc) 16.284 kB
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
1 =pod
2
3 =encoding utf8
4
5 =head1 NAME
6
7 Sereal - Protocol definition
8
9 =head1 SYNOPSIS
10
11 This document describes the format and encoding of a Sereal data packet.
12
13 =head1 DESCRIPTION
14
15 A serialized structure is converted into a "document". A document is made
16 up of two parts, the header and the body.
17
18 =head2 General Points
19
20 =over 4
21
22 =item Little Endian
23
24 All numeric data is in little endian format.
25
26 =item IEEE Floats
27
28 Floating points types are in IEEE format.
29
30 =item Varints
31
32 Heavy use is made of a variable length integer encoding commonly called
33 a "varint" (Google calls it a Varint128). This encoding uses the high bit
34 of each byte to signal there is another byte worth of data coming, and the
35 last byte always having the high bit off. The data is in little endian
36 order with the low seven bits in the first byte, and the next 7 in the
37 next etc.
38
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
39 See L<Google's description|https://developers.google.com/protocol-buffers/docs/encoding#varints>.
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
40
41 =back
42
43 =head2 Header Format
44
45 A header consists of multiple components:
46
47 <MAGIC> <VERSION-TYPE> <HEADER-SUFFIX-SIZE> <OPT-SUFFIX>
48
49 =over 4
50
51 =item MAGIC
52
53 A "magic string" that identifies a document as being in the Sereal format.
54 The value of this string is "=srl", and when decoded as an unsigned 32 bit
55 integer on a little endian machine has a value of 0x6c72733d.
56
57 =item VERSION-TYPE
58
59 A single byte, of which the high 4 bits are used to represent the "type"
60 of the document, and the low 4 bits used to represent the version of the
61 Sereal protocol the document complies with.
62
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
63 Up until now there has only been one version of Sereal released so the
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
64 low bits will be 1.
65
66 Currently only two types are defined:
67
68 =over 4
69
70 =item 0
71
72 Raw Sereal format. The data can be processed verbatim.
73
74 =item 1
75
76 Compressed Sereal format, using Google's Snappy compression internally.
77
78 =back
79
80 Additional compression types are envisaged and will be assigned type
81 numbers by the maintainers of the protocol.
82
83 =item HEADER-SUFFIX-SIZE
84
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
85 The structure of the header includes support for embedding additional data.
86 This is accomplished by specifying the length of the suffix
87 in the header with a varint. Headers with no suffix will set this to a
88 binary 0. This is intended for future format extensions that retain some
89 level of compatibility for old decoders (which know how to skip the
90 extended header due to the embedded length).
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
91
92 =item OPT-SUFFIX
93
94 The suffix may contain whatever data the encoder wishes to embed in the
95 header. In version 1 of the protocol the decoder will never look inside
96 this data. Later versions may introduce new rules for this field.
97
98 =back
99
100 =head2 Body Format
101
102 The body is made up of one or more tagged data items:
103
104 <TAG> <OPT-DATA>
105
106 =over 4
107
108 =item TAG
109
110 A tag is a single byte which specifies the type of the data being decoded.
111
112 The high bit of each tag is used to signal to the decoder that the
113 deserialized data needs to be stored and tracked and will be reused again
114 elsewhere in the serialization. This is sometimes called the "track flag"
115 or the "F-bit" in code and documentation. Its status should be ignored
116 when processing a tag, meaning code should mask off the high bit and
117 only use the low 7 bits.
118
119 Some tags, such as POS, NEG and SHORT_BINARY contain embedded in them
120 either the data (in the case of POS and NEG) or the length of the
121 OPT-DATA section (in the case of SHORT_BINARY).
122
123 =item OPT-DATA
124
125 This field may contain an arbitrary set of bytes, either determined
126 implicitly by the tag (such as for FLOAT), explicitly in the tag (as in
127 SHORT_BINARY) or in a varint following the tag (such as for STRING).
128
129 =back
130
4912c3d @tsee Clarify the meaning of an offset
tsee authored
131 When referring to an offset below, what's meant is a varint encoded
132 absolute integer byte position. That is, an offset of 10 refers to the
133 tenth byte in the Sereal document (including its header).
134
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
135 =head3 Tags
136
137 =for autoupdater start
138
139
140 Tag | Char | Dec | Hex | Binary | Follow
141 ------------------+------+-----+------+----------- |-----------------------------------------
142 POS_0 | | 0 | 0x00 | 0b00000000 | small positive integer - value in low 4 bits (identity)
143 POS_1 | | 1 | 0x01 | 0b00000001 |
144 POS_2 | | 2 | 0x02 | 0b00000010 |
145 POS_3 | | 3 | 0x03 | 0b00000011 |
146 POS_4 | | 4 | 0x04 | 0b00000100 |
147 POS_5 | | 5 | 0x05 | 0b00000101 |
148 POS_6 | | 6 | 0x06 | 0b00000110 |
149 POS_7 | "\a" | 7 | 0x07 | 0b00000111 |
150 POS_8 | "\b" | 8 | 0x08 | 0b00001000 |
151 POS_9 | "\t" | 9 | 0x09 | 0b00001001 |
152 POS_10 | "\n" | 10 | 0x0a | 0b00001010 |
153 POS_11 | | 11 | 0x0b | 0b00001011 |
154 POS_12 | "\f" | 12 | 0x0c | 0b00001100 |
155 POS_13 | "\r" | 13 | 0x0d | 0b00001101 |
156 POS_14 | | 14 | 0x0e | 0b00001110 |
157 POS_15 | | 15 | 0x0f | 0b00001111 | small positive integer - value in low 4 bits (identity)
158 NEG_16 | | 16 | 0x10 | 0b00010000 | small negative integer - value in low 4 bits (k+32)
159 NEG_15 | | 17 | 0x11 | 0b00010001 |
160 NEG_14 | | 18 | 0x12 | 0b00010010 |
161 NEG_13 | | 19 | 0x13 | 0b00010011 |
162 NEG_12 | | 20 | 0x14 | 0b00010100 |
163 NEG_11 | | 21 | 0x15 | 0b00010101 |
164 NEG_10 | | 22 | 0x16 | 0b00010110 |
165 NEG_9 | | 23 | 0x17 | 0b00010111 |
166 NEG_8 | | 24 | 0x18 | 0b00011000 |
167 NEG_7 | | 25 | 0x19 | 0b00011001 |
168 NEG_6 | | 26 | 0x1a | 0b00011010 |
169 NEG_5 | "\e" | 27 | 0x1b | 0b00011011 |
170 NEG_4 | | 28 | 0x1c | 0b00011100 |
171 NEG_3 | | 29 | 0x1d | 0b00011101 |
172 NEG_2 | | 30 | 0x1e | 0b00011110 |
173 NEG_1 | | 31 | 0x1f | 0b00011111 | small negative integer - value in low 4 bits (k+32)
174 VARINT | " " | 32 | 0x20 | 0b00100000 | <VARINT> - Varint variable length integer
175 ZIGZAG | "!" | 33 | 0x21 | 0b00100001 | <ZIGZAG-VARINT> - Zigzag variable length integer
176 FLOAT | "\"" | 34 | 0x22 | 0b00100010 | <IEEE-FLOAT>
177 DOUBLE | "#" | 35 | 0x23 | 0b00100011 | <IEEE-DOUBLE>
178 LONG_DOUBLE | "\$" | 36 | 0x24 | 0b00100100 | <IEEE-LONG-DOUBLE>
179 UNDEF | "%" | 37 | 0x25 | 0b00100101 | None - Perl undef
180 BINARY | "&" | 38 | 0x26 | 0b00100110 | <LEN-VARINT> <BYTES> - binary/(latin1) string
181 STR_UTF8 | "'" | 39 | 0x27 | 0b00100111 | <LEN-VARINT> <UTF8> - utf8 string
182 REFN | "(" | 40 | 0x28 | 0b00101000 | <ITEM-TAG> - ref to next item
183 REFP | ")" | 41 | 0x29 | 0b00101001 | <OFFSET-VARINT> - ref to previous item stored at offset
184 HASH | "*" | 42 | 0x2a | 0b00101010 | <COUNT-VARINT> [<KEY-TAG> <ITEM-TAG> ...] - count followed by key/value pairs
185 ARRAY | "+" | 43 | 0x2b | 0b00101011 | <COUNT-VARINT> [<ITEM-TAG> ...] - count followed by items
186 OBJECT | "," | 44 | 0x2c | 0b00101100 | <STR-TAG> <ITEM-TAG> - class, object-item
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
187 OBJECTV | "-" | 45 | 0x2d | 0b00101101 | <OFFSET-VARINT> <ITEM-TAG> - offset of previously used classname tag - object-item
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
188 ALIAS | "." | 46 | 0x2e | 0b00101110 | <OFFSET-VARINT> - alias to item defined at offset
189 COPY | "/" | 47 | 0x2f | 0b00101111 | <OFFSET-VARINT> - copy of item defined at offset
190 WEAKEN | "0" | 48 | 0x30 | 0b00110000 | <REF-TAG> - Weaken the following reference
191 REGEXP | "1" | 49 | 0x31 | 0b00110001 | <PATTERN-STR-TAG> <MODIFIERS-STR-TAG>
192 RESERVED_0 | "2" | 50 | 0x32 | 0b00110010 | reserved
193 RESERVED_1 | "3" | 51 | 0x33 | 0b00110011 |
194 RESERVED_2 | "4" | 52 | 0x34 | 0b00110100 |
195 RESERVED_3 | "5" | 53 | 0x35 | 0b00110101 |
196 RESERVED_4 | "6" | 54 | 0x36 | 0b00110110 |
197 RESERVED_5 | "7" | 55 | 0x37 | 0b00110111 |
198 RESERVED_6 | "8" | 56 | 0x38 | 0b00111000 |
199 RESERVED_7 | "9" | 57 | 0x39 | 0b00111001 | reserved
200 FALSE | ":" | 58 | 0x3a | 0b00111010 | false (PL_sv_no)
201 TRUE | ";" | 59 | 0x3b | 0b00111011 | true (PL_sv_yes)
f9641fb @tsee Update spec: MANY will only be in version 2
tsee authored
202 MANY | "<" | 60 | 0x3c | 0b00111100 | <LEN-VARINT> <TYPE-BYTE> <TAG-DATA> - repeated tag (not done yet, will be implemented in version 2)
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
203 PACKET_START | "=" | 61 | 0x3d | 0b00111101 | (first byte of magic string in header)
204 EXTEND | ">" | 62 | 0x3e | 0b00111110 | <BYTE> - for additional tags
205 PAD | "?" | 63 | 0x3f | 0b00111111 | (ignored tag, skip to next byte)
0675748 @demerphq fix some spelling errors and regenerate the files
demerphq authored
206 ARRAYREF_0 | "\@" | 64 | 0x40 | 0b01000000 | [<ITEM-TAG> ...] - count of items in low 4 bits (ARRAY must be refcnt=1)
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
207 ARRAYREF_1 | "A" | 65 | 0x41 | 0b01000001 |
208 ARRAYREF_2 | "B" | 66 | 0x42 | 0b01000010 |
209 ARRAYREF_3 | "C" | 67 | 0x43 | 0b01000011 |
210 ARRAYREF_4 | "D" | 68 | 0x44 | 0b01000100 |
211 ARRAYREF_5 | "E" | 69 | 0x45 | 0b01000101 |
212 ARRAYREF_6 | "F" | 70 | 0x46 | 0b01000110 |
213 ARRAYREF_7 | "G" | 71 | 0x47 | 0b01000111 |
214 ARRAYREF_8 | "H" | 72 | 0x48 | 0b01001000 |
215 ARRAYREF_9 | "I" | 73 | 0x49 | 0b01001001 |
216 ARRAYREF_10 | "J" | 74 | 0x4a | 0b01001010 |
217 ARRAYREF_11 | "K" | 75 | 0x4b | 0b01001011 |
218 ARRAYREF_12 | "L" | 76 | 0x4c | 0b01001100 |
219 ARRAYREF_13 | "M" | 77 | 0x4d | 0b01001101 |
220 ARRAYREF_14 | "N" | 78 | 0x4e | 0b01001110 |
0675748 @demerphq fix some spelling errors and regenerate the files
demerphq authored
221 ARRAYREF_15 | "O" | 79 | 0x4f | 0b01001111 | [<ITEM-TAG> ...] - count of items in low 4 bits (ARRAY must be refcnt=1)
222 HASHREF_0 | "P" | 80 | 0x50 | 0b01010000 | [<KEY-TAG> <ITEM-TAG> ...] - count in low 4 bits, key/value pairs (HASH must be refcnt=1)
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
223 HASHREF_1 | "Q" | 81 | 0x51 | 0b01010001 |
224 HASHREF_2 | "R" | 82 | 0x52 | 0b01010010 |
225 HASHREF_3 | "S" | 83 | 0x53 | 0b01010011 |
226 HASHREF_4 | "T" | 84 | 0x54 | 0b01010100 |
227 HASHREF_5 | "U" | 85 | 0x55 | 0b01010101 |
228 HASHREF_6 | "V" | 86 | 0x56 | 0b01010110 |
229 HASHREF_7 | "W" | 87 | 0x57 | 0b01010111 |
230 HASHREF_8 | "X" | 88 | 0x58 | 0b01011000 |
231 HASHREF_9 | "Y" | 89 | 0x59 | 0b01011001 |
232 HASHREF_10 | "Z" | 90 | 0x5a | 0b01011010 |
233 HASHREF_11 | "[" | 91 | 0x5b | 0b01011011 |
234 HASHREF_12 | "\\" | 92 | 0x5c | 0b01011100 |
235 HASHREF_13 | "]" | 93 | 0x5d | 0b01011101 |
236 HASHREF_14 | "^" | 94 | 0x5e | 0b01011110 |
0675748 @demerphq fix some spelling errors and regenerate the files
demerphq authored
237 HASHREF_15 | "_" | 95 | 0x5f | 0b01011111 | [<KEY-TAG> <ITEM-TAG> ...] - count in low 4 bits, key/value pairs (HASH must be refcnt=1)
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
238 SHORT_BINARY_0 | "`" | 96 | 0x60 | 0b01100000 | <BYTES> - binary/latin1 string, length encoded in low 5 bits of tag
239 SHORT_BINARY_1 | "a" | 97 | 0x61 | 0b01100001 |
240 SHORT_BINARY_2 | "b" | 98 | 0x62 | 0b01100010 |
241 SHORT_BINARY_3 | "c" | 99 | 0x63 | 0b01100011 |
242 SHORT_BINARY_4 | "d" | 100 | 0x64 | 0b01100100 |
243 SHORT_BINARY_5 | "e" | 101 | 0x65 | 0b01100101 |
244 SHORT_BINARY_6 | "f" | 102 | 0x66 | 0b01100110 |
245 SHORT_BINARY_7 | "g" | 103 | 0x67 | 0b01100111 |
246 SHORT_BINARY_8 | "h" | 104 | 0x68 | 0b01101000 |
247 SHORT_BINARY_9 | "i" | 105 | 0x69 | 0b01101001 |
248 SHORT_BINARY_10 | "j" | 106 | 0x6a | 0b01101010 |
249 SHORT_BINARY_11 | "k" | 107 | 0x6b | 0b01101011 |
250 SHORT_BINARY_12 | "l" | 108 | 0x6c | 0b01101100 |
251 SHORT_BINARY_13 | "m" | 109 | 0x6d | 0b01101101 |
252 SHORT_BINARY_14 | "n" | 110 | 0x6e | 0b01101110 |
253 SHORT_BINARY_15 | "o" | 111 | 0x6f | 0b01101111 |
254 SHORT_BINARY_16 | "p" | 112 | 0x70 | 0b01110000 |
255 SHORT_BINARY_17 | "q" | 113 | 0x71 | 0b01110001 |
256 SHORT_BINARY_18 | "r" | 114 | 0x72 | 0b01110010 |
257 SHORT_BINARY_19 | "s" | 115 | 0x73 | 0b01110011 |
258 SHORT_BINARY_20 | "t" | 116 | 0x74 | 0b01110100 |
259 SHORT_BINARY_21 | "u" | 117 | 0x75 | 0b01110101 |
260 SHORT_BINARY_22 | "v" | 118 | 0x76 | 0b01110110 |
261 SHORT_BINARY_23 | "w" | 119 | 0x77 | 0b01110111 |
262 SHORT_BINARY_24 | "x" | 120 | 0x78 | 0b01111000 |
263 SHORT_BINARY_25 | "y" | 121 | 0x79 | 0b01111001 |
264 SHORT_BINARY_26 | "z" | 122 | 0x7a | 0b01111010 |
265 SHORT_BINARY_27 | "{" | 123 | 0x7b | 0b01111011 |
266 SHORT_BINARY_28 | "|" | 124 | 0x7c | 0b01111100 |
267 SHORT_BINARY_29 | "}" | 125 | 0x7d | 0b01111101 |
268 SHORT_BINARY_30 | "~" | 126 | 0x7e | 0b01111110 |
269 SHORT_BINARY_31 | | 127 | 0x7f | 0b01111111 | <BYTES> - binary/latin1 string, length encoded in low 5 bits of tag
270
271 =for autoupdater stop
272
273 =head4 The Track Bit And Cyclic Data Structures
274
275 The protocol uses a combination of the offset of a tracked tag and the
276 flag bit to be able to encode and reconstruct cyclic structures in a single
277 pass.
278
279 An encoder must track duplicated items and generate the appropriate ALIAS or
280 REFP tags to reconstruct them, and when it does so ensure that the high
281 bit of the original tag has been set.
282
283 When a decoder encounters a tag with its flag set it will remember the
284 offset of the tag in the output packet and the item that was decoded from
285 that tag. At a later point in the packet there will be an ALIAS or REFP
286 instruction which will refer to the item by its offset, and the decoder
287 will reuse it as needed.
288
289 =head4 The COPY Tag
290
291 Sometimes it is convenient to be able to reuse a previously emitted
292 sequence in the packet to reduce duplication. For instance a data
293 structure with many hashes with the same keys. The COPY tag is used for
294 this. Its argument is a varint which is the offset of a previously
295 emitted tag, and decoders are to behave as though the tag it references
296 was inserted into the packet stream as a replacement for the COPY tag.
297
298 Note, that in this case the track flag is B<not> set. It is assumed the
299 decoder can jump back to reread the tag from its location alone.
300
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
301 Copy tags are forbidden from referring to another COPY tag, and are also
302 forbidden from referring to anything containing a COPY tag, with the
303 exception that a COPY tag used as a value may refer to an tag that uses
304 a COPY tag for a classname or hash key.
305
306 =head4 String Types
307
308 Sereal supports three string representations. Two are "encodingless" and
9243d31 @rgs Fix a couple of typos, notably in tag names
rgs authored
309 are SHORT_BINARY and BINARY, where binary means "raw bytes". The other
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
310 is STR_UTF8 which is expected to contain valid canonical UTF8 encoded
311 unicode text data. Under normal circumstances a decoder is not expected
312 to validate that this is actually the case, and is allowed to simply
313 extract the data verbatim.
314
315 SHORT_BINARY stores the length of the string in the tag itself and is used
9243d31 @rgs Fix a couple of typos, notably in tag names
rgs authored
316 for strings of less than 32 characters long. Both BINARY and STR_UTF8
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
317 use a varint to indicate the number of B<bytes> (octets) in the string.
318
319 =head4 Hash Keys
320
9243d31 @rgs Fix a couple of typos, notably in tag names
rgs authored
321 Hash keys are always one of the string types, or a COPY tag referencing a
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
322 string.
323
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
324 =head4 Handling objects
325
326 Objects are serialized as a class name and a tag which represents the
327 objects data. In Perl land this will always be a reference. Mapping perl
328 objects to other languages is left to the future.
329
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
330 Note that classnames MUST be a string, or a COPY tag referencing a string.
331
332 OBJECTV varints MUST reference a previously used classname, and not an
333 arbitrary string.
334
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
335 =head1 AUTHOR
336
337 Yves Orton E<lt>demerphq@gmail.comE<gt>
338
339 Damian Gryski
340
341 Steffen Mueller E<lt>smueller@cpan.orgE<gt>
342
343 Rafaël Garcia-Suarez
344
905e5dd @avar Add my E-Mail address to POD I appear in
avar authored
345 Ævar Arnfjörð Bjarmason E<lt>avar@cpan.orgE<gt>
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
346
347 =head1 ACKNOWLEDGMENT
348
349 This protocol was originally developed for Booking.com. With approval
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
350 from Booking.com, this document was generalized and published on github
351 and CPAN, for which the authors would like to express their gratitude.
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
352
353 =head1 COPYRIGHT AND LICENSE
354
355 Copyright (C) 2012 by Steffen Mueller
356
357 Copyright (C) 2012 by Yves Orton
358
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
359 This library is free software; you can redistribute it and/or modify
360 it under the same terms as Perl itself.
361
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
362 =cut
363
Something went wrong with that request. Please try again.