Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 506 lines (393 sloc) 23.826 kb
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
1 =pod
2
3 =encoding utf8
4
5 =head1 NAME
6
7 Sereal - Protocol definition
8
9 =head1 SYNOPSIS
10
11 This document describes the format and encoding of a Sereal data packet.
12
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
13 =head1 VERSION
14
6e19519 @tsee Spec: Proposed changes to implement freeze/thaw hook mechanism
tsee authored
15 This is the Sereal specification version 2.01.
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
16
17 The integer part of the document version corresponds to
afe41f5 @tsee Update Sereal spec to V2
tsee authored
18 the Sereal protocol version. For details on incompatible changes between
19 major protocol versions, see the L</"PROTOCOL CHANGES"> below.
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
20
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
21 =head1 DESCRIPTION
22
23 A serialized structure is converted into a "document". A document is made
afe41f5 @tsee Update Sereal spec to V2
tsee authored
24 up of two parts, the document header and the document body.
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
25
26 =head2 General Points
27
e7e0fbd @tsee Strictness: Invalid documents must be detected!
tsee authored
28 =head3 Strictness
29
30 A compliant Sereal decoder must detect invalid documents and handle them
31 as a safe exception in the respective implementation language.
32 That is to say, without a crash or worse.
33
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
34 =head3 Little Endian
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
35
36 All numeric data is in little endian format.
37
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
38 =head3 IEEE Floats
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
39
40 Floating points types are in IEEE format.
41
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
42 =head3 Varints
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
43
44 Heavy use is made of a variable length integer encoding commonly called
45 a "varint" (Google calls it a Varint128). This encoding uses the high bit
46 of each byte to signal there is another byte worth of data coming, and the
47 last byte always having the high bit off. The data is in little endian
48 order with the low seven bits in the first byte, and the next 7 in the
49 next etc.
50
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
51 See L<Google's description|https://developers.google.com/protocol-buffers/docs/encoding#varints>.
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
52
67af4c5 @tsee Document tentative spec change wrt. header meta data
tsee authored
53 =head2 Document Header Format
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
54
55 A header consists of multiple components:
56
57 <MAGIC> <VERSION-TYPE> <HEADER-SUFFIX-SIZE> <OPT-SUFFIX>
58
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
59 =head3 MAGIC
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
60
61 A "magic string" that identifies a document as being in the Sereal format.
62 The value of this string is "=srl", and when decoded as an unsigned 32 bit
63 integer on a little endian machine has a value of 0x6c72733d.
64
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
65 =head3 VERSION-TYPE
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
66
67 A single byte, of which the high 4 bits are used to represent the "type"
68 of the document, and the low 4 bits used to represent the version of the
69 Sereal protocol the document complies with.
70
afe41f5 @tsee Update Sereal spec to V2
tsee authored
71 Up until now there have been versions 1 and 2 of the Sereal protocol.
72 So the low four bits will be 1 or 2.
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
73
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
74 Currently only three types are defined:
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
75
76 =over 4
77
78 =item 0
79
80 Raw Sereal format. The data can be processed verbatim.
81
82 =item 1
83
b67440d @tsee Spec update: In V2, incremental Snappy is the new Snappy
tsee authored
84 B<This is not a valid document type for Sereal protocol version 2!>
85
86 In Sereal protocol version 1, this used to be
87 "Compressed Sereal format, using Google's Snappy compression internally."
88 It has long been advised to prefer I<2>, "incremental-decoding-enabled
89 compressed Sereal," wherever possible.
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
90
91 =item 2
92
93 Compressed Sereal format, using Google's Snappy compression internally as
b67440d @tsee Spec update: In V2, incremental Snappy is the new Snappy
tsee authored
94 format I<1>, but supporting incremental-parsing. Long preferred over
95 I<1> as this is considered a bug fix in the Snappy compression support.
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
96
97 The format is:
98
99 <Varint><Snappy Blob>
100
101 where the varint signifies the length of the Snappy-compressed blob
102 following it. See L</"NOTES ON IMPLEMENTATION"> below for a discussion on
103 how to implement this efficiently.
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
104
105 =back
106
107 Additional compression types are envisaged and will be assigned type
108 numbers by the maintainers of the protocol.
109
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
110 =head3 HEADER-SUFFIX-SIZE
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
111
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
112 The structure of the header includes support for embedding additional data.
113 This is accomplished by specifying the length of the suffix
114 in the header with a varint. Headers with no suffix will set this to a
115 binary 0. This is intended for future format extensions that retain some
116 level of compatibility for old decoders (which know how to skip the
117 extended header due to the embedded length).
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
118
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
119 =head3 OPT-SUFFIX
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
120
121 The suffix may contain whatever data the encoder wishes to embed in the
67af4c5 @tsee Document tentative spec change wrt. header meta data
tsee authored
122 header. In version 1 of the protocol the decoder never looked inside
123 this data. Later versions may introduce additional rules for this field.
124 In version 2 of the protocol, this variable-length part of the header
125 may be empty or have the following format:
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
126
67af4c5 @tsee Document tentative spec change wrt. header meta data
tsee authored
127 <8bit-BITFIELD> <OPT-USER-META-DATA>
128
129 =over 2
130
131 =item 8bit-BITFIELD
132
133 If not present, all bits are assumed off. In version 2 of the protocol,
134 only the least significant bit is meaningful: If set, the bitfield is
135 followed by the C<E<lt>USER-META-DATAE<gt>>. If not set, there is
136 no user meta data.
137
138 =item OPT-USER-META-DATA
139
140 If the least significant bit of the preceding bitfield is set, this
141 may be an arbitrary Sereal document body. Like any other Sereal
142 document body, it is self-contained and can be deserialized independently
143 from any other document bodies in the Sereal document. This document
144 body is NEVER compressed.
145
146 This is intended for embedding small amounts of meta data, such as
147 routing information, in a document that allows users to avoid
148 deserializing very large document bodies needlessly or having to
149 call into decompression logic.
150
151 =back
152
153 =head2 Document Body Format
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
154
3da0d9a @tsee Clarify body structure a bit
tsee authored
155 The body is made up of tagged data items:
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
156
157 <TAG> <OPT-DATA>
158
3da0d9a @tsee Clarify body structure a bit
tsee authored
159 Tagged items can be containers that hold other tagged items.
160 At the top level, the body holds only ONE tagged item (often
161 an array or hash) that holds others.
162
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
163 =head3 TAG
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
164
165 A tag is a single byte which specifies the type of the data being decoded.
166
167 The high bit of each tag is used to signal to the decoder that the
168 deserialized data needs to be stored and tracked and will be reused again
169 elsewhere in the serialization. This is sometimes called the "track flag"
170 or the "F-bit" in code and documentation. Its status should be ignored
171 when processing a tag, meaning code should mask off the high bit and
172 only use the low 7 bits.
173
174 Some tags, such as POS, NEG and SHORT_BINARY contain embedded in them
175 either the data (in the case of POS and NEG) or the length of the
176 OPT-DATA section (in the case of SHORT_BINARY).
177
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
178 =head3 OPT-DATA
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
179
180 This field may contain an arbitrary set of bytes, either determined
181 implicitly by the tag (such as for FLOAT), explicitly in the tag (as in
182 SHORT_BINARY) or in a varint following the tag (such as for STRING).
183
4912c3d @tsee Clarify the meaning of an offset
tsee authored
184 When referring to an offset below, what's meant is a varint encoded
afe41f5 @tsee Update Sereal spec to V2
tsee authored
185 absolute integer byte position in the document body.
186 That is, an offset of 10 refers to the
187 tenth byte in the Sereal document body (ie. excluding its header).
188 Sereal version 1 used to mandate offsets from the start of the document
189 header.
4912c3d @tsee Clarify the meaning of an offset
tsee authored
190
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
191 =head3 Tags
192
193 =for autoupdater start
194
195
196 Tag | Char | Dec | Hex | Binary | Follow
197 ------------------+------+-----+------+----------- |-----------------------------------------
198 POS_0 | | 0 | 0x00 | 0b00000000 | small positive integer - value in low 4 bits (identity)
199 POS_1 | | 1 | 0x01 | 0b00000001 |
200 POS_2 | | 2 | 0x02 | 0b00000010 |
201 POS_3 | | 3 | 0x03 | 0b00000011 |
202 POS_4 | | 4 | 0x04 | 0b00000100 |
203 POS_5 | | 5 | 0x05 | 0b00000101 |
204 POS_6 | | 6 | 0x06 | 0b00000110 |
205 POS_7 | "\a" | 7 | 0x07 | 0b00000111 |
206 POS_8 | "\b" | 8 | 0x08 | 0b00001000 |
207 POS_9 | "\t" | 9 | 0x09 | 0b00001001 |
208 POS_10 | "\n" | 10 | 0x0a | 0b00001010 |
209 POS_11 | | 11 | 0x0b | 0b00001011 |
210 POS_12 | "\f" | 12 | 0x0c | 0b00001100 |
211 POS_13 | "\r" | 13 | 0x0d | 0b00001101 |
212 POS_14 | | 14 | 0x0e | 0b00001110 |
213 POS_15 | | 15 | 0x0f | 0b00001111 | small positive integer - value in low 4 bits (identity)
214 NEG_16 | | 16 | 0x10 | 0b00010000 | small negative integer - value in low 4 bits (k+32)
215 NEG_15 | | 17 | 0x11 | 0b00010001 |
216 NEG_14 | | 18 | 0x12 | 0b00010010 |
217 NEG_13 | | 19 | 0x13 | 0b00010011 |
218 NEG_12 | | 20 | 0x14 | 0b00010100 |
219 NEG_11 | | 21 | 0x15 | 0b00010101 |
220 NEG_10 | | 22 | 0x16 | 0b00010110 |
221 NEG_9 | | 23 | 0x17 | 0b00010111 |
222 NEG_8 | | 24 | 0x18 | 0b00011000 |
223 NEG_7 | | 25 | 0x19 | 0b00011001 |
224 NEG_6 | | 26 | 0x1a | 0b00011010 |
225 NEG_5 | "\e" | 27 | 0x1b | 0b00011011 |
226 NEG_4 | | 28 | 0x1c | 0b00011100 |
227 NEG_3 | | 29 | 0x1d | 0b00011101 |
228 NEG_2 | | 30 | 0x1e | 0b00011110 |
229 NEG_1 | | 31 | 0x1f | 0b00011111 | small negative integer - value in low 4 bits (k+32)
230 VARINT | " " | 32 | 0x20 | 0b00100000 | <VARINT> - Varint variable length integer
231 ZIGZAG | "!" | 33 | 0x21 | 0b00100001 | <ZIGZAG-VARINT> - Zigzag variable length integer
232 FLOAT | "\"" | 34 | 0x22 | 0b00100010 | <IEEE-FLOAT>
233 DOUBLE | "#" | 35 | 0x23 | 0b00100011 | <IEEE-DOUBLE>
234 LONG_DOUBLE | "\$" | 36 | 0x24 | 0b00100100 | <IEEE-LONG-DOUBLE>
235 UNDEF | "%" | 37 | 0x25 | 0b00100101 | None - Perl undef
236 BINARY | "&" | 38 | 0x26 | 0b00100110 | <LEN-VARINT> <BYTES> - binary/(latin1) string
237 STR_UTF8 | "'" | 39 | 0x27 | 0b00100111 | <LEN-VARINT> <UTF8> - utf8 string
238 REFN | "(" | 40 | 0x28 | 0b00101000 | <ITEM-TAG> - ref to next item
239 REFP | ")" | 41 | 0x29 | 0b00101001 | <OFFSET-VARINT> - ref to previous item stored at offset
240 HASH | "*" | 42 | 0x2a | 0b00101010 | <COUNT-VARINT> [<KEY-TAG> <ITEM-TAG> ...] - count followed by key/value pairs
241 ARRAY | "+" | 43 | 0x2b | 0b00101011 | <COUNT-VARINT> [<ITEM-TAG> ...] - count followed by items
242 OBJECT | "," | 44 | 0x2c | 0b00101100 | <STR-TAG> <ITEM-TAG> - class, object-item
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
243 OBJECTV | "-" | 45 | 0x2d | 0b00101101 | <OFFSET-VARINT> <ITEM-TAG> - offset of previously used classname tag - object-item
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
244 ALIAS | "." | 46 | 0x2e | 0b00101110 | <OFFSET-VARINT> - alias to item defined at offset
245 COPY | "/" | 47 | 0x2f | 0b00101111 | <OFFSET-VARINT> - copy of item defined at offset
246 WEAKEN | "0" | 48 | 0x30 | 0b00110000 | <REF-TAG> - Weaken the following reference
247 REGEXP | "1" | 49 | 0x31 | 0b00110001 | <PATTERN-STR-TAG> <MODIFIERS-STR-TAG>
6e19519 @tsee Spec: Proposed changes to implement freeze/thaw hook mechanism
tsee authored
248 OBJECT_FREEZE | "2" | 50 | 0x32 | 0b00110010 | <STR-TAG> <ITEM-TAG> - class, object-item. Need to call "THAW" method on class after decoding
249 OBJECTV_FREEZE | "3" | 51 | 0x33 | 0b00110011 | <OFFSET-VARINT> <ITEM-TAG> - (OBJECTV_FREEZE is to OBJECT_FREEZE as OBJECTV is to OBJECT)
1dd48ff @tsee Generate spec/tag changes appropriately
tsee authored
250 RESERVED_0 | "4" | 52 | 0x34 | 0b00110100 | reserved
251 RESERVED_1 | "5" | 53 | 0x35 | 0b00110101 |
252 RESERVED_2 | "6" | 54 | 0x36 | 0b00110110 |
253 RESERVED_3 | "7" | 55 | 0x37 | 0b00110111 |
254 RESERVED_4 | "8" | 56 | 0x38 | 0b00111000 |
255 RESERVED_5 | "9" | 57 | 0x39 | 0b00111001 | reserved
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
256 FALSE | ":" | 58 | 0x3a | 0b00111010 | false (PL_sv_no)
257 TRUE | ";" | 59 | 0x3b | 0b00111011 | true (PL_sv_yes)
1dd48ff @tsee Generate spec/tag changes appropriately
tsee authored
258 MANY | "<" | 60 | 0x3c | 0b00111100 | <LEN-VARINT> <TYPE-BYTE> <TAG-DATA> - repeated tag (not done yet, will be implemented in version 3)
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
259 PACKET_START | "=" | 61 | 0x3d | 0b00111101 | (first byte of magic string in header)
260 EXTEND | ">" | 62 | 0x3e | 0b00111110 | <BYTE> - for additional tags
261 PAD | "?" | 63 | 0x3f | 0b00111111 | (ignored tag, skip to next byte)
0675748 @demerphq fix some spelling errors and regenerate the files
demerphq authored
262 ARRAYREF_0 | "\@" | 64 | 0x40 | 0b01000000 | [<ITEM-TAG> ...] - count of items in low 4 bits (ARRAY must be refcnt=1)
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
263 ARRAYREF_1 | "A" | 65 | 0x41 | 0b01000001 |
264 ARRAYREF_2 | "B" | 66 | 0x42 | 0b01000010 |
265 ARRAYREF_3 | "C" | 67 | 0x43 | 0b01000011 |
266 ARRAYREF_4 | "D" | 68 | 0x44 | 0b01000100 |
267 ARRAYREF_5 | "E" | 69 | 0x45 | 0b01000101 |
268 ARRAYREF_6 | "F" | 70 | 0x46 | 0b01000110 |
269 ARRAYREF_7 | "G" | 71 | 0x47 | 0b01000111 |
270 ARRAYREF_8 | "H" | 72 | 0x48 | 0b01001000 |
271 ARRAYREF_9 | "I" | 73 | 0x49 | 0b01001001 |
272 ARRAYREF_10 | "J" | 74 | 0x4a | 0b01001010 |
273 ARRAYREF_11 | "K" | 75 | 0x4b | 0b01001011 |
274 ARRAYREF_12 | "L" | 76 | 0x4c | 0b01001100 |
275 ARRAYREF_13 | "M" | 77 | 0x4d | 0b01001101 |
276 ARRAYREF_14 | "N" | 78 | 0x4e | 0b01001110 |
0675748 @demerphq fix some spelling errors and regenerate the files
demerphq authored
277 ARRAYREF_15 | "O" | 79 | 0x4f | 0b01001111 | [<ITEM-TAG> ...] - count of items in low 4 bits (ARRAY must be refcnt=1)
278 HASHREF_0 | "P" | 80 | 0x50 | 0b01010000 | [<KEY-TAG> <ITEM-TAG> ...] - count in low 4 bits, key/value pairs (HASH must be refcnt=1)
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
279 HASHREF_1 | "Q" | 81 | 0x51 | 0b01010001 |
280 HASHREF_2 | "R" | 82 | 0x52 | 0b01010010 |
281 HASHREF_3 | "S" | 83 | 0x53 | 0b01010011 |
282 HASHREF_4 | "T" | 84 | 0x54 | 0b01010100 |
283 HASHREF_5 | "U" | 85 | 0x55 | 0b01010101 |
284 HASHREF_6 | "V" | 86 | 0x56 | 0b01010110 |
285 HASHREF_7 | "W" | 87 | 0x57 | 0b01010111 |
286 HASHREF_8 | "X" | 88 | 0x58 | 0b01011000 |
287 HASHREF_9 | "Y" | 89 | 0x59 | 0b01011001 |
288 HASHREF_10 | "Z" | 90 | 0x5a | 0b01011010 |
289 HASHREF_11 | "[" | 91 | 0x5b | 0b01011011 |
290 HASHREF_12 | "\\" | 92 | 0x5c | 0b01011100 |
291 HASHREF_13 | "]" | 93 | 0x5d | 0b01011101 |
292 HASHREF_14 | "^" | 94 | 0x5e | 0b01011110 |
0675748 @demerphq fix some spelling errors and regenerate the files
demerphq authored
293 HASHREF_15 | "_" | 95 | 0x5f | 0b01011111 | [<KEY-TAG> <ITEM-TAG> ...] - count in low 4 bits, key/value pairs (HASH must be refcnt=1)
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
294 SHORT_BINARY_0 | "`" | 96 | 0x60 | 0b01100000 | <BYTES> - binary/latin1 string, length encoded in low 5 bits of tag
295 SHORT_BINARY_1 | "a" | 97 | 0x61 | 0b01100001 |
296 SHORT_BINARY_2 | "b" | 98 | 0x62 | 0b01100010 |
297 SHORT_BINARY_3 | "c" | 99 | 0x63 | 0b01100011 |
298 SHORT_BINARY_4 | "d" | 100 | 0x64 | 0b01100100 |
299 SHORT_BINARY_5 | "e" | 101 | 0x65 | 0b01100101 |
300 SHORT_BINARY_6 | "f" | 102 | 0x66 | 0b01100110 |
301 SHORT_BINARY_7 | "g" | 103 | 0x67 | 0b01100111 |
302 SHORT_BINARY_8 | "h" | 104 | 0x68 | 0b01101000 |
303 SHORT_BINARY_9 | "i" | 105 | 0x69 | 0b01101001 |
304 SHORT_BINARY_10 | "j" | 106 | 0x6a | 0b01101010 |
305 SHORT_BINARY_11 | "k" | 107 | 0x6b | 0b01101011 |
306 SHORT_BINARY_12 | "l" | 108 | 0x6c | 0b01101100 |
307 SHORT_BINARY_13 | "m" | 109 | 0x6d | 0b01101101 |
308 SHORT_BINARY_14 | "n" | 110 | 0x6e | 0b01101110 |
309 SHORT_BINARY_15 | "o" | 111 | 0x6f | 0b01101111 |
310 SHORT_BINARY_16 | "p" | 112 | 0x70 | 0b01110000 |
311 SHORT_BINARY_17 | "q" | 113 | 0x71 | 0b01110001 |
312 SHORT_BINARY_18 | "r" | 114 | 0x72 | 0b01110010 |
313 SHORT_BINARY_19 | "s" | 115 | 0x73 | 0b01110011 |
314 SHORT_BINARY_20 | "t" | 116 | 0x74 | 0b01110100 |
315 SHORT_BINARY_21 | "u" | 117 | 0x75 | 0b01110101 |
316 SHORT_BINARY_22 | "v" | 118 | 0x76 | 0b01110110 |
317 SHORT_BINARY_23 | "w" | 119 | 0x77 | 0b01110111 |
318 SHORT_BINARY_24 | "x" | 120 | 0x78 | 0b01111000 |
319 SHORT_BINARY_25 | "y" | 121 | 0x79 | 0b01111001 |
320 SHORT_BINARY_26 | "z" | 122 | 0x7a | 0b01111010 |
321 SHORT_BINARY_27 | "{" | 123 | 0x7b | 0b01111011 |
322 SHORT_BINARY_28 | "|" | 124 | 0x7c | 0b01111100 |
323 SHORT_BINARY_29 | "}" | 125 | 0x7d | 0b01111101 |
324 SHORT_BINARY_30 | "~" | 126 | 0x7e | 0b01111110 |
325 SHORT_BINARY_31 | | 127 | 0x7f | 0b01111111 | <BYTES> - binary/latin1 string, length encoded in low 5 bits of tag
326
327 =for autoupdater stop
328
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
329 =head3 The Track Bit And Cyclic Data Structures
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
330
331 The protocol uses a combination of the offset of a tracked tag and the
332 flag bit to be able to encode and reconstruct cyclic structures in a single
333 pass.
334
335 An encoder must track duplicated items and generate the appropriate ALIAS or
336 REFP tags to reconstruct them, and when it does so ensure that the high
337 bit of the original tag has been set.
338
339 When a decoder encounters a tag with its flag set it will remember the
340 offset of the tag in the output packet and the item that was decoded from
341 that tag. At a later point in the packet there will be an ALIAS or REFP
342 instruction which will refer to the item by its offset, and the decoder
343 will reuse it as needed.
344
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
345 =head3 The COPY Tag
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
346
347 Sometimes it is convenient to be able to reuse a previously emitted
348 sequence in the packet to reduce duplication. For instance a data
349 structure with many hashes with the same keys. The COPY tag is used for
350 this. Its argument is a varint which is the offset of a previously
351 emitted tag, and decoders are to behave as though the tag it references
352 was inserted into the packet stream as a replacement for the COPY tag.
353
354 Note, that in this case the track flag is B<not> set. It is assumed the
355 decoder can jump back to reread the tag from its location alone.
356
afe41f5 @tsee Update Sereal spec to V2
tsee authored
357 COPY tags are forbidden from referring to another COPY tag, and are also
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
358 forbidden from referring to anything containing a COPY tag, with the
359 exception that a COPY tag used as a value may refer to an tag that uses
360 a COPY tag for a classname or hash key.
361
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
362 =head3 String Types
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
363
364 Sereal supports three string representations. Two are "encodingless" and
9243d31 @rgs Fix a couple of typos, notably in tag names
rgs authored
365 are SHORT_BINARY and BINARY, where binary means "raw bytes". The other
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
366 is STR_UTF8 which is expected to contain valid canonical UTF8 encoded
367 unicode text data. Under normal circumstances a decoder is not expected
368 to validate that this is actually the case, and is allowed to simply
369 extract the data verbatim.
370
371 SHORT_BINARY stores the length of the string in the tag itself and is used
9243d31 @rgs Fix a couple of typos, notably in tag names
rgs authored
372 for strings of less than 32 characters long. Both BINARY and STR_UTF8
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
373 use a varint to indicate the number of B<bytes> (octets) in the string.
374
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
375 =head3 Hash Keys
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
376
9243d31 @rgs Fix a couple of typos, notably in tag names
rgs authored
377 Hash keys are always one of the string types, or a COPY tag referencing a
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
378 string.
379
6e19519 @tsee Spec: Proposed changes to implement freeze/thaw hook mechanism
tsee authored
380 =head3 Handling Objects
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
381
382 Objects are serialized as a class name and a tag which represents the
6e19519 @tsee Spec: Proposed changes to implement freeze/thaw hook mechanism
tsee authored
383 objects data. In Perl land this will always be a reference. Mapping Perl
384 objects to other languages is left to the future, but the OBJECT_FREEZE
385 and OBJECTV_FREEZE tags provide a basic method of doing that, see below.
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
386
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
387 Note that classnames MUST be a string, or a COPY tag referencing a string.
388
389 OBJECTV varints MUST reference a previously used classname, and not an
390 arbitrary string.
391
6e19519 @tsee Spec: Proposed changes to implement freeze/thaw hook mechanism
tsee authored
392 Sereal implementations may choose to allow authors of classes to provide
393 hooks for custom object serialization. Depending on the Sereal
394 implementation, this feature may require enabling with an encoder
395 option on the encoding side, but compliant decoders must
396 at least recognize the OBJECT_FREEZE and OBJECTV_FREEZE tags. The
397 interface shall be such that if enabled in the encoder, for each
398 object in the input, the encoder will invoke a C<FREEZE> method
399 on the object and pass in the string C<Sereal> to allow distinguishing
400 from other serializers (this is inspired by the CBOR::XS CBOR
401 implementation). If there is no C<FREEZE> method available, then
402 a normal OBJECT or OBJECTV tag is emitted, serializing the object
403 content normally. If invoked, the C<FREEZE> method must return
404 a single data structure that is serializable by Sereal. The encoder
405 shall emit an OBJECT_FREEZE or OBJECTV_FREEZE tag followed by
406 the Sereal encoding of the returned data structure.
407
408 Upon decoding OBJECT_FREEZE or OBJECTV_FREEZE, a compliant decoder
409 (unless explicitly instructed not to) will invoke the C<THAW>
410 class method of the given class. (Likely, implementations should
411 throw a fatal error if no such method exists.) Arguments to that
412 method will be the string C<Sereal> as first argument, and the
413 decoded data structure that was returned from the C<FREEZE> call.
414 The return value of that C<THAW> call needs
415 to be included in the final output structure. See the documentation
416 of the Perl Sereal implemenation for examples of FREEZE/THAW methods.
417
afe41f5 @tsee Update Sereal spec to V2
tsee authored
418 =head1 PROTOCOL CHANGES
419
420 =head2 Protocol Version 2
421
422 In Sereal protocol version 2, offsets were changed from being relative to
423 the start of the document (including header) to being relative to the start
424 of the document body (ie. excluding the document header). This means that
425 Sereal document bodies are now self-contained - relocatable within the document.
d83bc51 @xant minor changes to the documentation
xant authored
426 Note that the offset is 1-based, which means that to point the first byte
427 of the body its value must be 1.
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
428
67af4c5 @tsee Document tentative spec change wrt. header meta data
tsee authored
429 Additionally, protocol version 2 introduced the 8bit bit-field (8bit-BITFIELD)
430 in the variable-length/optional header part (OPT-SUFFIX) of the document
431 and the user-meta-data section (OPT-USER-META-DATA) of the variable-length header.
432
6e19519 @tsee Spec: Proposed changes to implement freeze/thaw hook mechanism
tsee authored
433 Protocol version 2 introduces the OBJECT_FREEZE and OBJECTV_FREEZE tags in
434 place of two previously reserved tags. The meaning and implementation of these
435 two tags is described in the L</"Handling Objects"> section of this document.
436 In a nutshell, it allows application developers to have custom hooks for
437 serializing and deserializing the instances of their classes.
438
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
439 =head1 NOTES ON IMPLEMENTATION
440
441 =head2 Encoding the Length of Compressed Documents
442
443 With Sereal body format type 2 (see above), you need to encode (as a varint)
444 the length of the Snappy-compressed document as a prefix to the document body.
445 This is somewhat tricky to do efficiently since at first sight,
446 the amount of space required to encode a varint depends on the size of the
447 output. This means that you need to do the normal Sereal-encoding of the
448 document body, then compress the output of that, then append the varint
449 encoded length of the compressed data to a Sereal header, then append the
450 compressed data. In this naive way of implementing this Snappy compression
451 support, you may end up having to copy around the entire document up to three
452 times (and may end up having to allocate 3x the space, too). That is very
453 inefficient.
454
455 There is a better way, though, that's just a tiny bit subtle.
456 Thankfully, you have an upper bound on the
457 size of the compressed blob. It's the uncompressed blob plus the size of
458 the Snappy header (a Snappy library call can tell you what that is in
459 practice). What you can do is before compressing, you allocate a varint
460 that is long enough to encode an integer that is big enough to represent
461 the upper limit on the compressed output size. Then you proceed to
462 point the compressor into the buffer right after the thusly preallocated
463 varint. After compression, you'll know the real size of the compressed
464 blob. Now, you go back to the varint and fill it in. If the reserved
465 space for the varint is B<larger> than what you actually need, then
466 thanks to the way varints work, you can simply set the high bit on the
467 last byte of the varint, and continue to set the high bits of all following
468 padding bytes B<except the last>, which you set to 0 (NUL). For details
469 on why that works, please refer to the Google ProtoBuf documentation
470 referenced earlier. With this specially crafted varint, any normal
471 varint parsing function will treat it as a single varint and skip right
472 to the start of theSnappy-compressed blob. The varint is a correct
473 varint, just not in the canonical form. With this modified plan, you
474 should only need one extra malloc, and (beyond that which the Snappy
475 implementation does), no extra, large memcpy operations.
476
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
477 =head1 AUTHOR
478
479 Yves Orton E<lt>demerphq@gmail.comE<gt>
480
481 Damian Gryski
482
483 Steffen Mueller E<lt>smueller@cpan.orgE<gt>
484
485 Rafaël Garcia-Suarez
486
905e5dd @avar Add my E-Mail address to POD I appear in
avar authored
487 Ævar Arnfjörð Bjarmason E<lt>avar@cpan.orgE<gt>
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
488
489 =head1 ACKNOWLEDGMENT
490
491 This protocol was originally developed for Booking.com. With approval
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
492 from Booking.com, this document was generalized and published on github
493 and CPAN, for which the authors would like to express their gratitude.
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
494
495 =head1 COPYRIGHT AND LICENSE
496
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
497 Copyright (C) 2012, 2013 by Steffen Mueller
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
498
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
499 Copyright (C) 2012, 2013 by Yves Orton
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
500
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
501 This library is free software; you can redistribute it and/or modify
502 it under the same terms as Perl itself.
503
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
504 =cut
505
Something went wrong with that request. Please try again.