Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 590 lines (455 sloc) 27.542 kB
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
1 =pod
2
3 =encoding utf8
4
5 =head1 NAME
6
7 Sereal - Protocol definition
8
9 =head1 SYNOPSIS
10
11 This document describes the format and encoding of a Sereal data packet.
12
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
13 =head1 VERSION
14
cdd2f93 @tsee Spec: Describe v3 spec change wrt. magic string
tsee authored
15 This is the Sereal specification version 3.00.
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
16
17 The integer part of the document version corresponds to
afe41f5 @tsee Update Sereal spec to V2
tsee authored
18 the Sereal protocol version. For details on incompatible changes between
19 major protocol versions, see the L</"PROTOCOL CHANGES"> below.
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
20
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
21 =head1 DESCRIPTION
22
23 A serialized structure is converted into a "document". A document is made
afe41f5 @tsee Update Sereal spec to V2
tsee authored
24 up of two parts, the document header and the document body.
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
25
26 =head2 General Points
27
e7e0fbd @tsee Strictness: Invalid documents must be detected!
tsee authored
28 =head3 Strictness
29
30 A compliant Sereal decoder must detect invalid documents and handle them
31 as a safe exception in the respective implementation language.
32 That is to say, without a crash or worse.
33
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
34 =head3 Little Endian
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
35
36 All numeric data is in little endian format.
37
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
38 =head3 IEEE Floats
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
39
40 Floating points types are in IEEE format.
41
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
42 =head3 Varints
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
43
44 Heavy use is made of a variable length integer encoding commonly called
45 a "varint" (Google calls it a Varint128). This encoding uses the high bit
46 of each byte to signal there is another byte worth of data coming, and the
47 last byte always having the high bit off. The data is in little endian
48 order with the low seven bits in the first byte, and the next 7 in the
49 next etc.
50
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
51 See L<Google's description|https://developers.google.com/protocol-buffers/docs/encoding#varints>.
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
52
67af4c5 @tsee Document tentative spec change wrt. header meta data
tsee authored
53 =head2 Document Header Format
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
54
55 A header consists of multiple components:
56
57 <MAGIC> <VERSION-TYPE> <HEADER-SUFFIX-SIZE> <OPT-SUFFIX>
58
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
59 =head3 MAGIC
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
60
61 A "magic string" that identifies a document as being in the Sereal format.
cdd2f93 @tsee Spec: Describe v3 spec change wrt. magic string
tsee authored
62 In protocol version 1 and 2, the value of this string is "=srl",
4bfd0b4 @demerphq make the srl_spec.pod compatible with the perl looks_like_sereal docu…
demerphq authored
63 and when decoded as an unsigned 32 bit integer on a little endian machine
64 has a value of 0x6C72733D. Vesion 1 and 2 of the protocol require this
65 magic string.
66
67 In protocol version 3 the magic string has been changed to "=\xF3rl",
68 where \xF3 is "s" with the high bit set. The little endian integer form
69 of this string is 0x6C72F33D. Having a hight bit set in the magic string
70 makes it easy to detect when a Sereal document has been accidentally
2ec9ca5 @mvuets Spec: Remove a redundant "been" word
mvuets authored
71 UTF-8 encoded because the \xF3 is translated to \xC3\xB3.
4bfd0b4 @demerphq make the srl_spec.pod compatible with the perl looks_like_sereal docu…
demerphq authored
72
73 Decoders are required to support the magic string associated to the
74 protocol versions they can decode, so if a decoder can handle v1, v2, and
75 v3 then it should handle both magic header.
76
77 It is an error to use a new magic header on a v1 or v2 packet, and it is
78 an error to use the old magic header in v3 or later.
cdd2f93 @tsee Spec: Describe v3 spec change wrt. magic string
tsee authored
79
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
80 =head3 VERSION-TYPE
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
81
82 A single byte, of which the high 4 bits are used to represent the "type"
83 of the document, and the low 4 bits used to represent the version of the
84 Sereal protocol the document complies with.
85
cdd2f93 @tsee Spec: Describe v3 spec change wrt. magic string
tsee authored
86 Up until now there have been versions 1, 2, and 3 of the Sereal protocol.
87 So the low four bits will be one of those values in little-endian.
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
88
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
89 Currently only three types are defined:
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
90
91 =over 4
92
93 =item 0
94
95 Raw Sereal format. The data can be processed verbatim.
96
97 =item 1
98
8901625 @mvuets Spec: Document type 1 is not welcome in Sereal version 3 as well
mvuets authored
99 B<This is not a valid document type for Sereal protocol version 2 and up!>
b67440d @tsee Spec update: In V2, incremental Snappy is the new Snappy
tsee authored
100
101 In Sereal protocol version 1, this used to be
102 "Compressed Sereal format, using Google's Snappy compression internally."
103 It has long been advised to prefer I<2>, "incremental-decoding-enabled
104 compressed Sereal," wherever possible.
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
105
106 =item 2
107
108 Compressed Sereal format, using Google's Snappy compression internally as
b67440d @tsee Spec update: In V2, incremental Snappy is the new Snappy
tsee authored
109 format I<1>, but supporting incremental-parsing. Long preferred over
110 I<1> as this is considered a bug fix in the Snappy compression support.
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
111
112 The format is:
113
114 <Varint><Snappy Blob>
115
116 where the varint signifies the length of the Snappy-compressed blob
117 following it. See L</"NOTES ON IMPLEMENTATION"> below for a discussion on
118 how to implement this efficiently.
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
119
5e0e82c @tsee Spec: Add zlib compression doc type
tsee authored
120 =item 3
121
122 Compressed Sereal format, using zlib compression. This does similar framing
123 as the incremental Snappy compression (2):
124
125 <Varint><Varint><Zlib Blob>
126
127 where the first varint indicates the length of the uncompressed document,
128 the second varint indicates the length of the compressed document.
129 See L</"NOTES ON IMPLEMENTATION"> below for a discussion on
130 how to implement this efficiently.
131
132 This compression format is new in v3 of the specification.
133
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
134 =back
135
136 Additional compression types are envisaged and will be assigned type
137 numbers by the maintainers of the protocol.
138
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
139 =head3 HEADER-SUFFIX-SIZE
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
140
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
141 The structure of the header includes support for embedding additional data.
142 This is accomplished by specifying the length of the suffix
143 in the header with a varint. Headers with no suffix will set this to a
144 binary 0. This is intended for future format extensions that retain some
145 level of compatibility for old decoders (which know how to skip the
146 extended header due to the embedded length).
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
147
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
148 =head3 OPT-SUFFIX
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
149
150 The suffix may contain whatever data the encoder wishes to embed in the
67af4c5 @tsee Document tentative spec change wrt. header meta data
tsee authored
151 header. In version 1 of the protocol the decoder never looked inside
152 this data. Later versions may introduce additional rules for this field.
cdd2f93 @tsee Spec: Describe v3 spec change wrt. magic string
tsee authored
153 Starting from version 2 of the protocol, this variable-length part of the header
67af4c5 @tsee Document tentative spec change wrt. header meta data
tsee authored
154 may be empty or have the following format:
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
155
67af4c5 @tsee Document tentative spec change wrt. header meta data
tsee authored
156 <8bit-BITFIELD> <OPT-USER-META-DATA>
157
158 =over 2
159
160 =item 8bit-BITFIELD
161
cdd2f93 @tsee Spec: Describe v3 spec change wrt. magic string
tsee authored
162 If not present, all bits are assumed off. In version 2 and 3 of the protocol,
67af4c5 @tsee Document tentative spec change wrt. header meta data
tsee authored
163 only the least significant bit is meaningful: If set, the bitfield is
164 followed by the C<E<lt>USER-META-DATAE<gt>>. If not set, there is
165 no user meta data.
166
167 =item OPT-USER-META-DATA
168
169 If the least significant bit of the preceding bitfield is set, this
170 may be an arbitrary Sereal document body. Like any other Sereal
171 document body, it is self-contained and can be deserialized independently
172 from any other document bodies in the Sereal document. This document
173 body is NEVER compressed.
174
175 This is intended for embedding small amounts of meta data, such as
176 routing information, in a document that allows users to avoid
177 deserializing very large document bodies needlessly or having to
178 call into decompression logic.
179
180 =back
181
182 =head2 Document Body Format
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
183
3da0d9a @tsee Clarify body structure a bit
tsee authored
184 The body is made up of tagged data items:
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
185
186 <TAG> <OPT-DATA>
187
3da0d9a @tsee Clarify body structure a bit
tsee authored
188 Tagged items can be containers that hold other tagged items.
189 At the top level, the body holds only ONE tagged item (often
190 an array or hash) that holds others.
191
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
192 =head3 TAG
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
193
194 A tag is a single byte which specifies the type of the data being decoded.
195
196 The high bit of each tag is used to signal to the decoder that the
197 deserialized data needs to be stored and tracked and will be reused again
198 elsewhere in the serialization. This is sometimes called the "track flag"
199 or the "F-bit" in code and documentation. Its status should be ignored
200 when processing a tag, meaning code should mask off the high bit and
201 only use the low 7 bits.
202
203 Some tags, such as POS, NEG and SHORT_BINARY contain embedded in them
204 either the data (in the case of POS and NEG) or the length of the
205 OPT-DATA section (in the case of SHORT_BINARY).
206
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
207 =head3 OPT-DATA
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
208
209 This field may contain an arbitrary set of bytes, either determined
210 implicitly by the tag (such as for FLOAT), explicitly in the tag (as in
211 SHORT_BINARY) or in a varint following the tag (such as for STRING).
212
4912c3d @tsee Clarify the meaning of an offset
tsee authored
213 When referring to an offset below, what's meant is a varint encoded
afe41f5 @tsee Update Sereal spec to V2
tsee authored
214 absolute integer byte position in the document body.
215 That is, an offset of 10 refers to the
216 tenth byte in the Sereal document body (ie. excluding its header).
217 Sereal version 1 used to mandate offsets from the start of the document
218 header.
4912c3d @tsee Clarify the meaning of an offset
tsee authored
219
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
220 =head3 Tags
221
222 =for autoupdater start
223
224
225 Tag | Char | Dec | Hex | Binary | Follow
226 ------------------+------+-----+------+----------- |-----------------------------------------
227 POS_0 | | 0 | 0x00 | 0b00000000 | small positive integer - value in low 4 bits (identity)
228 POS_1 | | 1 | 0x01 | 0b00000001 |
229 POS_2 | | 2 | 0x02 | 0b00000010 |
230 POS_3 | | 3 | 0x03 | 0b00000011 |
231 POS_4 | | 4 | 0x04 | 0b00000100 |
232 POS_5 | | 5 | 0x05 | 0b00000101 |
233 POS_6 | | 6 | 0x06 | 0b00000110 |
234 POS_7 | "\a" | 7 | 0x07 | 0b00000111 |
235 POS_8 | "\b" | 8 | 0x08 | 0b00001000 |
236 POS_9 | "\t" | 9 | 0x09 | 0b00001001 |
237 POS_10 | "\n" | 10 | 0x0a | 0b00001010 |
238 POS_11 | | 11 | 0x0b | 0b00001011 |
239 POS_12 | "\f" | 12 | 0x0c | 0b00001100 |
240 POS_13 | "\r" | 13 | 0x0d | 0b00001101 |
241 POS_14 | | 14 | 0x0e | 0b00001110 |
242 POS_15 | | 15 | 0x0f | 0b00001111 | small positive integer - value in low 4 bits (identity)
243 NEG_16 | | 16 | 0x10 | 0b00010000 | small negative integer - value in low 4 bits (k+32)
244 NEG_15 | | 17 | 0x11 | 0b00010001 |
245 NEG_14 | | 18 | 0x12 | 0b00010010 |
246 NEG_13 | | 19 | 0x13 | 0b00010011 |
247 NEG_12 | | 20 | 0x14 | 0b00010100 |
248 NEG_11 | | 21 | 0x15 | 0b00010101 |
249 NEG_10 | | 22 | 0x16 | 0b00010110 |
250 NEG_9 | | 23 | 0x17 | 0b00010111 |
251 NEG_8 | | 24 | 0x18 | 0b00011000 |
252 NEG_7 | | 25 | 0x19 | 0b00011001 |
253 NEG_6 | | 26 | 0x1a | 0b00011010 |
254 NEG_5 | "\e" | 27 | 0x1b | 0b00011011 |
255 NEG_4 | | 28 | 0x1c | 0b00011100 |
256 NEG_3 | | 29 | 0x1d | 0b00011101 |
257 NEG_2 | | 30 | 0x1e | 0b00011110 |
258 NEG_1 | | 31 | 0x1f | 0b00011111 | small negative integer - value in low 4 bits (k+32)
259 VARINT | " " | 32 | 0x20 | 0b00100000 | <VARINT> - Varint variable length integer
260 ZIGZAG | "!" | 33 | 0x21 | 0b00100001 | <ZIGZAG-VARINT> - Zigzag variable length integer
261 FLOAT | "\"" | 34 | 0x22 | 0b00100010 | <IEEE-FLOAT>
262 DOUBLE | "#" | 35 | 0x23 | 0b00100011 | <IEEE-DOUBLE>
263 LONG_DOUBLE | "\$" | 36 | 0x24 | 0b00100100 | <IEEE-LONG-DOUBLE>
0e32b4a @demerphq Spec/Perl: Add SV_UNDEF to handle PL_sv_undef edge cases
demerphq authored
264 UNDEF | "%" | 37 | 0x25 | 0b00100101 | None - Perl undef var; eg my $var= undef;
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
265 BINARY | "&" | 38 | 0x26 | 0b00100110 | <LEN-VARINT> <BYTES> - binary/(latin1) string
266 STR_UTF8 | "'" | 39 | 0x27 | 0b00100111 | <LEN-VARINT> <UTF8> - utf8 string
267 REFN | "(" | 40 | 0x28 | 0b00101000 | <ITEM-TAG> - ref to next item
268 REFP | ")" | 41 | 0x29 | 0b00101001 | <OFFSET-VARINT> - ref to previous item stored at offset
269 HASH | "*" | 42 | 0x2a | 0b00101010 | <COUNT-VARINT> [<KEY-TAG> <ITEM-TAG> ...] - count followed by key/value pairs
270 ARRAY | "+" | 43 | 0x2b | 0b00101011 | <COUNT-VARINT> [<ITEM-TAG> ...] - count followed by items
271 OBJECT | "," | 44 | 0x2c | 0b00101100 | <STR-TAG> <ITEM-TAG> - class, object-item
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
272 OBJECTV | "-" | 45 | 0x2d | 0b00101101 | <OFFSET-VARINT> <ITEM-TAG> - offset of previously used classname tag - object-item
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
273 ALIAS | "." | 46 | 0x2e | 0b00101110 | <OFFSET-VARINT> - alias to item defined at offset
274 COPY | "/" | 47 | 0x2f | 0b00101111 | <OFFSET-VARINT> - copy of item defined at offset
275 WEAKEN | "0" | 48 | 0x30 | 0b00110000 | <REF-TAG> - Weaken the following reference
276 REGEXP | "1" | 49 | 0x31 | 0b00110001 | <PATTERN-STR-TAG> <MODIFIERS-STR-TAG>
6e19519 @tsee Spec: Proposed changes to implement freeze/thaw hook mechanism
tsee authored
277 OBJECT_FREEZE | "2" | 50 | 0x32 | 0b00110010 | <STR-TAG> <ITEM-TAG> - class, object-item. Need to call "THAW" method on class after decoding
278 OBJECTV_FREEZE | "3" | 51 | 0x33 | 0b00110011 | <OFFSET-VARINT> <ITEM-TAG> - (OBJECTV_FREEZE is to OBJECT_FREEZE as OBJECTV is to OBJECT)
1dd48ff @tsee Generate spec/tag changes appropriately
tsee authored
279 RESERVED_0 | "4" | 52 | 0x34 | 0b00110100 | reserved
280 RESERVED_1 | "5" | 53 | 0x35 | 0b00110101 |
281 RESERVED_2 | "6" | 54 | 0x36 | 0b00110110 |
282 RESERVED_3 | "7" | 55 | 0x37 | 0b00110111 |
0e32b4a @demerphq Spec/Perl: Add SV_UNDEF to handle PL_sv_undef edge cases
demerphq authored
283 RESERVED_4 | "8" | 56 | 0x38 | 0b00111000 | reserved
bc41b7d @demerphq more work in progress
demerphq authored
284 CANONICAL_UNDEF | "9" | 57 | 0x39 | 0b00111001 | undef (PL_sv_undef) - "the" Perl undef (see notes)
285 FALSE | ":" | 58 | 0x3a | 0b00111010 | false (PL_sv_no)
286 TRUE | ";" | 59 | 0x3b | 0b00111011 | true (PL_sv_yes)
287 MANY | "<" | 60 | 0x3c | 0b00111100 | <LEN-VARINT> <TYPE-BYTE> <TAG-DATA> - repeated tag (not done yet, will be implemented in version 3)
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
288 PACKET_START | "=" | 61 | 0x3d | 0b00111101 | (first byte of magic string in header)
289 EXTEND | ">" | 62 | 0x3e | 0b00111110 | <BYTE> - for additional tags
290 PAD | "?" | 63 | 0x3f | 0b00111111 | (ignored tag, skip to next byte)
0675748 @demerphq fix some spelling errors and regenerate the files
demerphq authored
291 ARRAYREF_0 | "\@" | 64 | 0x40 | 0b01000000 | [<ITEM-TAG> ...] - count of items in low 4 bits (ARRAY must be refcnt=1)
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
292 ARRAYREF_1 | "A" | 65 | 0x41 | 0b01000001 |
293 ARRAYREF_2 | "B" | 66 | 0x42 | 0b01000010 |
294 ARRAYREF_3 | "C" | 67 | 0x43 | 0b01000011 |
295 ARRAYREF_4 | "D" | 68 | 0x44 | 0b01000100 |
296 ARRAYREF_5 | "E" | 69 | 0x45 | 0b01000101 |
297 ARRAYREF_6 | "F" | 70 | 0x46 | 0b01000110 |
298 ARRAYREF_7 | "G" | 71 | 0x47 | 0b01000111 |
299 ARRAYREF_8 | "H" | 72 | 0x48 | 0b01001000 |
300 ARRAYREF_9 | "I" | 73 | 0x49 | 0b01001001 |
301 ARRAYREF_10 | "J" | 74 | 0x4a | 0b01001010 |
302 ARRAYREF_11 | "K" | 75 | 0x4b | 0b01001011 |
303 ARRAYREF_12 | "L" | 76 | 0x4c | 0b01001100 |
304 ARRAYREF_13 | "M" | 77 | 0x4d | 0b01001101 |
305 ARRAYREF_14 | "N" | 78 | 0x4e | 0b01001110 |
0675748 @demerphq fix some spelling errors and regenerate the files
demerphq authored
306 ARRAYREF_15 | "O" | 79 | 0x4f | 0b01001111 | [<ITEM-TAG> ...] - count of items in low 4 bits (ARRAY must be refcnt=1)
307 HASHREF_0 | "P" | 80 | 0x50 | 0b01010000 | [<KEY-TAG> <ITEM-TAG> ...] - count in low 4 bits, key/value pairs (HASH must be refcnt=1)
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
308 HASHREF_1 | "Q" | 81 | 0x51 | 0b01010001 |
309 HASHREF_2 | "R" | 82 | 0x52 | 0b01010010 |
310 HASHREF_3 | "S" | 83 | 0x53 | 0b01010011 |
311 HASHREF_4 | "T" | 84 | 0x54 | 0b01010100 |
312 HASHREF_5 | "U" | 85 | 0x55 | 0b01010101 |
313 HASHREF_6 | "V" | 86 | 0x56 | 0b01010110 |
314 HASHREF_7 | "W" | 87 | 0x57 | 0b01010111 |
315 HASHREF_8 | "X" | 88 | 0x58 | 0b01011000 |
316 HASHREF_9 | "Y" | 89 | 0x59 | 0b01011001 |
317 HASHREF_10 | "Z" | 90 | 0x5a | 0b01011010 |
318 HASHREF_11 | "[" | 91 | 0x5b | 0b01011011 |
319 HASHREF_12 | "\\" | 92 | 0x5c | 0b01011100 |
320 HASHREF_13 | "]" | 93 | 0x5d | 0b01011101 |
321 HASHREF_14 | "^" | 94 | 0x5e | 0b01011110 |
0675748 @demerphq fix some spelling errors and regenerate the files
demerphq authored
322 HASHREF_15 | "_" | 95 | 0x5f | 0b01011111 | [<KEY-TAG> <ITEM-TAG> ...] - count in low 4 bits, key/value pairs (HASH must be refcnt=1)
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
323 SHORT_BINARY_0 | "`" | 96 | 0x60 | 0b01100000 | <BYTES> - binary/latin1 string, length encoded in low 5 bits of tag
324 SHORT_BINARY_1 | "a" | 97 | 0x61 | 0b01100001 |
325 SHORT_BINARY_2 | "b" | 98 | 0x62 | 0b01100010 |
326 SHORT_BINARY_3 | "c" | 99 | 0x63 | 0b01100011 |
327 SHORT_BINARY_4 | "d" | 100 | 0x64 | 0b01100100 |
328 SHORT_BINARY_5 | "e" | 101 | 0x65 | 0b01100101 |
329 SHORT_BINARY_6 | "f" | 102 | 0x66 | 0b01100110 |
330 SHORT_BINARY_7 | "g" | 103 | 0x67 | 0b01100111 |
331 SHORT_BINARY_8 | "h" | 104 | 0x68 | 0b01101000 |
332 SHORT_BINARY_9 | "i" | 105 | 0x69 | 0b01101001 |
333 SHORT_BINARY_10 | "j" | 106 | 0x6a | 0b01101010 |
334 SHORT_BINARY_11 | "k" | 107 | 0x6b | 0b01101011 |
335 SHORT_BINARY_12 | "l" | 108 | 0x6c | 0b01101100 |
336 SHORT_BINARY_13 | "m" | 109 | 0x6d | 0b01101101 |
337 SHORT_BINARY_14 | "n" | 110 | 0x6e | 0b01101110 |
338 SHORT_BINARY_15 | "o" | 111 | 0x6f | 0b01101111 |
339 SHORT_BINARY_16 | "p" | 112 | 0x70 | 0b01110000 |
340 SHORT_BINARY_17 | "q" | 113 | 0x71 | 0b01110001 |
341 SHORT_BINARY_18 | "r" | 114 | 0x72 | 0b01110010 |
342 SHORT_BINARY_19 | "s" | 115 | 0x73 | 0b01110011 |
343 SHORT_BINARY_20 | "t" | 116 | 0x74 | 0b01110100 |
344 SHORT_BINARY_21 | "u" | 117 | 0x75 | 0b01110101 |
345 SHORT_BINARY_22 | "v" | 118 | 0x76 | 0b01110110 |
346 SHORT_BINARY_23 | "w" | 119 | 0x77 | 0b01110111 |
347 SHORT_BINARY_24 | "x" | 120 | 0x78 | 0b01111000 |
348 SHORT_BINARY_25 | "y" | 121 | 0x79 | 0b01111001 |
349 SHORT_BINARY_26 | "z" | 122 | 0x7a | 0b01111010 |
350 SHORT_BINARY_27 | "{" | 123 | 0x7b | 0b01111011 |
351 SHORT_BINARY_28 | "|" | 124 | 0x7c | 0b01111100 |
352 SHORT_BINARY_29 | "}" | 125 | 0x7d | 0b01111101 |
353 SHORT_BINARY_30 | "~" | 126 | 0x7e | 0b01111110 |
354 SHORT_BINARY_31 | | 127 | 0x7f | 0b01111111 | <BYTES> - binary/latin1 string, length encoded in low 5 bits of tag
355
356 =for autoupdater stop
357
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
358 =head3 The Track Bit And Cyclic Data Structures
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
359
360 The protocol uses a combination of the offset of a tracked tag and the
361 flag bit to be able to encode and reconstruct cyclic structures in a single
362 pass.
363
364 An encoder must track duplicated items and generate the appropriate ALIAS or
365 REFP tags to reconstruct them, and when it does so ensure that the high
366 bit of the original tag has been set.
367
368 When a decoder encounters a tag with its flag set it will remember the
369 offset of the tag in the output packet and the item that was decoded from
0e32b4a @demerphq Spec/Perl: Add SV_UNDEF to handle PL_sv_undef edge cases
demerphq authored
370 that tag. At a later point in the packet there may be an ALIAS or REFP
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
371 instruction which will refer to the item by its offset, and the decoder
372 will reuse it as needed.
373
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
374 =head3 The COPY Tag
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
375
376 Sometimes it is convenient to be able to reuse a previously emitted
377 sequence in the packet to reduce duplication. For instance a data
378 structure with many hashes with the same keys. The COPY tag is used for
379 this. Its argument is a varint which is the offset of a previously
380 emitted tag, and decoders are to behave as though the tag it references
381 was inserted into the packet stream as a replacement for the COPY tag.
382
383 Note, that in this case the track flag is B<not> set. It is assumed the
384 decoder can jump back to reread the tag from its location alone.
385
afe41f5 @tsee Update Sereal spec to V2
tsee authored
386 COPY tags are forbidden from referring to another COPY tag, and are also
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
387 forbidden from referring to anything containing a COPY tag, with the
388 exception that a COPY tag used as a value may refer to an tag that uses
389 a COPY tag for a classname or hash key.
390
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
391 =head3 String Types
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
392
393 Sereal supports three string representations. Two are "encodingless" and
9243d31 @rgs Fix a couple of typos, notably in tag names
rgs authored
394 are SHORT_BINARY and BINARY, where binary means "raw bytes". The other
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
395 is STR_UTF8 which is expected to contain valid canonical UTF8 encoded
396 unicode text data. Under normal circumstances a decoder is not expected
397 to validate that this is actually the case, and is allowed to simply
398 extract the data verbatim.
399
400 SHORT_BINARY stores the length of the string in the tag itself and is used
9243d31 @rgs Fix a couple of typos, notably in tag names
rgs authored
401 for strings of less than 32 characters long. Both BINARY and STR_UTF8
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
402 use a varint to indicate the number of B<bytes> (octets) in the string.
403
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
404 =head3 Hash Keys
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
405
9243d31 @rgs Fix a couple of typos, notably in tag names
rgs authored
406 Hash keys are always one of the string types, or a COPY tag referencing a
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
407 string.
408
6e19519 @tsee Spec: Proposed changes to implement freeze/thaw hook mechanism
tsee authored
409 =head3 Handling Objects
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
410
411 Objects are serialized as a class name and a tag which represents the
6e19519 @tsee Spec: Proposed changes to implement freeze/thaw hook mechanism
tsee authored
412 objects data. In Perl land this will always be a reference. Mapping Perl
413 objects to other languages is left to the future, but the OBJECT_FREEZE
414 and OBJECTV_FREEZE tags provide a basic method of doing that, see below.
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
415
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
416 Note that classnames MUST be a string, or a COPY tag referencing a string.
417
418 OBJECTV varints MUST reference a previously used classname, and not an
419 arbitrary string.
420
6e19519 @tsee Spec: Proposed changes to implement freeze/thaw hook mechanism
tsee authored
421 Sereal implementations may choose to allow authors of classes to provide
422 hooks for custom object serialization. Depending on the Sereal
423 implementation, this feature may require enabling with an encoder
424 option on the encoding side, but compliant decoders must
425 at least recognize the OBJECT_FREEZE and OBJECTV_FREEZE tags. The
426 interface shall be such that if enabled in the encoder, for each
32523c7 @tsee Spec: Update FREEZE/THAW section of the spec
tsee authored
427 object in the input that has a C<FREEZE> method, the encoder will invoke
428 said C<FREEZE> method on the object and pass in the string C<Sereal>
429 to allow distinguishing from other serializers (this is inspired by
430 the C<CBOR::XS> CBOR implementation). If there is no C<FREEZE> method
431 available, then a normal OBJECT or OBJECTV tag is emitted, serializing
432 the object content deeply. If invoked, the C<FREEZE> method must return
433 a list of data structures that are serializable by Sereal. The encoder
434 shall emit an OBJECT_FREEZE or OBJECTV_FREEZE tag followed by a reference
435 (REFN) to an array (ARRAY) of the Sereal-encoded data structures that
436 were returned from C<FREEZE>.
6e19519 @tsee Spec: Proposed changes to implement freeze/thaw hook mechanism
tsee authored
437
438 Upon decoding OBJECT_FREEZE or OBJECTV_FREEZE, a compliant decoder
439 (unless explicitly instructed not to) will invoke the C<THAW>
440 class method of the given class. (Likely, implementations should
32523c7 @tsee Spec: Update FREEZE/THAW section of the spec
tsee authored
441 throw a fatal error if no such method exists for a class referenced
442 by OBJECT(V)_FREEZE.) Arguments to that method will be the string
443 C<Sereal> as first argument, and then the decoded data structures
444 that were returned from the C<FREEZE> call.
6e19519 @tsee Spec: Proposed changes to implement freeze/thaw hook mechanism
tsee authored
445 The return value of that C<THAW> call needs
446 to be included in the final output structure. See the documentation
447 of the Perl Sereal implemenation for examples of FREEZE/THAW methods.
448
0e32b4a @demerphq Spec/Perl: Add SV_UNDEF to handle PL_sv_undef edge cases
demerphq authored
449 =head3 Dealing with undefined values
450
451 The concept of undef is a little tricky in Perl. A variable may be
452 undefined, in addition there is also a definitive "undef" which Perl
9ed2abb @tsee Spec: Clarification on canonical undef
tsee authored
453 uses in many situations. This definitive "undef" is a globally shared,
454 immutable value. Its use is vaguely equivalent to aliasing the same,
455 read-only copy of a Perl value that happens to be undefined.
456
457 The difference can be illustrated with the following code:
0e32b4a @demerphq Spec/Perl: Add SV_UNDEF to handle PL_sv_undef edge cases
demerphq authored
458
459 my $x;
460 print +(\$x == \$x) ? "same" : "different", "\n";
461 print +(\undef == \undef) ? "same" : "different", "\n";
462 print +(\$x == \undef) ? "same" : "different", "\n";
463
464 which should print out
465
466 same
467 same
468 different
469
470 In protocol versions 1 and 2 it was not possible to represent both forms
471 of undef correctly, and Sereal defaulted to the "undefined variable"
472 interpretation represented by the UNDEF tag in most situations.
473
c5be4d4 @demerphq s/SV_UNDEF/CANONICAL_UNDEF/g
demerphq authored
474 As of protocol version 3 the CANONICAL_UNDEF tag is used to handle this special
0e32b4a @demerphq Spec/Perl: Add SV_UNDEF to handle PL_sv_undef edge cases
demerphq authored
475 case of undef so that Perl data structures can round trip properly.
9ed2abb @tsee Spec: Clarification on canonical undef
tsee authored
476 Other languages are free to treat CANONICAL_UNDEF and UNDEF as is appropriate
477 to their language semantics:
478 If there is an equivalent to this globally shared undefined value (PL_sv_undef in Perl's
479 implementation) then they should map CANONICAL_UNDEF
c5be4d4 @demerphq s/SV_UNDEF/CANONICAL_UNDEF/g
demerphq authored
480 accordingly, otherwise they are free to treat CANONICAL_UNDEF the same as UNDEF.
0e32b4a @demerphq Spec/Perl: Add SV_UNDEF to handle PL_sv_undef edge cases
demerphq authored
481
afe41f5 @tsee Update Sereal spec to V2
tsee authored
482 =head1 PROTOCOL CHANGES
483
cdd2f93 @tsee Spec: Describe v3 spec change wrt. magic string
tsee authored
484 =head2 Protocol Version 3
485
db7d36a @demerphq Improvements on how SRL_MAGIC_STRING_HIGHBIT is handled.
demerphq authored
486 In Sereal protocol version 3, the magic string has been changed to make it
487 easier to detect UTF-8 encoded data by setting the high bit on the 's'
488 character, thus changing the older "=srl" to "=\xF3rl". Encoders generating
489 version 3 of the protocol or later must use the new header, and encoders
490 generating version 1 or 2 of the protocol must use the old header.
cdd2f93 @tsee Spec: Describe v3 spec change wrt. magic string
tsee authored
491
5e0e82c @tsee Spec: Add zlib compression doc type
tsee authored
492 Also new is the "zlib" compression (document type 3). As detailed above, its
493 structure is
494
495 <Varint><Varint><Zlib Blob>
496
497 where the first varint indicates the length of the uncompressed document,
498 the second varint indicates the length of the compressed document.
499
c5be4d4 @demerphq s/SV_UNDEF/CANONICAL_UNDEF/g
demerphq authored
500 Additionally there is the new CANONICAL_UNDEF tag, used to represent Perl's
9ed2abb @tsee Spec: Clarification on canonical undef
tsee authored
501 canonical, shared undefined value (PL_sv_undef) in certain edge cases.
502 See L<Dealing with undefined values> for details.
0e32b4a @demerphq Spec/Perl: Add SV_UNDEF to handle PL_sv_undef edge cases
demerphq authored
503
afe41f5 @tsee Update Sereal spec to V2
tsee authored
504 =head2 Protocol Version 2
505
506 In Sereal protocol version 2, offsets were changed from being relative to
507 the start of the document (including header) to being relative to the start
508 of the document body (ie. excluding the document header). This means that
509 Sereal document bodies are now self-contained - relocatable within the document.
d83bc51 @xant minor changes to the documentation
xant authored
510 Note that the offset is 1-based, which means that to point the first byte
511 of the body its value must be 1.
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
512
67af4c5 @tsee Document tentative spec change wrt. header meta data
tsee authored
513 Additionally, protocol version 2 introduced the 8bit bit-field (8bit-BITFIELD)
514 in the variable-length/optional header part (OPT-SUFFIX) of the document
515 and the user-meta-data section (OPT-USER-META-DATA) of the variable-length header.
516
6e19519 @tsee Spec: Proposed changes to implement freeze/thaw hook mechanism
tsee authored
517 Protocol version 2 introduces the OBJECT_FREEZE and OBJECTV_FREEZE tags in
518 place of two previously reserved tags. The meaning and implementation of these
519 two tags is described in the L</"Handling Objects"> section of this document.
520 In a nutshell, it allows application developers to have custom hooks for
521 serializing and deserializing the instances of their classes.
522
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
523 =head1 NOTES ON IMPLEMENTATION
524
525 =head2 Encoding the Length of Compressed Documents
526
527 With Sereal body format type 2 (see above), you need to encode (as a varint)
528 the length of the Snappy-compressed document as a prefix to the document body.
529 This is somewhat tricky to do efficiently since at first sight,
530 the amount of space required to encode a varint depends on the size of the
531 output. This means that you need to do the normal Sereal-encoding of the
532 document body, then compress the output of that, then append the varint
533 encoded length of the compressed data to a Sereal header, then append the
534 compressed data. In this naive way of implementing this Snappy compression
535 support, you may end up having to copy around the entire document up to three
536 times (and may end up having to allocate 3x the space, too). That is very
537 inefficient.
538
539 There is a better way, though, that's just a tiny bit subtle.
540 Thankfully, you have an upper bound on the
541 size of the compressed blob. It's the uncompressed blob plus the size of
542 the Snappy header (a Snappy library call can tell you what that is in
543 practice). What you can do is before compressing, you allocate a varint
544 that is long enough to encode an integer that is big enough to represent
545 the upper limit on the compressed output size. Then you proceed to
546 point the compressor into the buffer right after the thusly preallocated
547 varint. After compression, you'll know the real size of the compressed
548 blob. Now, you go back to the varint and fill it in. If the reserved
549 space for the varint is B<larger> than what you actually need, then
550 thanks to the way varints work, you can simply set the high bit on the
551 last byte of the varint, and continue to set the high bits of all following
552 padding bytes B<except the last>, which you set to 0 (NUL). For details
553 on why that works, please refer to the Google ProtoBuf documentation
554 referenced earlier. With this specially crafted varint, any normal
555 varint parsing function will treat it as a single varint and skip right
556 to the start of theSnappy-compressed blob. The varint is a correct
557 varint, just not in the canonical form. With this modified plan, you
558 should only need one extra malloc, and (beyond that which the Snappy
559 implementation does), no extra, large memcpy operations.
560
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
561 =head1 AUTHOR
562
563 Yves Orton E<lt>demerphq@gmail.comE<gt>
564
565 Damian Gryski
566
567 Steffen Mueller E<lt>smueller@cpan.orgE<gt>
568
569 Rafaël Garcia-Suarez
570
905e5dd @avar Add my E-Mail address to POD I appear in
avar authored
571 Ævar Arnfjörð Bjarmason E<lt>avar@cpan.orgE<gt>
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
572
573 =head1 ACKNOWLEDGMENT
574
575 This protocol was originally developed for Booking.com. With approval
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
576 from Booking.com, this document was generalized and published on github
577 and CPAN, for which the authors would like to express their gratitude.
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
578
579 =head1 COPYRIGHT AND LICENSE
580
cdd2f93 @tsee Spec: Describe v3 spec change wrt. magic string
tsee authored
581 Copyright (C) 2012, 2013, 2014 by Steffen Mueller
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
582
cdd2f93 @tsee Spec: Describe v3 spec change wrt. magic string
tsee authored
583 Copyright (C) 2012, 2013, 2014 by Yves Orton
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
584
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
585 This library is free software; you can redistribute it and/or modify
586 it under the same terms as Perl itself.
587
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
588 =cut
589
Something went wrong with that request. Please try again.