Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 433 lines (334 sloc) 20.107 kB
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
1 =pod
2
3 =encoding utf8
4
5 =head1 NAME
6
7 Sereal - Protocol definition
8
9 =head1 SYNOPSIS
10
11 This document describes the format and encoding of a Sereal data packet.
12
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
13 =head1 VERSION
14
afe41f5 @tsee Update Sereal spec to V2
tsee authored
15 This is the Sereal specification version 2.00.
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
16
17 The integer part of the document version corresponds to
afe41f5 @tsee Update Sereal spec to V2
tsee authored
18 the Sereal protocol version. For details on incompatible changes between
19 major protocol versions, see the L</"PROTOCOL CHANGES"> below.
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
20
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
21 =head1 DESCRIPTION
22
23 A serialized structure is converted into a "document". A document is made
afe41f5 @tsee Update Sereal spec to V2
tsee authored
24 up of two parts, the document header and the document body.
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
25
26 =head2 General Points
27
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
28 =head3 Little Endian
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
29
30 All numeric data is in little endian format.
31
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
32 =head3 IEEE Floats
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
33
34 Floating points types are in IEEE format.
35
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
36 =head3 Varints
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
37
38 Heavy use is made of a variable length integer encoding commonly called
39 a "varint" (Google calls it a Varint128). This encoding uses the high bit
40 of each byte to signal there is another byte worth of data coming, and the
41 last byte always having the high bit off. The data is in little endian
42 order with the low seven bits in the first byte, and the next 7 in the
43 next etc.
44
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
45 See L<Google's description|https://developers.google.com/protocol-buffers/docs/encoding#varints>.
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
46
47 =head2 Header Format
48
49 A header consists of multiple components:
50
51 <MAGIC> <VERSION-TYPE> <HEADER-SUFFIX-SIZE> <OPT-SUFFIX>
52
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
53 =head3 MAGIC
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
54
55 A "magic string" that identifies a document as being in the Sereal format.
56 The value of this string is "=srl", and when decoded as an unsigned 32 bit
57 integer on a little endian machine has a value of 0x6c72733d.
58
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
59 =head3 VERSION-TYPE
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
60
61 A single byte, of which the high 4 bits are used to represent the "type"
62 of the document, and the low 4 bits used to represent the version of the
63 Sereal protocol the document complies with.
64
afe41f5 @tsee Update Sereal spec to V2
tsee authored
65 Up until now there have been versions 1 and 2 of the Sereal protocol.
66 So the low four bits will be 1 or 2.
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
67
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
68 Currently only three types are defined:
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
69
70 =over 4
71
72 =item 0
73
74 Raw Sereal format. The data can be processed verbatim.
75
76 =item 1
77
b67440d @tsee Spec update: In V2, incremental Snappy is the new Snappy
tsee authored
78 B<This is not a valid document type for Sereal protocol version 2!>
79
80 In Sereal protocol version 1, this used to be
81 "Compressed Sereal format, using Google's Snappy compression internally."
82 It has long been advised to prefer I<2>, "incremental-decoding-enabled
83 compressed Sereal," wherever possible.
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
84
85 =item 2
86
87 Compressed Sereal format, using Google's Snappy compression internally as
b67440d @tsee Spec update: In V2, incremental Snappy is the new Snappy
tsee authored
88 format I<1>, but supporting incremental-parsing. Long preferred over
89 I<1> as this is considered a bug fix in the Snappy compression support.
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
90
91 The format is:
92
93 <Varint><Snappy Blob>
94
95 where the varint signifies the length of the Snappy-compressed blob
96 following it. See L</"NOTES ON IMPLEMENTATION"> below for a discussion on
97 how to implement this efficiently.
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
98
99 =back
100
101 Additional compression types are envisaged and will be assigned type
102 numbers by the maintainers of the protocol.
103
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
104 =head3 HEADER-SUFFIX-SIZE
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
105
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
106 The structure of the header includes support for embedding additional data.
107 This is accomplished by specifying the length of the suffix
108 in the header with a varint. Headers with no suffix will set this to a
109 binary 0. This is intended for future format extensions that retain some
110 level of compatibility for old decoders (which know how to skip the
111 extended header due to the embedded length).
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
112
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
113 =head3 OPT-SUFFIX
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
114
115 The suffix may contain whatever data the encoder wishes to embed in the
116 header. In version 1 of the protocol the decoder will never look inside
117 this data. Later versions may introduce new rules for this field.
118
119 =head2 Body Format
120
3da0d9a @tsee Clarify body structure a bit
tsee authored
121 The body is made up of tagged data items:
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
122
123 <TAG> <OPT-DATA>
124
3da0d9a @tsee Clarify body structure a bit
tsee authored
125 Tagged items can be containers that hold other tagged items.
126 At the top level, the body holds only ONE tagged item (often
127 an array or hash) that holds others.
128
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
129 =head3 TAG
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
130
131 A tag is a single byte which specifies the type of the data being decoded.
132
133 The high bit of each tag is used to signal to the decoder that the
134 deserialized data needs to be stored and tracked and will be reused again
135 elsewhere in the serialization. This is sometimes called the "track flag"
136 or the "F-bit" in code and documentation. Its status should be ignored
137 when processing a tag, meaning code should mask off the high bit and
138 only use the low 7 bits.
139
140 Some tags, such as POS, NEG and SHORT_BINARY contain embedded in them
141 either the data (in the case of POS and NEG) or the length of the
142 OPT-DATA section (in the case of SHORT_BINARY).
143
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
144 =head3 OPT-DATA
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
145
146 This field may contain an arbitrary set of bytes, either determined
147 implicitly by the tag (such as for FLOAT), explicitly in the tag (as in
148 SHORT_BINARY) or in a varint following the tag (such as for STRING).
149
4912c3d @tsee Clarify the meaning of an offset
tsee authored
150 When referring to an offset below, what's meant is a varint encoded
afe41f5 @tsee Update Sereal spec to V2
tsee authored
151 absolute integer byte position in the document body.
152 That is, an offset of 10 refers to the
153 tenth byte in the Sereal document body (ie. excluding its header).
154 Sereal version 1 used to mandate offsets from the start of the document
155 header.
4912c3d @tsee Clarify the meaning of an offset
tsee authored
156
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
157 =head3 Tags
158
159 =for autoupdater start
160
161
162 Tag | Char | Dec | Hex | Binary | Follow
163 ------------------+------+-----+------+----------- |-----------------------------------------
164 POS_0 | | 0 | 0x00 | 0b00000000 | small positive integer - value in low 4 bits (identity)
165 POS_1 | | 1 | 0x01 | 0b00000001 |
166 POS_2 | | 2 | 0x02 | 0b00000010 |
167 POS_3 | | 3 | 0x03 | 0b00000011 |
168 POS_4 | | 4 | 0x04 | 0b00000100 |
169 POS_5 | | 5 | 0x05 | 0b00000101 |
170 POS_6 | | 6 | 0x06 | 0b00000110 |
171 POS_7 | "\a" | 7 | 0x07 | 0b00000111 |
172 POS_8 | "\b" | 8 | 0x08 | 0b00001000 |
173 POS_9 | "\t" | 9 | 0x09 | 0b00001001 |
174 POS_10 | "\n" | 10 | 0x0a | 0b00001010 |
175 POS_11 | | 11 | 0x0b | 0b00001011 |
176 POS_12 | "\f" | 12 | 0x0c | 0b00001100 |
177 POS_13 | "\r" | 13 | 0x0d | 0b00001101 |
178 POS_14 | | 14 | 0x0e | 0b00001110 |
179 POS_15 | | 15 | 0x0f | 0b00001111 | small positive integer - value in low 4 bits (identity)
180 NEG_16 | | 16 | 0x10 | 0b00010000 | small negative integer - value in low 4 bits (k+32)
181 NEG_15 | | 17 | 0x11 | 0b00010001 |
182 NEG_14 | | 18 | 0x12 | 0b00010010 |
183 NEG_13 | | 19 | 0x13 | 0b00010011 |
184 NEG_12 | | 20 | 0x14 | 0b00010100 |
185 NEG_11 | | 21 | 0x15 | 0b00010101 |
186 NEG_10 | | 22 | 0x16 | 0b00010110 |
187 NEG_9 | | 23 | 0x17 | 0b00010111 |
188 NEG_8 | | 24 | 0x18 | 0b00011000 |
189 NEG_7 | | 25 | 0x19 | 0b00011001 |
190 NEG_6 | | 26 | 0x1a | 0b00011010 |
191 NEG_5 | "\e" | 27 | 0x1b | 0b00011011 |
192 NEG_4 | | 28 | 0x1c | 0b00011100 |
193 NEG_3 | | 29 | 0x1d | 0b00011101 |
194 NEG_2 | | 30 | 0x1e | 0b00011110 |
195 NEG_1 | | 31 | 0x1f | 0b00011111 | small negative integer - value in low 4 bits (k+32)
196 VARINT | " " | 32 | 0x20 | 0b00100000 | <VARINT> - Varint variable length integer
197 ZIGZAG | "!" | 33 | 0x21 | 0b00100001 | <ZIGZAG-VARINT> - Zigzag variable length integer
198 FLOAT | "\"" | 34 | 0x22 | 0b00100010 | <IEEE-FLOAT>
199 DOUBLE | "#" | 35 | 0x23 | 0b00100011 | <IEEE-DOUBLE>
200 LONG_DOUBLE | "\$" | 36 | 0x24 | 0b00100100 | <IEEE-LONG-DOUBLE>
201 UNDEF | "%" | 37 | 0x25 | 0b00100101 | None - Perl undef
202 BINARY | "&" | 38 | 0x26 | 0b00100110 | <LEN-VARINT> <BYTES> - binary/(latin1) string
203 STR_UTF8 | "'" | 39 | 0x27 | 0b00100111 | <LEN-VARINT> <UTF8> - utf8 string
204 REFN | "(" | 40 | 0x28 | 0b00101000 | <ITEM-TAG> - ref to next item
205 REFP | ")" | 41 | 0x29 | 0b00101001 | <OFFSET-VARINT> - ref to previous item stored at offset
206 HASH | "*" | 42 | 0x2a | 0b00101010 | <COUNT-VARINT> [<KEY-TAG> <ITEM-TAG> ...] - count followed by key/value pairs
207 ARRAY | "+" | 43 | 0x2b | 0b00101011 | <COUNT-VARINT> [<ITEM-TAG> ...] - count followed by items
208 OBJECT | "," | 44 | 0x2c | 0b00101100 | <STR-TAG> <ITEM-TAG> - class, object-item
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
209 OBJECTV | "-" | 45 | 0x2d | 0b00101101 | <OFFSET-VARINT> <ITEM-TAG> - offset of previously used classname tag - object-item
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
210 ALIAS | "." | 46 | 0x2e | 0b00101110 | <OFFSET-VARINT> - alias to item defined at offset
211 COPY | "/" | 47 | 0x2f | 0b00101111 | <OFFSET-VARINT> - copy of item defined at offset
212 WEAKEN | "0" | 48 | 0x30 | 0b00110000 | <REF-TAG> - Weaken the following reference
213 REGEXP | "1" | 49 | 0x31 | 0b00110001 | <PATTERN-STR-TAG> <MODIFIERS-STR-TAG>
214 RESERVED_0 | "2" | 50 | 0x32 | 0b00110010 | reserved
215 RESERVED_1 | "3" | 51 | 0x33 | 0b00110011 |
216 RESERVED_2 | "4" | 52 | 0x34 | 0b00110100 |
217 RESERVED_3 | "5" | 53 | 0x35 | 0b00110101 |
218 RESERVED_4 | "6" | 54 | 0x36 | 0b00110110 |
219 RESERVED_5 | "7" | 55 | 0x37 | 0b00110111 |
220 RESERVED_6 | "8" | 56 | 0x38 | 0b00111000 |
221 RESERVED_7 | "9" | 57 | 0x39 | 0b00111001 | reserved
222 FALSE | ":" | 58 | 0x3a | 0b00111010 | false (PL_sv_no)
223 TRUE | ";" | 59 | 0x3b | 0b00111011 | true (PL_sv_yes)
f9641fb @tsee Update spec: MANY will only be in version 2
tsee authored
224 MANY | "<" | 60 | 0x3c | 0b00111100 | <LEN-VARINT> <TYPE-BYTE> <TAG-DATA> - repeated tag (not done yet, will be implemented in version 2)
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
225 PACKET_START | "=" | 61 | 0x3d | 0b00111101 | (first byte of magic string in header)
226 EXTEND | ">" | 62 | 0x3e | 0b00111110 | <BYTE> - for additional tags
227 PAD | "?" | 63 | 0x3f | 0b00111111 | (ignored tag, skip to next byte)
0675748 @demerphq fix some spelling errors and regenerate the files
demerphq authored
228 ARRAYREF_0 | "\@" | 64 | 0x40 | 0b01000000 | [<ITEM-TAG> ...] - count of items in low 4 bits (ARRAY must be refcnt=1)
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
229 ARRAYREF_1 | "A" | 65 | 0x41 | 0b01000001 |
230 ARRAYREF_2 | "B" | 66 | 0x42 | 0b01000010 |
231 ARRAYREF_3 | "C" | 67 | 0x43 | 0b01000011 |
232 ARRAYREF_4 | "D" | 68 | 0x44 | 0b01000100 |
233 ARRAYREF_5 | "E" | 69 | 0x45 | 0b01000101 |
234 ARRAYREF_6 | "F" | 70 | 0x46 | 0b01000110 |
235 ARRAYREF_7 | "G" | 71 | 0x47 | 0b01000111 |
236 ARRAYREF_8 | "H" | 72 | 0x48 | 0b01001000 |
237 ARRAYREF_9 | "I" | 73 | 0x49 | 0b01001001 |
238 ARRAYREF_10 | "J" | 74 | 0x4a | 0b01001010 |
239 ARRAYREF_11 | "K" | 75 | 0x4b | 0b01001011 |
240 ARRAYREF_12 | "L" | 76 | 0x4c | 0b01001100 |
241 ARRAYREF_13 | "M" | 77 | 0x4d | 0b01001101 |
242 ARRAYREF_14 | "N" | 78 | 0x4e | 0b01001110 |
0675748 @demerphq fix some spelling errors and regenerate the files
demerphq authored
243 ARRAYREF_15 | "O" | 79 | 0x4f | 0b01001111 | [<ITEM-TAG> ...] - count of items in low 4 bits (ARRAY must be refcnt=1)
244 HASHREF_0 | "P" | 80 | 0x50 | 0b01010000 | [<KEY-TAG> <ITEM-TAG> ...] - count in low 4 bits, key/value pairs (HASH must be refcnt=1)
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
245 HASHREF_1 | "Q" | 81 | 0x51 | 0b01010001 |
246 HASHREF_2 | "R" | 82 | 0x52 | 0b01010010 |
247 HASHREF_3 | "S" | 83 | 0x53 | 0b01010011 |
248 HASHREF_4 | "T" | 84 | 0x54 | 0b01010100 |
249 HASHREF_5 | "U" | 85 | 0x55 | 0b01010101 |
250 HASHREF_6 | "V" | 86 | 0x56 | 0b01010110 |
251 HASHREF_7 | "W" | 87 | 0x57 | 0b01010111 |
252 HASHREF_8 | "X" | 88 | 0x58 | 0b01011000 |
253 HASHREF_9 | "Y" | 89 | 0x59 | 0b01011001 |
254 HASHREF_10 | "Z" | 90 | 0x5a | 0b01011010 |
255 HASHREF_11 | "[" | 91 | 0x5b | 0b01011011 |
256 HASHREF_12 | "\\" | 92 | 0x5c | 0b01011100 |
257 HASHREF_13 | "]" | 93 | 0x5d | 0b01011101 |
258 HASHREF_14 | "^" | 94 | 0x5e | 0b01011110 |
0675748 @demerphq fix some spelling errors and regenerate the files
demerphq authored
259 HASHREF_15 | "_" | 95 | 0x5f | 0b01011111 | [<KEY-TAG> <ITEM-TAG> ...] - count in low 4 bits, key/value pairs (HASH must be refcnt=1)
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
260 SHORT_BINARY_0 | "`" | 96 | 0x60 | 0b01100000 | <BYTES> - binary/latin1 string, length encoded in low 5 bits of tag
261 SHORT_BINARY_1 | "a" | 97 | 0x61 | 0b01100001 |
262 SHORT_BINARY_2 | "b" | 98 | 0x62 | 0b01100010 |
263 SHORT_BINARY_3 | "c" | 99 | 0x63 | 0b01100011 |
264 SHORT_BINARY_4 | "d" | 100 | 0x64 | 0b01100100 |
265 SHORT_BINARY_5 | "e" | 101 | 0x65 | 0b01100101 |
266 SHORT_BINARY_6 | "f" | 102 | 0x66 | 0b01100110 |
267 SHORT_BINARY_7 | "g" | 103 | 0x67 | 0b01100111 |
268 SHORT_BINARY_8 | "h" | 104 | 0x68 | 0b01101000 |
269 SHORT_BINARY_9 | "i" | 105 | 0x69 | 0b01101001 |
270 SHORT_BINARY_10 | "j" | 106 | 0x6a | 0b01101010 |
271 SHORT_BINARY_11 | "k" | 107 | 0x6b | 0b01101011 |
272 SHORT_BINARY_12 | "l" | 108 | 0x6c | 0b01101100 |
273 SHORT_BINARY_13 | "m" | 109 | 0x6d | 0b01101101 |
274 SHORT_BINARY_14 | "n" | 110 | 0x6e | 0b01101110 |
275 SHORT_BINARY_15 | "o" | 111 | 0x6f | 0b01101111 |
276 SHORT_BINARY_16 | "p" | 112 | 0x70 | 0b01110000 |
277 SHORT_BINARY_17 | "q" | 113 | 0x71 | 0b01110001 |
278 SHORT_BINARY_18 | "r" | 114 | 0x72 | 0b01110010 |
279 SHORT_BINARY_19 | "s" | 115 | 0x73 | 0b01110011 |
280 SHORT_BINARY_20 | "t" | 116 | 0x74 | 0b01110100 |
281 SHORT_BINARY_21 | "u" | 117 | 0x75 | 0b01110101 |
282 SHORT_BINARY_22 | "v" | 118 | 0x76 | 0b01110110 |
283 SHORT_BINARY_23 | "w" | 119 | 0x77 | 0b01110111 |
284 SHORT_BINARY_24 | "x" | 120 | 0x78 | 0b01111000 |
285 SHORT_BINARY_25 | "y" | 121 | 0x79 | 0b01111001 |
286 SHORT_BINARY_26 | "z" | 122 | 0x7a | 0b01111010 |
287 SHORT_BINARY_27 | "{" | 123 | 0x7b | 0b01111011 |
288 SHORT_BINARY_28 | "|" | 124 | 0x7c | 0b01111100 |
289 SHORT_BINARY_29 | "}" | 125 | 0x7d | 0b01111101 |
290 SHORT_BINARY_30 | "~" | 126 | 0x7e | 0b01111110 |
291 SHORT_BINARY_31 | | 127 | 0x7f | 0b01111111 | <BYTES> - binary/latin1 string, length encoded in low 5 bits of tag
292
293 =for autoupdater stop
294
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
295 =head3 The Track Bit And Cyclic Data Structures
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
296
297 The protocol uses a combination of the offset of a tracked tag and the
298 flag bit to be able to encode and reconstruct cyclic structures in a single
299 pass.
300
301 An encoder must track duplicated items and generate the appropriate ALIAS or
302 REFP tags to reconstruct them, and when it does so ensure that the high
303 bit of the original tag has been set.
304
305 When a decoder encounters a tag with its flag set it will remember the
306 offset of the tag in the output packet and the item that was decoded from
307 that tag. At a later point in the packet there will be an ALIAS or REFP
308 instruction which will refer to the item by its offset, and the decoder
309 will reuse it as needed.
310
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
311 =head3 The COPY Tag
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
312
313 Sometimes it is convenient to be able to reuse a previously emitted
314 sequence in the packet to reduce duplication. For instance a data
315 structure with many hashes with the same keys. The COPY tag is used for
316 this. Its argument is a varint which is the offset of a previously
317 emitted tag, and decoders are to behave as though the tag it references
318 was inserted into the packet stream as a replacement for the COPY tag.
319
320 Note, that in this case the track flag is B<not> set. It is assumed the
321 decoder can jump back to reread the tag from its location alone.
322
afe41f5 @tsee Update Sereal spec to V2
tsee authored
323 COPY tags are forbidden from referring to another COPY tag, and are also
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
324 forbidden from referring to anything containing a COPY tag, with the
325 exception that a COPY tag used as a value may refer to an tag that uses
326 a COPY tag for a classname or hash key.
327
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
328 =head3 String Types
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
329
330 Sereal supports three string representations. Two are "encodingless" and
9243d31 @rgs Fix a couple of typos, notably in tag names
rgs authored
331 are SHORT_BINARY and BINARY, where binary means "raw bytes". The other
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
332 is STR_UTF8 which is expected to contain valid canonical UTF8 encoded
333 unicode text data. Under normal circumstances a decoder is not expected
334 to validate that this is actually the case, and is allowed to simply
335 extract the data verbatim.
336
337 SHORT_BINARY stores the length of the string in the tag itself and is used
9243d31 @rgs Fix a couple of typos, notably in tag names
rgs authored
338 for strings of less than 32 characters long. Both BINARY and STR_UTF8
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
339 use a varint to indicate the number of B<bytes> (octets) in the string.
340
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
341 =head3 Hash Keys
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
342
9243d31 @rgs Fix a couple of typos, notably in tag names
rgs authored
343 Hash keys are always one of the string types, or a COPY tag referencing a
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
344 string.
345
076d0e8 @avar Use less =item in POD in favor of =head*
avar authored
346 =head3 Handling objects
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
347
348 Objects are serialized as a class name and a tag which represents the
349 objects data. In Perl land this will always be a reference. Mapping perl
350 objects to other languages is left to the future.
351
6230a97 @demerphq improve docs, tweak some definitions, make some rules more explicit
demerphq authored
352 Note that classnames MUST be a string, or a COPY tag referencing a string.
353
354 OBJECTV varints MUST reference a previously used classname, and not an
355 arbitrary string.
356
afe41f5 @tsee Update Sereal spec to V2
tsee authored
357 =head1 PROTOCOL CHANGES
358
359 =head2 Protocol Version 2
360
361 In Sereal protocol version 2, offsets were changed from being relative to
362 the start of the document (including header) to being relative to the start
363 of the document body (ie. excluding the document header). This means that
364 Sereal document bodies are now self-contained - relocatable within the document.
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
365
366 =head1 NOTES ON IMPLEMENTATION
367
368 =head2 Encoding the Length of Compressed Documents
369
370 With Sereal body format type 2 (see above), you need to encode (as a varint)
371 the length of the Snappy-compressed document as a prefix to the document body.
372 This is somewhat tricky to do efficiently since at first sight,
373 the amount of space required to encode a varint depends on the size of the
374 output. This means that you need to do the normal Sereal-encoding of the
375 document body, then compress the output of that, then append the varint
376 encoded length of the compressed data to a Sereal header, then append the
377 compressed data. In this naive way of implementing this Snappy compression
378 support, you may end up having to copy around the entire document up to three
379 times (and may end up having to allocate 3x the space, too). That is very
380 inefficient.
381
382 There is a better way, though, that's just a tiny bit subtle.
383 Thankfully, you have an upper bound on the
384 size of the compressed blob. It's the uncompressed blob plus the size of
385 the Snappy header (a Snappy library call can tell you what that is in
386 practice). What you can do is before compressing, you allocate a varint
387 that is long enough to encode an integer that is big enough to represent
388 the upper limit on the compressed output size. Then you proceed to
389 point the compressor into the buffer right after the thusly preallocated
390 varint. After compression, you'll know the real size of the compressed
391 blob. Now, you go back to the varint and fill it in. If the reserved
392 space for the varint is B<larger> than what you actually need, then
393 thanks to the way varints work, you can simply set the high bit on the
394 last byte of the varint, and continue to set the high bits of all following
395 padding bytes B<except the last>, which you set to 0 (NUL). For details
396 on why that works, please refer to the Google ProtoBuf documentation
397 referenced earlier. With this specially crafted varint, any normal
398 varint parsing function will treat it as a single varint and skip right
399 to the start of theSnappy-compressed blob. The varint is a correct
400 varint, just not in the canonical form. With this modified plan, you
401 should only need one extra malloc, and (beyond that which the Snappy
402 implementation does), no extra, large memcpy operations.
403
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
404 =head1 AUTHOR
405
406 Yves Orton E<lt>demerphq@gmail.comE<gt>
407
408 Damian Gryski
409
410 Steffen Mueller E<lt>smueller@cpan.orgE<gt>
411
412 Rafaël Garcia-Suarez
413
905e5dd @avar Add my E-Mail address to POD I appear in
avar authored
414 Ævar Arnfjörð Bjarmason E<lt>avar@cpan.orgE<gt>
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
415
416 =head1 ACKNOWLEDGMENT
417
418 This protocol was originally developed for Booking.com. With approval
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
419 from Booking.com, this document was generalized and published on github
420 and CPAN, for which the authors would like to express their gratitude.
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
421
422 =head1 COPYRIGHT AND LICENSE
423
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
424 Copyright (C) 2012, 2013 by Steffen Mueller
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
425
4be7e9c @tsee Update spec: varint w/ Snappy length
tsee authored
426 Copyright (C) 2012, 2013 by Yves Orton
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
427
04b88e3 @tsee Minor spec fixes (cosmetics)
tsee authored
428 This library is free software; you can redistribute it and/or modify
429 it under the same terms as Perl itself.
430
7d4d8f8 @demerphq split out the spec from the README
demerphq authored
431 =cut
432
Something went wrong with that request. Please try again.