HMMER4 introduced compressed representations of sequence alignments called "magic capsules". One of the compression techniques in a magic capsule is run-length encoding, with integer run lengths encoded using variable-length binary codewords (varint codes) that can be as short as a single bit per integer.
The general idea of a varint code is that you're often in the
situation where small integer values are more frequent, and more
frequent values can be encoded by shorter codewords. Because
Golomb codes [Golomb, 1966] are one of the best known varint codes,
but many other varint codes exist. Each different varint code implies
a probability distribution over the possible values
The Easel esl_varint
module provides routines for several different
varint codes, including Golomb codes, Golomb-Rice codes, exponential
Golomb codes, Google varint codes, and Elias delta codes. HMMER4 magic
capsules use exponential Golomb codes.
The varint codes are explained below starting with truncated binary
codes, which is not itself a varint code per se, but is a technique
used by Golomb codes. Golomb codes take a parameter
Finally, two other codes that we implement are Elias delta codes [Elias, 1975] and Google varint codes.
Golomb codes are only here by way of explanation; Easel does not
implement them. Easel implements Golomb-Rice codes in
esl_varint_rice*
, exponential Golomb codes in esl_varint_expgol*
,
Elias delta codes in esl_varint_delta*
, and Google varint codes in
esl_varint_google*
.
A truncated binary code is a canonical Huffman encoding for
The idea is that when
Let
To decode: read
For example, for
length | code | |
---|---|---|
0 | 3 | 000 |
1 | 3 | 001 |
2 | 3 | 010 |
3 | 3 | 011 |
4 | 3 | 100 |
5 | 3 | 101 |
6 | 4 | 1100 |
7 | 4 | 1101 |
8 | 4 | 1110 |
9 | 4 | 1111 |
-
Encodes: integer
$v$ ;$v \geq 0$ . -
Argument:
$m$ ;$m > 0$ .
Divide
For example, 0..9 in a Golomb-3 code:
length | code | |
---|---|---|
0 | 2 | 0 0 |
1 | 3 | 0 10 |
2 | 3 | 0 11 |
3 | 3 | 10 0 |
4 | 4 | 10 10 |
5 | 4 | 10 11 |
6 | 4 | 110 0 |
7 | 5 | 110 10 |
8 | 5 | 110 11 |
9 | 5 | 1110 0 |
Golomb codes are optimal for geometric distributions. For a geometric
run length distribution with extension probability
-
Encodes: integer
$v$ ,$v \leq 0$ -
Argument:
$k$ ,$k \geq 0$ -
Length:
$1 + k + \lfloor v / 2^k \rfloor$ -
Max
$v$ in$b$ bits:$v < 2^k (b-k)$
Divide
For example, 0..9 in a Golomb-Rice-2 code:
length | code | |
---|---|---|
0 | 3 | 0 00 |
1 | 3 | 0 01 |
2 | 3 | 0 10 |
3 | 3 | 0 11 |
4 | 4 | 10 00 |
5 | 4 | 10 01 |
6 | 4 | 10 10 |
7 | 4 | 10 11 |
8 | 5 | 110 00 |
9 | 5 | 110 01 |
Golomb-Rice codes are the simpler subset of Golomb codes where the
-
Encodes: integer
$v$ ;$v \geq 0$ -
Length:
$2 \lfloor \log_2 (v+1) \rfloor + 1$ -
Max
$v$ in$b$ bits:$v < 2^{ \lfloor \frac{\mathrm{nbits}-1}{2} \rfloor + 1} - 1$
The number of leading 0's is the number of bits in the encoded
integer, and we encode
To encode: let
For example:
length | code | |
---|---|---|
0 | 1 | 1 |
1 | 3 | 0 10 |
2 | 3 | 0 11 |
3 | 5 | 00 100 |
4 | 5 | 00 101 |
5 | 5 | 00 110 |
6 | 5 | 00 111 |
7 | 7 | 000 1000 |
8 | 7 | 000 1001 |
9 | 7 | 000 1010 |
An Elias gamma code for positive nonzero integers
An exp-Golomb code is exp-Golomb-0 in the generalized exp-Golomb codes below. Easel's implementation does not distinguish them; to get an exponential Golomb code, use exp-Golomb-0.
The largest integer uint64_t
bitfield
is uint32_t
, uint64_t
.
-
Encodes: integer
$v$ ;$v \geq 0$ -
Argument:
$k$ , the width of the binary representation of the remainder$r$ ;$k \geq 0$ -
Length:
$k + 2 \lfloor \log_2( \lfloor v / 2^k \rfloor + 1) \rfloor + 1$ -
Max
$v$ in$b$ bits::$v < 2^k \left( 2^{\lfloor \frac{b - k - 1}{2} \rfloor + 1} - 1 \right)$
Divide integer
For example, the Exp-Golomb-2 code:
length | code | |
---|---|---|
0 | 3 | 1 00 |
1 | 3 | 1 01 |
2 | 3 | 1 10 |
3 | 3 | 1 11 |
4 | 5 | 010 00 |
5 | 5 | 010 01 |
6 | 5 | 010 10 |
7 | 5 | 010 11 |
8 | 5 | 011 00 |
9 | 5 | 011 01 |
Exponential Golomb codes were introduced by [Teuhola, 1978], although the details of his encoding are different in detail.
-
Encodes: integer
$v$ ; for$v \geq 1$ -
Code length:
$\lfloor \log_2 v \rfloor + 2 \lfloor \log_2 (\lfloor \log_2 v \rfloor + 1) \rfloor + 1$
Let
In other words, encode
For example, 1..10 in Elias delta code:
value | length | code |
---|---|---|
1 | 1 | 1 |
2 | 4 | 0 10 0 |
3 | 4 | 0 10 1 |
4 | 5 | 0 11 00 |
5 | 5 | 0 11 01 |
6 | 5 | 0 11 10 |
7 | 5 | 0 11 11 |
8 | 8 | 00 100 000 |
9 | 8 | 00 100 001 |
10 | 8 | 00 100 010 |
Elias codes including the Elias gamma and delta codes were introduced in [Elias, 1975].
-
Encodes: integer
$v$ ;$v \geq 0$ -
Argument:
$k$ , the width of each code group in bits;$k \geq 2$ -
Code length:
$k (1 + \lfloor \log_{2^{k-1}} v \rfloor)$ -
Max
$v$ in$b$ bits:$v < 2^{(k-1) \lfloor \frac{b}{k} \rfloor}$
Integer
For example, 0..9 encoded in a google-2 code:
length | code | |
---|---|---|
0 | 2 | 00 |
1 | 2 | 01 |
2 | 4 | 10 01 |
3 | 4 | 11 01 |
4 | 6 | 10 10 01 |
5 | 6 | 11 10 01 |
6 | 6 | 10 11 01 |
7 | 6 | 11 11 01 |
8 | 8 | 10 10 10 01 |
9 | 8 | 11 10 10 01 |
Google Protobuf
uses a google-8 code (i.e. one byte per code group, base 128). This is
actually the only Google varint code; they want it to be byte-aligned
for efficiency reasons. Allowing the generalization to any
- Codes can be user input (for example, in HMMER zigars) so varint decoders need to treat bad codes as normal errors, not exceptions.
Elias, P. Universal codeword sets and representations of the integers. IEEE Trans Inform Theory 21:194-203, 1975.
Golomb, SW. Run-length encodings. IEEE Trans Inform Theory 12:399-401, 1966.
Teuhola, J. A compression method for clustered bit-vectors. Information Processing Letters 7:308-311, 1978.