Skip to content

Latest commit

 

History

History
333 lines (251 loc) · 11.6 KB

esl_varint.md

File metadata and controls

333 lines (251 loc) · 11.6 KB

esl_varint : variable-length binary encodings for integers

HMMER4 introduced compressed representations of sequence alignments called "magic capsules". One of the compression techniques in a magic capsule is run-length encoding, with integer run lengths encoded using variable-length binary codewords (varint codes) that can be as short as a single bit per integer.

The general idea of a varint code is that you're often in the situation where small integer values are more frequent, and more frequent values can be encoded by shorter codewords. Because $v$ can be arbitrarily big, it isn't immediately apparent that we can apply Huffman codes, because Huffman codes assume there's a finite set of symbols. However, if we assume that the $v$ follow some particular decreasing distribution (geometric, for example), Solomon Golomb showed that one can create a code that's the optimal Huffman code for a set of values drawn from that distribution.

Golomb codes [Golomb, 1966] are one of the best known varint codes, but many other varint codes exist. Each different varint code implies a probability distribution over the possible values $v$. The best code for a given problem is the one that best matches the actual distribution of your $v$'s. Golomb codes are optimal for geometric distributions, for example.

The Easel esl_varint module provides routines for several different varint codes, including Golomb codes, Golomb-Rice codes, exponential Golomb codes, Google varint codes, and Elias delta codes. HMMER4 magic capsules use exponential Golomb codes.

Explanation of different varint codes

The varint codes are explained below starting with truncated binary codes, which is not itself a varint code per se, but is a technique used by Golomb codes. Golomb codes take a parameter $m$ that controls the shape of the geometric distribution (so we sometimes refer to them as Golomb-m codes), and they get fiddly to encode and decode when $m$ isn't a power of 2. If we restrict Golomb codes to powers of two, $m = 2^k$, we get a simpler set of schemes called the Golomb-Rice-k codes. Golomb-Rice codes consist of groups of $2^k$ codewords for each different length in bits - for example, 4 values $v=0..3$ encoded in 3 bit codewords, 4 values $v=4..7$ encoded in 4 bit codewords, 4 values $v=8..11$ encoded in 5 bit codewords, and so on. The exponential Golomb code has group sizes that grow as a power of two, instead of being constant - $v=0$ is encoded in 1 bit, the 2 values $v=1,2$ are encoded in 3 bits, 4 values $v=3..6$ are encoded in 5 bits, and 8 values $v=7..14$ are encoded in 7 bit codewords. The generalized exponential Golomb-$k$ codes have group sizes that start at $2^k$ instead of 1.

Finally, two other codes that we implement are Elias delta codes [Elias, 1975] and Google varint codes.

Golomb codes are only here by way of explanation; Easel does not implement them. Easel implements Golomb-Rice codes in esl_varint_rice*, exponential Golomb codes in esl_varint_expgol*, Elias delta codes in esl_varint_delta*, and Google varint codes in esl_varint_google*.

Truncated binary codes

A truncated binary code is a canonical Huffman encoding for $n$ uniformly distributed values $v = 0..n-1$. A truncated binary code is used to encode the remainder $r$ piece in a Golomb code. By itself, it does not encode an arbitrarily large integer $v$.

The idea is that when $n$ is a power of two, the optimal code is simply to encode $v$ in its $k = \log_2 n$ bit binary representation; but when $n$ is not a power of two, in an optimal Huffman encoding, we use $k+1$ bits for some values but $k$ for others.

Let $k = \lfloor \log_2 n \rfloor$; $v$ will be encoded by either $k$ or $k+1$ bits. Let $u = 2^{k+1} - n$; $u$ is the number of codewords that are $k$ bits long, and $n-u$ are $k+1$ bits long. For $v < u$, encode the $k$-bit representation of $v$; for $v \geq u$, encode the $k+1$ bit representation of $v+u$.

To decode: read $k$ bits to obtain $v'$. If $v' < u$, $v = v'$; else, read one more bit onto $v'$ and $v = v'-u$.

For example, for $n=10$, $k=3$ and $u=6$:

$v$ length code
0 3 000
1 3 001
2 3 010
3 3 011
4 3 100
5 3 101
6 4 1100
7 4 1101
8 4 1110
9 4 1111

Golomb-m codes

  • Encodes: integer $v$; $v \geq 0$.
  • Argument: $m$; $m > 0$.

Divide $v$ by $m$, obtaining quotient $q$ and remainder $r$. Encode $q$ in unary (i.e. $q$ 1's followed by 0), then append the truncated binary encoding of $r$.

For example, 0..9 in a Golomb-3 code:

$v$ length code
0 2 0 0
1 3 0 10
2 3 0 11
3 3 10 0
4 4 10 10
5 4 10 11
6 4 110 0
7 5 110 10
8 5 110 11
9 5 1110 0

Golomb codes are optimal for geometric distributions. For a geometric run length distribution with extension probability $p \geq 0.5$, choose $m = \mathrm{Round}(\frac{-1}{\log_2 p} )$.

Golomb-Rice-k codes

  • Encodes: integer $v$, $v \leq 0$
  • Argument: $k$, $k \geq 0$
  • Length: $1 + k + \lfloor v / 2^k \rfloor$
  • Max $v$ in $b$ bits: $v < 2^k (b-k)$

Divide $v$ by $2^k$, obtaining quotient $q$ and remainder $r$. Encode $q$ in unary ($q$ 1's followed by 0) and append the $k$-bit binary representation of $r$. (Golomb-Rice-0 is unary coding: $v+1$ bits to store value $v$.)

For example, 0..9 in a Golomb-Rice-2 code:

$v$ length code
0 3 0 00
1 3 0 01
2 3 0 10
3 3 0 11
4 4 10 00
5 4 10 01
6 4 10 10
7 4 10 11
8 5 110 00
9 5 110 01

Golomb-Rice codes are the simpler subset of Golomb codes where the $m$ parameter is a power of two ($m = 2^k$), so the remainder $r$ is always encoded by $k$ bits and does not need the fiddly truncated binary encoding.

Exponential-Golomb code

  • Encodes: integer $v$; $v \geq 0$
  • Length: $2 \lfloor \log_2 (v+1) \rfloor + 1$
  • Max $v$ in $b$ bits: $v < 2^{ \lfloor \frac{\mathrm{nbits}-1}{2} \rfloor + 1} - 1$

The number of leading 0's is the number of bits in the encoded integer, and we encode $v+1$ instead of $v$ so that all integers, including 0, have a leading 1 bit.

To encode: let $w$ be the minimum width of the binary representation of $v+1$: i.e. $\lfloor \log_2 (v+1) \rfloor + 1$. Set $w-1$ leading bits to 0, followed by the $w$ bits of the binary representation of $v+1$.

For example:

$v$ length code
0 1 1
1 3 0 10
2 3 0 11
3 5 00 100
4 5 00 101
5 5 00 110
6 5 00 111
7 7 000 1000
8 7 000 1001
9 7 000 1010

An Elias gamma code for positive nonzero integers $v > 0$ is identical to the Exp-Golomb code for $v-1$: i.e. an Elias gamma code just directly encodes $v$ instead of $v+1$. (Yes, this can be a little confusing.)

An exp-Golomb code is exp-Golomb-0 in the generalized exp-Golomb codes below. Easel's implementation does not distinguish them; to get an exponential Golomb code, use exp-Golomb-0.

The largest integer $v$ that can be encoded in a uint64_t bitfield is $2^{32}-2$; in a uint32_t, $2^{16}-2$. The limited range of a 32-bit exp-Golomb encoding is why the Easel implementation is all in uint64_t.

Generalized exponential-Golomb-k codes

  • Encodes: integer $v$; $v \geq 0$
  • Argument: $k$, the width of the binary representation of the remainder $r$; $k \geq 0$
  • Length: $k + 2 \lfloor \log_2( \lfloor v / 2^k \rfloor + 1) \rfloor + 1$
  • Max $v$ in $b$ bits:: $v < 2^k \left( 2^{\lfloor \frac{b - k - 1}{2} \rfloor + 1} - 1 \right)$

Divide integer $v$ by $2^k$ to obtain quotient $q$ and remainder $r$. Encode $q$ using an Exp-Golomb code (above), followed by the $k$ bit binary representation of $r$.

For example, the Exp-Golomb-2 code:

$v$ length code
0 3 1 00
1 3 1 01
2 3 1 10
3 3 1 11
4 5 010 00
5 5 010 01
6 5 010 10
7 5 010 11
8 5 011 00
9 5 011 01

Exponential Golomb codes were introduced by [Teuhola, 1978], although the details of his encoding are different in detail.

Elias delta code

  • Encodes: integer $v$; for $v \geq 1$
  • Code length: $\lfloor \log_2 v \rfloor + 2 \lfloor \log_2 (\lfloor \log_2 v \rfloor + 1) \rfloor + 1$

Let $a = \lfloor \log_2 v \rfloor$, the width of the binary representation of $v$ ($2^a$ is the highest power in $v$). Let $b = \lfloor \log_2 (a+1) \rfloor$, the width of binary $a+1$. The code consists of three parts: $b$ leading 0's, followed by the binary representation of $a+1$ in $b+1$ bits, followed by the representation of $v \mod 2^a$ in $a$ bits.

In other words, encode $a$ in Elias gamma code, followed by the binary representation of $v$ without its leading (highest-order) 1 (i.e. rightmost $a$ bits of $v$).

For example, 1..10 in Elias delta code:

value length code
1 1 1
2 4 0 10 0
3 4 0 10 1
4 5 0 11 00
5 5 0 11 01
6 5 0 11 10
7 5 0 11 11
8 8 00 100 000
9 8 00 100 001
10 8 00 100 010

Elias codes including the Elias gamma and delta codes were introduced in [Elias, 1975].

Google varint-k codes

  • Encodes: integer $v$; $v \geq 0$
  • Argument: $k$, the width of each code group in bits; $k \geq 2$
  • Code length: $k (1 + \lfloor \log_{2^{k-1}} v \rfloor)$
  • Max $v$ in $b$ bits: $v < 2^{(k-1) \lfloor \frac{b}{k} \rfloor}$

Integer $v$ is encoded in base $2^{k-1}$ digits. Each digit is represented by a $k$-bit code group consisting of a one-bit continuation flag followed by the $k-1$ bits of the binary representation of the digit. Code groups are ordered with least significant digit first. The continuation flag is 1 if another code group follows.

For example, 0..9 encoded in a google-2 code:

$v$ length code
0 2 00
1 2 01
2 4 10 01
3 4 11 01
4 6 10 10 01
5 6 11 10 01
6 6 10 11 01
7 6 11 11 01
8 8 10 10 10 01
9 8 11 10 10 01

Google Protobuf uses a google-8 code (i.e. one byte per code group, base 128). This is actually the only Google varint code; they want it to be byte-aligned for efficiency reasons. Allowing the generalization to any $k$ bits is my own thing.

Additional design notes

  • Codes can be user input (for example, in HMMER zigars) so varint decoders need to treat bad codes as normal errors, not exceptions.

References

Elias, P. Universal codeword sets and representations of the integers. IEEE Trans Inform Theory 21:194-203, 1975.

Golomb, SW. Run-length encodings. IEEE Trans Inform Theory 12:399-401, 1966.

Teuhola, J. A compression method for clustered bit-vectors. Information Processing Letters 7:308-311, 1978.