Formalize dictionary encoding #37

Closed
julienledem opened this Issue Apr 5, 2013 · 3 comments

Comments

Projects
None yet
2 participants
Owner

julienledem commented Apr 5, 2013

When writing, the values writer will try to use dictionary encoding (PLAIN_DICTIONARY) unless the dictionary becomes too big, in which case it will fall back to plain encoding (PLAIN).
If used, the dictionary is stored in a dictionary page stored at the beginning of the column chunk. The values in the dictionary are stored in PLAIN encoding in order.

The dictionary is too big if one of those conditions apply:

The dictionary reaches the maximum size in bytes for the encoded dictionary. The maximum is smaller than a page and defined in a setting or proportionally to the page size.
The dictionary reaches the maximum number of entries. The maximum is 2^16 so that the code is 2 bytes max.
The integer ids returned by the dictionary can be stored in multiple ways:

Here is a proposed mechanism to let the writer pick one scheme depending on the data and size of dictionary.
2 bytes to store the id of the scheme used followed by the data encoded in one of those schemes:

  1. 2 bytes little endian per value (wasteful if we have few entries in the dictionary)
  2. bit packed: 1 byte to store the width in bits used followed by the values aligned to the next byte. (used if the size of the dictionary is small)
  3. bit packed + RLE: 1 byte to store the width followed by the encoded values. (if dictionary is small and values repeat often)
  4. group varint encoded (if dictionary is big)
Owner

julienledem commented Apr 5, 2013

For java implementation, see: See: Parquet/parquet-mr@master...dictionary_encoding

Contributor

nongli commented Apr 8, 2013

Let's discuss this tomorrow. I'll think over your proposals.

Owner

julienledem commented Jun 4, 2013

documentation updated in #43

julienledem closed this Jun 4, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment