Study state of the art floating point compression algorithms #306

Open
julienledem opened this Issue Feb 14, 2014 · 7 comments

Projects

None yet

5 participants

@julienledem
Member

Study existing lossless floating point compression papers.
Provide reference implementation and benchmark comparison.

@cryptogrammer

Is there any suggested reading to understand the project in more detail ?

@julienledem
Member

Here are some papers:
http://www.cs.unc.edu/~isenburg/lcpfpv/
http://blosc.pytables.org/trac
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.7936&rep=rep1&type=pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.124.8968&rep=rep1&type=pdf
http://users.ices.utexas.edu/~burtscher/papers/dcc07a.pdf
http://users.ices.utexas.edu/~burtscher/papers/tr08.pdf

There are also simple ideas like splitting the value and exponent in two columns and delta-encode them as ints separately.

For Parquet, here are the requirements:

  • lossless
  • high throughput, compression is not the only goal, it needs to be fast
  • memory constraints (how much memory does this need)/ streaming compression
  • simple: we want to avoid unnecessary complexity. A 10% improvement is not worth the extra complexity
  • not dependent on specialized hardware (IE: no GPU)

You can take a look at the delta encoding for integers in Parquet. This was inspired by the great Paper by D. Lemire and L. Boytsov
http://arxiv.org/abs/1209.2137

@julienledem
Member

The goal would be to try out several of those techniques, provide implementations and benchmark them.
In the end the best implementation based on the criteria I described above would become the floating point encoding for Parquet.

@dvryaboy
Contributor

Note that we also cannot use any patented techniques due to licensing.

On Sat, Mar 8, 2014 at 10:10 AM, Julien Le Dem notifications@github.comwrote:

The goal would be to try out several of those techniques, provide
implementations and benchmark them.
In the end the best implementation based on the criteria I described above
would become the floating point encoding for Parquet.

Reply to this email directly or view it on GitHubhttps://github.com/Parquet/parquet-mr/issues/306#issuecomment-37104912
.

@julienledem
Member

Now is time to submit proposals on the GSOC website

@julienledem julienledem added pick me up! and removed GSoC-2014 labels Apr 22, 2014
@santoshv98

I am planning to pick this up- How long does one expect to take for this activity? Where do I start? Any pointers? How would I know which of those are already patented or not?

@mabrek
mabrek commented Oct 17, 2015

There is method for compressing floating point values in facebook's Gorilla time-series database paper
http://www.vldb.org/pvldb/vol8/p1816-teller.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment