blosc use case #73

OneArb · 2014-12-01T14:26:54Z

I am checking if I could use blosc to compress 1000 char long strings or so.

As a test I am using the string "Methionylthreonylthreonylglutaminyla..." which is highly repetive.

http://blog.jmay.us/2009/11/longest-english-word.html

I modified simple.c and the best I can get is 1.5x compression with shuffle and 2.8x without shuffle at clevel 9

without shuffle

chars
1000 1.4x
2000 1.8x
3000 2x
4000 2.1x
5000 2.3x

ZIP compresses the full string to 5.5x

Follows my settings :

define LINESIZE 98310

define SIZE 100000

define SHAPE {10,10,10}

define CHUNKSHAPE {1,10,10}

static unsigned char data[LINESIZE];
static unsigned char data_out[SIZE];
static unsigned char data_dest[LINESIZE];

Questions : Am I within expected compression ratios without switching to Zlib ?

Is the block / string I intend to compress too small for blosc use case ?

Is there any prospect for blosc to support indexed and random access of compressed blocks ?

Any suggestions for performance "small" string compression ?

OneArb · 2014-12-03T08:08:49Z

Closed further research answered most questions.

FrancescAlted · 2014-12-03T17:29:13Z

Yes, the default compressor in Blosc (BloscLZ) is geared towards speed, not
compression ratio, but maybe included LZ4HC or ZLIB can get better ratios,
specially when using large blocksizes. Does this match your research or
you found something different?

2014-12-03 9:08 GMT+01:00 OneArb notifications@github.com:

Closed #73 #73.

—
Reply to this email directly or view it on GitHub
#73 (comment).

Francesc Alted

OneArb · 2014-12-08T13:56:07Z

I found a few compression overview http://compressionratings.com/sort.cgi?rating_sum.brief+6n

https://docs.google.com/spreadsheet/ccc?key=0AiLIAFlgldSodENkNEhIM3lDZEtBTlFUQ29FdWhvTEE&usp=sharing#gid=2

http://heartofcomp.altervista.org/MOC/MOCACE.htm

Would it be worth submiting and get blosck in the fray ?

Looking over the benchmark section I notice that bloscLZ is the only decompressor able to outperform memcopy, at least on your machine.

[blosc zlib benchmark] 'http://www.blosc.org/benchmarks-zlib.html) use a different compression scale than the other compressor. It also starts at 0% (vs. 1) which interfers with the graph readability.

Some chart across compressors would ease comparison.

I sure would like to see bloscLZ take its due place within the compressor benchmark community.

simple.c uses almost all CPU bandwidth on my 2 core machine. Is that expected ?

esc · 2014-12-08T16:38:38Z

Regarding the zlib Benchmarks, the first measurement is also at one, but because zlib has such high compression ratios, especially with that dataset, it looks like the measurement is at zero. Ideally we should start all graphs at one, since this means "no compression".

Regarding the speed of BloscLZ, I believe what you are seeing is a distortion due to measurement. The only benchmarks we have listed fo LZ4 right now are from a BlueGene. This is a HPC architecture and let's just say things behave differently there than on commodity hardware. I believe that both LZ4 and BloscLZ (maybe snappy too) can outperform memcpy when driven by Blosc. The reason we don't have any LZ4 benchmarks listed yet is that support for driving LZ4 from Blosc has only been officially supported for about a year now. Support for BloscLZ is much older, so many benchmarks have accumulated for this one.

esc · 2014-12-08T16:43:55Z

FYI: the reason we get these "off-the-charts" ratios for zlib is because of the shuffle filter in Blosc that can pre-condition certain datasets favorably for zlib, effectively boosting the compression ratio.

See also: http://slides.zetatech.org/haenel-ep14-compress-me-stupid.pdf page 23 onwards

OneArb · 2014-12-08T18:00:41Z

https://www.youtube.com/watch?v=IzqlWUTndTo at 9:39
provides the comparative chart I was looking for. LZ4 seems indeed a bit more speedy overall considering range, linear and random distribution.

at 11:19 compressor charts vs. memcopy for each distribution type.

I see Intel Core i5 test for each supported compressor on http://blosc.org/synthetic-benchmarks.html, perhaps the benchmark distorsion has some other source ?

OneArb closed this as completed Dec 3, 2014

OneArb reopened this Dec 8, 2014

OneArb closed this as completed Dec 8, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blosc use case #73

blosc use case #73

OneArb commented Dec 1, 2014

OneArb commented Dec 3, 2014

FrancescAlted commented Dec 3, 2014

OneArb commented Dec 8, 2014

esc commented Dec 8, 2014

esc commented Dec 8, 2014

OneArb commented Dec 8, 2014

blosc use case #73

blosc use case #73

Comments

OneArb commented Dec 1, 2014

define LINESIZE 98310

define SIZE 100000

define SHAPE {10,10,10}

define CHUNKSHAPE {1,10,10}

OneArb commented Dec 3, 2014

FrancescAlted commented Dec 3, 2014

OneArb commented Dec 8, 2014

esc commented Dec 8, 2014

esc commented Dec 8, 2014

OneArb commented Dec 8, 2014