Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blosc use case #73

Closed
OneArb opened this issue Dec 1, 2014 · 6 comments
Closed

blosc use case #73

OneArb opened this issue Dec 1, 2014 · 6 comments

Comments

@OneArb
Copy link

OneArb commented Dec 1, 2014

I am checking if I could use blosc to compress 1000 char long strings or so.

As a test I am using the string "Methionylthreonylthreonylglutaminyla..." which is highly repetive.

http://blog.jmay.us/2009/11/longest-english-word.html

I modified simple.c and the best I can get is 1.5x compression with shuffle and 2.8x without shuffle at clevel 9

without shuffle

chars
1000 1.4x
2000 1.8x
3000 2x
4000 2.1x
5000 2.3x

ZIP compresses the full string to 5.5x

Follows my settings :

define LINESIZE 98310

define SIZE 100000

define SHAPE {10,10,10}

define CHUNKSHAPE {1,10,10}

static unsigned char data[LINESIZE];
static unsigned char data_out[SIZE];
static unsigned char data_dest[LINESIZE];

Questions : Am I within expected compression ratios without switching to Zlib ?

Is the block / string I intend to compress too small for blosc use case ?

Is there any prospect for blosc to support indexed and random access of compressed blocks ?

Any suggestions for performance "small" string compression ?

@OneArb
Copy link
Author

OneArb commented Dec 3, 2014

Closed further research answered most questions.

@OneArb OneArb closed this as completed Dec 3, 2014
@FrancescAlted
Copy link
Member

Yes, the default compressor in Blosc (BloscLZ) is geared towards speed, not
compression ratio, but maybe included LZ4HC or ZLIB can get better ratios,
specially when using large blocksizes. Does this match your research or
you found something different?

2014-12-03 9:08 GMT+01:00 OneArb notifications@github.com:

Closed #73 #73.


Reply to this email directly or view it on GitHub
#73 (comment).

Francesc Alted

@OneArb
Copy link
Author

OneArb commented Dec 8, 2014

  1. I found a few compression overview http://compressionratings.com/sort.cgi?rating_sum.brief+6n

https://docs.google.com/spreadsheet/ccc?key=0AiLIAFlgldSodENkNEhIM3lDZEtBTlFUQ29FdWhvTEE&usp=sharing#gid=2

http://heartofcomp.altervista.org/MOC/MOCACE.htm

Would it be worth submiting and get blosck in the fray ?

Looking over the benchmark section I notice that bloscLZ is the only decompressor able to outperform memcopy, at least on your machine.

[blosc zlib benchmark] 'http://www.blosc.org/benchmarks-zlib.html) use a different compression scale than the other compressor. It also starts at 0% (vs. 1) which interfers with the graph readability.

Some chart across compressors would ease comparison.

I sure would like to see bloscLZ take its due place within the compressor benchmark community.

  1. simple.c uses almost all CPU bandwidth on my 2 core machine. Is that expected ?

@OneArb OneArb reopened this Dec 8, 2014
@OneArb OneArb closed this as completed Dec 8, 2014
@esc
Copy link
Member

esc commented Dec 8, 2014

Regarding the zlib Benchmarks, the first measurement is also at one, but because zlib has such high compression ratios, especially with that dataset, it looks like the measurement is at zero. Ideally we should start all graphs at one, since this means "no compression".

Regarding the speed of BloscLZ, I believe what you are seeing is a distortion due to measurement. The only benchmarks we have listed fo LZ4 right now are from a BlueGene. This is a HPC architecture and let's just say things behave differently there than on commodity hardware. I believe that both LZ4 and BloscLZ (maybe snappy too) can outperform memcpy when driven by Blosc. The reason we don't have any LZ4 benchmarks listed yet is that support for driving LZ4 from Blosc has only been officially supported for about a year now. Support for BloscLZ is much older, so many benchmarks have accumulated for this one.

@esc
Copy link
Member

esc commented Dec 8, 2014

FYI: the reason we get these "off-the-charts" ratios for zlib is because of the shuffle filter in Blosc that can pre-condition certain datasets favorably for zlib, effectively boosting the compression ratio.

See also: http://slides.zetatech.org/haenel-ep14-compress-me-stupid.pdf page 23 onwards

@OneArb
Copy link
Author

OneArb commented Dec 8, 2014

https://www.youtube.com/watch?v=IzqlWUTndTo at 9:39
provides the comparative chart I was looking for. LZ4 seems indeed a bit more speedy overall considering range, linear and random distribution.

at 11:19 compressor charts vs. memcopy for each distribution type.

I see Intel Core i5 test for each supported compressor on http://blosc.org/synthetic-benchmarks.html, perhaps the benchmark distorsion has some other source ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants