Skip to content
FrancescAlted edited this page Jul 25, 2016 · 4 revisions

What is Blosc

Blosc (http://www.blosc.org) is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc works well for compressing numerical arrays that contains data with relatively low entropy, like sparse data, time series, grids with regular-spaced values, etc.

python-blosc is a Python package that wraps it.

Easy install

You can use the PyPI repository with the pip command line utility:

$ pip install blosc

or, if you prefer compiling the sources yourself, read on.

Building from sources

Assuming that you have a C compiler installed, do:

$ python setup.py build_ext --inplace

This package supports Python 2.6, 2.7 and 3.3 or higher versions.

Testing from sources

After compiling, you can quickly check that the package is sane by running:

$ PYTHONPATH=.   (or "set PYTHONPATH=." on Win)
$ export PYTHONPATH=.  (not needed on Win)
$ python blosc/toplevel.py  (add -v for verbose mode)

Installing from sources

Install it as a typical Python package:

$ python setup.py install

Some basic benchmarks

[Figures below obtained using a VM with only 2 cores on top of a Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz]

# Let's create a NumPy array with 80 MB full of data
>>> import numpy as np
>>> a = np.linspace(0, 100, 1e7)
>>> bytes_array = a.tostring()  # get a bytes stream

# Blosc as a very fast compressor
>>> import zlib
>>> %time zpacked = zlib.compress(bytes_array)
CPU times: user 4.03 s, sys: 0.03 s, total: 4.06 s
Wall time: 4.08 s   # ~ 20 MB/s
>>> import blosc
>>> %time bpacked = blosc.compress(bytes_array, typesize=8)
CPU times: user 0.10 s, sys: 0.00 s, total: 0.11 s
Wall time: 0.05 s   # ~ 1.6 GB/s and 80x faster than zlib
>>> time acp = a.copy()   # a copy of the actual data (using memcpy() behing the scenes)
CPU times: user 0.03 s, sys: 0.01 s, total: 0.04 s
Wall time: 0.04 s   # ~ 2 GB/s, just a 25% faster than Blosc

# ... that is optimized for compressing binary data ...
>>> len(zpacked)
52994692
>>> len(bytes_array) / float(len(zpacked))
1.5095851486409242   # zlib achieves a 1.5x compression ratio
>>> len(bpacked)
7641156
>>> len(bytes_array) / float(len(bpacked))
10.469620041784253   # blosc reaches more than 10x compression ratio

# Blosc is also extremely fast when decompressing
>>> %time bytes_array2 = zlib.decompress(zpacked)
CPU times: user 0.28 s, sys: 0.02 s, total: 0.30 s
Wall time: 0.31 s   # ~ 260 MB/s
>>> %time bytes_array2 = blosc.decompress(bpacked)
CPU times: user 0.07 s, sys: 0.02 s, total: 0.09 s
Wall time: 0.05 s   # ~ 1.6 GB/s and 6x times faster than zlib

# You can pack and unpack NumPy arrays very easily too:
>>> packed = blosc.pack_array(a)
>>> a2 = blosc.unpack_array(packed)
>>> np.alltrue(a == a2)
True

Documentation

Please refer to http://python-blosc.blosc.org/. You can also have a look at docstrings. Start by the main package:

>>> import blosc
>>> help(blosc)

and ask for more help in the referenced functions.

Mailing list

There is an official mailing list for Blosc at:

http://groups.google.es/group/blosc

That's it! Let us know of any bugs, suggestions, gripes, kudos, etc. you may have.