buffer type support #65

esc · 2014-05-30T15:53:35Z

Under certain circumstances it is advantageous to supply a buffer type to the compression and decompression routines. My use case is decompressing Numpy arrays from string in Bloscpack. The input data is a string and during decompression the individual chunks need to be taken from the string. In order to do this you could for example slice the string. However, AFAIU this necessitates a memory copy. The current implementation is such that I use a cStringIO, since much of the stuff in Bloscpack is implemented using file pointers. However, from what I understand, reading from a cStringIO object also requires a memory copy, it would seem. The alternative is to use the buffer builtin to emulate a file:

esc/bloscpack@dd77456

And then use that as a source for the compressed data.

Initial benchmarks look promising:

In [2]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:import numpy as np
:import bloscpack as bp
:a = np.linspace(0, 100, 2e8)
:shuffle = True
:clevel = 9
:cname = 'lz4'
:bargs = bp.args.BloscArgs(clevel=clevel, shuffle=shuffle, cname=cname)
:bpargs = bp.BloscpackArgs(checksum='None', offsets=False, max_app_chunks=0)
:bpc = bp.pack_ndarray_str(a, blosc_args=bargs, bloscpack_args=bpargs,
:        chunk_size='0.5G')
:--

In [4]: %timeit a3 = bp.unpack_ndarray_str(bpc)
1 loops, best of 3: 390 ms per loop

In [5]: %timeit a3 = bp.unpack_ndarray_str(bpc)
1 loops, best of 3: 389 ms per loop

In [6]: %timeit a3 = bp.fast_unpack_ndarray_str(bpc)
1 loops, best of 3: 336 ms per loop

In [7]: %timeit a3 = bp.fast_unpack_ndarray_str(bpc)
1 loops, best of 3: 337 ms per loop

Where the fast_unpack_ndarray_str is using the buffer under the hood. In fact this solution is of the same order as the plain compress_ptr method:

In [9]: import blosc

In [10]: c = blosc.compress_ptr(a.__array_interface__['data'][0], a.size, a.dtype.itemsize, clevel=clevel, shuffle=shuffle, cname=cname)

In [11]: %timeit a2 = np.empty_like(a) ; bytes_written = blosc.decompress_ptr(c, a2.__array_interface__['data'][0])
1 loops, best of 3: 334 ms per loop

In [12]: %timeit a2 = np.empty_like(a) ; bytes_written = blosc.decompress_ptr(c, a2.__array_interface__['data'][0])
1 loops, best of 3: 334 ms per loop

Support was easy to implemet since we use s# in the PyArg_ParseTuple which can accept a buffer as input and will cast that into a c-string:

s# (string, Unicode or any read buffer compatible object) [const char *, int (or Py_ssize_t, see below)]

    This variant on s stores into two C variables, the first one a pointer to a character string, the
second one its length. In this case the Python string may contain embedded null bytes.
 Unicode objects pass back a pointer to the default encoded string version of the object if such
 a conversion is possible. All other read-buffer compatible objects pass back a reference to the
 raw internal data representation.

So they only thing left to do, was to allow this from the toplevel.py

A remaining issue is that buffer was removed in Python 3 and the documentation seems to suggest using a memoryview but I can't get it to work. The error is something about requiring a 'pinnned' read-only buffer. I hope to look into a Py3 solution soon but wanted to float this idea already anyway.

ThomasWaldmann · 2015-03-28T04:33:14Z

blosc/toplevel.py

@@ -297,6 +299,9 @@ def compress(bytesobj, typesize, clevel=9, shuffle=True, cname='blosclz'):
    >>> c_bytesobj = blosc.compress(a_bytesobj, typesize=4)
    >>> len(c_bytesobj) < len(a_bytesobj)
    True
+    >>> c_bytesobj = blosc.compress(buffer(a_bytesobj), typesize=4)


this makes the doctest (on travis) fail on python 3.x

esc · 2015-03-28T12:14:54Z

yes, the python3 issue is a problem.

esc · 2015-03-29T16:17:49Z

This will probably be superseded by #80

esc · 2015-05-26T09:19:07Z

Probably superseded by #80

esc · 2015-05-26T19:12:56Z

Definitely superseded by #80, closing.

esc added 4 commits May 30, 2014 16:55

in Python2 we can also accept the buffer type

3c216d5

documentation for use of buffer

4bdfeab

doctest for use of buffer object

26b5da0

add documentation for decompress too

56e682a

ThomasWaldmann reviewed Mar 28, 2015
View reviewed changes

esc mentioned this pull request Mar 28, 2015

support memoryviews even on bytearray? #78

Closed

esc closed this May 26, 2015

esc deleted the feature/buffer_type_support branch May 26, 2015 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

buffer type support #65

buffer type support #65

esc commented May 30, 2014

ThomasWaldmann Mar 28, 2015

esc commented Mar 28, 2015

esc commented Mar 29, 2015

esc commented May 26, 2015

esc commented May 26, 2015

buffer type support #65

buffer type support #65

Conversation

esc commented May 30, 2014

ThomasWaldmann Mar 28, 2015

Choose a reason for hiding this comment

esc commented Mar 28, 2015

esc commented Mar 29, 2015

esc commented May 26, 2015

esc commented May 26, 2015