Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buffer type support #65

Closed
wants to merge 4 commits into from
Closed

Conversation

esc
Copy link
Member

@esc esc commented May 30, 2014

Under certain circumstances it is advantageous to supply a buffer type to the compression and decompression routines. My use case is decompressing Numpy arrays from string in Bloscpack. The input data is a string and during decompression the individual chunks need to be taken from the string. In order to do this you could for example slice the string. However, AFAIU this necessitates a memory copy. The current implementation is such that I use a cStringIO, since much of the stuff in Bloscpack is implemented using file pointers. However, from what I understand, reading from a cStringIO object also requires a memory copy, it would seem. The alternative is to use the buffer builtin to emulate a file:

esc/bloscpack@dd77456

And then use that as a source for the compressed data.

Initial benchmarks look promising:

In [2]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:import numpy as np
:import bloscpack as bp
:a = np.linspace(0, 100, 2e8)
:shuffle = True
:clevel = 9
:cname = 'lz4'
:bargs = bp.args.BloscArgs(clevel=clevel, shuffle=shuffle, cname=cname)
:bpargs = bp.BloscpackArgs(checksum='None', offsets=False, max_app_chunks=0)
:bpc = bp.pack_ndarray_str(a, blosc_args=bargs, bloscpack_args=bpargs,
:        chunk_size='0.5G')
:--

In [4]: %timeit a3 = bp.unpack_ndarray_str(bpc)
1 loops, best of 3: 390 ms per loop

In [5]: %timeit a3 = bp.unpack_ndarray_str(bpc)
1 loops, best of 3: 389 ms per loop

In [6]: %timeit a3 = bp.fast_unpack_ndarray_str(bpc)
1 loops, best of 3: 336 ms per loop

In [7]: %timeit a3 = bp.fast_unpack_ndarray_str(bpc)
1 loops, best of 3: 337 ms per loop

Where the fast_unpack_ndarray_str is using the buffer under the hood. In fact this solution is of the same order as the plain compress_ptr method:

In [9]: import blosc

In [10]: c = blosc.compress_ptr(a.__array_interface__['data'][0], a.size, a.dtype.itemsize, clevel=clevel, shuffle=shuffle, cname=cname)

In [11]: %timeit a2 = np.empty_like(a) ; bytes_written = blosc.decompress_ptr(c, a2.__array_interface__['data'][0])
1 loops, best of 3: 334 ms per loop

In [12]: %timeit a2 = np.empty_like(a) ; bytes_written = blosc.decompress_ptr(c, a2.__array_interface__['data'][0])
1 loops, best of 3: 334 ms per loop

Support was easy to implemet since we use s# in the PyArg_ParseTuple which can accept a buffer as input and will cast that into a c-string:

s# (string, Unicode or any read buffer compatible object) [const char *, int (or Py_ssize_t, see below)]

    This variant on s stores into two C variables, the first one a pointer to a character string, the
second one its length. In this case the Python string may contain embedded null bytes.
 Unicode objects pass back a pointer to the default encoded string version of the object if such
 a conversion is possible. All other read-buffer compatible objects pass back a reference to the
 raw internal data representation.

So they only thing left to do, was to allow this from the toplevel.py

A remaining issue is that buffer was removed in Python 3 and the documentation seems to suggest using a memoryview but I can't get it to work. The error is something about requiring a 'pinnned' read-only buffer. I hope to look into a Py3 solution soon but wanted to float this idea already anyway.

@@ -297,6 +299,9 @@ def compress(bytesobj, typesize, clevel=9, shuffle=True, cname='blosclz'):
>>> c_bytesobj = blosc.compress(a_bytesobj, typesize=4)
>>> len(c_bytesobj) < len(a_bytesobj)
True
>>> c_bytesobj = blosc.compress(buffer(a_bytesobj), typesize=4)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes the doctest (on travis) fail on python 3.x

@esc
Copy link
Member Author

esc commented Mar 28, 2015

yes, the python3 issue is a problem.

@esc
Copy link
Member Author

esc commented Mar 29, 2015

This will probably be superseded by #80

@esc
Copy link
Member Author

esc commented May 26, 2015

Probably superseded by #80

@esc
Copy link
Member Author

esc commented May 26, 2015

Definitely superseded by #80, closing.

@esc esc closed this May 26, 2015
@esc esc deleted the feature/buffer_type_support branch May 26, 2015 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants