Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

165 hdf5 in memory using core driver #173

Conversation

michalslonina
Copy link

This patch adds the ability to use different libhdf5 file drivers (H5FD_SEC2,H5FD_CORE,H5FD_STDIO) and one special fake driver H5FD_CORE_INMEMORY that enables the user to read and write hdf5 documents in memory.

Setting the driver is done by passing the DRIVER **kwarg to openFile function.
To pass an in-memory document to libhdf5, please pass the python string as the H5FD_CORE_INMEMORY_IMAGE argument to openFile function and set the DRIVER to H5FD_CORE_INMEMORY.

To write an in-memory just set the DRIVER to H5FD_CORE_INMEMORY and use pytables as usual.
After closing the file object, your data will be made available thru the file.getInMemoryFileContents().

@scopatz
Copy link
Member

scopatz commented Jul 26, 2012

Quick comment @michalslonina, the DRIVER kwarg should be lowercase (driver) to match the rest of the code base.

@michalslonina
Copy link
Author

Hi @scopatz. I don't exactly understand you. Looking at the code, i see a few examples like:

# Set the cache size
set_cache_size(self.file_id, params['METADATA_CACHE_SIZE'])

# Set the maximum number of threads for Blosc
setBloscMaxThreads(params['MAX_BLOSC_THREADS'])

Also in the parameters.py every parameter is upper case, like:
BOUNDS_MAX_SIZE = 1*_MB
"""The maximum size for bounds values cached during index lookups."""

BOUNDS_MAX_SLOTS = 4*1024
"""The maximum number of slots for the BOUNDS cache."""

Should I convert all of them to lower case ?

@scopatz
Copy link
Member

scopatz commented Jul 27, 2012

In parameters.py, these are all global 'constants', so these should be uppercase. However, keyword arguments to functions should always lowercase with underscores.

# Good
def f(driver=""):
    pass

# Bad
def f(DRIVER=""):
    pass

I hope this clarifies things.

@avalentino
Copy link
Member

@michalslonina, sorry for the late replay.
I have a build failure on GNU/Linux (Ubuntu 12.04 amd64) with GCC 4.6.3 and HDF5 1.8.4-patch1:

$ python setup.py build_ext --inplace
* Found numpy 1.6.1 package installed.
* Found numexpr 1.4.2 package installed.
* Found Cython 0.15.1 package installed.
* Found HDF5 headers at ``/usr/include``, library at ``/usr/lib``.
* Found HDF5HL headers at ``/usr/include``, library at ``/usr/lib``.
.. WARNING:: Could not find the HDF5HL runtime.
   The HDF5HL shared library was *not* found in the default library
   paths. In case of runtime problems, please remember to install it.
* Found LZO 2 headers at ``/usr/include``, the library is located in the standard system search dirs.
* Skipping detection of LZO 1 since LZO 2 has already been found.
* Found bzip2 headers at ``/usr/include``, the library is located in the standard system search dirs.

[CUT]

src/H5PCORE-mem.c:8:1: warning: initialization makes integer from pointer without a cast [enabled by default]
src/H5PCORE-mem.c:8:1: warning: (near initialization for ‘udata.len’) [enabled by default]
src/H5PCORE-mem.c:10:33: error: unknown type name ‘H5FD_file_image_op_t’
src/H5PCORE-mem.c:16:9: error: unknown type name ‘H5FD_file_image_op_t’
src/H5PCORE-mem.c:21:45: error: unknown type name ‘H5FD_file_image_op_t’
src/H5PCORE-mem.c:27:30: error: unknown type name ‘H5FD_file_image_op_t’
src/H5PCORE-mem.c:40:1: error: unknown type name ‘H5FD_file_image_callbacks_t’
src/H5PCORE-mem.c:40:42: error: ‘image_malloc’ undeclared here (not in a function)
src/H5PCORE-mem.c:40:56: error: ‘image_memcpy’ undeclared here (not in a function)
src/H5PCORE-mem.c:40:1: warning: excess elements in scalar initializer [enabled by default]
src/H5PCORE-mem.c:40:1: warning: (near initialization for ‘callbacks’) [enabled by default]
src/H5PCORE-mem.c:41:5: error: ‘image_realloc’ undeclared here (not in a function)
src/H5PCORE-mem.c:41:5: warning: excess elements in scalar initializer [enabled by default]
src/H5PCORE-mem.c:41:5: warning: (near initialization for ‘callbacks’) [enabled by default]
src/H5PCORE-mem.c:41:20: error: ‘image_free’ undeclared here (not in a function)
src/H5PCORE-mem.c:41:5: warning: excess elements in scalar initializer [enabled by default]
src/H5PCORE-mem.c:41:5: warning: (near initialization for ‘callbacks’) [enabled by default]
src/H5PCORE-mem.c:42:5: warning: excess elements in scalar initializer [enabled by default]
src/H5PCORE-mem.c:42:5: warning: (near initialization for ‘callbacks’) [enabled by default]
src/H5PCORE-mem.c:42:5: warning: excess elements in scalar initializer [enabled by default]
src/H5PCORE-mem.c:42:5: warning: (near initialization for ‘callbacks’) [enabled by default]
src/H5PCORE-mem.c:43:5: warning: excess elements in scalar initializer [enabled by default]
src/H5PCORE-mem.c:43:5: warning: (near initialization for ‘callbacks’) [enabled by default]
src/H5PCORE-mem.c: In function ‘H5Fcreate_inmemory’:
src/H5PCORE-mem.c:49:14: error: request for member ‘udata’ in something not a structure or union
src/H5PCORE-mem.c:50:5: warning: implicit declaration of function ‘H5Pset_file_image_callbacks’ [-Wimplicit-function-declaration]
src/H5PCORE-mem.c:8:1: warning: initialization makes integer from pointer without a cast [enabled by default]
src/H5PCORE-mem.c:8:1: warning: (near initialization for ‘udata.len’) [enabled by default]
src/H5PCORE-mem.c:10:33: error: unknown type name ‘H5FD_file_image_op_t’
src/H5PCORE-mem.c:16:9: error: unknown type name ‘H5FD_file_image_op_t’
src/H5PCORE-mem.c:21:45: error: unknown type name ‘H5FD_file_image_op_t’
src/H5PCORE-mem.c:27:30: error: unknown type name ‘H5FD_file_image_op_t’
src/H5PCORE-mem.c:40:1: error: unknown type name ‘H5FD_file_image_callbacks_t’
src/H5PCORE-mem.c:40:42: error: ‘image_malloc’ undeclared here (not in a function)
src/H5PCORE-mem.c:40:56: error: ‘image_memcpy’ undeclared here (not in a function)
src/H5PCORE-mem.c:40:1: warning: excess elements in scalar initializer [enabled by default]
src/H5PCORE-mem.c:40:1: warning: (near initialization for ‘callbacks’) [enabled by default]
src/H5PCORE-mem.c:41:5: error: ‘image_realloc’ undeclared here (not in a function)
src/H5PCORE-mem.c:41:5: warning: excess elements in scalar initializer [enabled by default]
src/H5PCORE-mem.c:41:5: warning: (near initialization for ‘callbacks’) [enabled by default]
src/H5PCORE-mem.c:41:20: error: ‘image_free’ undeclared here (not in a function)
src/H5PCORE-mem.c:41:5: warning: excess elements in scalar initializer [enabled by default]
src/H5PCORE-mem.c:41:5: warning: (near initialization for ‘callbacks’) [enabled by default]
src/H5PCORE-mem.c:42:5: warning: excess elements in scalar initializer [enabled by default]
src/H5PCORE-mem.c:42:5: warning: (near initialization for ‘callbacks’) [enabled by default]
src/H5PCORE-mem.c:42:5: warning: excess elements in scalar initializer [enabled by default]
src/H5PCORE-mem.c:42:5: warning: (near initialization for ‘callbacks’) [enabled by default]
src/H5PCORE-mem.c:43:5: warning: excess elements in scalar initializer [enabled by default]
src/H5PCORE-mem.c:43:5: warning: (near initialization for ‘callbacks’) [enabled by default]
src/H5PCORE-mem.c: In function ‘H5Fcreate_inmemory’:
src/H5PCORE-mem.c:49:14: error: request for member ‘udata’ in something not a structure or union
src/H5PCORE-mem.c:50:5: warning: implicit declaration of function ‘H5Pset_file_image_callbacks’ [-Wimplicit-function-declaration]
error: Command "gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -DNDEBUG=1 -DHAVE_HDF5HL_LIB=1 -DHAVE_LZO2_LIB=1 -DHAVE_BZ2_LIB=1 -Iblosc -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -c src/H5PCORE-mem.c -o build/temp.linux-x86_64-2.7/src/H5PCORE-mem.o -Isrc -DH5Acreate_vers=2 -DH5Aiterate_vers=2 -DH5Dcreate_vers=2 -DH5Dopen_vers=2 -DH5Eclear_vers=2 -DH5Eprint_vers=2 -DH5Epush_vers=2 -DH5Eset_auto_vers=2 -DH5Eget_auto_vers=2 -DH5Ewalk_vers=2 -DH5E_auto_t_vers=2 -DH5Gcreate_vers=2 -DH5Gopen_vers=2 -DH5Pget_filter_vers=2 -DH5Pget_filter_by_id_vers=2 -DH5Tarray_create_vers=2 -DH5Tget_array_dims_vers=2 -DH5Z_class_t_vers=2 -DH5_NO_DEPRECATED_SYMBOLS" failed with exit status 1

@avalentino
Copy link
Member

@michalslonina, it seems that some of the features you used are only available in HDF5 >= 1.8.9.
it is a constraint that is a little bit too strong IMO (also considering the dependency from hdf5_hl).

Do you this it is possible to enable this feature conditionally (only if all dependencies are available)?

@scopatz
Copy link
Member

scopatz commented Jul 28, 2012

Or in response to the version issue, to make what you have work with HDF5 v1.8.4+ would be even better!

@michalslonina
Copy link
Author

@avalentino, @scopatz thanks for the incredible feedback and the code review. I agree with you 100% on the mentioned issues. I'll update my branch this week.

I'll try to write a better setup script that does a better job at library version detection and write some fallback too.

@avalentino Regarding the assert, you are right, this was a bad idea.

Regarding the chunk cache, i found a small explanation how it works here:
http://www.hdfgroup.org/HDF5/doc/H5.user/Chunking.html#S2
Looks like the chunk cache has the effect on how the data is passed to the filters, so ignoring it was not very wise move on my side.

((hvl_t ) udata)->p = size; / the author of this should be hanged drawn and quartered */

@scopatz
Copy link
Member

scopatz commented Jul 30, 2012

Thanks @michalslonina! I look forward to getting this merged in... Just let us know when you want us to look at it again.

  handling
- fixed a dangerous memory initialization bug
@michalslonina
Copy link
Author

Hi Guys, have some spare time, so I can get back and fix all the stuff. To be clear on the params issue, I would like to clarify a few things. I've added DRIVER, as well as H5FD_CORE_INMEMORY_IMAGE parameters to the processing of file **kwargs. I will delete the DRIVER from parameters.py and move it to file.py:File.init(..., driver=None). H5FD_CORE_INMEMORY_IMAGE parameter should be lowercase as well, since its not a constant, right ?

All the other parameters referenced, like CHUNK_CACHE__,MAX_BLOSC__, MAX_NUMEXPR_THREADS_* stay uppercase, even though you can pass them as *_kwarg to the File.init() function (as i understand, this code copies the args from parameters.py to params variable that represents union of *kwargs and parameters.py variables.
from file.py:470 File.init()
# Get all the parameters in parameter file(s)
params = dict([(k, v) for k, v in parameters.dict.iteritems()
if k.isupper() and not k.startswith('
')])
# Update them with possible keyword arguments
params.update(kwargs)

Considering the above, the DRIVER and H5FD_CORE_INMEMORY_IMAGE will still be lowercase, right ?

@scopatz
Copy link
Member

scopatz commented Aug 14, 2012

Yes, I think that is correct. driver and memory_image are supposed to lowercase.

@avalentino
Copy link
Member

While I agree with @scopatz that keywords should be lowercase I would like to preserve consistency with the current convention used for tables.File keyword parameters (that currently are uppercase).

Also, IMO at least the DRIVER, but probably also H5FD_CORE_INMEMORY_IMAGE, should be defined in parameters.py in order to allow global configuration in the same fashion of other parameters regarding file creation.

My suggestion is to keep using the old convention (uppercase parameters) and then then address the coding style question in a second time for all keyword parameters altogether.

After all we are planning a big API change to improve the coding style.

The patch in this case would e trivial:

params = dict([(k.lower(), v) for k, v in parameters.dict.iteritems()

instead of

params = dict([(k, v) for k, v in parameters.dict.iteritems()

What do you think about it?

@michalslonina
Copy link
Author

@avalentino The convention you are asking was already implemented in 58e92fd. I've changed it on your request guys. I can revert the patch if you like. It will be wise to move DRIVER back to the parameters.py. As for the H5FD_CORE_INMEMORY_IMAGE, it doesn't matter. The parameter is used only to provide the string that holds the image of the file.

IMHO keeping the DRIVER in parameters.py is more consistent with the current implementation. It also allows provides one point in code where you can change the default behavior and the code looks cleaner as the default values will not need to be passed explicitly between function calls.

@avalentino @scopatz I will gladly accept any decision you make.

@scopatz
Copy link
Member

scopatz commented Aug 14, 2012

I leave it up to @avalentino

@avalentino
Copy link
Member

Oh it was just an idea :)
Please @michalslonina go ahead as you feel it is better.

@michalslonina
Copy link
Author

@avalentino I'm a new contributor to the project, i don't think I should make such decisions. I'll take your last proposal as the official direction ;)

michalslonina added 4 commits August 15, 2012 13:30
…er and"

This reverts commit fcc2db8.

Conflicts:

	tables/hdf5Extension.pyx
- fixed the case with no DRIVER argument provided
- fix compilation of H5PCORE-mem
@michalslonina
Copy link
Author

Summary of the fixed problems in the recent commits:

  • autodetection of libhdf5_hl works, if library is not present reading in-memory hdf5 file doesn't work
  • parameters are upper case
  • fixed the exception when no DRIVER keyword was provided
  • CHUNK_CACHE_* is not ignored for in memory file writes

@scopatz scopatz closed this Aug 15, 2012
@scopatz
Copy link
Member

scopatz commented Aug 15, 2012

@michalslonina Thanks a ton! Sorry for accidentally closing....

@scopatz scopatz reopened this Aug 15, 2012
@avalentino
Copy link
Member

Thanks @michalslonina, I'll work on it ASAP.

@michalslonina
Copy link
Author

@avalentino @scopatz It's a pleasure working with you guys. PyTables is an amazing piece of software. Please let me know if there are any other issues I should fix.

@scopatz
Copy link
Member

scopatz commented Aug 17, 2012

Thanks for pushing this feature through, @michalslonina. I'll let @avalentino merge this in.

avalentino added a commit that referenced this pull request Aug 25, 2012
@ghost ghost assigned avalentino Aug 25, 2012
@avalentino
Copy link
Member

The pull request is now merged into develop with some modifications:

  • dropped dependency from hdg5_hl
  • the special driver "H5FD_CORE_INMEMORY" has been dropped.
    Now in-memory files can be controlled using the H5FD_CORE and its parameters.

Probably writing a short tutorial about this wold be a good idea.

Thanks again @michalslonina.

@avalentino avalentino closed this Aug 25, 2012
@nehalecky
Copy link

Hey all. This feature looks awesome, and is just what I was looking for to bypass writing to disk for persisting some raw data stored in compressed HDF5 to remote MongoDB. When can we expect to see it released with 3.0? :)

@scopatz
Copy link
Member

scopatz commented Mar 23, 2013

PyTables 3.0 will be released with Python 3.x support, which @avalentino is leading. Any of the remaining issues that for this milestone that you want to help out with would help speed this up! ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants