Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes on parallel reads #790

Closed
Dapid opened this issue Feb 4, 2020 · 2 comments
Closed

Crashes on parallel reads #790

Dapid opened this issue Feb 4, 2020 · 2 comments

Comments

@Dapid
Copy link
Contributor

Dapid commented Feb 4, 2020

I have a file with one group and one array, and two parallel threads reading random chunks. While the individual threads work perfectly well, when running at the same time, I get HDF5 tracebacks. According to the documentation, it should work because I am only reading.

Problems reading the array data.
tables.exceptions.HDF5ExtError: HDF5 error back trace

  File "H5Dio.c", line 199, in H5Dread
    can't read data
  File "H5Dio.c", line 601, in H5D__read
    can't read data
  File "H5Dchunk.c", line 2229, in H5D__chunk_read
    unable to read raw data chunk
  File "H5Dchunk.c", line 3609, in H5D__chunk_lock
    data pipeline read failed
  File "H5Z.c", line 1326, in H5Z_pipeline
    filter returned failure during read

End of HDF5 error back trace

Here is a fully reproducing minimal example: https://gist.github.com/Dapid/7a5cdb04c2ababbd86d6513002a37d69

It seems that either turning compression off or using the in-memory core driver makes it work.

The system is Fedora Linux 31, Python 3.6 and 3.7 from the repositories, Pytables installed with pip on a virtual environment. HDF5 version is 1.10.4, and Pytables is 3.6.1.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
PyTables version:    3.6.1
HDF5 version:        1.10.4
NumPy version:       1.18.1
Numexpr version:     2.7.1 (not using Intel's VML/MKL)
Zlib version:        1.2.11 (in Python interpreter)
LZO version:         2.09 (Feb 04 2015)
BZIP2 version:       1.0.6 (6-Sept-2010)
Blosc version:       1.16.3 (2019-03-08)
Blosc compressors:   blosclz (1.1.0), lz4 (1.8.3), lz4hc (1.8.3), snappy (1.1.1), zlib (1.2.8), zstd (1.3.8)
Blosc filters:       shuffle, bitshuffle
Cython version:      0.29.14
Python version:      3.7.6 (default, Jan 30 2020, 09:44:41) 
[GCC 9.2.1 20190827 (Red Hat 9.2.1-1)]
Platform:            Linux-5.4.15-200.fc31.x86_64-x86_64-with-fedora-31-Thirty_One
Byte-ordering:       little
Detected cores:      8
Default encoding:    utf-8
Default FS encoding: utf-8
Default locale:      (en_GB, UTF-8)
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
@cjrh
Copy link

cjrh commented Jun 16, 2020

Thanks for linking your reproducer examples, it really helps a lot. I am not a maintainer of pytables, but I'm just beginning to use this library and trying to figure out how to use it safely.

Looking at reader2.py:

The default "start method" in multiprocessing is fork. The problem with fork is that it copies file handles into subprocesses. The h5 file handle in your parent process is copied (not recreated) in the subprocesses, and for reasons I don't understand, this messes up pytables.

If you change the multiprocessing start_method to "spawn", then reader2.py no longer crashes:

# Try re-opening on each thread
import random
import multiprocessing

import tables

def read():
    _h5 = tables.open_file('onefile.h5')
    for _ in range(100):
        _h5.root.data.data_array[random.randint(0, int(1e6)-1)]
    _h5.close()

# Stress-test
if __name__ == '__main__':
    h5 = tables.open_file('onefile.h5')
    read()  # It works!

    # THIS LINE
    multiprocessing.set_start_method('spawn')

    processes = [multiprocessing.Process(target=read) for _ in range(2)]
    for p in processes:
        p.start()
    print('Joining')
    for p in processes:
        p.join()

    h5.close()

With "spawn", it does not crash. If you comment out the set_start_method call, it'll crash. (If you set start method to forkserver it also appears to be ok).

@avalentino
Copy link
Member

Thanks @Dapid and @cjrh this is a very interesting topic and a very interesting solution.

Currently we have an item in our FAQ about concurrent access to an HDF5 file.

I plan to improve our documentation pointing to your examples.

Please let me know if you think there is anything else that should be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants