Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guess_format doesn't work with h5py.File objects #2884

Open
edisj opened this issue Aug 4, 2020 · 9 comments
Open

guess_format doesn't work with h5py.File objects #2884

edisj opened this issue Aug 4, 2020 · 9 comments
Labels
Component-Converters Component-Readers Format-H5MD hdf5-based H5MD trajectory format NSF REU NSF Research Experience for Undergraduates project

Comments

@edisj
Copy link
Contributor

edisj commented Aug 4, 2020

Is your feature request related to a problem?

I'm trying to pass an h5py.File object into mda.Universe using H5MDReader (#2787), but have been running into funky errors and I think the issue is that h5py.File attributes aren't suited for MDAnalysis.lib.util.isstream and MDAnalysis.lib.util.iterable, due to h5py files not passing the isstream method and being classified as "iterable" due to having a len(h5py.File)


If I don't pass format="H5MD:

u = mda.Universe(TPR_xvf, h5py.File(H5MD_xvf, 'r'))

MDAnalysis.lib.util.guess_format() looks at isstream(h5py.File) and iterable(h5py.File, however, these return unexpected values for an h5py.File:

image
which makes guess_format think there are multiple trajectories that get passed to MDAnalysis.coordinates.chain.py which THEN gets passed to MDAnalysis.coordinates.core.py which fails and gives ValueError: Unknown coordinate trajectory format '' for 'h5md'., so it looks like it did try to iterate through the groups in the h5py.File:
image


If I do pass `format="H5MD":

u = mda.Universe(TPR_xvf, h5py.File(H5MD_xvf, 'r'), format="H5MD")

Again, it ends up calling MDAnalysis.coordinates.chain.py, but then passes it to H5MDReader where it fails with OSError: Unable to open file (unable to open file: name = 'h5md', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0) . So, again, it looks like it tried to load the first group 'h5md' from the file as an individual trajectory

Describe the solution you'd like

I'd like h5py files to also pass the isstream method, even though it doesn't fit the criteria.

Describe alternatives you've considered

I've tried doing something like adding an if clause at the start of MDAnalysis.lib.util.guess_format with something like

import h5py
if isinstance(filename, h5py.File):
    format = format_from_filename_extension(filename.filename)

but the same issues occur, so I'm not sure what to do to fix it.

Additional context

@richardjgowers
Copy link
Member

@lilyminium
Copy link
Member

lilyminium commented Aug 4, 2020

It looks like the problem is that it's a false positive for ChainReader, because the format hint for that one checks if the file is iterable. Not sure the best way around this; maybe a special case for h5py in the ChainReader format hint? Alternatively, the iterable function itself

@richardjgowers
Copy link
Member

richardjgowers commented Aug 4, 2020 via email

@lilyminium
Copy link
Member

The format hint thing takes priority over chainreader I think. It’s also
how numpy arrays don’t fall into it

Not currently, and you wouldn't want it to anyway (although this would have been an easy solution), because what if you have an iterable of chemfiles objects? ChainReader ignores ndarrays specifically:

    @staticmethod
    def _format_hint(thing):
        """Can ChainReader read the object *thing*

        .. versionadded:: 1.0.0
        """
        return (not isinstance(thing, np.ndarray) and
                util.iterable(thing) and
                not util.isstream(thing))

@richardjgowers
Copy link
Member

Ah my mistake then, I thought chainreader was in the guess_format thing that happens once all the hints are done. Hmm. I think ChainReader should only happen once all other formats have had a chance to have a go. So maybe ChainReader shouldn't use a _format_hint to avoid this.

@edisj
Copy link
Contributor Author

edisj commented Aug 4, 2020

@richardjgowers @lilyminium thanks for the replies.
I do have a _format_hint defined here in H5MDReader. So is the issue that the h5py.File is being picked up by the ChainReader format hint, which selects ChainReader before it has a chance to be selected by H5MDReader?

@richardjgowers
Copy link
Member

richardjgowers commented Aug 4, 2020 via email

@orbeckst orbeckst added the NSF REU NSF Research Experience for Undergraduates project label Aug 7, 2020
@orbeckst
Copy link
Member

Should we make ChainReader a special case in the format analysis?

Or should we add a "priority" attribute to each format hint along the lines that everything with the same priority can be tested in any order but low priority are guaranteed to be tested after higher priorities? We then just have to agree that ChainReader is, say, 0, and others are higher, say 10. Not sure if other special cases such as numpy arrays need to be higher/lower than the others.

@orbeckst
Copy link
Member

orbeckst commented Apr 3, 2024

@ljwoods2 is this issue related to the guessing issue you're having for zarrtraj?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component-Converters Component-Readers Format-H5MD hdf5-based H5MD trajectory format NSF REU NSF Research Experience for Undergraduates project
Projects
None yet
Development

No branches or pull requests

4 participants