Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloading / Streaming Trajectories from MDDB #4603

Open
BradyAJohnston opened this issue May 21, 2024 · 6 comments
Open

Downloading / Streaming Trajectories from MDDB #4603

BradyAJohnston opened this issue May 21, 2024 · 6 comments
Labels
interoperability important for making MDAnalysis work with other packages and tools new-feature streaming

Comments

@BradyAJohnston
Copy link

It's very early days, but the Molecular Dynamics Database (MDDB) is starting to do some very initial hosting of datasets. (https://mmb.mddbr.eu/#/browse)

I don't suppose it would be a priority anytime soon for integrating with it, but in the future potential to download / stream topologies & trajectories would be an interesting functionality.

Would this be something within the scope of MDAnalysis? There is an initial REST API, but I believe it's all pretty subject to change of the coming months / years as the project matures a bit more.

@IAlibay
Copy link
Member

IAlibay commented May 21, 2024

cc @philbiggin - probably worth checking if this something already planned within the consortia / something someone is looking to do.

@philbiggin
Copy link

I don't believe it's on the near-horizon window if I can put it like that, but certainly worth reminding folk about. Yes - the API is probably susceptible to change as we already know some things that need addressing (although @adamhospital and @d-beltran can comment further for sure!)

@d-beltran
Copy link

Hi everyone and thank you for your interest in the MDDB.

You are right, things may change in the long term since this is still a prototype.
We could try to be as back-compatible as possible to support some early integration in MDAnalysis.
If you need any assistance to do this please reach out!

@orbeckst
Copy link
Member

ping @hmacdope @ljwoods2

@orbeckst orbeckst added interoperability important for making MDAnalysis work with other packages and tools new-feature labels May 30, 2024
@hmacdope
Copy link
Member

hmacdope commented Jul 3, 2024

@ljwoods2 would you be able to detail our approach here? We have a prototype of H5MD (slated as future format for GMX, https://gitlab.com/gromacs/gromacs/-/issues/5016 / MDDB: https://gitlab.com/groups/gromacs/-/epics/5 ) streaming working IIRC

@ljwoods2
Copy link
Contributor

ljwoods2 commented Jul 4, 2024

Yes, I'm working on this for my GSOC project!

The approach right now is to make H5MD file streamable from cloud services by first reformatting the metadata of the h5 file using kerchunk into a form Zarr can parse and then passing this metadata (containing the byte ranges of the datasets in the h5 file) to fsspec to create a "reference" filesystem that can be opened by Zarr. I've found this site to have the best description of how kerchunk works and how to use it.

So far, this is working for reading h5md files from s3, but we haven't tested other cloud services yet. This approach also doesn't allow writing h5md files to cloud services, either, and this would require doing something different like passing an s3fs object to h5py.

You can save kerchunk translated h5 metadata to json and use it later so that you can access the same remote h5 file again via zarr without the added overhead of converting the byte ranges, compression, etc a second time, but we aren't currently using this in the initial prototype- not sure if this would potentially be helpful for something like MDDB.

Finally, one interesting thing is that since Zarr-python has intentionally made their api (mostly) identical to h5py, and since zarr includes a directory-like layout, groups, datasets, and attributes just like h5, we've been able to easily convert h5md files into zarr files that use the h5md format/directory layout and treat them the same as we would an actual h5-backed h5md file in the file reader. The only caveat for this is that Zarr doesn't yet support linking datasets like h5 does, so the format does not translate perfectly, but in every other way, including api interactions with the file, it is the same. It's not yet clear if zarr will ever support links AFAIK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interoperability important for making MDAnalysis work with other packages and tools new-feature streaming
Projects
None yet
Development

No branches or pull requests

7 participants