Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenPMD HDF5Reader Plasma subclass #500

Merged
merged 14 commits into from
Jun 28, 2018

Conversation

ritiek
Copy link
Contributor

@ritiek ritiek commented Jun 17, 2018

Also see #499. This PR adds the ability to read HDF5 files based on OpenPMD standard when passing a keyword argument hdf5.

>>> plasmapy.classes.Plasma(hdf5='plasmapy/classes/tests/data/data00000255.h5')

The initial idea is to set appropriate properties for dealing with 2D datasets on the subclass. Once it works out nicely, we can go ahead having a similar implementation for working with 3D datasets.

EDIT: We ended up adding support for all types available in example datasets. 2D, 3D and even thetaMode. :D

@pep8speaks
Copy link

pep8speaks commented Jun 17, 2018

Hello @ritiek! Thanks for updating your pull request.

Congratulations! There are no PEP8 issues in this pull request. 😸

Comment last updated on June 28, 2018 at 07:32 Hours UTC

@ritiek
Copy link
Contributor Author

ritiek commented Jun 17, 2018

And happy #500!

@ritiek
Copy link
Contributor Author

ritiek commented Jun 17, 2018

I am going to compare h5py and openPMD-api soon, so we can choose whichever one seems better for our purpose.

@ritiek
Copy link
Contributor Author

ritiek commented Jun 17, 2018

OpenPMD API looks something like this:

>>> import openPMD
>>> series = openPMD.Series('plasmapy/classes/tests/data/data00000255.h5',
                            openPMD.Access_Type.read_only)

>>> items = list(series.iterations.items())
>>> items
[(255, <openPMD.Iteration of at t = '0.000000 s'>)]

>>> iteration, data = items[0]
>>> meshes = data.meshes.items()
>>> particles = data.particles.items()

>>> list(meshes)
[('E', <openPMD.Mesh>), ('rho', <openPMD.Mesh>)]
>>> list(particles)
[('Hydrogen1+', <openPMD.ParticleSpecies>),
 ('electrons', <openPMD.ParticleSpecies>)]

I personally think we should just stick with h5py. It is packaged on PyPi which makes it easy to distribute unlike OpenPMD-api which is available only through conda (and spack). Also, it seems a bit weird to me working with iter objects with OpenPMD API.

I found we could also do this with h5py to navigate through multiple levels easily:

>>> h5 = h5py.File('plasmapy/classes/tests/data/data00000255.h5')
>>> h5.get('data/255/fields/E/x')
<HDF5 dataset "x": shape (51, 201), type "<f8">

What do you guys think?

@StanczakDominik
Copy link
Member

Well, I guess we could start out with h5py for now. As long as we're using just hdf5 files, we don't really get advantages from the API right now besides potentially being nice to the guys at OpenPMD and testing their API out for them. But that's obviously a lower priority than getting this working! I'm fine with skipping that for now.

@StanczakDominik
Copy link
Member

By the way, I'm absolutely not sure we should default to OpenPMD on a "hdf5" keyword argument. I think there are two solutions:

  • openPMD = True forces initialization via OpenPMD
  • I think there's an attr saying this is an OpenPMD file in h5 = h5py.File('plasmapy/classes/tests/data/data00000255.h5'); h5.attrs. We could read that and if it's found, awesome, we know how to handle that. If it isn't, ValueError or something.

@ritiek
Copy link
Contributor Author

ritiek commented Jun 17, 2018

Ok, let's put our stakes on h5py then.

I'm absolutely not sure we should default to OpenPMD on a "hdf5" keyword argument. I think there are two solutions

Maybe we could have both of them. I checked it out and OpenPMD does seem to set standards to check if a dataset is based on OpenPMD format h5.attrs['openPMDextension'].

User could force HDF5 to read as OpenPMD by passing openPMD=True. If this keyword argument is not supplied, we could automatically decide checking openPMDextension attribute.

@StanczakDominik
Copy link
Member

Sounds reasonable 👍

@ritiek ritiek changed the title OpenPMD HDF5Reader Plasma subclass [WIP] OpenPMD HDF5Reader Plasma subclass Jun 17, 2018
# convinced otherwise if another way makes more sense

def test_x(openPMD2DPlasma):
assert openPMD2DPlasma.x.shape == (51, 201)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is whose position coordinates we are talking about? If this is for particles like electrons, etc., it would make more sense to have something like openPMD2DPlasma.electrons.x.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely would! I thought I mentioned it, but guess not. OpenPMD lets you attach discrete particle data to your mesh data, but I think it would be fine to skip the particles for now and just handle the mesh data well. Adding particles would mean we'd have to rethink our species class... which we could, of course, do.

Also, I now have doubts about the fact that OpenPMD lets you define separate coordinates for each field (e.g. you can have electric and magnetic fields defined at a spatial offset to each other, like for Yee grids )... we'll have to think more about how to handle this...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be fine to skip the particles for now

yes, fully optional. mesh-only works as well.

[o]penPMD lets you define separate coordinates for each field

Just in case this is relevant for you. Otherwise just assign them to zero for now if it's irrelevant for your use case (e.g. node-centered).


def test_has_charge_density_with_units(openPMD2DPlasma):
assert openPMD2DPlasma.charge_density.to(u.C/u.m**3) # unless it's some
# 2D charge density
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to check units this way:

>>> import astropy.units as u
>>> import h5py
>>> h5 = h5py.File('data00000255.h5')
>>> path = 'data/255/fields/rho'
>>> grid_size = len(h5[path].attrs['axisLabels'])
>>> units = u.C / u.m**grid_size

Would this work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there's a way that synergizes with what OpenPMD provides for us way more nicely:

In [4]: dict(h5[path].attrs)['unitDimension']
Out[4]: array([-3.,  0.,  1.,  1.,  0.,  0.,  0.])

And then we can interpret that based on documentation for the standard as

powers of the 7 base measures characterizing the record's unit in SI (length L, mass M, time T, electric current I, thermodynamic temperature theta, amount of substance N, luminous intensity J)

so we'll have length^-3 (1/volume, reasonable), no mass, time ^1 and current ^1 gets you electric charge. We can then multiply each of these by its appropriate astropy.unit and we'll get the complete Quantity for that field! I really liked that system while reading the docs. I guess I was too hyped up to note that down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is really neat!

def test_correct_shape_electric_field(openPMD2DPlasma):
assert openPMD2DPlasma.electric_field.shape == (3, 51, 201)
# IIRC this is how we defined it in the old Plasma class but I can be
# convinced otherwise if another way makes more sense
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like this?

assert openPMD2DPlasma.electric_field.x.shape == (51, 201)
assert openPMD2DPlasma.electric_field.y.shape == (51, 201)
assert openPMD2DPlasma.electric_field.z.shape == (51, 201)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with either of them though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might make sense, but it also might make using the electric field harder in simulations if you can't just access it as one big 3D array... but there may be ways around that. I'm not sure what the optimal choice is here. I think this is a good point to talk about at tomorrow's telecon, if we're all up for that.

@ax3l
Copy link

ax3l commented Jun 20, 2018

User could force HDF5 to read as [o]penPMD by passing openPMD=True. If this keyword argument is not supplied, we could automatically decide checking openPMDextension attribute.

For reading a .h5 file I would do the following detection (as we implemented in yt):

  • is a hdf5 file?
    • [yes] has the attribute openPMD in /?
      • [yes] attribute openPMD contains a supported version (e.g. 1.1.0 <= v < 2.0.0)
        • [yes] is use openPMD reader
        • [no] warn user about unsupported openPMD version, if version in file is too old, offer an update of it
      • [no] test for a different HDF5 reader
    • [no] test for different file format

openPMDextension is not the main identifier, but this attribute will also be present. It allows to add further, domain-specific meaning on top of the base standard data description, e.g. for electro-magnetic PIC codes (1.0+), physical particle species (2.0+) or particle accelerator codes (2.0+). Maybe you even want to propose your own as well later on?

For writing I would just auto-annotate all .h5 output with openPMD markup. That might be a biased opinion, but it does not harm to add those attributes and just makes the data by default more portable, self-documented, readable by more tools, ... ;-)

@ritiek
Copy link
Contributor Author

ritiek commented Jun 21, 2018

Ok, as of now we can handle both 2D and 3D HDF5 datasets for mesh (electric field and electric density) data but not particle data. What do we want to do with particle data? Should we just implement them the way we did with mesh data?

@ritiek
Copy link
Contributor Author

ritiek commented Jun 21, 2018

Also, should we ship the example OpenPMD HDF5 datasets that we are using for our tests with our main codebase? Maybe we could have them downloaded the moment a user runs those tests but I don't know how hard that would be practically.

@StanczakDominik
Copy link
Member

Oh yeah, downloading the test data is a sick idea that I completely forgot about! Maybe we could use that --online flag to setup.py test and emulate Astro/sunpy functionality.

By the way, we'll have to keep in mind to rebase and remove the files from these commits before we merge this PR into master. I can do that without a problem.

@ritiek
Copy link
Contributor Author

ritiek commented Jun 23, 2018

Good idea. --remote-data flag seems good to me.

We need to upload data00000100.h5 and data00000255.h5 somewhere. Do you have any preference? I'd put them on AWS but they ask for scary details during registration. :/

By the way, we'll have to keep in mind to rebase and remove the files from these commits before we merge this PR into master. I can do that without a problem.

Yep, once we are ready to merge this, we're going to have to take care of this.

@ritiek
Copy link
Contributor Author

ritiek commented Jun 23, 2018

We need to replace the URLs in test_openpmd_hdf5.py once we have those datasets uploaded somewhere and then we should be ready to merge this as a prototype. :D

@ritiek
Copy link
Contributor Author

ritiek commented Jun 23, 2018

There is one problem at the moment that test data would need be to downloaded every single time the tests are run. We could workaround this by creating a module similar to https://github.com/sunpy/sunpy/blob/master/sunpy/data/sample.py and then data would need downloading only once?

For example, SunPy downloads all the data at once and stores it somewhere for future use when a code involving external datasets is first run.

>>> import sunpy.map
>>> import sunpy.data.sample  
>>> mymap = sunpy.map.Map(sunpy.data.sample.AIA_171_IMAGE) 

@StanczakDominik
Copy link
Member

Having thought about this for a bit, I think we're fine putting this sample data directly in the repository. These datasets are not really large, and SunPy appears to include a bit of its sample data right in the repo. I thus think we're fine not having to set up AWS or other storage for now!

We definitely need to reference the source for it (which I probably forgot to do...), and I think it may be good to store it in plasmapy/data/ - this would be a better place for Langmuir data as well, most likely.

@ax3l
Copy link

ax3l commented Jun 26, 2018

Just a note: I am not 100% sure we already compressed the checked-in files since we use them under git lfs. Please try to use h5repack, e.g. with gzip, on them and see if their size changes before checking them in to save space. Here is an example ("h5compress"): https://github.com/ax3l/cluster-scripts

@codecov
Copy link

codecov bot commented Jun 27, 2018

Codecov Report

Merging #500 into master will decrease coverage by 0.13%.
The diff coverage is 92.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #500      +/-   ##
==========================================
- Coverage   97.71%   97.58%   -0.14%     
==========================================
  Files          41       44       +3     
  Lines        3590     3679      +89     
==========================================
+ Hits         3508     3590      +82     
- Misses         82       89       +7
Impacted Files Coverage Δ
plasmapy/utils/__init__.py 100% <ø> (ø) ⬆️
plasmapy/classes/__init__.py 100% <100%> (ø) ⬆️
plasmapy/utils/exceptions.py 100% <100%> (ø) ⬆️
plasmapy/data/test/__init__.py 100% <100%> (ø)
plasmapy/classes/sources/__init__.py 100% <100%> (ø) ⬆️
plasmapy/data/setup_package.py 50% <50%> (ø)
plasmapy/classes/sources/openpmd_hdf5.py 92.4% <92.4%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f7310aa...e3d0664. Read the comment docs.

@ritiek
Copy link
Contributor Author

ritiek commented Jun 27, 2018

@ax3l Looks like they were indeed not compressed. Total size for (those two) HDF5 files reduced from about 9 MB to 6 MB. Thanks!

@StanczakDominik @ax3l We already mention from where we downloaded the datasets (here), is there anything else we need to do?

@ax3l
Copy link

ax3l commented Jun 27, 2018

That looks perfect, you also mention the license CC-0 so I think that's it :-)

@ax3l
Copy link

ax3l commented Jun 27, 2018

Sorry for the hijack of the thread: I would like to mention PlasmaPy with the PlasmaPy logo in upcoming conference talks/posters as part of the "openPMD community". Is that fine with you? I won't go into details in those but I think it's cool to mention that you are in the process of adding openPMD support.

We also should not forget to add you here so people find you: https://github.com/openPMD/openPMD-projects

@StanczakDominik
Copy link
Member

@ritiek awesome, looking into the changes and I hope I can merge this in today!

@ax3l by all means, go right ahead. Pick a version you like from the logo source repo! I'd love to see the poster once you're done there.

Also, if you want to not hijack issues in the future, check out our Matrix chatroom or our mailing list! :D

Copy link
Member

@StanczakDominik StanczakDominik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, this looks nice! I think there's a few issues that could be addressed with a follow up PR, but I'm happy to merge it as a prototype. :)

Parameters
----------
hdf5 : `str`
Path to HDF5 file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this could use a References link to the site: http://openpmd.org/

electric_field : `astropy.units.Quantity`
An (x, y, z) array containing electric field data.
charge_density : `astropy.units.Quantity`
An array containing charge density data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... my bad here, but I guess I kind of forgot that the 2D and 3D examples were showing off electrostatic + particle simulations. If you look at the thetaMode example, you will find two more mesh quantities:

(base) 20:04:58 dominik@dell: ~/Code/openPMD/openPMD-example-datasets/example-thetaMode/hdf5 $ h5ls -r data00000100.h5 
/                        Group
/data                    Group
/data/100                Group
/data/100/fields         Group
/data/100/fields/B       Group
/data/100/fields/B/r     Dataset {3, 51, 201}
/data/100/fields/B/t     Dataset {3, 51, 201}
/data/100/fields/B/z     Dataset {3, 51, 201}
/data/100/fields/E       Group
/data/100/fields/E/r     Dataset {3, 51, 201}
/data/100/fields/E/t     Dataset {3, 51, 201}
/data/100/fields/E/z     Dataset {3, 51, 201}
/data/100/fields/J       Group
/data/100/fields/J/r     Dataset {3, 51, 201}
/data/100/fields/J/t     Dataset {3, 51, 201}
/data/100/fields/J/z     Dataset {3, 51, 201}
/data/100/fields/rho     Dataset {3, 51, 201}

B being the magnetic field and J the electric current, which can be neglected in electrostatic simulations. I think we should probably handle those as well - electrostatics are a fraction of all possible simulations. Do you think it would be much of a hassle to add them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few tests for any of those datasets would also be very nice, of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I'll work on them today.

units = _fetch_units(self.h5[path].attrs["unitDimension"])
return np.array((self.h5[path]['x'],
self.h5[path]['y'],
self.h5[path]['z'])) * units
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll definitely have to look at how np.array here is going to work... but it's nice enough for the prototype!

assert self.h5.electric_field.to(u.V / u.m)

def test_correct_shape_electric_field(self):
assert self.h5.electric_field.shape == (3, 26, 26, 201)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might mess around with this one so that the interface handles better in a soon-to-be-made PR.

@@ -78,6 +78,11 @@ class InvalidParticleError(AtomicError):
pass


class OpenPMDError(PlasmaPyError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe DataStandardError? OpenPMDError feels quite specific.

@ritiek
Copy link
Contributor Author

ritiek commented Jun 28, 2018

I've pushed the suggested changes. For the failing coverage, its because we don't test for exceptions. We would have to include more data which in my opinion isn't worth it. Otherwise, this should be ready to merge.

Copy link
Member

@StanczakDominik StanczakDominik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Gonna merge it now. I'd say the same thing about testing for exceptions. Thanks, @ritiek!

@StanczakDominik StanczakDominik merged commit 334dc37 into PlasmaPy:master Jun 28, 2018
@ritiek ritiek deleted the openpmd-hdf5reader branch July 5, 2018 14:59
@namurphy namurphy changed the title [WIP] OpenPMD HDF5Reader Plasma subclass OpenPMD HDF5Reader Plasma subclass Jul 23, 2018
@namurphy namurphy added the plasmapy.plasma Related to the plasmapy.plasma subpackage label Jul 23, 2018
@namurphy namurphy added the needs changelog entry See: https://docs.plasmapy.org/en/latest/contributing/changelog_guide.html label May 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs changelog entry See: https://docs.plasmapy.org/en/latest/contributing/changelog_guide.html plasmapy.plasma Related to the plasmapy.plasma subpackage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants