New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenPMD HDF5Reader Plasma subclass #500

Merged
merged 14 commits into from Jun 28, 2018

Conversation

Projects
None yet
5 participants
@ritiek
Contributor

ritiek commented Jun 17, 2018

Also see #499. This PR adds the ability to read HDF5 files based on OpenPMD standard when passing a keyword argument hdf5.

>>> plasmapy.classes.Plasma(hdf5='plasmapy/classes/tests/data/data00000255.h5')

The initial idea is to set appropriate properties for dealing with 2D datasets on the subclass. Once it works out nicely, we can go ahead having a similar implementation for working with 3D datasets.

EDIT: We ended up adding support for all types available in example datasets. 2D, 3D and even thetaMode. :D

StanczakDominik and others added some commits Jun 16, 2018

Merge pull request #6 from PlasmaPy/openpmd-subclass
Initial tests for 2D OpenPMD sample
@pep8speaks

This comment has been minimized.

pep8speaks commented Jun 17, 2018

Hello @ritiek! Thanks for updating your pull request.

Congratulations! There are no PEP8 issues in this pull request. 😸

Comment last updated on June 28, 2018 at 07:32 Hours UTC
@ritiek

This comment has been minimized.

Contributor

ritiek commented Jun 17, 2018

And happy #500!

@ritiek

This comment has been minimized.

Contributor

ritiek commented Jun 17, 2018

I am going to compare h5py and openPMD-api soon, so we can choose whichever one seems better for our purpose.

@ritiek

This comment has been minimized.

Contributor

ritiek commented Jun 17, 2018

OpenPMD API looks something like this:

>>> import openPMD
>>> series = openPMD.Series('plasmapy/classes/tests/data/data00000255.h5',
                            openPMD.Access_Type.read_only)

>>> items = list(series.iterations.items())
>>> items
[(255, <openPMD.Iteration of at t = '0.000000 s'>)]

>>> iteration, data = items[0]
>>> meshes = data.meshes.items()
>>> particles = data.particles.items()

>>> list(meshes)
[('E', <openPMD.Mesh>), ('rho', <openPMD.Mesh>)]
>>> list(particles)
[('Hydrogen1+', <openPMD.ParticleSpecies>),
 ('electrons', <openPMD.ParticleSpecies>)]

I personally think we should just stick with h5py. It is packaged on PyPi which makes it easy to distribute unlike OpenPMD-api which is available only through conda (and spack). Also, it seems a bit weird to me working with iter objects with OpenPMD API.

I found we could also do this with h5py to navigate through multiple levels easily:

>>> h5 = h5py.File('plasmapy/classes/tests/data/data00000255.h5')
>>> h5.get('data/255/fields/E/x')
<HDF5 dataset "x": shape (51, 201), type "<f8">

What do you guys think?

@StanczakDominik

This comment has been minimized.

Member

StanczakDominik commented Jun 17, 2018

Well, I guess we could start out with h5py for now. As long as we're using just hdf5 files, we don't really get advantages from the API right now besides potentially being nice to the guys at OpenPMD and testing their API out for them. But that's obviously a lower priority than getting this working! I'm fine with skipping that for now.

@StanczakDominik

This comment has been minimized.

Member

StanczakDominik commented Jun 17, 2018

By the way, I'm absolutely not sure we should default to OpenPMD on a "hdf5" keyword argument. I think there are two solutions:

  • openPMD = True forces initialization via OpenPMD
  • I think there's an attr saying this is an OpenPMD file in h5 = h5py.File('plasmapy/classes/tests/data/data00000255.h5'); h5.attrs. We could read that and if it's found, awesome, we know how to handle that. If it isn't, ValueError or something.
@ritiek

This comment has been minimized.

Contributor

ritiek commented Jun 17, 2018

Ok, let's put our stakes on h5py then.

I'm absolutely not sure we should default to OpenPMD on a "hdf5" keyword argument. I think there are two solutions

Maybe we could have both of them. I checked it out and OpenPMD does seem to set standards to check if a dataset is based on OpenPMD format h5.attrs['openPMDextension'].

User could force HDF5 to read as OpenPMD by passing openPMD=True. If this keyword argument is not supplied, we could automatically decide checking openPMDextension attribute.

@StanczakDominik

This comment has been minimized.

Member

StanczakDominik commented Jun 17, 2018

Sounds reasonable 👍

@ritiek ritiek changed the title from OpenPMD HDF5Reader Plasma subclass to [WIP] OpenPMD HDF5Reader Plasma subclass Jun 17, 2018

@ritiek ritiek force-pushed the ritiek:openpmd-hdf5reader branch from 924c965 to 2c61119 Jun 17, 2018

# convinced otherwise if another way makes more sense
def test_x(openPMD2DPlasma):
assert openPMD2DPlasma.x.shape == (51, 201)

This comment has been minimized.

@ritiek

ritiek Jun 18, 2018

Contributor

This is whose position coordinates we are talking about? If this is for particles like electrons, etc., it would make more sense to have something like openPMD2DPlasma.electrons.x.

This comment has been minimized.

@StanczakDominik

StanczakDominik Jun 18, 2018

Member

It definitely would! I thought I mentioned it, but guess not. OpenPMD lets you attach discrete particle data to your mesh data, but I think it would be fine to skip the particles for now and just handle the mesh data well. Adding particles would mean we'd have to rethink our species class... which we could, of course, do.

Also, I now have doubts about the fact that OpenPMD lets you define separate coordinates for each field (e.g. you can have electric and magnetic fields defined at a spatial offset to each other, like for Yee grids )... we'll have to think more about how to handle this...

This comment has been minimized.

@ax3l

ax3l Jun 20, 2018

I think it would be fine to skip the particles for now

yes, fully optional. mesh-only works as well.

[o]penPMD lets you define separate coordinates for each field

Just in case this is relevant for you. Otherwise just assign them to zero for now if it's irrelevant for your use case (e.g. node-centered).

def test_has_charge_density_with_units(openPMD2DPlasma):
assert openPMD2DPlasma.charge_density.to(u.C/u.m**3) # unless it's some
# 2D charge density

This comment has been minimized.

@ritiek

ritiek Jun 18, 2018

Contributor

We might be able to check units this way:

>>> import astropy.units as u
>>> import h5py
>>> h5 = h5py.File('data00000255.h5')
>>> path = 'data/255/fields/rho'
>>> grid_size = len(h5[path].attrs['axisLabels'])
>>> units = u.C / u.m**grid_size

Would this work?

This comment has been minimized.

@StanczakDominik

StanczakDominik Jun 18, 2018

Member

Actually, there's a way that synergizes with what OpenPMD provides for us way more nicely:

In [4]: dict(h5[path].attrs)['unitDimension']
Out[4]: array([-3.,  0.,  1.,  1.,  0.,  0.,  0.])

And then we can interpret that based on documentation for the standard as

powers of the 7 base measures characterizing the record's unit in SI (length L, mass M, time T, electric current I, thermodynamic temperature theta, amount of substance N, luminous intensity J)

so we'll have length^-3 (1/volume, reasonable), no mass, time ^1 and current ^1 gets you electric charge. We can then multiply each of these by its appropriate astropy.unit and we'll get the complete Quantity for that field! I really liked that system while reading the docs. I guess I was too hyped up to note that down.

This comment has been minimized.

@ritiek

ritiek Jun 18, 2018

Contributor

Wow, this is really neat!

def test_correct_shape_electric_field(openPMD2DPlasma):
assert openPMD2DPlasma.electric_field.shape == (3, 51, 201)
# IIRC this is how we defined it in the old Plasma class but I can be
# convinced otherwise if another way makes more sense

This comment has been minimized.

@ritiek

ritiek Jun 18, 2018

Contributor

How about something like this?

assert openPMD2DPlasma.electric_field.x.shape == (51, 201)
assert openPMD2DPlasma.electric_field.y.shape == (51, 201)
assert openPMD2DPlasma.electric_field.z.shape == (51, 201)

This comment has been minimized.

@ritiek

ritiek Jun 18, 2018

Contributor

I'm fine with either of them though.

This comment has been minimized.

@StanczakDominik

StanczakDominik Jun 18, 2018

Member

That might make sense, but it also might make using the electric field harder in simulations if you can't just access it as one big 3D array... but there may be ways around that. I'm not sure what the optimal choice is here. I think this is a good point to talk about at tomorrow's telecon, if we're all up for that.

ritiek added some commits Jun 19, 2018

@ax3l

This comment has been minimized.

ax3l commented Jun 20, 2018

User could force HDF5 to read as [o]penPMD by passing openPMD=True. If this keyword argument is not supplied, we could automatically decide checking openPMDextension attribute.

For reading a .h5 file I would do the following detection (as we implemented in yt):

  • is a hdf5 file?
    • [yes] has the attribute openPMD in /?
      • [yes] attribute openPMD contains a supported version (e.g. 1.1.0 <= v < 2.0.0)
        • [yes] is use openPMD reader
        • [no] warn user about unsupported openPMD version, if version in file is too old, offer an update of it
      • [no] test for a different HDF5 reader
    • [no] test for different file format

openPMDextension is not the main identifier, but this attribute will also be present. It allows to add further, domain-specific meaning on top of the base standard data description, e.g. for electro-magnetic PIC codes (1.0+), physical particle species (2.0+) or particle accelerator codes (2.0+). Maybe you even want to propose your own as well later on?

For writing I would just auto-annotate all .h5 output with openPMD markup. That might be a biased opinion, but it does not harm to add those attributes and just makes the data by default more portable, self-documented, readable by more tools, ... ;-)

@StanczakDominik

This comment has been minimized.

Member

StanczakDominik commented on plasmapy/classes/sources/openpmd_hdf5.py in d3110fc Jun 21, 2018

For the record, I think this may end up being something to change later on. Basically, when you take a hdf5 object and use it to create a numpy array, it downloads all the data into RAM and creates the numpy array in memory. That's completely fine for small datasets.

Suppose however we have a large, high resolution supercomputer simulation output? Loading every single array into memory right on Plasma instantiation will kill your script instantly with a MemoryError. It make make sense to keep these as h5py objects, pointers to arrays in disk space.

Then, whatever analysis or evaluation you wanna do, you can just do lazily, accessing data as you need them.

Still, this approach is fine for a prototype :)

This comment has been minimized.

Contributor

ritiek replied Jun 21, 2018

Basically, when you take a hdf5 object and use it to create a numpy array, it downloads all the data into RAM and creates the numpy array in memory. That's completely fine for small datasets.

Oh, yeah. I didn't think about that.

It make make sense to keep these as h5py objects, pointers to arrays in disk space.

I think that would be a good idea. We would return a tuple (x, y, z) where x, y, z are HDF5 dataset objects. End users could play with them however they want?

This comment has been minimized.

Contributor

ritiek replied Jun 21, 2018

Though, this might mean we wouldn't be able to multiply the values with suitable astropy.units.

This comment has been minimized.

Contributor

ritiek replied Jun 21, 2018

Or would we able to? We might need to test this with some large dataset.

This comment has been minimized.

Member

StanczakDominik replied Jun 22, 2018

Perhaps if we wrote this as @Property... I dunno. I think this could be tested just as well on small datasets.

@ritiek ritiek force-pushed the ritiek:openpmd-hdf5reader branch from 94a8a2d to dee8828 Jun 21, 2018

@ritiek

This comment has been minimized.

Contributor

ritiek commented Jun 21, 2018

Ok, as of now we can handle both 2D and 3D HDF5 datasets for mesh (electric field and electric density) data but not particle data. What do we want to do with particle data? Should we just implement them the way we did with mesh data?

@ritiek

This comment has been minimized.

Contributor

ritiek commented Jun 21, 2018

Also, should we ship the example OpenPMD HDF5 datasets that we are using for our tests with our main codebase? Maybe we could have them downloaded the moment a user runs those tests but I don't know how hard that would be practically.

@StanczakDominik

This comment has been minimized.

Member

StanczakDominik commented Jun 22, 2018

Oh yeah, downloading the test data is a sick idea that I completely forgot about! Maybe we could use that --online flag to setup.py test and emulate Astro/sunpy functionality.

By the way, we'll have to keep in mind to rebase and remove the files from these commits before we merge this PR into master. I can do that without a problem.

@ritiek

This comment has been minimized.

Contributor

ritiek commented Jun 23, 2018

Good idea. --remote-data flag seems good to me.

We need to upload data00000100.h5 and data00000255.h5 somewhere. Do you have any preference? I'd put them on AWS but they ask for scary details during registration. :/

By the way, we'll have to keep in mind to rebase and remove the files from these commits before we merge this PR into master. I can do that without a problem.

Yep, once we are ready to merge this, we're going to have to take care of this.

@ritiek

This comment has been minimized.

Contributor

ritiek commented Jun 23, 2018

We need to replace the URLs in test_openpmd_hdf5.py once we have those datasets uploaded somewhere and then we should be ready to merge this as a prototype. :D

@ritiek

This comment has been minimized.

Contributor

ritiek commented Jun 23, 2018

There is one problem at the moment that test data would need be to downloaded every single time the tests are run. We could workaround this by creating a module similar to https://github.com/sunpy/sunpy/blob/master/sunpy/data/sample.py and then data would need downloading only once?

For example, SunPy downloads all the data at once and stores it somewhere for future use when a code involving external datasets is first run.

>>> import sunpy.map
>>> import sunpy.data.sample  
>>> mymap = sunpy.map.Map(sunpy.data.sample.AIA_171_IMAGE) 
@StanczakDominik

This comment has been minimized.

Member

StanczakDominik commented Jun 26, 2018

Having thought about this for a bit, I think we're fine putting this sample data directly in the repository. These datasets are not really large, and SunPy appears to include a bit of its sample data right in the repo. I thus think we're fine not having to set up AWS or other storage for now!

We definitely need to reference the source for it (which I probably forgot to do...), and I think it may be good to store it in plasmapy/data/ - this would be a better place for Langmuir data as well, most likely.

@ax3l

This comment has been minimized.

ax3l commented Jun 26, 2018

Just a note: I am not 100% sure we already compressed the checked-in files since we use them under git lfs. Please try to use h5repack, e.g. with gzip, on them and see if their size changes before checking them in to save space. Here is an example ("h5compress"): https://github.com/ax3l/cluster-scripts

@ritiek ritiek force-pushed the ritiek:openpmd-hdf5reader branch from f0c2750 to 1361cec Jun 27, 2018

@codecov

This comment has been minimized.

codecov bot commented Jun 27, 2018

Codecov Report

Merging #500 into master will decrease coverage by 0.13%.
The diff coverage is 92.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #500      +/-   ##
==========================================
- Coverage   97.71%   97.58%   -0.14%     
==========================================
  Files          41       44       +3     
  Lines        3590     3679      +89     
==========================================
+ Hits         3508     3590      +82     
- Misses         82       89       +7
Impacted Files Coverage Δ
plasmapy/utils/__init__.py 100% <ø> (ø) ⬆️
plasmapy/classes/__init__.py 100% <100%> (ø) ⬆️
plasmapy/utils/exceptions.py 100% <100%> (ø) ⬆️
plasmapy/data/test/__init__.py 100% <100%> (ø)
plasmapy/classes/sources/__init__.py 100% <100%> (ø) ⬆️
plasmapy/data/setup_package.py 50% <50%> (ø)
plasmapy/classes/sources/openpmd_hdf5.py 92.4% <92.4%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f7310aa...e3d0664. Read the comment docs.

@ritiek

This comment has been minimized.

Contributor

ritiek commented Jun 27, 2018

@ax3l Looks like they were indeed not compressed. Total size for (those two) HDF5 files reduced from about 9 MB to 6 MB. Thanks!

@StanczakDominik @ax3l We already mention from where we downloaded the datasets (here), is there anything else we need to do?

@ritiek ritiek force-pushed the ritiek:openpmd-hdf5reader branch from aa8393f to a404ddb Jun 27, 2018

@ax3l

This comment has been minimized.

ax3l commented Jun 27, 2018

That looks perfect, you also mention the license CC-0 so I think that's it :-)

@ax3l

This comment has been minimized.

ax3l commented Jun 27, 2018

Sorry for the hijack of the thread: I would like to mention PlasmaPy with the PlasmaPy logo in upcoming conference talks/posters as part of the "openPMD community". Is that fine with you? I won't go into details in those but I think it's cool to mention that you are in the process of adding openPMD support.

We also should not forget to add you here so people find you: https://github.com/openPMD/openPMD-projects

@ritiek ritiek force-pushed the ritiek:openpmd-hdf5reader branch from a404ddb to 50cb17c Jun 27, 2018

@StanczakDominik

This comment has been minimized.

Member

StanczakDominik commented Jun 27, 2018

@ritiek awesome, looking into the changes and I hope I can merge this in today!

@ax3l by all means, go right ahead. Pick a version you like from the logo source repo! I'd love to see the poster once you're done there.

Also, if you want to not hijack issues in the future, check out our Matrix chatroom or our mailing list! :D

@StanczakDominik

All right, this looks nice! I think there's a few issues that could be addressed with a follow up PR, but I'm happy to merge it as a prototype. :)

Parameters
----------
hdf5 : `str`
Path to HDF5 file.

This comment has been minimized.

@StanczakDominik

StanczakDominik Jun 27, 2018

Member

Looks like this could use a References link to the site: http://openpmd.org/

electric_field : `astropy.units.Quantity`
An (x, y, z) array containing electric field data.
charge_density : `astropy.units.Quantity`
An array containing charge density data.

This comment has been minimized.

@StanczakDominik

StanczakDominik Jun 27, 2018

Member

So... my bad here, but I guess I kind of forgot that the 2D and 3D examples were showing off electrostatic + particle simulations. If you look at the thetaMode example, you will find two more mesh quantities:

(base) 20:04:58 dominik@dell: ~/Code/openPMD/openPMD-example-datasets/example-thetaMode/hdf5 $ h5ls -r data00000100.h5 
/                        Group
/data                    Group
/data/100                Group
/data/100/fields         Group
/data/100/fields/B       Group
/data/100/fields/B/r     Dataset {3, 51, 201}
/data/100/fields/B/t     Dataset {3, 51, 201}
/data/100/fields/B/z     Dataset {3, 51, 201}
/data/100/fields/E       Group
/data/100/fields/E/r     Dataset {3, 51, 201}
/data/100/fields/E/t     Dataset {3, 51, 201}
/data/100/fields/E/z     Dataset {3, 51, 201}
/data/100/fields/J       Group
/data/100/fields/J/r     Dataset {3, 51, 201}
/data/100/fields/J/t     Dataset {3, 51, 201}
/data/100/fields/J/z     Dataset {3, 51, 201}
/data/100/fields/rho     Dataset {3, 51, 201}

B being the magnetic field and J the electric current, which can be neglected in electrostatic simulations. I think we should probably handle those as well - electrostatics are a fraction of all possible simulations. Do you think it would be much of a hassle to add them?

This comment has been minimized.

@StanczakDominik

StanczakDominik Jun 27, 2018

Member

A few tests for any of those datasets would also be very nice, of course.

This comment has been minimized.

@ritiek

ritiek Jun 28, 2018

Contributor

Sure! I'll work on them today.

units = _fetch_units(self.h5[path].attrs["unitDimension"])
return np.array((self.h5[path]['x'],
self.h5[path]['y'],
self.h5[path]['z'])) * units

This comment has been minimized.

@StanczakDominik

StanczakDominik Jun 27, 2018

Member

I'll definitely have to look at how np.array here is going to work... but it's nice enough for the prototype!

assert self.h5.electric_field.to(u.V / u.m)
def test_correct_shape_electric_field(self):
assert self.h5.electric_field.shape == (3, 26, 26, 201)

This comment has been minimized.

@StanczakDominik

StanczakDominik Jun 27, 2018

Member

I might mess around with this one so that the interface handles better in a soon-to-be-made PR.

@@ -78,6 +78,11 @@ class InvalidParticleError(AtomicError):
pass
class OpenPMDError(PlasmaPyError):

This comment has been minimized.

@StanczakDominik

StanczakDominik Jun 27, 2018

Member

Maybe DataStandardError? OpenPMDError feels quite specific.

@ritiek ritiek force-pushed the ritiek:openpmd-hdf5reader branch from cf52393 to 6dd3ce4 Jun 28, 2018

@ritiek ritiek force-pushed the ritiek:openpmd-hdf5reader branch from 6dd3ce4 to e3d0664 Jun 28, 2018

@ritiek

This comment has been minimized.

Contributor

ritiek commented Jun 28, 2018

I've pushed the suggested changes. For the failing coverage, its because we don't test for exceptions. We would have to include more data which in my opinion isn't worth it. Otherwise, this should be ready to merge.

@StanczakDominik

Awesome! Gonna merge it now. I'd say the same thing about testing for exceptions. Thanks, @ritiek!

@StanczakDominik StanczakDominik merged commit 334dc37 into PlasmaPy:master Jun 28, 2018

3 of 5 checks passed

codecov/patch 92.22% of diff hit (target 97.71%)
Details
codecov/project 97.58% (-0.14%) compared to f7310aa
Details
ci/circleci: test-html Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ritiek ritiek deleted the ritiek:openpmd-hdf5reader branch Jul 5, 2018

@namurphy namurphy changed the title from [WIP] OpenPMD HDF5Reader Plasma subclass to OpenPMD HDF5Reader Plasma subclass Jul 23, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment