Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent dtype for universe.dimensions #2190

Closed
PicoCentauri opened this issue Jan 30, 2019 · 15 comments
Closed

Inconsistent dtype for universe.dimensions #2190

PicoCentauri opened this issue Jan 30, 2019 · 15 comments

Comments

@PicoCentauri
Copy link
Contributor

Expected behavior

The dtype of the dimensions for a standard (not in-memory) universe should be consistent with an in-memory universe. This is in particular important for the low-lewel C functions like the make_whole function which will fail for the in-memory trajectory, since it requires a float.

Actual behavior

The dimension of the standard universe representation is a float and the in-memory representation is a double.

Code to reproduce the behavior

import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PSF, DCD

u = mda.Universe(PSF, DCD)
u2 = mda.Universe(PSF, DCD, in_memory=True)

print(u.dimensions.dtype)
print(u2.dimensions.dtype)

Currently version of MDAnalysis

I'm using Python 3.6 and the dev version of MDAnalysis on MacOS

@PicoCentauri
Copy link
Contributor Author

@richardjgowers
Copy link
Member

I'm pretty sure that we needed doubles at some point for triclinic pbc, so I think make_whole is wrong here

@zemanj
Copy link
Member

zemanj commented Jan 30, 2019

@richardjgowers AFAIK triclinic PBC are handled exclusively in lib.distances, and that uses float32 boxes.
IIRC all routines in lib.distances use lib.util.check_box(), which does the type conversion to float32 regardless of input dtype.

You're right regarding precision problems with PBC transformations. Such problems can occur even with orthorombic boxes leading to atom positions ending up exactly on the upper box boundary instead of on the lower. The reason is that the box inverse is not precise enough in single precision. I encountered that when testing a wrap-unwrap-wrap cycle.

@richardjgowers
Copy link
Member

richardjgowers commented Jan 30, 2019 via email

@zemanj
Copy link
Member

zemanj commented Feb 6, 2019

make_whole() will be fixed for double precision dimensions in PR #2189. However, that is just a workaround for the dtype consistency problem. I found that a universe created with mda.Universe.empty() also has a double precision box. Maybe that's related to the construction of an in-memory Universe?

@richardjgowers
Copy link
Member

@zemanj yeah it's probably just a non specific call to np.zeros if I had to guess.. Universe.empty is using MemoryReader but it might be making the arrays manually

@zemanj
Copy link
Member

zemanj commented Feb 11, 2019

I think we need a decision here... Should we

  1. enforce Universe.dimensions to always be a float32 ndarray or
  2. just make sure we always cast it as float32 whenever it's used?

If we opt for 2., I think we can close this.

@jbarnoud
Copy link
Contributor

Opt 1 is easier to enforce: we can add a test in the base reader test class that asserts that universe.dimensions is float32. Any reader not following that rule would cause a test failure. Is there a reason to have float64 for the box?

@zemanj
Copy link
Member

zemanj commented Feb 11, 2019

Universe.dimensions boils down to TimeStep._unitcell, which is default-initialized with np.zeros((6), np.float32). It might be easiest to add automatic dtype conversion to np.float32 in the TimeStep.dimensions setter, which would then read

@dimensions.setter
def dimensions(self, box):
    self._unitcell[:] = box.astype(np.float32, copy=False)

Ideally, we add a validation check here so that one cannot supply an invalid box.
That check could go into lib.util, which is imported in coordinates.base anyway.

EDIT:
We still have to check if TimeStep._init_unitcell() is overwritten with something undesirable in any of the trajectory readers.

@zemanj
Copy link
Member

zemanj commented Feb 11, 2019

A check for a valid box could look as follows:

  1. If all values are zero, return np.zeros(6, dtype=np.float32).
  2. If any of the values is not finite (+/- np.nan or +/- np.inf), raise a ValueError.
  3. If any angle is negative or greater than or equal to 360 degrees, apply angle = angle % 360.0.
  4. If all lengths and angles are greater than zero and all angles are strictly smaller than 180 degrees:
    • If the sum of any two angles is greater than the remaining one, return the box.
    • Otherwise, raise a ValueError.
  5. If any of the lengths or angles is zero:
    • Check that for every zero length, the corresponding pair of angles is also zero (indicates a 2d- or 1d system). If any of the following conditions is not met, raise a ValueError:
      • If lx == 0, beta and gamma must be zero.
      • If ly == 0, alpha and gamma must be zero.
      • If lz == 0, alpha and beta must be zero.
      • If alpha == 0, at least one of ly or lz must be zero.
      • If beta == 0, at least one of lx or lz must be zero.
      • If gamma == 0, at least one of lx or ly must be zero.
  6. If any length is negative or any angle is greater than 180 degrees, transform the box to matrix representation and back using
    box = lib.mdamath.triclinic_box(*lib.mdamath.triclinic_vectors(box)) and run the above checks again. Otherwise, return the box.

Maybe one could also change the order of the tests and first check if a valid matrix representation exists, that might simplify things a bit.

Moreover, if any of the values is changed, a corresponding warning should be raised.

@richardjgowers
Copy link
Member

@jbarnoud I think maybe you could argue that round tripping via MDAnalysis shouldn't mangle precision, so if we get data as float64 we should keep it that way, but we don't do that for positions, so why for box?

@zemanj enforcing option 1 seems possible like you've outlined. There's not many (any?) readers that redefine _init_unitcell iirc. I think the idea behind that was that each reader can store the ._unitcell in the native format, then .dimensions is the format MDA expects.

@richardjgowers
Copy link
Member

@zemanj using np.nan is an interesting idea for expressing semi periodic systems, it is missing data essentially? Or maybe 0.0 is better, ie this dimension is flat...

@zemanj
Copy link
Member

zemanj commented Feb 11, 2019

@richardjgowers

I think the idea behind that was that each reader can store the ._unitcell in the native format, then .dimensions is the format MDA expects.

if that's the case, then we should not convert the dtype in the dimensions setter but in the getter.

using np.nan is an interesting idea for expressing semi periodic systems, it is missing data essentially? Or maybe 0.0 is better, ie this dimension is flat...

We might run into problems with matrix representations if we choose np.nan for absent periodicity.
If we want periodic poundary conditions in less than three dimensions, we probably won't get around having an extra kind of "periodicity mask" that is applied to the box before applying PBC.
Think of 3d systems with walls, for example. There, you want distances to be non-periodic only in the direction normal to the walls, so setting the dimensions to zero in that direction is correct for distance calculations. If you want to compute the volume, however, setting that dimension to zero (or np.nan) will be wrong.

@richardjgowers
Copy link
Member

@zemanj sure but I also don't like having it in the setter because we have too many astype calls :) Really we should act like numpy and handle whatever dtypes and return the appropriate dtypes, eg C++ templates/Cython fused types etc. But that's not a small PR to suggest.

Maybe just hacking the setter/init to force a float32 dtype is a nice fix for today's needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants