New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDAnalysis.lib.distances needs rework #2046

Open
zemanj opened this Issue Aug 13, 2018 · 0 comments

Comments

Projects
None yet
2 participants
@zemanj
Member

zemanj commented Aug 13, 2018

The documentation of the module MDAnalysis.lib.distances has several issues, whereof one has already been addressed by @xiki-tempula in issue #2004. I also see some points where code duplication could be reduced to make the code more DRY.

Documentation issues:

  • Probably deprecated title: Fast distance array computation
    IMHO that seems a bit misleading. Even though the module contains methods to compute distance arrays, it also includes methods for coordinate transformation, PBC handling, and for the calculation of angles and dihedrals.
  • The same issue applies to the doc header, which doesn't describe the module's contents sufficiently since it only refers to distance arrays. (Fixed via PR #2070)
  • The methods listed as .. autofunction: ... in the header appear as duplicates in the docs. Furthermore, the way the autofunction strings are formatted may not be very useful (nested square brackets without default values for kwargs). (Fixed via PR #2070)
  • Many docstrings violate the 80 character line length limit. (Fixed via PR #2070)
  • The methods augment_coordinates and undo_augment, which are imported from MDAnalysis.lib._augment, are missing from the docs. (Fixed via PR #2062)
  • Many docstrings have issues with spelling mistakes, missing or improperly referenced parameters and methods, wrong or missing data types, and potentially misleading explanations. That includes but is not limited to issue #2004. (Fixed via PR #2070)
  • In general, the docstrings could be much more standardized, there's no need to have different descriptions for parameters such as box which always have the same purpose or meaning. (Fixed via PR #2070)

Code issues:

  • Many functions use the _box_check() helper function to determine the type of simulation box supplied. Thereafter, the box coordinates are transformed to the memory layout expected by the subsequently called low-level C functions. This transformation should be incorporated into _check_box() to avoid code duplication. (Fixed via PR #2048)
  • The docstrings of most functions state that the box must be supplied in the format [lx, ly, lz, alpha, beta, gamma] as returned by `Timestep.dimensions```. The _box_check function doesn't reflect that. So either the requirements for the box should be less strict or ``_box_check`` should be stricter in that respect. (Fixed via PR #2048)
  • Not only checking but also creating the results array (if required) should take place in the _check_result_array() helper function. (Fixed via PR #2048)
  • Many functions now incorporate automatic dtype conversion, so the dtype check in the _check_array() helper function is now redundant and can be removed in these cases. (Fixed via PR #2048 by means of a new @check_coords() decorator)
  • PR #2048 introduced a subtle bug so that in certain situations, some functions change their input coordinate arrays in-place. (Fixed via PR #2083)
  • Many functions choke on empty input coordinates arrays (i.e., with shape=(0, 3)). (Fixed via PR #2083)
  • Depending on the employed search method, the results of capped_distance() and self_capped_distance() are not always numpy arrays. (Fixed via PR #2083)
  • If no pairs are found, _bruteforce_capped_self() correctly returns empty pairs but unfortunately also non-empty mumbo-jumbo distances. (Fixed via PR #2083)
  • _bruteforce_capped() crashes if all input coordinates are the same and box is None. (Fixed via PR #2083)
  • lib.distances._check_box() should be moved to lib.util (with underscore removed). (Fixed via PR #2114)
  • Not all methods for *capped_distance() have the same cut-off criteria (sometimes distances < max_cutoff and sometimes distances <= max_cutoff). (Fixed via PR #2114)
  • The different methods for *capped_distance() do not always return the same number of pairs. (EDIT: only pathological cases, won't fix)
  • lib.nsgrid calculates distances in single precision, whereas all functions in lib.distances use double precision. (Fixed via PR #2114)
  • lib.nsgrid.PBCBox uses an arbitrary constant EPSILON=1e-5 to determine whether a box is triclinic. This a) does not correspond to the behavior of other functions, and b) fails if a box angle is, e.g., 90.00001 degrees (or higher or negative). (Fixed via PR #2114)
  • lib/include/calc_distances.h contains a lot of duplicated code (functions differing in PBC type only). (EDIT: unifying functions for different PBC types impacts performance, won't fix)
  • OpenMP parallelization in lib/include/calc_distances.h often suffers from false sharing.
  • In lib.distances.distance_array(), the inner loop (the one over the configuration coordinate array) should be parallelized instead of the outer one (going over the reference coordinate array), since often, one seeks to know the number of "neighbors" with respect to some reference atom.

TODO suggestions:

  • Documentation:
    • Ensure that parameters referenced in docstrings correspond to function signatures. (Fixed via PRs #2048 and #2070)
    • Properly format references to other methods so that links are generated. Add .. seealso : See Also sections where applicable. (Fixed via PR #2070)
    • Use italics only to reference function parameters, use teletype font for variables and values in docstrings. (Fixed via PR #2070)
    • Use logical instead of visual markup (e.g., backticks instead of asterisks for function parameters). (Fixed via PR #2070)
    • Enforce 80 character lines (Fixed via PR #2070)
    • Standardize phrasing of repeatedly occurring parameter descriptions (Fixed via PR #2070)
    • Always state correct type of parameters (e.g. numpy.ndarray if a numpy array is expected) (Fixed via PRs #2048, #2062, and #2070)
    • Mark optional parameters by adding , optional after the type (e.g. box : array_like, optional) (Fixed via PR #2070)
    • Do not give default values for kwargs in brackets since these appear in the auto-generated function signature. (Fixed via PR #2070)
    • Module Header:
      • Change title to something more descriptive (e.g., "fast geometry computations" or similar)
      • Improve module description to better correspond to its contents (Fixed via PR #2070)
      • Remove .. autofunction lines :members: from distances.rst causing duplicates
  • Code:
    • Move box memory layout transformations to _check_box and either loosen requirements for its expected format or make checks stricter. (Fixed via PR #2048)
    • Fix bugs introduced in PR #2048 and add tests assuring input arrays always remain unmodified. (Fixed via PR #2083)
    • For completeness, make sure that all functions also work with empty coordinate arrays (shape=(0, 3)) and add corresponding test cases. (Fixed via PR #2083)
    • Ensure capped_distance() and self_capped_distance() always return numpy arrays and add corresponding tests. (Fixed via PR #2083)
    • Ensure _bruteforce_capped_self() returns empty distances if no pairs are found. Add corresponding tests for all *capped_distance() functions. (Fixed via PR #2083)
    • Fix _bruteforce_capped() for the case when all input coordinates are the same and box is None. (Fixed via PR #2083)
    • Move lib.distances._check_box() to lib.util (with underscore removed). (Fixed via PR #2114)
    • Ensure that all methods used in capped_distance() have the same cut-off criterion (distances <= max_cutoff). (Fixed via PR #2114)
    • Use double precision for distance computation in lib.nsgrid. (Fixed via PR #2114)
    • Fix box type detection in lib.nsgrid and some functions in lib.distances. (Fixed via PR #2114)
    • Optimize parallelism in lib/include/calc_distances.h for 64 byte cachelines (avoid false sharing).
    • Parallelize inner instead of outer loop in lib.distances.distance_array().

That's quite a lot of things to do, but I've already started working on most of the points.
There are still issues to be discussed, especially the module's title and description and how to proceed with the requirements for boxes.

Current version of MDAnalysis:

0.18.1-dev 0.19.1-dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment