Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastNS yields wrong results on simple test case #2345

Closed
Linux-cpp-lisp opened this issue Sep 7, 2019 · 16 comments
Closed

FastNS yields wrong results on simple test case #2345

Linux-cpp-lisp opened this issue Sep 7, 2019 · 16 comments

Comments

@Linux-cpp-lisp
Copy link

Expected behavior

FastNS misses neighbors in a perfect 111 FCC slab (an ideal, oversimplified case).

Actual behavior

Instead of identifying 9 or 12 neighbors for every atom, FastNS identifies 9 or 12 for most atoms, with some strange exceptions:

MDAnalysisBug1

Or see the structure, with atoms colored by coordination number:

image

Code to reproduce the behavior

(Requires ase.)

The Ag-Ag distance in this system is 2.89Å, so a 3.2Å cutoff is quite sufficient.

from ase.build import fcc111
from MDAnalysis.lib.mdamath import triclinic_box 
from MDAnalysis.lib.nsgrid import FastNS 
slab = fcc111('Ag', size = (5, 5, 4), vacuum = 10.0)
tcbox = triclinic_box(*slab.cell) 
fns = FastNS(cutoff = 3.2, coords = slab.positions, box = tcbox, pbc = True)
ns = fns.self_search()                                                                                              
cns = [len(n) for n in ns.get_indices()]                                                                            
print(np.unique(cns, return_counts = True))                                                                         
# => (array([ 7,  8,  9, 11, 12, 13]), array([ 1,  3, 46,  3, 45,  2]))

The results are identical for cutoffs of 3.0Å and 2.9Å. At a cutoff of 2.5Å, as expected, all atoms have a CN of 0.

The results are correct, for example, using pymatgen's (very, very, very slow) CutOffDictNN:

from pymatgen.io.ase import AseAtomsAdaptor 
from pymatgen.analysis.local_env import CutOffDictNN 
cdnn  = CutOffDictNN(cut_off_dict={('Ag', 'Ag') : 3.0})
slabstruct = AseAtomsAdaptor.get_structure(slab)
cs_matcutoff = [len(e) for e in cdnn.get_all_nn_info(slabstruct)]
print(np.unique(cs_matcutoff, return_counts = True))                                                                
# => (array([ 9, 12]), array([50, 50]))

Currently version of MDAnalysis

  • Which version are you using? 0.19.2
  • Which version of Python (python -V)? Python 3.6.8 :: Anaconda, Inc.
  • Which operating system? Linux (Ubuntu)
@ezavod
Copy link

ezavod commented Sep 8, 2019

Seems to be similar to #2229. Unfortunately there seems to be a tedious bug when putting the atoms in the sub cells.

@Linux-cpp-lisp
Copy link
Author

Yeah, I'd seen your issue there before and it's not a stretch to think they're related. Not promising that your issue remains open with no progress so much later though... this seems like a pretty important and fundamental bug.

@orbeckst
Copy link
Member

orbeckst commented Sep 9, 2019

@Linux-cpp-lisp fair point – but with no-one getting paid for full-time MDA support, even important issues often wait for someone desperate enough to dig into the code.

I can ping @ayushsuhane @richardjgowers @zemanj and see if anyone has any ideas.

In the worst case we might have to disable FastNS for now and fall back to PKD.

@orbeckst
Copy link
Member

orbeckst commented Sep 9, 2019

@Linux-cpp-lisp can you please make input files (e.g. a GRO or PDB file with the atoms and box information) and code available that we can use for testing and potentially for including with the tests (i.e., must be under GPL v2)?

@Linux-cpp-lisp
Copy link
Author

@orbeckst of course, and their volunteer work is very (very) appreciated. I realize now that my comment could be read as a lament about support; I only meant that the time the issue had spent open was likely a sign that the underlying bug is non-trivial.

The code above in the original post is a self contained test-case; consider this post from me as the author making it available to you under GPL v2.

It's interesting to note that if I save the structure (ase.io.write('structure.pdb', slab)) as a PDB (attached), and then load it (slab = ase.io.read('structure.pdb')), the exact code from the original post then yields:

(array([ 9, 10, 11, 12]), array([50,  1,  2, 47]))

Which is still wrong, but different. If I load from a CIF file, instead, I get the right results:

(array([ 9, 12]), array([50, 50]))

I think this is because the cell vectors are rounded after loading in the PDB (someone in the other issue had mentioned rounding as a possible cause of the overall issue):

# Original cell
array([[14.46033368,  0.        ,  0.        ],
       [ 7.23016684, 12.52301631,  0.        ],
       [ 0.        ,  0.        , 27.0840878 ]])
# Cell after loading from PDB
array([[14.46      ,  0.        ,  0.        ],
       [ 7.23      , 12.52272734,  0.        ],
       [ 0.        ,  0.        , 27.084     ]])
# Cell from CIF file
array([[14.4603    ,  0.        ,  0.        ],
       [ 7.23015   , 12.52298715,  0.        ],
       [ 0.        ,  0.        , 27.0841    ]])

surface_cif_and_pdb.zip

@orbeckst
Copy link
Member

orbeckst commented Sep 9, 2019

of course, and their volunteer work is very (very) appreciated. I realize now that my comment could be read as a lament about support; I only meant that the time the issue had spent open was likely a sign that the underlying bug is non-trivial.

Thank you!

My comment was a bit over-sensitive and defensive – I am not always sure people understand how much work it is to even keep a medium-sized open source project such as MDA in a useable state, let alone improving it and fixing bugs. I appreciate your clarifying comment!

The more we get feedback the better. The insight that the precision of the box might be important and that the FastNS algorithm might be very sensitive to something that it shouldn't be overly sensitive to is interesting.

Thanks for taking the time to give detailed feedback. I would guess for every user like you there are between 10 and 100 who decide that it's too much effort and move on. This is, of course a valid choice because everybody is busy. But that's why we value it even more when someone takes the time to submit well-researched bug reports. I just wish we could always and immediately make best use of this information. However, hopefully by triangulating bugs it will become easier to fix them eventually.

By any chance, did you try the periodic KD-Tree https://www.mdanalysis.org/docs/documentation_pages/lib/pkdtree.html to see if it has similar precision issues?

@Linux-cpp-lisp
Copy link
Author

@orbeckst: Not at all, and I'm fairly familiar, at least at a small scale, with just how much effort open-source upkeep is.

I haven't but I can give that a try when I have a chance, since I need to (at least temporarily) replace FastNS so that my downstream results are (more) correct. I'll let you know, though I'm not sure exactly when I'll have time to work on that.

@Linux-cpp-lisp
Copy link
Author

@orbeckst : With some modifications (pull request #2347) for unrelated reasons (I need all the neighbors, not the nearest, to compute coordination number), yes, PeriodicKDTree does seem to work.

@zemanj
Copy link
Member

zemanj commented Sep 10, 2019

@Linux-cpp-lisp
First of all, thank you for your very detailed bug report! 👍

I just ran the current development branch against your example code with the ASE-generated fcc lattice and got even worse results. I put a couple of prints in the nsgrid module to get some more information on what's happening internally. Note that this produces more output than can be inferred from the test script below. Apart from that, my test code should be in principle identical to yours.

Test code

import numpy as np
from MDAnalysis.lib.distances import distance_array, apply_PBC
from MDAnalysis.lib.mdamath import triclinic_box, triclinic_vectors
from MDAnalysis.lib.nsgrid import FastNS
from ase.build import fcc111

#search radius:
cutoff = 3.2

print("selected cutoff: {}".format(cutoff))

#generate FCC system with ASE:
slab = fcc111('Ag', size=(5, 5, 4), vacuum=10.0)
tcbox = triclinic_box(*slab.cell)

#ASE-generated box vectors:
print("\nASE box vectors:\n"
      "[{:15.10f} {:15.10f} {:15.10f}]\n"
      "[{:15.10f} {:15.10f} {:15.10f}]\n"
      "[{:15.10f} {:15.10f} {:15.10f}]"
      "".format(*(slab.cell[0]), *(slab.cell[1]), *(slab.cell[2])))

#MDA box from generated ASE box:
print("\nMDAnalysis box:\n"
      "[{:15.10f} {:15.10f} {:15.10f} {:15.10f} {:15.10f} {:15.10f}]"
      "".format(*tcbox))

#MDA box vectors:
tcbox_vecs = triclinic_vectors(tcbox)
print("\nMDAnalysis box vectors:\n"
      "[{:15.10f} {:15.10f} {:15.10f}]\n"
      "[{:15.10f} {:15.10f} {:15.10f}]\n"
      "[{:15.10f} {:15.10f} {:15.10f}]"
      "".format(*(tcbox_vecs[0]), *(tcbox_vecs[1]), *(tcbox_vecs[2])))

#bruteforce distance search using lib.distances.distance_array:
da = distance_array(slab.positions, slab.positions, box=tcbox)
bf_indices = []
bf_cn = []
for i in range(da.shape[0]):
    bf_cn.append(len([ix for ix in range(da.shape[1]) \
                 if 0.0 < da[i, ix] < cutoff]))

#grid-based distance search using lib.nsgrid.FastNS:
fns = FastNS(cutoff=cutoff, coords=slab.positions, box=tcbox, pbc=True)
ns = fns.self_search()
ns_cn = [len(nsi) for nsi in ns.get_indices()]

print("\nBrute force coordination numbers:\n"
      "CNs:    {}\n"
      "counts: {}".format(*(np.unique(bf_cn, return_counts=True))))
print("\nFastNS coordination numbers:\n"
      "CNs:    {}\n"
      "counts: {}".format(*(np.unique(ns_cn, return_counts=True))))

Test output

selected cutoff: 3.2

ASE box vectors:
[  14.4603336753    0.0000000000    0.0000000000]
[   7.2301668376   12.5230163100    0.0000000000]
[   0.0000000000    0.0000000000   27.0840878030]

MDAnalysis box:
[  14.4603338242   14.4603338242   27.0840873718   90.0000000000   90.0000000000   60.0000000000]

MDAnalysis box vectors:
[  14.4603338242    0.0000000000    0.0000000000]
[   7.2301669121   12.5230159760    0.0000000000]
[   0.0000000000    0.0000000000   27.0840873718]

FastNS.box.c_pbcbox.box:
[  14.4603338242    0.0000000000    0.0000000000]
[   7.2301669121   12.5230159760    0.0000000000]
[   0.0000000000    0.0000000000   27.0840873718]

_NSGrid cell grid dimensions:
[4 3 8]

_NSGrid cell size:
[   3.6150834560    4.1743388176    3.3855109215]

Brute force coordination numbers:
CNs:    [ 9 12]
counts: [50 50]

FastNS coordination numbers:
CNs:    [ 3  5  6  7  8  9 10 11 12 13 14 15 17]
counts: [ 1  2  5 12  6 15 12  4 26  7  5  1  4]

Obviously, this is even worse than what you got.
The Fact that the box vectors used in MDAnalysis (and thus, also FastNS) slightly differ from the ones generated by ASE is not anything we should worry about. If the grid search algorithm is designed properly, this will not yield entirely wrong results. The only thing that could happen is that atoms residing very close to any of the box boundaries and, at the same time, are separated from other atoms by distances extremely close to the cutoff, might or might not be counted. The exact results in such "corner" cases, however, will always slightly differ between different implementations.
Anyway, in your example, cutoff=3.2 is a choice that is well in between the nearest and second nearest neighbor distance so that the results should be insensitive to such tiny deviations in the box vectors. As far as I can tell, the source of error lies entirely in the broken cell assignment for triclinic boxes.

Furthermore, while playing around with the test code, I discovered an additional bug:
If we select the x/y-dimensions of the fcc lattice so small that the cell grid in FastNS has only 2 cells in the x- and y-direction, the algorithm double-counts neighboring atoms in adjacent cells.
For a smaller fcc lattice size of (3, 3, 4) the output then looks like this:

Test output for a (3,3,4) fcc lattice

selected cutoff: 3.2

ASE box vectors:
[   8.6762002052    0.0000000000    0.0000000000]
[   4.3381001026    7.5138097860    0.0000000000]
[   0.0000000000    0.0000000000   27.0840878030]

MDAnalysis box:
[   8.6761999130    8.6761999130   27.0840873718   90.0000000000   90.0000000000   60.0000000000]

MDAnalysis box vectors:
[   8.6761999130    0.0000000000    0.0000000000]
[   4.3380999565    7.5138096809    0.0000000000]
[   0.0000000000    0.0000000000   27.0840873718]

FastNS.box.c_pbcbox.box:
[   8.6761999130    0.0000000000    0.0000000000]
[   4.3380999565    7.5138096809    0.0000000000]
[   0.0000000000    0.0000000000   27.0840873718]

_NSGrid cell grid dimensions:
[2 2 8]

_NSGrid cell size:
[   4.3380999565    3.7569048405    3.3855109215]

Brute force coordination numbers:
CNs:    [ 9 12]
counts: [18 18]

FastNS coordination numbers:
CNs:    [15 16 17 18 19 20 21 22 23 24 25 26]
counts: [1 2 2 5 4 3 1 3 3 4 7 1]

Now that there are only two 2 cells in the x- and y-direction, FastNS overestimates the number of neighbors by far.

Sources of Error

  1. According to the docstrings in lib/nsgrid.pyx, the coordinates are supposed to be folded into a brick-shaped box for cell assignment. As far as I can see, this is not the case, and IMHO that is correct since we need the coordinates wrapped into the original box for distance calculations. Instead, the cell assignment for triclinic boxes should be fixed.
    This then leads to wrong cell ID assignments and causes the observed wrong number of neighbor atoms. This bug probably only exists for triclinic boxes.
  2. If there are less than 3 cells in one dimension, FastNS doesn't detect that the neighboring cells in positive and negative x- or y-direction are actually the same cells and, consequently, double-counts these cells. This is a bug that is not limited to triclinic boxes!

Quick(?)-and-dirty solution

  1. Fix cell assignment for triclinic boxes.
  2. Make sure cells are not double-counted

Proper solution: complete re-implementation

Currently, there are multiple sources for precision issues in the nsgrid module which won't go away by simply applying the "quick" fix:

  1. FastNS uses its own (lazy) implementation of PBC wrapping which lacks several checks compared to lib.distances.apply_PBC() (implemented in lib/include/calc_distances.h). This can lead to coordinates lying slightly outside of the box, especially for triclinic cells. (Generally, new modules should reuse existing code!)
  2. The fact that the cell grid is orthorhombic even for triclinic boxes necessitates the awkward probe shifting procedure, which is not only slow, but is also very error prone: Suppose a coordinate lies on a cell boundary in (at least) one direction. Due to round-off errors, probe coordinates shifted by one cell size can lead to the same or the second next cell being selected as a neighboring cell, depending on whether the particle lies on an upper or lower cell boundary. We have worked on reducing the probability of such cases by doing some computations in double precision. However, this just makes the problem less likely and is not a real fix. (EDIT: turns out this is the root cause for bug Neighbor search with nsgrid.FastNS yields different pairs for swapped coords and search_coords #2229.)

A new implementation should therefore incorporate the following changes:

  1. The grid in a triclinically distorted box should also be distorted, i.e., the grid should always fit the simulation box.
  2. With such grids, we can get rid of the costly and error-prone probe shifting cell search and instead determine neighboring cells via their indices, i.e., by exact integer (modulo) operations.
  3. PBC wrapping as well as periodic distance calculations should use the robust (and parallelized!) machinery from lib/include/calc_distances.h. This will avoid code duplication, is less error prone, and probably faster.
  4. For performance reasons, separate cell assignment methods should exist for orthorhombic and triclinic boxes.
  5. The duplicate code in the search() and self_search() methods should be unified.
  6. The code should never have to instantiate a second search grid object but reuse the existing one.
    This is possible since we only need the grid for cell index assignment of the search coordinates, i.e., we don't actually need to fill the search coordinates in a second grid object.
  7. The search methods should be parallelized, i.e., their low-level routines should probably be implemented in C (we should consider C++ only if that leads to major simplifications, which is unlikely).

I'm looking forward to everyone's comments/suggestions!

@orbeckst
Copy link
Member

@zemanj thanks for the detailed analysis. Give the issues that are coming to light with FastNS, should we disable it (or make it explicitly opt-in with a warning?) for the time being and fall back to the slower approaches?

@zemanj
Copy link
Member

zemanj commented Sep 11, 2019

If nobody finds the time to properly fix this thing before the next release we have to disable it until we have a replacement. I wouldn't expect making broken code opt-in could be valuable for anyone but maybe it's technically easier than throwing it out altogether.

@orbeckst orbeckst added the PBC label Oct 22, 2019
@orbeckst
Copy link
Member

orbeckst commented Nov 5, 2019

@bdice have a look at the examples in this issue to check if the PBC-aware neighbor list/RDF calculation in freud correctly deals with these cases.

@bdice
Copy link
Contributor

bdice commented Nov 5, 2019

@orbeckst I'm taking a look - from initial investigations, freud may be mishandling this as well, but I need to check more closely before I make a definitive conclusion.

@richardjgowers richardjgowers mentioned this issue Feb 4, 2021
4 tasks
richardjgowers added a commit that referenced this issue Feb 28, 2021
Rewrote lib.nsgrid module

Issues #2919 #2345 #2229
@richardjgowers richardjgowers mentioned this issue Feb 28, 2021
4 tasks
richardjgowers added a commit that referenced this issue Mar 7, 2021
* Fixed NSGrid

Rewrote lib.nsgrid module

Issues #2919 #2345 #2229

* enabled nsgrid again

* bumped version to 1.0.2-dev

* remove nsgrid warnings

* regenerated nsgrid C files

* Revert removal of nsgrid from test_distances

Co-authored-by: IAlibay <irfan.alibay@gmail.com>
@Linux-cpp-lisp
Copy link
Author

@richardjgowers thank you so much for fixing this issue!!

Is there a way in code to check if the installed MDAnalysis contains your fix? As in, if I want to write:

if ns_grid_ok:
   # use FastNS
else:
  # use ASE neighbor_list

what would ns_grid_ok be?

Also, just out of curiosity, is there any sense of when there will be a PyPI release that includes this fix? Maybe @orbeckst can speak to that?

Once again, thanks all for your work on this issue and code.

@orbeckst
Copy link
Member

orbeckst commented Mar 19, 2021 via email

@IAlibay
Copy link
Member

IAlibay commented Mar 19, 2021

@IAlibay the fix will be on 1.1.0, right?

Yes that's correct.

Any ETA?

We have a tentative deadline of this Sunday for new code contributions for 1.1.0. We'll put a code freeze on then and aim to release shortly after (we will need a bit of time to a) check that Windows works again, b) create a final list of max tested dependency versions).

zemanj added a commit to zemanj/mdanalysis that referenced this issue Mar 21, 2021
This was referenced Mar 21, 2021
richardjgowers pushed a commit that referenced this issue Mar 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants