Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-linear (quadratic?) scaling of SOAP calculation #31

Closed
anjohan opened this issue Nov 1, 2019 · 10 comments
Closed

Non-linear (quadratic?) scaling of SOAP calculation #31

anjohan opened this issue Nov 1, 2019 · 10 comments

Comments

@anjohan
Copy link
Contributor

anjohan commented Nov 1, 2019

Hi,

With the SOAP vectors working perfectly, I turned to a "real" system with more atoms. With 250 000 atoms, each frame takes 1.5 hours.

From my experimenting (see below), it seems that the SOAP vector calculation scales approximately linearly as a function of the square of the number of atoms, i.e. quadratically with the number of atoms. This goes for both 0.2.8 and the newest GitHub version.

Since there is a cut-off for interatomic distances, I would have expected the SOAP calculation to scale linearly with the number of atoms like classical MD force fields. Is this not possible?

from tqdm import tqdm
from time import time
import numpy as np
from ase import Atoms
from dscribe.descriptors import SOAP
import matplotlib.pyplot as plt

N = []
t = []

# Loop over different system sizes
for ncells in tqdm(range(5, 21)):
    natoms = 2 * ncells ** 3

    soap_generator = SOAP(rcut=3.0, nmax=4, lmax=4, species=["Ni", "Ti"], periodic=True)

    a = 2.993
    niti = Atoms(
        "NiTi",
        positions=[[0.0, 0.0, 0.0], [a / 2, a / 2, a / 2]],
        cell=[a, a, a],
        pbc=[1, 1, 1],
    )

    # Replicate system
    niti = niti * ncells
    a *= ncells

    t0 = time()
    soaps = soap_generator.create(niti)
    t1 = time()

    N.append(natoms)
    t.append(t1 - t0)

N = np.array(N, dtype=float)
t = np.array(t, dtype=float)

plt.xlabel("(# atoms)^2")
plt.ylabel("t [s]")
plt.grid()

plt.plot(N ** 2, t, "bo")
plt.savefig("time.png")

time

@lauri-codes
Copy link
Contributor

Hi @anjohan!

Wow, 250 k atoms :) Are you actually working with systems of this magnitude?

The linear scaling in classical MD is largely possible if there is no need to calculate neighbour lists. This is typically the case with fixed bonds. As soon as you have to start figuring out which atoms are within the radial cutoff, there is a need to calculate the pairwise distances between atoms.

SOAP is quite interesting in this case. In theory, there is no need to ever calculate pairwise distances. One could just form a gaussian atomic density based on all atoms. This would make the algorithm scale linearly. In practise however, this linear calculation will become quite expensive in terms of time and memory if the atomic density consists of thousands of atoms.

So instead, in our current implementation, only atoms within the cutoff are included in the atomic density when calculating the expansion on individual sites. This requires that we calculate the pairwise distances once. With the current implementation this results in the O(n^2) scaling you here demonstrate. The current distance calculation is quite naive and I can improve it to scale as O(n log n). This faster distance calculation is actually already implemented for many other descriptors in the package. It was not implemented for SOAP as with typical system sizes this distance calculation is negligible, and only starts to dominate the SOAP calculation at very large system sizes.

So in summary, I will implement the faster pairwise distance calculation, which will make it scale O(n log n). I will also see if the fully linear calculation is feasible, but I fear it has such a big constant prefactor that it will not be usable in practice.

@lauri-codes
Copy link
Contributor

With 250 000 atoms, each frame takes 1.5 hours.

With what SOAP settings do you get 1.5 h/configuration?

@anjohan
Copy link
Contributor Author

anjohan commented Nov 2, 2019

Thanks for the quick response!

Wow, 250 k atoms :) Are you actually working with systems of this magnitude?

Yes (unfortunately). I'm using the SOAP vectors to visualise phase transitions for fairly large systems. This may not have been your intended purpose, but it's working superbly!

The linear scaling in classical MD is largely possible if there is no need to calculate neighbour lists. This is typically the case with fixed bonds. As soon as you have to start figuring out which atoms are within the radial cutoff, there is a need to calculate the pairwise distances between atoms.

SOAP is quite interesting in this case. In theory, there is no need to ever calculate pairwise distances. One could just form a gaussian atomic density based on all atoms. This would make the algorithm scale linearly. In practise however, this linear calculation will become quite expensive in terms of time and memory if the atomic density consists of thousands of atoms.

So instead, in our current implementation, only atoms within the cutoff are included in the atomic density when calculating the expansion on individual sites. This requires that we calculate the pairwise distances once. With the current implementation this results in the O(n^2) scaling you here demonstrate. The current distance calculation is quite naive and I can improve it to scale as O(n log n). This faster distance calculation is actually already implemented for many other descriptors in the package. It was not implemented for SOAP as with typical system sizes this distance calculation is negligible, and only starts to dominate the SOAP calculation at very large system sizes.

So in summary, I will implement the faster pairwise distance calculation, which will make it scale O(n log n). I will also see if the fully linear calculation is feasible, but I fear it has such a big constant prefactor that it will not be usable in practice.

Hm. Thanks to dscribe, I haven't had to dive too deeply into the descriptor calculations, but I was hoping that it should be linearly scaling due to the cutoff. From what I understand, SOAP vectors rely on summing Gaussians centred at each neighbour and then projecting this onto some basis. This is conceptually similar to computing short-range forces in MD, so I was hoping the spatial binning trick would work, i.e.

  1. Divide space into regions of size rcut x rcut x rcut.
  2. For each atom, its neighbours are found either inside the same region or inside the nearest 26 regions.

The time thus ends up being proportional to (number of atoms) x (number of neighbours), where the number of neighbours is approximately rho x (27rcut^3), rho being the density. While the number of neighbours is a pretty large prefactor, it's still overall linear, and with N=250k, the number of neighbours per atom is very small compared to the total number of atoms.

Would this not work for SOAP vectors?

I did look around inside the libsoap directory, but the code is (understandably) quite involved.

From what I gather, the getFilteredPos function is called inside to soap function to collect the neighbours of a given atom (or position) by looping over all the atoms and determining the ones which are inside the cutoff sphere.

If you create a spatial binning of the atoms, the getFilteredPos function would only need to loop over a small subset of the atoms to find the neighbours.

Sorry if my explanation is rambly, here's some pseudo-code:

def soap(...):
    atomlist = way of storing atoms
    L = system size (assuming cubic)
    rcut = cutoff
    Nbins = ceil(L/rcut)
    binsize = L/Nbins
    all_atoms = atomlist with all the atoms
    soap_positions = where SOAPs should be calculated

    atomlists = atomlist[Nbins][Nbins][Nbins] # array of empty atom lists

    # do spatial binning, fill atomlists
    for atom in all_atoms:
        i = atom.x/binsize
        j = atom.y/binsize
        k = atom.z/binsize

        atomlists[i][j][k].append(atom)

    for position in soap_positions:
        neighbours = getFilteredPos(position, atomlists)
        calculate SOAPs based on neighbours

def getFilteredPos(position, atomlists):
    # Determine which bin
    this_i = position.x/binsize
    this_j = position.y/binsize
    this_k = position.z/binsize

    neighbours = []

    # Loop over neighbouring bins
    for i in this_i+[-1,0,1]:
        for j in this_j+[-1,0,1]:
            for k in this_j+[-1,0,1]:
                for atom in atomlists[i][j][k]:
                    if norm(atom.position - position) < rcut:
                        neighbours.append(atom)

    return neighbours    

With what SOAP settings do you get 1.5 h/configuration?

rcut = 6.0, nmax=lmax=3. I'm essentially running the system in my example above with 50x50x50 unit cells. With 18x18x18 unit cells (11664 atoms), each frame took 10-20 seconds. These runs are on fairly standard intel nodes on the cluster. Since I have to calculate SOAP vectors for 100-1000 of these frames, I've parallelised it with mpi4py.

@lauri-codes
Copy link
Contributor

Thanks for the detailed answer! You're application sounds very interesting and I would be very interested to see what comes out of such visualization.

I think you are right and the binning might be the best solution since there is only one global cutoff. Should have thought about that earlier, thanks for pointing it out!

I will try my hand at implementing and testing it. Unfortunately I cannot promise any schedule, as I don't right now have too much time to spend on this.

@anjohan
Copy link
Contributor Author

anjohan commented Nov 2, 2019

No worries, this isn't urgent. The SOAP vector calculation still takes much less time than the MD simulation, since I'm only calculating descriptors for a small subset of the frames and using mpi4py.

I should also be able to add a naive (but linearly scaling) binning myself, now that I understand more of libsoap. I may have some time later this week.

@anjohan
Copy link
Contributor Author

anjohan commented Nov 5, 2019

I've started working on an implementation.

@lauri-codes
Copy link
Contributor

That's great! External contributors are very welcome.

Will you be doing this on the python side or in C/C++? For now this doesn't really matter, but eventually, I would want this functionality in the C/C++ side and available to all the other descriptors as well.

@anjohan
Copy link
Contributor Author

anjohan commented Nov 5, 2019

In C. As a result, I'm currently enjoying some segfaults.

@lauri-codes
Copy link
Contributor

Have fun :>

@lauri-codes
Copy link
Contributor

lauri-codes commented Dec 23, 2019

Thanks @anjohan for contributing the spatial binning algorithm! It is now fully tested and integrated into SOAP in version 0.3.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants