Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified the distance function in RDF #2013

Merged
merged 7 commits into from Aug 15, 2018

Conversation

@ayushsuhane
Copy link
Contributor

commented Jul 28, 2018

RDF in analysis.py currently uses distance_array which becomes slow for large number of particles. Although it still uses the same distance array for small number of particles, it can also use KDTree for large number of particles to boost the performance.

PR Checklist

  • [X ] Tests?
  • Docs?
  • [X ] CHANGELOG updated?
  • Issue raised/referenced?
@ayushsuhane

This comment has been minimized.

Copy link
Contributor Author

commented Jul 28, 2018

I have tried to keep the flow of the code intact, partly because I am not sure how to handle the exclusion block after the evaluation of pairs and their corresponding distances. If the exclusion block is None, rdf can be directly calculated by binning the distances, but with blocks we have to create the n x m arrays first. I am not sure whether this will have any advantage. I will try to check the performance first with available data files. Also, the tests are missing from the InterRDF_s. If it happens that the performance is increased, I will also try to add tests for InterRDF_s

self.g2.positions,
self._maxrange,
box=self.u.dimensions)
self._result[tuple(zip(*pairs))] = dist

This comment has been minimized.

Copy link
@richardjgowers

richardjgowers Jul 28, 2018

Member

This looks ugly and slow, but I see that for the exclusion mask feature you need to have the full matrix of distances. I think it might be better to implement the exclusion mask in a different way to avoid having to have this full matrix representation.

Because this full matrix scales O(N^2) we should really be avoiding it anyway. I'll have a think to see if I can come up with a quick way to do exclusion_matrix. You should be able to go through the indices, check if they're inside an excluded block, and if so ignore that distance.

@richardjgowers richardjgowers self-assigned this Aug 5, 2018
@orbeckst orbeckst requested a review from VOD555 Aug 6, 2018
@richardjgowers

This comment has been minimized.

Copy link
Member

commented Aug 7, 2018

So with exclusion block, you can check the raw indices quite easily. If exclusion block = (2, 3), then x={0, 1}, y={0, 2} is excluded, and x={2, 3}, y={3, 5} etc

Or more generally x={i * lx, (i+1) * lx - 1}, y={i * ly, (i+1) * ly - 1}

So you can look at the x index, determine it's in the ith block, and see if the corresponding y index is also in the same block, if it is, exclude it

@ayushsuhane

This comment has been minimized.

Copy link
Contributor Author

commented Aug 9, 2018

Yes, I think this should fix the problem. However, I am not getting any improvement in the timings except use of capped_distance/nsgrid would be able to handle large datasets.

idxA, idxB = pairs[:, 0]//self._exclusion_block[0], pairs[:, 1]//self._exclusion_block[1]
mask = np.where(idxA != idxB)[0]
dist = dist[mask]
@ayushsuhane ayushsuhane force-pushed the ayushsuhane:rdf branch from daa1dcd to b1e9435 Aug 10, 2018
@codecov

This comment has been minimized.

Copy link

commented Aug 10, 2018

Codecov Report

Merging #2013 into develop will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2013      +/-   ##
===========================================
- Coverage    89.12%   89.12%   -0.01%     
===========================================
  Files          159      159              
  Lines        18761    18759       -2     
  Branches      2696     2695       -1     
===========================================
- Hits         16721    16719       -2     
  Misses        1425     1425              
  Partials       615      615
Impacted Files Coverage Δ
package/MDAnalysis/analysis/rdf.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6f34d3c...538187f. Read the comment docs.

@zemanj

This comment has been minimized.

Copy link
Member

commented Aug 10, 2018

@ayushsuhane I worked on a faster RDF computation a while ago but it's not production quality yet since I haven't incorporated exclusions yet, and it also doesn't make use of any more sophisticated approaches such as grid search for capped distances. I'd very much appreciate it if you could have a look at what I did in the lib_distances branch of my forked repo and tell me what you think.

@ayushsuhane

This comment has been minimized.

Copy link
Contributor Author

commented Aug 10, 2018

@zemanj Thanks!! It is amazing. Pardon me for a noob question, but I am trying to understand what is it which makes the function _calc_distance_histogram faster than np.histogram(distance_array(.., .., .., ..), **kwargs) apart from a small overhead (hopefully) taken by numpy.histogram .

Correct me if I am wrong, but I thought numpy methods also work on C based computations, and the function you have defined is doing the same but (most probably) without any checks which numpy.histogram most certainly does. Apart from it, there is another difference i.e the usage of minimum_image_ortho_lazy for distance computations. I am not sure if this assumption brought the performance boost or doing C based computations for generating a histogram. Maybe you can throw some light on it. How much of performance improvement do you achieve with this approach. If it is significant, I guess it can be incorporated in the library.

@ayushsuhane

This comment has been minimized.

Copy link
Contributor Author

commented Aug 10, 2018

@richardjgowers So the problem here is the use of large range. If the box size is small i.e. 30-50, searching a range of 15 (do we really need this large range even for smaller boxes?) really becomes a N^2 problem and i dont think there is anything we can do about it. I was testing the performance with (small.gro)[https://github.com/ayushsuhane/Benchmarks_Distance/blob/master/Notebooks/small.gro], and which has an orthogonal box of 50. This is why the normal distance array was working quite fast as compared to other methods. Ofcourse, there is some overhead to calculate the pairs in capped_function as well. So choosing capped_function over sparse array is always slower, but it has an advantage of dealing with arbitrary size of data (upto a limit though). What do you suggest? Its the question of performance over flexibility?

@richardjgowers

This comment has been minimized.

Copy link
Member

commented Aug 11, 2018

@ayushsuhane ayushsuhane force-pushed the ayushsuhane:rdf branch from b1e9435 to 6d51133 Aug 11, 2018
@ayushsuhane

This comment has been minimized.

Copy link
Contributor Author

commented Aug 11, 2018

image

While the performance of RDF decreased a little bit for small data set (g1, g2), but it seems there is an improvement for larger datasets (the last point in the graph correspond to 10000 particles in both the groups used for RDF calculations)

I also extended the benchmarks to include the number of atoms as a parameter, and this performance should be reflected there as well.

@ayushsuhane ayushsuhane force-pushed the ayushsuhane:rdf branch from 6d51133 to e8d9b5b Aug 12, 2018
@ayushsuhane

This comment has been minimized.

Copy link
Contributor Author

commented Aug 12, 2018

@richardjgowers I think it depends on case to case, which is difficult to quantify. I tried to come up with rules based on one/two cases at hand, but a more exhaustive study needs to be performed. However, I think this should suffice for now. It already shows an improvement in time by a factor of 2-2.5 for large datasets, which is quite huge (from 70 secs to 25 secs for 10000 atoms in both groups)

@zemanj

This comment has been minimized.

Copy link
Member

commented Aug 12, 2018

@ayushsuhane Ok here's some explanation of what I did:

I started working on RDF computations for a project investigating long-range correlations in ionic liquids, which involved systems comprising up to 350,000 ion pairs. The amount of RAM required to store a distance array then amounts to about 912.7 GB, which is slightly more than what is available on my laptop... 😉
The fact that memory requirements for RDF computations scale like O(N^2) quickly becomes a total showstopper, which IMHO is simply unacceptable. This is where the distance_histogram methods come into play:
In the methods I implemented, the calculated distances are never written to main memory and actually never leave the CPU cache. While this already makes the routine less memory bound, there is a much bigger advantage: Since most MD simulations employ PBC, the maximum distance considered for RDFs is half the box length, and therefore, only distances within that cut-off are of interest. In the distance_histogram method, the square of the distances is computed and checked to lie within the given range before taking square roots. This means that in a cubic, homogeneous system, one can save almost half of the square root evaluations.
Further improvements are:

  • Coordinate memory layout is changed (That's only O(N)) so that atom coordinates can be processed in blocks of 32 particles, allowing exploitation of SIMD acceleration (SSE/AVX units up to AVX1024)
  • Cache line optimization (so far only for 64 byte cache line size)
  • Memory locality (make the most out of what's already in the CPU cache)
  • Prevention of false sharing (when using OpenMP)
  • Faster vectorized minimum image computations

The last point is exactly what is implemented in minimum_image_ortho_lazy. It's called lazy because the routine assumes that distances can never exceed 1.5 times the box length. This is ensured by packing all coordinates in the box prior to RDF computation (again, that's only O(N)).
Within the minimum_image_ortho_lazy method, vectorized CMOVs are used instead of branching to account for PBC.

My latest benchmarks for groups containing 5000 atoms suggest a speedup factor of about 2.6 for single core, about 7.6 for OpenMP on 4 cores with hyperthreading, and about 18 when used in a trivially parallelized MPI implementation (one process per trajectory frame) on the same machine (overcommitted with 12 MPI processes). The latter is probably something for pmda, though.

Please be aware of the fact that AVX acceleration will only be used when compiled with -march=native, that might affect the timings a bit.

Concerning the checks you mentioned that numpy.histogram performs, con you elaborate on that? I simply check the range and keep numpy compliance by having a closed interval in the last RDF bin (even though I think the latter is complete BS). I use int64_t for the histograms and therefore don't check for overflow since that's extremely unlikely.

There's certainly still room for improvement since my code doesn't scale well beyond 4 cores with OpenMP (but with MPI it does), and it only ever uses brute force neighbor search, which is probably its weakest point.

I'll try to get this thing production ready asap, maybe we can use it to replace the current brute force computations. Comments/criticism/opinions, and especially ideas for improvements and how this could be combined with e.g. grid search methods are very welcome!

@richardjgowers

This comment has been minimized.

Copy link
Member

commented Aug 13, 2018

@zemanj all those changes sound like they're compatible (ie the speedups will be multiplicative) with what @ayushsuhane and @seb-buch has done so far. We've implemented a domain decomposition method, so rather than comparing all pairs, we spatially bin the particles and only compare nearby particles. What you've done can be applied to the bin<->bin distance calculation (with some hacking...).

@zemanj

This comment has been minimized.

Copy link
Member

commented Aug 13, 2018

@richardjgowers Let's hope it doesn't end up being too hacky in the end... I'll look into it once this goes in.

@richardjgowers

This comment has been minimized.

Copy link
Member

commented Aug 14, 2018

@ayushsuhane ayushsuhane force-pushed the ayushsuhane:rdf branch from e8d9b5b to 8146415 Aug 14, 2018
@ayushsuhane ayushsuhane force-pushed the ayushsuhane:rdf branch from 8146415 to 538187f Aug 14, 2018
Copy link
Member

left a comment

Thanks @ayushsuhane !

@ayushsuhane

This comment has been minimized.

Copy link
Contributor Author

commented Aug 15, 2018

@richardjgowers Thanks to you for constant feedback and support.

@richardjgowers richardjgowers merged commit e9ee6be into MDAnalysis:develop Aug 15, 2018
5 checks passed
5 checks passed
Travis CI - Pull Request Build Passed
Details
codecov/patch 100% of diff hit (target 89.12%)
Details
codecov/project Absolute coverage decreased by -<.01% but relative coverage increased by +10.87% compared to 6f34d3c
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ayushsuhane ayushsuhane deleted the ayushsuhane:rdf branch Aug 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.