Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel distance calculations #261

Closed
wants to merge 1 commit into from
Closed

Conversation

richardjgowers
Copy link
Member

So currently we have a parallel version of distance_array, but these are done in Cython, and only exist if someone does a separate piece of code for a parallel version.

By putting openmp statements behind conditional compilation options, we can use the same C code twice, and this then generates a serial and parallel version of each function. This is nice as it keeps the code DRY, and also all functions exist in the serial/parallel namespace, regardless of whether the directives have been written yet.

This needs to have a separate .pyx file, and will generate a different submodule. This is a little confusing, and arguably there should be a single point of reference for people accessing these functions. It would be nice to change the call signature to something like

MDAnalysis.analysis.distances.distance_array(a, b, parallel=True)

Then in analysis.distances it chooses one of the correct Cython backends. Using Flags to default to parallel/not is also possible.

So questions that need answering are:

  • Will this approach work on all platforms?
  • Will it fail properly on platforms without openmp support?
    • I've tried to make it just compile serially if it can't do openmp
  • How should we managed serial/parallel versions of identical functions in our namespace?

@richardjgowers
Copy link
Member Author

Quick benchmarks:

import numpy as np
from MDAnalysis.core.parallel.distances import distance_array as da_p
from MDAnalysis.core.distances import distance_array as da

N = 1000

a = np.random.random(N * 3).reshape(N, 3).astype(np.float32)
b = np.random.random(N * 3).reshape(N, 3).astype(np.float32)

In [1]: r1 = da(a, b)

In [2]: r2 = da_p(a, b)

In [3]: r1 == r2
Out[3]:
array([[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
...,
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True]], dtype=bool)

In [4]: %timeit da(a, b)
100 loops, best of 3: 19.4 ms per loop

In [5]: %timeit da_p(a, b)
100 loops, best of 3: 4.05 ms per loop

This is using 8 cores... I think the forced copy (done in serial) is slowing us down a lot.

@hainm
Copy link
Contributor

hainm commented Apr 30, 2015

if there is a bunch of frames, I think it's better to split them into different cores. The parallel scaling is better.

I did this for pytraj (vs mdtraj). mdtraj rmsd calculation is much faster than pytraj but if using the
same n_cores, pytraj is much faster since it uses embarrassingly parallel calculation.

http://nbviewer.ipython.org/github/pytraj/pytraj/blob/master/note-books/parallel/rmsd_mpi.ipynb

@richardjgowers
Copy link
Member Author

I'm not saying this is the fastest existing option, but it definitely has a place

That's a very cool notebook though.

Is this line:

traj = io.load(fname, top_name)

Loading the entire trajectory into memory?

@hainm
Copy link
Contributor

hainm commented Apr 30, 2015

pytraj has two Trajectory objects: TrajReadOnly (behaves like Universe) and FrameArray (load a chunk of frames to memory).

for the above example traj = io.load(fname, top_name) only does general checking (if file exists, loading Topology, check n_frames, ...) but not load anything frame to disk.

There is reason I designed so (I am thinking about a version using numpy array like mdtraj but
I am still doing benchmark between 3 packages to find the best combo). You can check the different
in speed for two types of objects in this notebook.

http://nbviewer.ipython.org/github/pytraj/pytraj/blob/master/note-books/speed_test_2_trajs.ipynb

Hai

@orbeckst
Copy link
Member

orbeckst commented May 2, 2015

The parallel version has a place, in particular for distance selections. If the parallel version will do PBC then we can use it as a drop-in replacement.

We should have a way to set the number of threads, perhaps using core.flags. There should also be a way to tell the distance-based selections to default to parallel or single versions.

I'm definitely with @hainm that the best case is to do work pleasingly parallel on chunks but sometimes it is just easier to do it serially and then distance_array is often the bottle neck (e.g. for LeafletFinder) so a faster option will always be helpful (just maybe not optimal).

@richardjgowers : I haven't had time to look at the code and don't have final answers for your questions but would raise the following points for discussion:

  1. Will this approach work on all platforms?
  • Don't know. Maybe try it in virtual machines? Have some other people try — let people know how to run a test or gist a test script.
  1. Will it fail properly on platforms without openmp support?
  •    I've tried to make it just compile serially if it can't do openmp
    
    • That's probably ok and we've been using this approach previously (I think @rmcgibbo added that).
  1. How should we managed serial/parallel versions of identical functions in our namespace?
  • I like the parallel keyword. Perhaps let it use arguments such as False (=serial), True (=parallel with as many threads as possible), auto (== True), <integer> (= number of threads), None (default to the global default). We could make this a common keyword in the API with a bunch of 'NotImplementedError`, at least for the cases where we'd eventually want to have threaded versions.
  • There should be some global defaults in core.flags for parallel, such as core.flags['parallel'] = False.

@rmcgibbo
Copy link
Contributor

rmcgibbo commented May 2, 2015

To the best of my recollection, the prange construct in cython compiles down to openmp pragmas in C/C++ that are guarded by ifdef _OPENMP, so the code compiles fine on compilers that don't support OpenMP (e.g. clang), without doing anything special. If you want to explicitly use functions from openmp, then you do need to do some conditional compilation (e.g. http://docs.cython.org/src/userguide/parallelism.html#using-openmp-functions).

Obviously this isn't available as an option when you're using the pragmas yourself in C (as here), but ifdef guarding with _OPENMP is the standard thing, since that macro will be automatically set by the compiler if OpenMP is activated and supported.

@hainm
Copy link
Contributor

hainm commented May 2, 2015

just want to confirm what @rmcgibbo saying. I got a cythonized file without OPENMP

https://github.com/pytraj/pytraj/blob/master/pytraj/ActionList.cpp#L284

@hainm
Copy link
Contributor

hainm commented May 2, 2015

@orbeckst I agree about parallel for single Frame. Never thought about we need to perform analysis with >100K atoms too.

Hai

@richardjgowers
Copy link
Member Author

So to save everyone having to read C code... the tl;dr of the commit is in our C distances library (src/numtools/calc_distances.h) can now look like

#ifdef PARALLEL
#include <omp.h>
#endif

static void some_function(args)

#ifdef PARALLEL
 // This only gets added if PARALLEL was defined when calling the C compiler
#pragma omp parallel for
#endif
for (i=0; i++; i<n){
  do math
}

Our setup.py then builds 2 extensions which both refer to calc_distances.h, but one pass includes the parallel stuff (ie. -DPARALLEL)

I'll add in selecting the number of cores and flags (thanks @orbeckst ), then I'll just need someone to test off MacOS.

@richardjgowers
Copy link
Member Author

So for this what I think I'll do is add a keyword to these Cython functions which will do as Oliver suggested. The other cool performance stuff should and will become a different PR.

It might be a good juncture to tidy the namespace, I'd like to:

  • move core.distances front ends into analysis.distances
  • make core._distances and core._distances_parallel extremely thin Cython wrappers (no checking)
  • analysis.distances then handles kwargs, does checks and calls appropriate core._* backend
  • deprecate core.distances in favour of analysis.distances

So analysis.distances is more the "user" module and core._distances the "dev" one. This will also centralise the distance related code (currently split in two between core and analysis).

So I'll do this in the next few days unless someone objects wildly

@richardjgowers richardjgowers self-assigned this May 11, 2015
@richardjgowers richardjgowers added this to the 0.11 milestone May 11, 2015
@hainm
Copy link
Contributor

hainm commented May 11, 2015

something like this (which is similiar to mdtraj)?

https://github.com/pytraj/pytraj/blob/master/pytraj/Frame.pyx#L978

after you implementation, can you post your benchmark? the openmp stuff for distance calculation is not really make the calculation a lot faster (and the scaling is poor too). I used 8 cores but only got speed up of 3-4 times for 10**8 atompairs calculation.

Hai

@orbeckst
Copy link
Member

@richardgowers : Keep in mind that selections need distance_array() and all of this needs to run without MDAnalysis.analysis. We do not want any dependencies on MDAnalysis.analysis in the core (in the future we might split that part of to make the core library light-weight).

As long as the above is fulfilled I'm happy with restructuring. Also note that we have core.transformations that do somewhat similar calculations and your fast dihedral and angle calculations and then there's KD-tree. Perhaps we should think about bundling all of them under MDAnalysis.lib or numerical or similar?

Oliver

On 11 May, 2015, at 11:48, Richard Gowers wrote:

So for this what I think I'll do is add a keyword to these Cython functions which will do as Oliver suggested. The other cool performance stuff should and will become a different PR.

It might be a good juncture to tidy the namespace, I'd like to:

� move core.distances front ends into analysis.distances
� make core._distances and core.distances_parallel extremely thin Cython wrappers (no checking)
� analysis.distances then handles kwargs, does checks and calls appropriate core.
* backend
� deprecate core.distances in favour of analysis.distances
So analysis.distances is more the "user" module and core._distances the "dev" one. This will also centralise the distance related code (currently split in two between core and analysis).

So I'll do this in the next few days unless someone objects wildly


Reply to this email directly or view it on GitHub.

Oliver Beckstein * orbeckst@gmx.net
skype: orbeckst * orbeckst@gmail.com

@richardjgowers
Copy link
Member Author

I get pretty close to n times faster in the thin wrapper, it's just the copy statement (in serial) that kills it.

@richardjgowers
Copy link
Member Author

@orbeckst ahhh, maybe it is more complicated than I hoped. Our numerics should all be dependency free.. So bundling them into lib or numerical sounds smart... I'll see what makes most sense

@hainm
Copy link
Contributor

hainm commented May 11, 2015

@richardjgowers:

it's interesting if you can include iterating time too. For example. mdtraj distance calculation is about 3-4 times faster than pytraj for single Frame or if the whole traj is loaded to memory (thanks to their SSE stuff). But it's not true when iterating each Frame or a chunk of data to memory. (MDAnalysis is using frame-based iterating ? )

You can check my notebook here. (my testing system is not that huge though, only ~17 K atoms, 1000 frames). Any way, just want to make the point that sometimes the bottle neck is in I/O.
http://nbviewer.ipython.org/github/pytraj/pytraj/blob/master/note-books/dev/speed_test_distance.ipynb
Hai

@dotsdl
Copy link
Member

dotsdl commented May 11, 2015

(MDAnalysis is using frame-based iterating ? )

Not to add noise to the conversation, but perhaps related to #238?

@richardjgowers
Copy link
Member Author

Ok so with commit bb7f382 I've moved the cython interface to MDAnalysis.lib._distances. This is now an extremely lightweight Cython interface to the c header.

MDAnalysis.core.distances is now a pure python module, which just imports this new distance module in lib.

The parallel distances in this PR will become MDAnalysis.lib._parallel_distances

MDAnalysis.core.distances will then choose which Cython backend to use.

@orbeckst
Copy link
Member

Sounds good � although, how about adding more order right away, at least for serial/parallel versions:

lib.parallel._distances
lib.serial._distances

especially if we do more of this stuff in the future � we might end up with lib.cuda, lib.mpi, or lib.parallel.cuda, lib.parallel.openmp ... not sure if this is already overspecifying things but something along those lines would make it a bit more easy to slot new stuff in.

Alternatively, focus at the top level on the functionality and e.g.

lib.geometry.distances.openmp
lib.geometry.distances.serial

etc, but then almost each submodule is named serial, which seems a bit dumb. (Or do 'import .distances_serial as serial')

Opinions???

On 15 May, 2015, at 10:24, Richard Gowers wrote:

Ok so with commit bb7f382 I've moved the cython interface to MDAnalysis.lib._distances. This is now an extremely lightweight Cython interface to the c header.

MDAnalysis.core.distances is now a pure python module, which just imports this new distance module in lib.

The parallel distances in this PR will become MDAnalysis.lib._parallel_distances

MDAnalysis.core.distances will then choose which Cython backend to use.


Reply to this email directly or view it on GitHub.

Oliver Beckstein * orbeckst@gmx.net
skype: orbeckst * orbeckst@gmail.com

@orbeckst
Copy link
Member

@richardjgowers , should we just close the pull request, or does it still contain stuff that is not in bb7f382 ?

For the organization of lib I'd prefer (along your original lines):

lib.distances.openmp
lib.distances.serial

because that leaves us open for lib.distances.cuda etc while also being pretty explicit. In any case, the user frontend in core.distances will then pick the right lower level routines but if one really wants to one can also use the ones in lib.

@richardjgowers
Copy link
Member Author

Yeah the branch in the original PR is outdated now, so I'll close this.

@orbeckst Ok I'll organise like that. If anyone wants to buy me a CUDA rig to play on I'll write distances.cuda too!

@orbeckst
Copy link
Member

On 19 May, 2015, at 11:53, Richard Gowers wrote:

@orbeckst Ok I'll organise like that. If anyone wants to buy me a CUDA rig to play on I'll write distances.cuda too!

Hypothetically speaking, what would you like :-) ?

Oliver Beckstein * orbeckst@gmx.net
skype: orbeckst * orbeckst@gmail.com

@richardjgowers
Copy link
Member Author

Haha, I think I was using M2050s before. Of course I'd need 2 to test the multi gpu performance.

@orbeckst
Copy link
Member

On 19 May, 2015, at 12:12, Richard Gowers wrote:

Haha, I think I was using M2050s before. Of course I'd need 2 to test the multi gpu performance.

GTX 690s are still pretty decent in our hands.

Oliver Beckstein * orbeckst@gmx.net
skype: orbeckst * orbeckst@gmail.com

@richardjgowers richardjgowers deleted the feature-newparallel branch May 27, 2015 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants