Parallel distance calculations #261

richardjgowers · 2015-04-29T09:56:33Z

So currently we have a parallel version of distance_array, but these are done in Cython, and only exist if someone does a separate piece of code for a parallel version.

By putting openmp statements behind conditional compilation options, we can use the same C code twice, and this then generates a serial and parallel version of each function. This is nice as it keeps the code DRY, and also all functions exist in the serial/parallel namespace, regardless of whether the directives have been written yet.

This needs to have a separate .pyx file, and will generate a different submodule. This is a little confusing, and arguably there should be a single point of reference for people accessing these functions. It would be nice to change the call signature to something like

MDAnalysis.analysis.distances.distance_array(a, b, parallel=True)

Then in analysis.distances it chooses one of the correct Cython backends. Using Flags to default to parallel/not is also possible.

So questions that need answering are:

Will this approach work on all platforms?
Will it fail properly on platforms without openmp support?
- I've tried to make it just compile serially if it can't do openmp
How should we managed serial/parallel versions of identical functions in our namespace?

richardjgowers · 2015-04-29T09:58:13Z

Quick benchmarks:

import numpy as np
from MDAnalysis.core.parallel.distances import distance_array as da_p
from MDAnalysis.core.distances import distance_array as da

N = 1000

a = np.random.random(N * 3).reshape(N, 3).astype(np.float32)
b = np.random.random(N * 3).reshape(N, 3).astype(np.float32)

In [1]: r1 = da(a, b)

In [2]: r2 = da_p(a, b)

In [3]: r1 == r2
Out[3]:
array([[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
...,
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True]], dtype=bool)

In [4]: %timeit da(a, b)
100 loops, best of 3: 19.4 ms per loop

In [5]: %timeit da_p(a, b)
100 loops, best of 3: 4.05 ms per loop

This is using 8 cores... I think the forced copy (done in serial) is slowing us down a lot.

hainm · 2015-04-30T08:16:18Z

if there is a bunch of frames, I think it's better to split them into different cores. The parallel scaling is better.

I did this for pytraj (vs mdtraj). mdtraj rmsd calculation is much faster than pytraj but if using the
same n_cores, pytraj is much faster since it uses embarrassingly parallel calculation.

http://nbviewer.ipython.org/github/pytraj/pytraj/blob/master/note-books/parallel/rmsd_mpi.ipynb

richardjgowers · 2015-04-30T08:26:32Z

I'm not saying this is the fastest existing option, but it definitely has a place

That's a very cool notebook though.

Is this line:

traj = io.load(fname, top_name)

Loading the entire trajectory into memory?

hainm · 2015-04-30T08:40:05Z

pytraj has two Trajectory objects: TrajReadOnly (behaves like Universe) and FrameArray (load a chunk of frames to memory).

for the above example traj = io.load(fname, top_name) only does general checking (if file exists, loading Topology, check n_frames, ...) but not load anything frame to disk.

There is reason I designed so (I am thinking about a version using numpy array like mdtraj but
I am still doing benchmark between 3 packages to find the best combo). You can check the different
in speed for two types of objects in this notebook.

http://nbviewer.ipython.org/github/pytraj/pytraj/blob/master/note-books/speed_test_2_trajs.ipynb

Hai

orbeckst · 2015-05-02T00:29:15Z

The parallel version has a place, in particular for distance selections. If the parallel version will do PBC then we can use it as a drop-in replacement.

We should have a way to set the number of threads, perhaps using core.flags. There should also be a way to tell the distance-based selections to default to parallel or single versions.

I'm definitely with @hainm that the best case is to do work pleasingly parallel on chunks but sometimes it is just easier to do it serially and then distance_array is often the bottle neck (e.g. for LeafletFinder) so a faster option will always be helpful (just maybe not optimal).

@richardjgowers : I haven't had time to look at the code and don't have final answers for your questions but would raise the following points for discussion:

Will this approach work on all platforms?

Don't know. Maybe try it in virtual machines? Have some other people try — let people know how to run a test or gist a test script.

Will it fail properly on platforms without openmp support?

```
   I've tried to make it just compile serially if it can't do openmp
```
- That's probably ok and we've been using this approach previously (I think @rmcgibbo added that).

How should we managed serial/parallel versions of identical functions in our namespace?

I like the parallel keyword. Perhaps let it use arguments such as False (=serial), True (=parallel with as many threads as possible), auto (== True), <integer> (= number of threads), None (default to the global default). We could make this a common keyword in the API with a bunch of 'NotImplementedError`, at least for the cases where we'd eventually want to have threaded versions.
There should be some global defaults in core.flags for parallel, such as core.flags['parallel'] = False.

rmcgibbo · 2015-05-02T00:38:27Z

To the best of my recollection, the prange construct in cython compiles down to openmp pragmas in C/C++ that are guarded by ifdef _OPENMP, so the code compiles fine on compilers that don't support OpenMP (e.g. clang), without doing anything special. If you want to explicitly use functions from openmp, then you do need to do some conditional compilation (e.g. http://docs.cython.org/src/userguide/parallelism.html#using-openmp-functions).

Obviously this isn't available as an option when you're using the pragmas yourself in C (as here), but ifdef guarding with _OPENMP is the standard thing, since that macro will be automatically set by the compiler if OpenMP is activated and supported.

hainm · 2015-05-02T01:09:15Z

just want to confirm what @rmcgibbo saying. I got a cythonized file without OPENMP

https://github.com/pytraj/pytraj/blob/master/pytraj/ActionList.cpp#L284

hainm · 2015-05-02T03:23:17Z

@orbeckst I agree about parallel for single Frame. Never thought about we need to perform analysis with >100K atoms too.

Hai

richardjgowers · 2015-05-05T10:16:10Z

So to save everyone having to read C code... the tl;dr of the commit is in our C distances library (src/numtools/calc_distances.h) can now look like

#ifdef PARALLEL
#include <omp.h>
#endif

static void some_function(args)

#ifdef PARALLEL
 // This only gets added if PARALLEL was defined when calling the C compiler
#pragma omp parallel for
#endif
for (i=0; i++; i<n){
  do math
}

Our setup.py then builds 2 extensions which both refer to calc_distances.h, but one pass includes the parallel stuff (ie. -DPARALLEL)

I'll add in selecting the number of cores and flags (thanks @orbeckst ), then I'll just need someone to test off MacOS.

richardjgowers · 2015-05-11T18:48:45Z

So for this what I think I'll do is add a keyword to these Cython functions which will do as Oliver suggested. The other cool performance stuff should and will become a different PR.

It might be a good juncture to tidy the namespace, I'd like to:

move core.distances front ends into analysis.distances
make core._distances and core._distances_parallel extremely thin Cython wrappers (no checking)
analysis.distances then handles kwargs, does checks and calls appropriate core._* backend
deprecate core.distances in favour of analysis.distances

So analysis.distances is more the "user" module and core._distances the "dev" one. This will also centralise the distance related code (currently split in two between core and analysis).

So I'll do this in the next few days unless someone objects wildly

hainm · 2015-05-11T18:54:21Z

something like this (which is similiar to mdtraj)?

https://github.com/pytraj/pytraj/blob/master/pytraj/Frame.pyx#L978

after you implementation, can you post your benchmark? the openmp stuff for distance calculation is not really make the calculation a lot faster (and the scaling is poor too). I used 8 cores but only got speed up of 3-4 times for 10**8 atompairs calculation.

Hai

orbeckst · 2015-05-11T19:03:19Z

@richardgowers : Keep in mind that selections need distance_array() and all of this needs to run without MDAnalysis.analysis. We do not want any dependencies on MDAnalysis.analysis in the core (in the future we might split that part of to make the core library light-weight).

As long as the above is fulfilled I'm happy with restructuring. Also note that we have core.transformations that do somewhat similar calculations and your fast dihedral and angle calculations and then there's KD-tree. Perhaps we should think about bundling all of them under MDAnalysis.lib or numerical or similar?

Oliver

On 11 May, 2015, at 11:48, Richard Gowers wrote:

So for this what I think I'll do is add a keyword to these Cython functions which will do as Oliver suggested. The other cool performance stuff should and will become a different PR.

It might be a good juncture to tidy the namespace, I'd like to:

� move core.distances front ends into analysis.distances
� make core._distances and core.distances_parallel extremely thin Cython wrappers (no checking)
� analysis.distances then handles kwargs, does checks and calls appropriate core.* backend
� deprecate core.distances in favour of analysis.distances
So analysis.distances is more the "user" module and core._distances the "dev" one. This will also centralise the distance related code (currently split in two between core and analysis).

So I'll do this in the next few days unless someone objects wildly

�
Reply to this email directly or view it on GitHub.

Oliver Beckstein * orbeckst@gmx.net
skype: orbeckst * orbeckst@gmail.com

richardjgowers · 2015-05-11T19:03:23Z

I get pretty close to n times faster in the thin wrapper, it's just the copy statement (in serial) that kills it.

richardjgowers · 2015-05-11T19:09:34Z

@orbeckst ahhh, maybe it is more complicated than I hoped. Our numerics should all be dependency free.. So bundling them into lib or numerical sounds smart... I'll see what makes most sense

hainm · 2015-05-11T19:39:11Z

@richardjgowers:

it's interesting if you can include iterating time too. For example. mdtraj distance calculation is about 3-4 times faster than pytraj for single Frame or if the whole traj is loaded to memory (thanks to their SSE stuff). But it's not true when iterating each Frame or a chunk of data to memory. (MDAnalysis is using frame-based iterating ? )

You can check my notebook here. (my testing system is not that huge though, only ~17 K atoms, 1000 frames). Any way, just want to make the point that sometimes the bottle neck is in I/O.
http://nbviewer.ipython.org/github/pytraj/pytraj/blob/master/note-books/dev/speed_test_distance.ipynb
Hai

dotsdl · 2015-05-11T19:45:16Z

(MDAnalysis is using frame-based iterating ? )

Not to add noise to the conversation, but perhaps related to #238?

richardjgowers · 2015-05-15T17:24:30Z

Ok so with commit bb7f382 I've moved the cython interface to MDAnalysis.lib._distances. This is now an extremely lightweight Cython interface to the c header.

MDAnalysis.core.distances is now a pure python module, which just imports this new distance module in lib.

The parallel distances in this PR will become MDAnalysis.lib._parallel_distances

MDAnalysis.core.distances will then choose which Cython backend to use.

orbeckst · 2015-05-15T17:55:09Z

Sounds good � although, how about adding more order right away, at least for serial/parallel versions:

lib.parallel._distances
lib.serial._distances

especially if we do more of this stuff in the future � we might end up with lib.cuda, lib.mpi, or lib.parallel.cuda, lib.parallel.openmp ... not sure if this is already overspecifying things but something along those lines would make it a bit more easy to slot new stuff in.

Alternatively, focus at the top level on the functionality and e.g.

lib.geometry.distances.openmp
lib.geometry.distances.serial

etc, but then almost each submodule is named serial, which seems a bit dumb. (Or do 'import .distances_serial as serial')

Opinions???

On 15 May, 2015, at 10:24, Richard Gowers wrote:

Ok so with commit bb7f382 I've moved the cython interface to MDAnalysis.lib._distances. This is now an extremely lightweight Cython interface to the c header.

MDAnalysis.core.distances is now a pure python module, which just imports this new distance module in lib.

The parallel distances in this PR will become MDAnalysis.lib._parallel_distances

MDAnalysis.core.distances will then choose which Cython backend to use.

�
Reply to this email directly or view it on GitHub.

Oliver Beckstein * orbeckst@gmx.net
skype: orbeckst * orbeckst@gmail.com

orbeckst · 2015-05-19T18:47:50Z

@richardjgowers , should we just close the pull request, or does it still contain stuff that is not in bb7f382 ?

For the organization of lib I'd prefer (along your original lines):

lib.distances.openmp
lib.distances.serial

because that leaves us open for lib.distances.cuda etc while also being pretty explicit. In any case, the user frontend in core.distances will then pick the right lower level routines but if one really wants to one can also use the ones in lib.

richardjgowers · 2015-05-19T18:53:45Z

Yeah the branch in the original PR is outdated now, so I'll close this.

@orbeckst Ok I'll organise like that. If anyone wants to buy me a CUDA rig to play on I'll write distances.cuda too!

orbeckst · 2015-05-19T18:58:45Z

On 19 May, 2015, at 11:53, Richard Gowers wrote:

@orbeckst Ok I'll organise like that. If anyone wants to buy me a CUDA rig to play on I'll write distances.cuda too!

Hypothetically speaking, what would you like :-) ?

Oliver Beckstein * orbeckst@gmx.net
skype: orbeckst * orbeckst@gmail.com

richardjgowers · 2015-05-19T19:12:49Z

Haha, I think I was using M2050s before. Of course I'd need 2 to test the multi gpu performance.

orbeckst · 2015-05-19T19:22:03Z

On 19 May, 2015, at 12:12, Richard Gowers wrote:

Haha, I think I was using M2050s before. Of course I'd need 2 to test the multi gpu performance.

GTX 690s are still pretty decent in our hands.

Oliver Beckstein * orbeckst@gmx.net
skype: orbeckst * orbeckst@gmail.com

Experimental addition of openmp statements

9ead3c4

richardjgowers added help wanted performance API labels Apr 29, 2015

richardjgowers mentioned this pull request Apr 30, 2015

Port weave functions to Cython #264

Closed

richardjgowers self-assigned this May 11, 2015

richardjgowers added this to the 0.11 milestone May 11, 2015

richardjgowers closed this May 19, 2015

richardjgowers deleted the feature-newparallel branch May 27, 2015 07:30

orbeckst added the parallelization label Apr 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel distance calculations #261

Parallel distance calculations #261

richardjgowers commented Apr 29, 2015

richardjgowers commented Apr 29, 2015

hainm commented Apr 30, 2015

richardjgowers commented Apr 30, 2015

hainm commented Apr 30, 2015

orbeckst commented May 2, 2015

rmcgibbo commented May 2, 2015

hainm commented May 2, 2015

hainm commented May 2, 2015

richardjgowers commented May 5, 2015

richardjgowers commented May 11, 2015

hainm commented May 11, 2015

orbeckst commented May 11, 2015

richardjgowers commented May 11, 2015

richardjgowers commented May 11, 2015

hainm commented May 11, 2015

dotsdl commented May 11, 2015

richardjgowers commented May 15, 2015

orbeckst commented May 15, 2015

orbeckst commented May 19, 2015

richardjgowers commented May 19, 2015

orbeckst commented May 19, 2015

richardjgowers commented May 19, 2015

orbeckst commented May 19, 2015

Parallel distance calculations #261

Parallel distance calculations #261

Conversation

richardjgowers commented Apr 29, 2015

richardjgowers commented Apr 29, 2015

hainm commented Apr 30, 2015

richardjgowers commented Apr 30, 2015

hainm commented Apr 30, 2015

orbeckst commented May 2, 2015

rmcgibbo commented May 2, 2015

hainm commented May 2, 2015

hainm commented May 2, 2015

richardjgowers commented May 5, 2015

richardjgowers commented May 11, 2015

hainm commented May 11, 2015

orbeckst commented May 11, 2015

richardjgowers commented May 11, 2015

richardjgowers commented May 11, 2015

hainm commented May 11, 2015

dotsdl commented May 11, 2015

richardjgowers commented May 15, 2015

orbeckst commented May 15, 2015

orbeckst commented May 19, 2015

richardjgowers commented May 19, 2015

orbeckst commented May 19, 2015

richardjgowers commented May 19, 2015

orbeckst commented May 19, 2015