Improving Parcels efficiency #553
The next main goal in Parcels development is to obtain an efficient parallel version. To do so, let's present here what is the code structure, what are its strengths and weaknesses, how we aim to tackle the different challenges. The point of this issue is to open the discussion such that we use the latest available libraries and best Python practice in Parcels development.
To understand better Parcels general profile, different applications will be profiled. This can be done using
The first profile is a 2-month simulation of floating MP in the North Sea:
The results are available here:
But this still implies to load contiguous data. So if a particle is located at 170° and another one at -170°, the full domain -170:170 should be loaded? This would be annoying.
This option was presented in PR #265. In the particle loop, the particles are distributed on different threads, sharing the same memory. Fieldset is shared by all threads.
But how to do it efficiently for large input data applications?
This option requires the more development, but it enables to have a large control over the load balance, how to distribute the particles on the differents processors (see simple proof of concept here https://github.com/OceanParcels/bare-bones-parcels)
The text was updated successfully, but these errors were encountered:
A note on repeatability and pseudo-random numbers
(I've brought this up in #265 as well.): Usually, it is recommended to conduct all experiments that include pseudo-randomnes in a way that ensures repeatability, by specifying the initial state of the random number generator that is used. (See, e.g., https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285#s7.) To be fully deterministic, both, the initial state of the pseudo-random-number generator and the sequence of computation, need to be fully specified. The latter is often sacrificed for optimizing occupancy of the allocated resources. OpenMP's
I'm (slowly) giving up my resistance against this trend. But I do think, that as soon as one enters the terrain of non-determinism, this has to be communicated very clearly.
Load balancing without tiling
There's constraints on the Physics and Math side of Parcels that we can leverage to balance load without falling back to explicit domain decomposition: For any given distribution of particles, we can estimate the largest possible deformation after a given number of time steps. When parallelizing Parcels by distributing
This is a low-hanging fruit. #265 looks very promising. And it also is largely orthogonal to all other efforts towards distribution of Parcels experiments. When distributed with Dask, we can still use intenal parallelization of Parcels experiments with OpenMP. The same is true for MPI, where it would be possible to have hybrid jobs with MPI across nodes and OpenMP on shared memory portions of the cluster.
Handling Scipy mode as first-class citizen
I think it is very important to make it as easy as possible to not use the built-in JIT functionality of Parcels. There's two reasons for this:
Field should be cut into chunks, such that don't load the fully the data.
How to do that? Here comes one possibility (to be discussed)
We get rid of
Is there some better solution?
About OpenMP and randomness:
If I understood correctly: What we want is (1) a code that is reproducible, (2) not the same random sequences for each thread?
I was thinking that we can simply seed the random generator for each particle in the loop, as
In this simple code, I would like to see that every thread generates the same sequence, but they don't.
Also, this link https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285#s7
Ideally, we'd have deterministic results irrespective of the details of the parallelization. All that would matter in this ideal case would be velocity fields, initial positions, length of time step, and the initial state of the random number generator.
There are ways of doing this, but they might be expensive and / or complicated and / or inefficient:
Note that only 1. and 2. above would ensure deterministic results irrespective of the details of the parallelization.
Realistically, I'd go for 3. above. My expectation as a user would be that an immediate re-run of an experiment should yield the same results.
Further down the road
Note again that thinking hard about 2 and its implications for the design of Parcels (how to achieve real concurrency?) would be a very interesting exercise. :)
No, I think this code does what it should (at least with the commented
If I seed
So the random generator does not seem to be thread safe
Yes, it would work, but this would mean that we need to pass the seed in the kernels then. But it can as well be done.
This simple c code works.
Now trying to do better with gsl, but installation is not straigthforward on osx (conda's gcc does not work, I was using a different one, but we want to use conda for dependencies and c librariries. Checking how to do it)
Yes, basically if I include this at compilation
using this gcc:
before executing, I export also the dynamic library path:
And all works fine. That's the manual way. Now checking how to do it better (still using conda gcc)