Feature optimise across nodes #542

PaulJonasJost · 2020-12-16T13:40:23Z

Hey,
i implemented an engine, that can optimise across nodes. Tested it also for a small model (Benchmarkcollection: Zhao) and got a definite speedup. The exact speedup most likely depends on the complexity of the model.
I also wanted to add the engine to the test_engine.py file, but one cannot test it without using multiple nodes so that probably would not make too much sense? Any suggestions on that?
Lastly, some things to note:

In Bonna, srun and MPI do not work together and send out an error. This already happens as soon as one imports mpi4py so for the convenience, i did not add the MPIPoolEngine to the __init__.py file. Thus when using it one has to import it separately through from pypesto.engine.mpi_pool import MPIPoolEngine.
Since srun does not work, one can use in the slurm file mpiexec -n [number of nodes] python -m mpi4py.futures [filename]. Additionally when using it, one has to clarify that all the work before the optimisation should be done on only one node. For this one can put the whole code as it is in a if __name__ == '__main__':-condition.
Currently the optimisation works in the following way: one node is considered the master while the other are considered the workers. The master sends the workers batches (of the size of #cpus/node) of tasks, which they in turn execute with a multiprocessing Pool. Thus currently a whole node is only coordinating and not working. That would be a next step and i would be happy for any suggestions.

dweindl · 2020-12-16T19:35:43Z

* Thus currently a whole node is only coordinating and not working.

That would be a major limitation. But simply telling mpiexec about the number of cores (per node) should take care of that. Not?

dweindl

Looks like a good start to me. There is obviously room for improving efficiency by going for dynamic scheduling, but that can be extended later.

Please extend the documentation on how to use it (as posted in the PR description).

Do you already have any benchmarking / scaling results? Would be interesting to see.

pypesto/engine/mpi_pool.py

dweindl · 2020-12-16T19:45:15Z

pypesto/engine/mpi_pool.py

+        logger.info(f"Performing parallel task execution on {n_procs} "
+                    f"nodes with chunksize of {self.chunksize}.")


Communicator size is not necessarily the number of nodes.

so i do think it is, at least when i tested, if i allocated e.g. 5 nodes in my slurm file, the communicator size was also 5. Not sure whether this is exactly want you meant though

It depends on how you invoke mpiexec. As mentioned, I don't see any obvious problem with having one MPI rank per core instead of per node. The slightly more expensive communication should be overcompensated by not wasting one full node.

might be because i invoked mpiexec with -n [Number]that it always only ran one process per node.

Co-authored-by: Daniel Weindl <dweindl@users.noreply.github.com>

PaulJonasJost · 2020-12-17T08:14:46Z

* Thus currently a whole node is only coordinating and not working.
That would be a major limitation. But simply telling mpiexec about the number of cores (per node) should take care of that. Not?

Had the idea as well, tested some things that did not work, but will try a bit more.

codecov-io · 2021-01-28T09:53:30Z

Codecov Report

Merging #542 (1199b4c) into develop (160c2a8) will increase coverage by 0.40%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           develop     #542      +/-   ##
===========================================
+ Coverage    88.16%   88.56%   +0.40%     
===========================================
  Files           79       87       +8     
  Lines         5257     5380     +123     
===========================================
+ Hits          4635     4765     +130     
+ Misses         622      615       -7

Impacted Files	Coverage Δ
pypesto/engine/mpi_pool.py	`100.00% <100.00%> (ø)`
pypesto/optimize/__init__.py	`100.00% <0.00%> (ø)`
pypesto/__init__.py	`100.00% <0.00%> (ø)`
pypesto/profile/__init__.py	`100.00% <0.00%> (ø)`
pypesto/result.py	`88.46% <0.00%> (ø)`
pypesto/visualize/__init__.py	`100.00% <0.00%> (ø)`
pypesto/sample/__init__.py	`100.00% <0.00%> (ø)`
pypesto/petab/__init__.py	`55.55% <0.00%> (ø)`
pypesto/optimize/optimizer.py	`90.74% <0.00%> (+0.25%)`	⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 160c2a8...1199b4c. Read the comment docs.

PaulJonasJost · 2021-01-28T10:07:59Z

I updated the MPIPoolEngine()'. The main problem was actually my .slumscript i used before. Also the engine is not in the init.pysince otherwise whenever importingpypesto.engineit would also importmpi4pywhich produces problem when used in combination with a normalsrun` in a slurm-file.

I think it would be good to have a description somewhere on how to use it exactly, where can i put something like that?

dweindl

Looks good, but would be great to include a test case (which would also serve as usage example).

pypesto/engine/mpi_pool.py

Co-authored-by: Daniel Weindl <dweindl@users.noreply.github.com>

PaulJonasJost · 2021-01-28T10:20:04Z

Looks good, but would be great to include a test case (which would also serve as usage example).

Can i write a test which requires a slurm file and a cluster and would still pass any pytest test? That was the main issue i had.

dweindl · 2021-01-28T10:33:29Z

Can i write a test which requires a slurm file and a cluster and would still pass any pytest test? That was the main issue i had.

It shouldn't require slurm, just mpiexec should do the job, not? I see that it's a bit inconvenient. Easiest might be creating a test script that runs a (cheap - Rosenbrock?) optimization using the MPIPoolEngine and writing the result to a file. From a pytest test case you could then launch mpiexec ... $thatOtherScript via subprocess.run and verify the number of optimizations and final objective.

PaulJonasJost · 2021-02-11T13:12:34Z

ok, so the test seems to work. @dweindl i noticed that if i run the slurm file in Bonna, it does not stop altough the program has finished.

doc/example/exampleBatchFile.slurm

doc/example/example_MPIPool.py

test/optimize/test_optimize.py

Co-authored-by: Daniel Weindl <dweindl@users.noreply.github.com>

dweindl · 2021-02-19T14:54:58Z

test/optimize/test_optimize.py

+    os.system('rm temp_result1.h5')
+    os.system('rm temp_result2.h5')


Use os.remove() or os.unlink() for corresponding functions from Pathlib

test/optimize/test_optimize.py

Co-authored-by: Daniel Weindl <dweindl@users.noreply.github.com>

dweindl

Looks good.

pypesto/engine/mpi_pool.py

Co-authored-by: Daniel Weindl <dweindl@users.noreply.github.com>

PaulJonasJost added 2 commits December 16, 2020 13:37

added MPIPoolEngine

cd264b9

updated to pep8 standard

52b17ac

PaulJonasJost requested review from dilpath, dweindl, paulstapor and yannikschaelte December 16, 2020 15:33

dweindl reviewed Dec 16, 2020

View reviewed changes

Update pypesto/engine/mpi_pool.py

664df0f

Co-authored-by: Daniel Weindl <dweindl@users.noreply.github.com>

PaulJonasJost added 2 commits January 28, 2021 10:22

updated mpi pool engine

4569be6

fixed init

7068360

dweindl reviewed Jan 28, 2021

View reviewed changes

pypesto/engine/mpi_pool.py Outdated Show resolved Hide resolved

Update pypesto/engine/mpi_pool.py

934fdb4

Co-authored-by: Daniel Weindl <dweindl@users.noreply.github.com>

PaulJonasJost and others added 4 commits February 11, 2021 10:30

Merge branch 'develop' into feature_optimise_across_nodes

f49dc00

added example and a test

f64dfaa

pep8 standard

c6d3283

added slurm example

0f485a9

PaulJonasJost requested a review from dweindl February 11, 2021 13:12

dweindl requested changes Feb 11, 2021

View reviewed changes

PaulJonasJost and others added 5 commits February 11, 2021 15:50

Update doc/example/exampleBatchFile.slurm

ee93dd3

Co-authored-by: Daniel Weindl <dweindl@users.noreply.github.com>

first cleanup

2c547f6

added docstring and subprocess

0386968

small fix

2060129

changed test

a7b4cd3

PaulJonasJost added 4 commits February 19, 2021 14:53

Update ci_push.yml

bd2450b

Update test_optimize.py

7c6b16f

Update test_optimize.py

b176655

Update example_MPIPool.py

d7818d3

dweindl reviewed Feb 19, 2021

View reviewed changes

test/optimize/test_optimize.py Outdated Show resolved Hide resolved

dweindl reviewed Feb 19, 2021

View reviewed changes

test/optimize/test_optimize.py Outdated Show resolved Hide resolved

PaulJonasJost and others added 15 commits February 19, 2021 15:58

Update test_optimize.py

b37d7c2

Update test/optimize/test_optimize.py

9a107c9

Co-authored-by: Daniel Weindl <dweindl@users.noreply.github.com>

Update test/optimize/test_optimize.py

b2ae2ed

Co-authored-by: Daniel Weindl <dweindl@users.noreply.github.com>

Update test_optimize.py

66c1eb4

Update example_MPIPool.py

f03a3a1

removed MPEngine from example

4171e77

Update test_optimize.py

4a68ea4

Update test_optimize.py

d181325

Merge branch 'develop' into feature_optimise_across_nodes

2fb0a2f

Update example_MPIPool.py

61dcd00

added objective function to problem

b6d85c6

pep8 standard

6804ed0

Dummy commit

9280b1e

Update test_optimize.py

dbba68d

Update test_optimize.py

3190e67

PaulJonasJost requested a review from dweindl February 25, 2021 12:36

dweindl approved these changes Feb 25, 2021

View reviewed changes

pypesto/engine/mpi_pool.py Outdated Show resolved Hide resolved

PaulJonasJost and others added 3 commits February 25, 2021 14:42

Update pypesto/engine/mpi_pool.py

9dbd2d9

Co-authored-by: Daniel Weindl <dweindl@users.noreply.github.com>

Update mpi_pool.py

aab3416

Update mpi_pool.py

1199b4c

PaulJonasJost merged commit bee7f8a into develop Feb 25, 2021

FFroehlich deleted the feature_optimise_across_nodes branch February 25, 2021 20:29

yannikschaelte mentioned this pull request Mar 17, 2021

Release 0.2.4 #596

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature optimise across nodes #542

Feature optimise across nodes #542

PaulJonasJost commented Dec 16, 2020

dweindl commented Dec 16, 2020

dweindl left a comment

dweindl Dec 16, 2020

PaulJonasJost Dec 17, 2020

dweindl Dec 17, 2020 •

edited

Loading

PaulJonasJost Dec 17, 2020

PaulJonasJost commented Dec 17, 2020 •

edited

Loading

codecov-io commented Jan 28, 2021 •

edited

Loading

PaulJonasJost commented Jan 28, 2021 •

edited

Loading

dweindl left a comment

PaulJonasJost commented Jan 28, 2021

dweindl commented Jan 28, 2021

PaulJonasJost commented Feb 11, 2021

dweindl Feb 19, 2021

dweindl left a comment

		logger.info(f"Performing parallel task execution on {n_procs} "
		f"nodes with chunksize of {self.chunksize}.")

		os.system('rm temp_result1.h5')
		os.system('rm temp_result2.h5')

Feature optimise across nodes #542

Feature optimise across nodes #542

Conversation

PaulJonasJost commented Dec 16, 2020

dweindl commented Dec 16, 2020

dweindl left a comment

Choose a reason for hiding this comment

dweindl Dec 16, 2020

Choose a reason for hiding this comment

PaulJonasJost Dec 17, 2020

Choose a reason for hiding this comment

dweindl Dec 17, 2020 • edited Loading

Choose a reason for hiding this comment

PaulJonasJost Dec 17, 2020

Choose a reason for hiding this comment

PaulJonasJost commented Dec 17, 2020 • edited Loading

codecov-io commented Jan 28, 2021 • edited Loading

Codecov Report

PaulJonasJost commented Jan 28, 2021 • edited Loading

dweindl left a comment

Choose a reason for hiding this comment

PaulJonasJost commented Jan 28, 2021

dweindl commented Jan 28, 2021

PaulJonasJost commented Feb 11, 2021

dweindl Feb 19, 2021

Choose a reason for hiding this comment

dweindl left a comment

Choose a reason for hiding this comment

dweindl Dec 17, 2020 •

edited

Loading

PaulJonasJost commented Dec 17, 2020 •

edited

Loading

codecov-io commented Jan 28, 2021 •

edited

Loading

PaulJonasJost commented Jan 28, 2021 •

edited

Loading