Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature optimise across nodes #542

Merged
merged 61 commits into from
Feb 25, 2021
Merged

Conversation

PaulJonasJost
Copy link
Collaborator

Hey,
i implemented an engine, that can optimise across nodes. Tested it also for a small model (Benchmarkcollection: Zhao) and got a definite speedup. The exact speedup most likely depends on the complexity of the model.
I also wanted to add the engine to the test_engine.py file, but one cannot test it without using multiple nodes so that probably would not make too much sense? Any suggestions on that?
Lastly, some things to note:

  • In Bonna, srun and MPI do not work together and send out an error. This already happens as soon as one imports mpi4py so for the convenience, i did not add the MPIPoolEngine to the __init__.py file. Thus when using it one has to import it separately through from pypesto.engine.mpi_pool import MPIPoolEngine.

  • Since srun does not work, one can use in the slurm file mpiexec -n [number of nodes] python -m mpi4py.futures [filename]. Additionally when using it, one has to clarify that all the work before the optimisation should be done on only one node. For this one can put the whole code as it is in a if __name__ == '__main__':-condition.

  • Currently the optimisation works in the following way: one node is considered the master while the other are considered the workers. The master sends the workers batches (of the size of #cpus/node) of tasks, which they in turn execute with a multiprocessing Pool. Thus currently a whole node is only coordinating and not working. That would be a next step and i would be happy for any suggestions.

@dweindl
Copy link
Member

dweindl commented Dec 16, 2020

* Thus currently a whole node is only coordinating and not working.

That would be a major limitation. But simply telling mpiexec about the number of cores (per node) should take care of that. Not?

Copy link
Member

@dweindl dweindl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good start to me. There is obviously room for improving efficiency by going for dynamic scheduling, but that can be extended later.

Please extend the documentation on how to use it (as posted in the PR description).

Do you already have any benchmarking / scaling results? Would be interesting to see.

pypesto/engine/mpi_pool.py Outdated Show resolved Hide resolved
pypesto/engine/mpi_pool.py Outdated Show resolved Hide resolved
Comment on lines 52 to 53
logger.info(f"Performing parallel task execution on {n_procs} "
f"nodes with chunksize of {self.chunksize}.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Communicator size is not necessarily the number of nodes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i do think it is, at least when i tested, if i allocated e.g. 5 nodes in my slurm file, the communicator size was also 5. Not sure whether this is exactly want you meant though

Copy link
Member

@dweindl dweindl Dec 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on how you invoke mpiexec. As mentioned, I don't see any obvious problem with having one MPI rank per core instead of per node. The slightly more expensive communication should be overcompensated by not wasting one full node.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be because i invoked mpiexec with -n [Number]that it always only ran one process per node.

Co-authored-by: Daniel Weindl <dweindl@users.noreply.github.com>
@PaulJonasJost
Copy link
Collaborator Author

PaulJonasJost commented Dec 17, 2020

* Thus currently a whole node is only coordinating and not working.

That would be a major limitation. But simply telling mpiexec about the number of cores (per node) should take care of that. Not?

Had the idea as well, tested some things that did not work, but will try a bit more.

@codecov-io
Copy link

codecov-io commented Jan 28, 2021

Codecov Report

Merging #542 (1199b4c) into develop (160c2a8) will increase coverage by 0.40%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #542      +/-   ##
===========================================
+ Coverage    88.16%   88.56%   +0.40%     
===========================================
  Files           79       87       +8     
  Lines         5257     5380     +123     
===========================================
+ Hits          4635     4765     +130     
+ Misses         622      615       -7     
Impacted Files Coverage Δ
pypesto/engine/mpi_pool.py 100.00% <100.00%> (ø)
pypesto/optimize/__init__.py 100.00% <0.00%> (ø)
pypesto/__init__.py 100.00% <0.00%> (ø)
pypesto/profile/__init__.py 100.00% <0.00%> (ø)
pypesto/result.py 88.46% <0.00%> (ø)
pypesto/visualize/__init__.py 100.00% <0.00%> (ø)
pypesto/sample/__init__.py 100.00% <0.00%> (ø)
pypesto/petab/__init__.py 55.55% <0.00%> (ø)
pypesto/optimize/optimizer.py 90.74% <0.00%> (+0.25%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 160c2a8...1199b4c. Read the comment docs.

@PaulJonasJost
Copy link
Collaborator Author

PaulJonasJost commented Jan 28, 2021

I updated the MPIPoolEngine()'. The main problem was actually my .slumscript i used before. Also the engine is not in the init.pysince otherwise whenever importingpypesto.engineit would also importmpi4pywhich produces problem when used in combination with a normalsrun` in a slurm-file.

I think it would be good to have a description somewhere on how to use it exactly, where can i put something like that?

Copy link
Member

@dweindl dweindl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but would be great to include a test case (which would also serve as usage example).

pypesto/engine/mpi_pool.py Outdated Show resolved Hide resolved
Co-authored-by: Daniel Weindl <dweindl@users.noreply.github.com>
@PaulJonasJost
Copy link
Collaborator Author

Looks good, but would be great to include a test case (which would also serve as usage example).

Can i write a test which requires a slurm file and a cluster and would still pass any pytest test? That was the main issue i had.

@dweindl
Copy link
Member

dweindl commented Jan 28, 2021

Can i write a test which requires a slurm file and a cluster and would still pass any pytest test? That was the main issue i had.

It shouldn't require slurm, just mpiexec should do the job, not? I see that it's a bit inconvenient. Easiest might be creating a test script that runs a (cheap - Rosenbrock?) optimization using the MPIPoolEngine and writing the result to a file. From a pytest test case you could then launch mpiexec ... $thatOtherScript via subprocess.run and verify the number of optimizations and final objective.

@PaulJonasJost
Copy link
Collaborator Author

ok, so the test seems to work. @dweindl i noticed that if i run the slurm file in Bonna, it does not stop altough the program has finished.

doc/example/exampleBatchFile.slurm Outdated Show resolved Hide resolved
doc/example/exampleBatchFile.slurm Outdated Show resolved Hide resolved
doc/example/exampleBatchFile.slurm Outdated Show resolved Hide resolved
doc/example/example_MPIPool.py Show resolved Hide resolved
doc/example/example_MPIPool.py Outdated Show resolved Hide resolved
doc/example/example_MPIPool.py Outdated Show resolved Hide resolved
doc/example/example_MPIPool.py Outdated Show resolved Hide resolved
test/optimize/test_optimize.py Outdated Show resolved Hide resolved
Comment on lines 246 to 247
os.system('rm temp_result1.h5')
os.system('rm temp_result2.h5')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use os.remove() or os.unlink() for corresponding functions from Pathlib

Copy link
Member

@dweindl dweindl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

pypesto/engine/mpi_pool.py Outdated Show resolved Hide resolved
PaulJonasJost and others added 3 commits February 25, 2021 14:42
Co-authored-by: Daniel Weindl <dweindl@users.noreply.github.com>
@PaulJonasJost PaulJonasJost merged commit bee7f8a into develop Feb 25, 2021
@FFroehlich FFroehlich deleted the feature_optimise_across_nodes branch February 25, 2021 20:29
@yannikschaelte yannikschaelte mentioned this pull request Mar 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants