Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example/tutorial on automating parameter exploration with Slurm? #1137

Closed
ali-ramadhan opened this issue Nov 2, 2020 · 5 comments
Closed
Labels
documentation 📜 The sacred scrolls

Comments

@ali-ramadhan
Copy link
Member

ali-ramadhan commented Nov 2, 2020

PJ Tuckman (@qwert2266) recently wrote a pretty sweet Julia script for doing automated parameter exploration with Slurm on Satori (see https://github.com/ali-ramadhan/JuicyMoons.jl/pull/14).

The script itself can be found at: https://github.com/ali-ramadhan/JuicyMoons.jl/blob/pjt/enceladus-slurm/slurm/20201031ScriptCreator.jl (needs some refactoring and might have bugs)

The automation was complicated by the fact that Satori only allows you 1 Slurm job (through which you can request 4 GPUs and cram 4 GPU simulations on one node) and there's a 12 hour time limit on all jobs. So the idea/hack we came up with was for the simulation scripts to touch a file to indicate they have checkpointed themselves and to touch another file to indicate they have reached steady state. The Julia script keeps creating and submitting Slurm scripts until all simulations have reached steady state ("checkpointed" simulations are queued/scheduled again while simulation that have reached "steady state" are not queued/scheduled any more).

The point of this issue is to discuss whether it makes sense to add an example/tutorial of automating parameter exploration with Slurm? The specific workflow discussed above is specific to Satori so it might not make sense to include it in the docs (might be more of an internal resource).

We looked at ClusterManagers.jl but don't think it's super useful since we're working around only have 1 job, and we don't know what the next job will be until the 4 simulations crammed into the first job are done running.

X-Ref: #1045 proposes adding example Slurm scripts would is definitely a good idea.

cc @sandreza @suyashbire1 might be interested.

@ali-ramadhan ali-ramadhan added the documentation 📜 The sacred scrolls label Nov 2, 2020
@ali-ramadhan
Copy link
Member Author

@christophernhill Do you know if it's possible/reasonable to change the Slurm limit on Satori to 4 GPUs per user (instead of the current 1 job/1 node/4 GPU limit)? If it's a helpful change to other Satori users and does not impact cluster performance/scheduling, it might help enable all the automation we develop for Satori to seamlessly work on other clusters?

@ali-ramadhan
Copy link
Member Author

@vchuravy @jpsamaroo I have a feeling that Dagger.jl can do this much better than we can haha. Is this kind of automation with Slurm within scope for Dagger.jl?

@vchuravy
Copy link
Collaborator

vchuravy commented Nov 2, 2020

I have a feeling that Dagger.jl can do this much better than we can haha. Is this kind of automation with Slurm within scope for Dagger.jl?

Currently out of scope of Dagger, we currently assume that resources have been allocated by the user, and then Dagger will manage them.

@glwagner
Copy link
Member

glwagner commented Nov 2, 2020

I think basic info about slurm is a good idea! Possibly hyper sophisticated and cluster specific tricks might be better documented in other packages like JuicyMoons.jl?

We can still provide links to packages that use Oceananigans.jl, like LESbrary.jl, JuicyMoons.jl, and hopefully soon EadyTurbulence.jl, in the Oceananigans docs, so that users can benefit from their more complicated examples, scripts, and other tricks. The benefit of this approach is that these subsidiary packages don't have to be synchronous with Oceananigans master, unlike the examples and tutorials that we ship with the Oceananigans source and docs.

@glwagner
Copy link
Member

I'm closing this issue because I'm judging that it's not of current, timely relevance to Oceananigans development. If you would like to make it a higher priority or if you think the issue was closed in error please feel free to re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation 📜 The sacred scrolls
Projects
None yet
Development

No branches or pull requests

3 participants