example/stop_ensemble #458

dylancliche · 2023-12-04T17:33:53Z

A new example for having the workflow itself stop the workers assigned to it.

…elps to alleviate the need to constantly check an ensemble to stop the workers when the workflow finishes. See README file for further details.

…show job <jobid> doesn't show up anymore. This was remedied by checking the specification filename printed at the top of the output file. This also allowed for improved robustness by doing a whole match between the and the WorkDir set in scontrol show job <jobid>.

lucpeterson

Some questions about this:

How can/should it be modified now that merlin monitor works?

Can/should it use exit $(MERLIN_STOP_WORKERS) to avoid the worker-killed but task not completed issue?

We should note this is only for slurm based systems

Is there an alternative way to do this, eg via modifying $(MERLIN_STOP_WORKERS) logic to cancel a worker’s batch job and any dependencies instead of just the worker? It might be more elegant than an example.

lucpeterson · 2024-04-09T20:39:27Z

@bgunnar5 what do you think about creating a custom stop worker command that also shuts down a worker’s batch job (and any dependencies)? We could register a custom command that workers can listen for

https://docs.celeryq.dev/en/latest/userguide/workers.html#id18

This would mean that $(MERLIN_STOP_WORKERS) could shut down batch jobs too and make it easy for any workflow to shut down its workers and allocation at the end

bgunnar5 · 2024-04-09T21:44:43Z

@lucpeterson yeah I think ultimately baking this into Merlin rather than having it as a standalone example is probably the way to go. This also means that I could expand it to work for more than just slurm too.

I'm not sure if this would need to be an entirely new custom command or if we could just add it on to the existing merlin stop-workers command.

lucpeterson · 2024-04-10T00:03:42Z

Yeah the stop workers command issues a broadcast call to shutdown workers remotely. If we made a new command that we could broadcast, this would slip in nicely. You could kill workers and batch jobs from either the command line or a workflow.

dylancliche · 2024-05-09T16:37:05Z

@bgunnar5 @lucpeterson, I personally think that having a new command in the form of something along the lines $(MERLIN_STOP_WORKFLOW) which accomplishes this example would be the best way to move forward. From a users perspective, adding a single line command to execute the scripts within this example would be a lot better than copying and pasting a whole step into a workflow.

dylancliche · 2024-05-09T16:37:53Z

@bgunnar5 Let me know if you would like some help in accomplishing this. I can work on this a little at a time if needed.

bgunnar5 · 2024-05-09T17:43:05Z

Ideally we could be able to do something like merlin stop-workers --all-dependencies if we create the custom celery command that Luc was linking. This would then lead us to be able to create a new step return variable like you mentioned @dylancliche (see the big if/elif chain of how results are processed in merlin/common/tasks.py::merlin_step for how this could be implemented; specifically see how workers are currently stopped when STOP_WORKERS is provided). With this implementation users could either invoke the stop command from the command line or from within their workflow.

We'll need to figure out how to register this custom command based on the scheduler so there will likely be some work to do within the script adapters themselves. Maybe each script adapter can have its own kill_dependent_jobs method (naming conventions up for debate on this) and that can be linked to the custom command? I'm not sure the best way to go about this as of now.

If you have time to work on it then I definitely won't stop you!

dylancliche · 2024-05-09T18:05:04Z

@bgunnar5 Thanks for pointing me to the right locations of where STOP_WORKERS takes place as well as the adapters. I will first take a look at using the adapters with this example to help make this scheduler independent. We'll go from there.

dylancliche added 6 commits November 30, 2023 16:12

A new example for having the workflow stop the workers itself. This h…

83f4a7c

…elps to alleviate the need to constantly check an ensemble to stop the workers when the workflow finishes. See README file for further details.

Updated

adf5d66

Made general

d3ee3d5

Took contact info out of README

536914f

Updated CHANGELOG.md

581b7fa

lucpeterson reviewed Apr 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example/stop_ensemble #458

example/stop_ensemble #458

dylancliche commented Dec 4, 2023

lucpeterson left a comment

lucpeterson commented Apr 9, 2024

bgunnar5 commented Apr 9, 2024

lucpeterson commented Apr 10, 2024

dylancliche commented May 9, 2024

dylancliche commented May 9, 2024

bgunnar5 commented May 9, 2024

dylancliche commented May 9, 2024

example/stop_ensemble #458

Are you sure you want to change the base?

example/stop_ensemble #458

Conversation

dylancliche commented Dec 4, 2023

lucpeterson left a comment

Choose a reason for hiding this comment

lucpeterson commented Apr 9, 2024

bgunnar5 commented Apr 9, 2024

lucpeterson commented Apr 10, 2024

dylancliche commented May 9, 2024

dylancliche commented May 9, 2024

bgunnar5 commented May 9, 2024

dylancliche commented May 9, 2024