Modify the batch and slurm scripts to support additonal per-subject parameters #434

dkuegler · 2024-01-22T11:32:04Z

A current limitation of both srun_fastsurfer and brun_fastsurfer is that it is not possible to define additional subject-specific parameters to run_fastsurfer.sh.
This PR adds the feature to add such options to both srun_fastsurfer and brun_fastsurfer.

In particular, brun_fastsurfer.sh can have additional parameters (format =,). This also passed through to srun_fastsurfer, for example:
Example 1:
--subject_list_delim "," --subject_list_awk_code_args '\$2 ",--vox_size," \$4'
to implement from the subject_list line
subject-101,raw/T1w-101A.nii.gz,study-1,0.9
to
--sid subject-101 --t1 <data-path>/raw/T1w-101A.nii.gz --vox_size 0.9
Example 2:
--subject_list_delim "," --subject_list_awk_code '\$2' --subject_list_awk_code_args '\$4,--asegdkt_segfile,\$3'
to implement from the subject_list line
4,sub_A1,mri/asegdkt.mgz,412463-t1.nii.gz
to
--sid sub_A1 --t1 <data-path>/412463-t1.nii.gz --asegdkt_segfile mri/asegdkt.mgz

dkuegler · 2024-01-23T11:04:05Z

The PR is tested and can be reviewed now.

ClePol · 2024-01-25T09:14:18Z

Thanks for the PR, it's working well out of the box now.
I ran into two issues while testing:

I am not sure what is supposed to happen when a job crashes. Right now it seems like if the FastSurferCNN module crashes (e.g. QC doesnt pass) the slurm jobs become zombies, because the dependency is never met. Is this expected behaviour?
It would be great if the cleanup job is executed in any case, so we don't have to hunt down the jobs and files on the slurm cluster and cancel them manually.
The HPC work directory can not be modified while the processing is running because any files placed there would be deleted. We could instead check and remove the three folder created ("cases", "script", "logs", "images"). I think this only needs modifications in the check_hpcwork and make_cleanup_job functions.

ClePol · 2024-01-25T12:34:58Z

I believe I found a bug - the surface pipeline can not run, because --parralel_subjects is passed to brun_fastsrufer.sh without argument.
Putting --parallel_subjects -1 for maximum parallelism

FastSurfer/srun_fastsurfer.sh

Line 669 in 0d4dbb5

--parallel_subjects

allows the surface pipeline to start

dkuegler · 2024-01-25T18:42:52Z

--parallel_subjects is supposed to be a valid setting and equal to max.
I'll look into it.

With regards to your earlier comments.

Initially, if a job crashed, all jobs in following jobs were supposed to fail via slurm. (The dependency is never fulfilled). But this creates unwanted side-effects for the other cases that are merged to be processed in the same batch. So I implemented a monitoring file for successful completion of different stages, the subject_success file. So from then on, the expected behavior was that failed cases are skipped for failed prior steps. But I have not checked this recently, just when I implemented this.
Users are not supposed to modify/edit the work directory, so this is not really a concern. All data should be copied back in the cleanup script to the folder specified via --sd.
Also, you do not need to specify --work if $HPCWORK is set, as would be set on most clusters.

dkuegler · 2024-01-25T18:46:53Z

There seems to be a not missing

FastSurfer/brun_fastsurfer.sh

Line 148 in 0d4dbb5

then

.

dkuegler · 2024-01-25T18:54:08Z

I am a bit surprised about the zombification of jobs though. brun_fastsurfer is supposed to always return 0 (success)

FastSurfer/brun_fastsurfer.sh

Line 466 in 0d4dbb5

# always exit successful

I'll have to read up on the slurm docs, whether a non-zero return code for a child process supercedes a zero return code of the main process.

ClePol · 2024-01-25T18:58:00Z

1. Initially, if a job crashed, all jobs in following jobs were supposed to fail via slurm. (The dependency is never fulfilled). But this creates unwanted side-effects for the other cases that are merged to be processed in the same batch. So I implemented a monitoring file for successful completion of different stages, the subject_success file. So from then on, the expected behavior was that failed cases are skipped for failed prior steps. But I have not checked this recently, just when I implemented this.

I think this is broken atm, I can keep an eye on this when I run some batches with failing subjects and let you know what happens currently. IIRC the jobs stayed IDLE indefinetly with "dependency not met" (zombie processes), but I didn't look into it.

2. Users are not supposed to modify/edit the work directory, so this is not really a concern. All data should be copied back in the cleanup script to the folder specified via --sd.
   Also, you do not need to specify --work if $HPCWORK is set, as would be set on most clusters.

The issue is not that files used by the script cant be modified, but that the script esentially blocks the HCP work directory, so no other scripts can use it. (e.g. what if I want to also run freesurfer on slurm at the same time as fastsrufer). Why not only check and delete the folders that are actually used?
A workaround would be to set --work manually to a subfolder of $HPCWORK (which I didn't do yet, but might need to do in the future).

dkuegler · 2024-01-25T19:00:00Z

For proper scheduling between Jobarrays, there needs to be a dependency between in corresponding seg and surf tasks.

ClePol · 2024-01-26T13:58:16Z

In case this pops up during testing, just FYI - I ran into an issues on line

FastSurfer/brun_fastsurfer.sh

Line 438 in 0d4dbb5

    
           $run_fastsurfer "$seg_surf_only" "${args[@]}" "${POSITIONAL_FASTSURFER[@]}" | prepend "$subject_id: " &

where there seemed to be some issue with the formatting of the srun commend. I did a quick and dirty workaround by building the command into a string and calling eval on it.

dkuegler · 2024-01-27T00:12:45Z

In case this pops up during testing, just FYI - I ran into an issues on line

FastSurfer/brun_fastsurfer.sh

Line 438 in 0d4dbb5

$run_fastsurfer "$seg_surf_only" "${args[@]}" "${POSITIONAL_FASTSURFER[@]}" | prepend "$subject_id: " &

where there seemed to be some issue with the formatting of the srun commend. I did a quick and dirty workaround by building the command into a string and calling eval on it.

I don't particularly know what you are running into, but #438 identifies an issue in exactly that line, $seg_surg_only is empty. The question is what error you were getting. In #438 it was "unrecognized option" -- the empty string.

dkuegler · 2024-01-27T00:23:12Z

Currently, the default option for check_hpc_work is none, i.e. there is no check, if the directory is in fact empty, which can lead to unexpected deletion of files by the cleanup job.

Also, the cleanup job should just delete cases, logs, scripts, and images, not *.

dkuegler · 2024-01-27T00:30:30Z

I can confirm that slurm inherits the return code from job steps, i.e. if a srun call fails, for example a surface or a segmentation command, this interrupts the whole processing pipeline.

This can be avoided by manually setting the derived return code with sjobexitmod https://slurm.schedmd.com/job_exit_code.html.

It would probably be best to have a verbose error message at the end of the script, but always return 0, so proceeding tasks are always executed.

- switch subject_list_delim separator to = from : (more consistency with brun_fastsurfer and its subject_list format) brun_fastsurfer.sh - add capability to add more fastsurfer parameters in the subject_list file stools.sh - adapt the awk script in so that additional parameter paths in additional parameters are also mapped to the singularity container

- fix debug printing of parallel_subjects - fix --parallel_subjects surf= parsing, max - add subject_id to all parallel processing - fix debug output of FastSurfer parameters

- copy stools.sh script

- change the (very expert) call interface of --subject_list_awk_code to --subject_list_awk_code_sid and --subject_list_awk_code_args. stools.sh - fix translate_cases according to the split of the awk_code snippets - fix variable escaping in translate_cases awk code

- add debug output for the translated subjects stools.sh - fix awk code snippets

- Add redundant parameter names --subjets_list* for all --subject_list* parameters

- fix the run_fastsurfer variable - add a slurm_task_id option in --batch to fix the --jobarray functionality - simplify the --parallel_subjects argument parsing - comment parallel_surf == true ==> seg_surf_only == "" - fix the argument setting of --statusfile - fix double usage of variable i, and make sure it gets incremented each loop (do this first) - remove quotes around $seg_surf_only, so no empty arguments get passed to run_fastsurfer.sh srun_fastsurfer.sh - Add doc to --work (should be empty and 'owned by this script') - add comments for $delete_hpc_work - make sure the slurm segmentation script returns exit code 0 - reduce default values for memory allocation - add log and debug functions - add a log of the slurm submit script (run_fastsurfer) to the logs - add the --batch argument to the brun_fastsurfer call where necessary stools.sh - remove explicit folders in hpc_work, not * - exit if not --sd is defined in check_out_dir - add logfile to both make_cleanup_job and make_copy_job run_fastsurfer.sh - do not crash on empty arguments (just ignore) - fix some formatting

- change the delimiter between t1 and additional args to <space>, to make it mroe consistent srun_fastsurfer.sh - fix cleanup, if redoing cases, mv fails because it does not replace data/folders, so we ask if folders should be replaced and then copy (cp + rm -R) stools.sh - check_cases_in_out_dir : checks if subject data already exists ... to ask if it should be overwritten - make_cleanup_job : added cleanup mode: mv or cp - export cleanup_jobid and copy_jobid

- cleanup the initial log text brun_fastsurfer.sh - streamline IFS - make subject_list work with different newline characters - use array to store subjects rather than a string - fix default values of task_id and task_count (now not specified/not specified) - clean up the for loop (no only iterates through the batch not all) srun_fastsurfer.sh - add parameters for mem -

update time limit for surface pipeline (1mm can take more than 1h on 1 core)

dkuegler force-pushed the slurm/more-commands branch 3 times, most recently from 35dacc0 to 06a1c76 Compare January 22, 2024 15:47

dkuegler requested a review from ClePol January 22, 2024 15:54

dkuegler marked this pull request as ready for review January 23, 2024 11:03

dkuegler force-pushed the slurm/more-commands branch 2 times, most recently from fe10262 to d5e725b Compare February 1, 2024 17:36

dkuegler mentioned this pull request Feb 1, 2024

question about subject list in FastSurfer on multiple subjects #438

Closed

dkuegler force-pushed the slurm/more-commands branch from 613070d to 235e422 Compare February 14, 2024 17:57

dkuegler force-pushed the slurm/more-commands branch from 235e422 to 4824f38 Compare February 22, 2024 10:51

dkuegler added 8 commits February 23, 2024 22:48

brun_fastsurfer.sh

479d9db

- fix debug printing of parallel_subjects - fix --parallel_subjects surf= parsing, max - add subject_id to all parallel processing - fix debug output of FastSurfer parameters

srun_fastsurfer.sh

146ca34

- copy stools.sh script

srun_fastsurfer.sh

363cbd5

- change the (very expert) call interface of --subject_list_awk_code to --subject_list_awk_code_sid and --subject_list_awk_code_args. stools.sh - fix translate_cases according to the split of the awk_code snippets - fix variable escaping in translate_cases awk code

srun_fastsurfer.sh

2941752

- add debug output for the translated subjects stools.sh - fix awk code snippets

srun_fastsurfer.sh/brun_fastsurfer.sh:

51277ed

- Add redundant parameter names --subjets_list* for all --subject_list* parameters

dkuegler added 2 commits February 23, 2024 22:48

add some slurm debugging documentation

cafe4fd

update time limit for surface pipeline (1mm can take more than 1h on 1 core)

dkuegler force-pushed the slurm/more-commands branch from 4824f38 to cafe4fd Compare February 23, 2024 21:49

dkuegler merged commit 7aed5f8 into Deep-MI:dev Feb 23, 2024

dkuegler deleted the slurm/more-commands branch February 23, 2024 21:50

Modify the batch and slurm scripts to support additonal per-subject parameters #434

Modify the batch and slurm scripts to support additonal per-subject parameters #434

Uh oh!

Conversation

dkuegler commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dkuegler commented Jan 23, 2024

Uh oh!

ClePol commented Jan 25, 2024

Uh oh!

ClePol commented Jan 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dkuegler commented Jan 25, 2024

Uh oh!

dkuegler commented Jan 25, 2024

Uh oh!

dkuegler commented Jan 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ClePol commented Jan 25, 2024

Uh oh!

dkuegler commented Jan 25, 2024

Uh oh!

ClePol commented Jan 26, 2024

Uh oh!

dkuegler commented Jan 27, 2024

Uh oh!

dkuegler commented Jan 27, 2024

Uh oh!

dkuegler commented Jan 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dkuegler commented Jan 22, 2024 •

edited

Loading

ClePol commented Jan 25, 2024 •

edited

Loading

dkuegler commented Jan 25, 2024 •

edited

Loading