Skip to content

Conversation

@dkuegler
Copy link
Member

@dkuegler dkuegler commented Jan 22, 2024

A current limitation of both srun_fastsurfer and brun_fastsurfer is that it is not possible to define additional subject-specific parameters to run_fastsurfer.sh.
This PR adds the feature to add such options to both srun_fastsurfer and brun_fastsurfer.

In particular, brun_fastsurfer.sh can have additional parameters (format =,). This also passed through to srun_fastsurfer, for example:
Example 1:
--subject_list_delim "," --subject_list_awk_code_args '\$2 ",--vox_size," \$4'
to implement from the subject_list line
subject-101,raw/T1w-101A.nii.gz,study-1,0.9
to
--sid subject-101 --t1 <data-path>/raw/T1w-101A.nii.gz --vox_size 0.9
Example 2:
--subject_list_delim "," --subject_list_awk_code '\$2' --subject_list_awk_code_args '\$4,--asegdkt_segfile,\$3'
to implement from the subject_list line
4,sub_A1,mri/asegdkt.mgz,412463-t1.nii.gz
to
--sid sub_A1 --t1 <data-path>/412463-t1.nii.gz --asegdkt_segfile mri/asegdkt.mgz

@dkuegler dkuegler force-pushed the slurm/more-commands branch 3 times, most recently from 35dacc0 to 06a1c76 Compare January 22, 2024 15:47
@dkuegler dkuegler requested a review from ClePol January 22, 2024 15:54
@dkuegler dkuegler marked this pull request as ready for review January 23, 2024 11:03
@dkuegler
Copy link
Member Author

The PR is tested and can be reviewed now.

@ClePol
Copy link
Member

ClePol commented Jan 25, 2024

Thanks for the PR, it's working well out of the box now.
I ran into two issues while testing:

  1. I am not sure what is supposed to happen when a job crashes. Right now it seems like if the FastSurferCNN module crashes (e.g. QC doesnt pass) the slurm jobs become zombies, because the dependency is never met. Is this expected behaviour?
    It would be great if the cleanup job is executed in any case, so we don't have to hunt down the jobs and files on the slurm cluster and cancel them manually.
  2. The HPC work directory can not be modified while the processing is running because any files placed there would be deleted. We could instead check and remove the three folder created ("cases", "script", "logs", "images"). I think this only needs modifications in the check_hpcwork and make_cleanup_job functions.

@ClePol
Copy link
Member

ClePol commented Jan 25, 2024

I believe I found a bug - the surface pipeline can not run, because --parralel_subjects is passed to brun_fastsrufer.sh without argument.
Putting --parallel_subjects -1 for maximum parallelism

--parallel_subjects
allows the surface pipeline to start

@dkuegler
Copy link
Member Author

--parallel_subjects is supposed to be a valid setting and equal to max.
I'll look into it.

With regards to your earlier comments.

  1. Initially, if a job crashed, all jobs in following jobs were supposed to fail via slurm. (The dependency is never fulfilled). But this creates unwanted side-effects for the other cases that are merged to be processed in the same batch. So I implemented a monitoring file for successful completion of different stages, the subject_success file. So from then on, the expected behavior was that failed cases are skipped for failed prior steps. But I have not checked this recently, just when I implemented this.
  2. Users are not supposed to modify/edit the work directory, so this is not really a concern. All data should be copied back in the cleanup script to the folder specified via --sd.
    Also, you do not need to specify --work if $HPCWORK is set, as would be set on most clusters.

@dkuegler
Copy link
Member Author

There seems to be a not missing

.

@dkuegler
Copy link
Member Author

dkuegler commented Jan 25, 2024

I am a bit surprised about the zombification of jobs though. brun_fastsurfer is supposed to always return 0 (success)

# always exit successful

I'll have to read up on the slurm docs, whether a non-zero return code for a child process supercedes a zero return code of the main process.

@ClePol
Copy link
Member

ClePol commented Jan 25, 2024

1. Initially, if a job crashed, all jobs in following jobs were supposed to fail via slurm. (The dependency is never fulfilled). But this creates unwanted side-effects for the other cases that are merged to be processed in the same batch. So I implemented a monitoring file for successful completion of different stages, the subject_success file. So from then on, the expected behavior was that failed cases are skipped for failed prior steps. But I have not checked this recently, just when I implemented this.

I think this is broken atm, I can keep an eye on this when I run some batches with failing subjects and let you know what happens currently. IIRC the jobs stayed IDLE indefinetly with "dependency not met" (zombie processes), but I didn't look into it.

2. Users are not supposed to modify/edit the work directory, so this is not really a concern. All data should be copied back in the cleanup script to the folder specified via --sd.
   Also, you do not need to specify --work if $HPCWORK is set, as would be set on most clusters.

The issue is not that files used by the script cant be modified, but that the script esentially blocks the HCP work directory, so no other scripts can use it. (e.g. what if I want to also run freesurfer on slurm at the same time as fastsrufer). Why not only check and delete the folders that are actually used?
A workaround would be to set --work manually to a subfolder of $HPCWORK (which I didn't do yet, but might need to do in the future).

@dkuegler
Copy link
Member Author

For proper scheduling between Jobarrays, there needs to be a dependency between in corresponding seg and surf tasks.

@ClePol
Copy link
Member

ClePol commented Jan 26, 2024

In case this pops up during testing, just FYI - I ran into an issues on line

$run_fastsurfer "$seg_surf_only" "${args[@]}" "${POSITIONAL_FASTSURFER[@]}" | prepend "$subject_id: " &

where there seemed to be some issue with the formatting of the srun commend. I did a quick and dirty workaround by building the command into a string and calling eval on it.

@dkuegler
Copy link
Member Author

In case this pops up during testing, just FYI - I ran into an issues on line

$run_fastsurfer "$seg_surf_only" "${args[@]}" "${POSITIONAL_FASTSURFER[@]}" | prepend "$subject_id: " &

where there seemed to be some issue with the formatting of the srun commend. I did a quick and dirty workaround by building the command into a string and calling eval on it.

I don't particularly know what you are running into, but #438 identifies an issue in exactly that line, $seg_surg_only is empty. The question is what error you were getting. In #438 it was "unrecognized option" -- the empty string.

@dkuegler
Copy link
Member Author

Currently, the default option for check_hpc_work is none, i.e. there is no check, if the directory is in fact empty, which can lead to unexpected deletion of files by the cleanup job.

Also, the cleanup job should just delete cases, logs, scripts, and images, not *.

@dkuegler
Copy link
Member Author

I can confirm that slurm inherits the return code from job steps, i.e. if a srun call fails, for example a surface or a segmentation command, this interrupts the whole processing pipeline.

This can be avoided by manually setting the derived return code with sjobexitmod https://slurm.schedmd.com/job_exit_code.html.

It would probably be best to have a verbose error message at the end of the script, but always return 0, so proceeding tasks are always executed.

- switch subject_list_delim separator to = from : (more consistency with brun_fastsurfer and its subject_list format)

brun_fastsurfer.sh
- add capability to add more fastsurfer parameters in the subject_list file

stools.sh
- adapt the awk script in so that additional parameter paths in additional parameters are also mapped to the singularity container
- fix debug printing of parallel_subjects
- fix --parallel_subjects surf= parsing, max
- add subject_id to all parallel processing
- fix debug output of FastSurfer parameters
- copy stools.sh script
- change the (very expert) call interface of --subject_list_awk_code to --subject_list_awk_code_sid and --subject_list_awk_code_args.

stools.sh
- fix translate_cases according to the split of the awk_code snippets
- fix variable escaping in translate_cases awk code
- add debug output for the translated subjects

stools.sh
- fix awk code snippets
- Add redundant parameter names --subjets_list* for all --subject_list* parameters
- fix the run_fastsurfer variable
- add a slurm_task_id option in --batch to fix the --jobarray functionality
- simplify the --parallel_subjects argument parsing
- comment parallel_surf == true ==> seg_surf_only == ""
- fix the argument setting of --statusfile
- fix double usage of variable i, and make sure it gets incremented each loop (do this first)
- remove quotes around $seg_surf_only, so no empty arguments get passed to run_fastsurfer.sh
srun_fastsurfer.sh
- Add doc to --work (should be empty and 'owned by this script')
- add comments for $delete_hpc_work
- make sure the slurm segmentation script returns exit code 0
- reduce default values for memory allocation
- add log and debug functions
- add a log of the slurm submit script (run_fastsurfer) to the logs
- add the --batch argument to the brun_fastsurfer call where necessary
stools.sh
- remove explicit folders in hpc_work, not *
- exit if not --sd is defined in check_out_dir
- add logfile to both make_cleanup_job and make_copy_job
run_fastsurfer.sh
- do not crash on empty arguments (just ignore)
- fix some formatting
- change the delimiter between t1 and additional args to <space>, to make it mroe consistent
srun_fastsurfer.sh
- fix cleanup, if redoing cases, mv fails because it does not replace data/folders, so we ask if folders should be replaced and then copy (cp + rm -R)
stools.sh
- check_cases_in_out_dir : checks if subject data already exists ... to ask if it should be overwritten
- make_cleanup_job : added cleanup mode: mv or cp
- export cleanup_jobid and copy_jobid
- cleanup the initial log text
brun_fastsurfer.sh
- streamline IFS
- make subject_list work with different newline characters
- use array to store subjects rather than a string
- fix default values of task_id and task_count (now not specified/not specified)
- clean up the for loop (no only iterates through the batch not all)
srun_fastsurfer.sh
- add parameters for mem
-
update time limit for surface pipeline (1mm can take more than 1h on 1 core)
@dkuegler dkuegler force-pushed the slurm/more-commands branch from 4824f38 to cafe4fd Compare February 23, 2024 21:49
@dkuegler dkuegler merged commit 7aed5f8 into Deep-MI:dev Feb 23, 2024
@dkuegler dkuegler deleted the slurm/more-commands branch February 23, 2024 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants