-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial sketch for the mriqc/fmriprep singularity based workflow #438
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #438 +/- ##
=======================================
Coverage 89.66% 89.66%
=======================================
Files 148 148
Lines 12388 12389 +1
=======================================
+ Hits 11108 11109 +1
Misses 1280 1280
Continue to review full report at Codecov.
|
|
||
# Sample run without any parallelization, and doing both levels (participant and group) | ||
reproman run --follow -r "${RM_RESOURCE}" --sub "${RM_SUB}" --orc "${RM_ORC}" \ | ||
--jp container=containers/bids-mriqc data/bids data/mriqc participant,group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the original, datalad command would be
datalad containers-run -n containers/bids-mriqc data/bids data/mriqc participant,group
TODO : add inputs/outputs specification
# - datalad-container | ||
|
||
RM_RESOURCE=smaug | ||
RM_SUB=condor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so for local execution it could be
RM_RESOURCE=localshell
RM_SUB=local
reproman run --follow -r "${RM_RESOURCE}" --sub "${RM_SUB}" --orc "${RM_ORC}" \ | ||
--bp 'thing=thing-*' \ | ||
--input '{p[thing]}' \ | ||
sh -c 'cat {p[thing]} {p[thing]} >doubled-{p[thing]}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the latest push to run-subjobs (ac14277) checked out, try
reproman run --follow -r "${RM_RESOURCE}" --sub "${RM_SUB}" --orc "${RM_ORC}" \
--jp container=containers/bids-mriqc \
--bp 'pl=02,13' \
--input data/bids \
data/bids data/mriqc participant --participant_label '{p[pl]}'
I was able to get that [*] to successfully run via condor on smaug. As you've already experienced, the management of existing datasets is a bit rough, so you may want to use a fresh dataset.
[*] Or more specifically, this script:
script
#!/bin/sh
set -eu
cd $(mktemp -d --tmpdir=. ds-XXXX)
datalad create -c text2git .
datalad install -d . ///repronim/containers
datalad install -d . -s https://github.com/ReproNim/ds000003-demo data/bids
mkdir licenses/
echo freesurfer.txt > licenses/.gitignore
cat > licenses/README.md <<EOF
Freesurfer
----------
Place your FreeSurfer license into freesurfer.txt file in this directory.
Visit https://surfer.nmr.mgh.harvard.edu/registration.html to obtain one if
you don't have it yet - it is free.
EOF
datalad save -m "DOC: licenses/ directory stub" licenses/
datalad create -d . data/mriqc
reproman run --resource sm --follow \
--sub condor --orc datalad-pair-run \
--jp container=containers/bids-mriqc --bp 'pl=02,13' \
-i data/bids \
data/bids data/mriqc participant --participant_label '{p[pl]}'
Unfortunately initial run has failed with 2019-08-15 14:32:13,311 [INFO ] Waiting on job 1848: running 2019-08-15 14:32:23,478 [INFO ] Fetching results for 20190815-142046-33ea 2019-08-15 14:35:51,720 [INFO ] Creating run commit in /home/yoh/proj/repronim/reproman-master/docs/usecases/bids-fmriprep-workflow-NP/out7 2019-08-15 14:36:06,509 [INFO ] Unregistered job 20190815-142046-33ea + reproman_run --jp container=containers/bids-mriqc --input data/bids --output data/mriqc "{inputs}" "{outputs}" group + reproman run --follow -r smaug --sub condor --orc datalad-pair-run --jp container=containers/bids-mriqc --input data/bids --output data/mriqc "{inputs}" "{outputs}" group 2019-08-15 14:36:10,588 [INFO ] No root directory supplied for smaug; using "/home/yoh/.reproman/run-root" [INFO ] Publishing <Dataset path=/home/yoh/proj/repronim/reproman-master/docs/usecases/bids-fmriprep-workflow-NP/out7/data/mriqc> to smaug ECDSA host key for IP address "129.170.233.9" not in list of known hosts. [INFO ] Publishing <Dataset path=/home/yoh/proj/repronim/reproman-master/docs/usecases/bids-fmriprep-workflow-NP/out7> to smaug [ERROR ] failed to push to smaug: master -> smaug/master [rejected] (non-fast-forward); pushed: ["d145d97..97de059"] [publish(/home/yoh/proj/repronim/reproman-master/docs/usecases/bids-fmriprep-workflow-NP/out7)] 2019-08-15 14:36:59,238 [ERROR ] "datalad publish" failed. Try running "datalad update -s smaug --merge --recursive" first [orchestrators.py:prepare_remote:792] (OrchestratorError) CONTAINERS_REPO=~/proj/repronim/containers INPUT_DATASET_REPO= 70.57s user 22.71s system 9% cpu 16:50.41 total and stderr.1 on remote end showed that tar failed to find some output file: $> tail -n 3 stderr.1 tar: ./work/workflow_enumerator/anatMRIQCT1w/ComputeIQMs/_in_file_..home..yoh...reproman..run-root..44671e06-bf85-11e9-95c1-8019340ce7f2..data..bids..sub-02..anat..sub-02_T1w.nii.gz/ComputeQI2/_0x9713a172faade86794f9c56a3080a44e_unfinished.json: Cannot stat: No such file or directory tar: ./work/workflow_enumerator/anatMRIQCT1w/ComputeIQMs/_in_file_..home..yoh...reproman..run-root..44671e06-bf85-11e9-95c1-8019340ce7f2..data..bids..sub-02..anat..sub-02_T1w.nii.gz/ComputeQI2/error.svg: Cannot stat: No such file or directory tar: Exiting with failure status due to previous errors
@kyleam I have reran the script and got the same failure due to tar -- could you please confirm that you get the same? |
@kyleam I have reran the script and got the same failure due to tar -- could you please confirm that you get the same?
Sure, I'll give it a try this afternoon.
|
I've triggered it, though I don't yet have an explanation of what's going on. |
After a command completes, it writes to "status.$subjob". If, after completing its command, a subjob sees that the status files for all the other subjobs are in, it claims responsibility for the post-processing step. For the datalad-run orchestrators, post-processing includes calling `find` to get a list of newly added files and then calling `tar` with these files as input. Given that the above procedure waits until each command exits, the hope is that all the output files are created and any temporary files will have been cleaned up. But we're hitting into cases [*] where apparently intermediate files are present for the `find` call but gone by the time `tar` is called. This leads to `tar` exiting with a non-zero status and the post-processing being aborted. Until someone has a better idea of how to deal with this, instruct `tar` to exit with zero even if an expected file isn't present. This allows post-processing to succeed and the incident will still show up in the captured stderr. [*] ReproNim#438 (comment)
Hmm, with several attempts, I was able to trigger the failure only once. Looking at the successful runs, the togethome file does not include the files that tar is complaining about in the failed runs. The only explanation I have for that is that these are temporary files that, based on the timing of things, might end getting removed between the scriptFor completeness: In all the above tries, I was using this script, which is a stripped-down version of 5b95ded:docs/usecases/bids-fmriprep-workflow-NP.sh. set -eu
cd $(mktemp -d --tmpdir=. ds-XXXX)
datalad create -c text2git .
datalad install -d . ///repronim/containers
datalad install -d . -s https://github.com/ReproNim/ds000003-demo data/bids
mkdir licenses/
echo freesurfer.txt > licenses/.gitignore
cat > licenses/README.md <<EOF
Freesurfer
----------
Place your FreeSurfer license into freesurfer.txt file in this directory.
Visit https://surfer.nmr.mgh.harvard.edu/registration.html to obtain one if
you don't have it yet - it is free.
EOF
datalad save -m "DOC: licenses/ directory stub" licenses/
datalad create -d . -c text2git data/mriqc
reproman run --resource sm --follow \
--sub condor --orc datalad-pair-run \
--jp container=containers/bids-mriqc --bp 'pl=02,13' \
-i data/bids -o data/mriqc \
'{inputs}' '{outputs}' participant --participant_label '{p[pl]}'
|
indeed probably with NFS etc we could see even more of such usecases. But I am still wondering what is exactly happening here besides may be condor returning "complete" job status before it (and all kids) actually finished, and either we wouldn't miss some results if we rush into collecting/tarring them up. May be adding some |
I am still wondering too :)
This status isn't coming from condor. Its creation is chained after the run of the command:
|
…tend how to deal with freesurfer license
fixed typo in ".../licenses/..."
also uses new freeze_version --save-dataset helper from containers etc
Ran current state (v0.2.1-43-ga81e457) on smaug as
and it finished nicely (although there was a puke from fmriprep's atexit about "OSError: handle is closed" which I guess was just ignored), and everything looked kosher BUT "data/fmriprep//fmriprep/dataset_description.json" was added to git-annex not git! It is strange. The whole "data/fmriprep" was initiated with $> file --mime-type -Lk 1.json
1.json: application/json\012- text/plain |
oh hoh -- that is a good old datalad/datalad#3361 which we didn't really address although git annex provided a way already :-/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment that it was a workaround
@@ -222,6 +222,7 @@ containers/scripts/freeze_versions --save-dataset=^ \ | |||
poldracklab-ds003-example=0.0.3 \ | |||
bids-mriqc=0.15.0 \ | |||
bids-fmriprep=1.4.1 | |||
( cd containers/ ; git clean -dfx ; ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this is a temporary workaround to remove .datalad/config-e created by above freeze_versions script. Proper solution sounds fix that script to stay compatible with osx.
When running the usecase example ReproNimgh-438, the working tree is coming up as unexpectedly dirty, including the change of status files from empty to "succeed", apparently indicating that the content of the file is not flushed at the time of the 'datalad add/save' call. To prevent saving before the content is flushed, don't count a status file unless it has the content. Modified-by: Kyle Meyer <kyle@kyleam.com> Add commit message body, and dropped `find` call, as suggested by @yarikoptic.
* origin/master: BF: runscript - only consider statuses for success/failure Fix runscript regexp to work on Mac OS Conflicts: reproman/support/jobs/job_templates/runscript/base.template.sh - took master version since it incorporated Chris' fixup for OSX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Mac sed doesn't handle tabs as linux does, so we start by converting the TSV file to CSV, then process from there.
For some reason otherwise it immediately fails on smaug if I do not do that ... Installing setuptools, pkg_resources, pip, wheel...done. > source venv3/bin/activate >> deactivate nondestructive >> unset -f pydoc >> '[' -z '' ']' >> '[' -z '' ']' >> '[' -n /bin/bash ']' >> hash -r >> '[' -z '' ']' >> unset VIRTUAL_ENV >> '[' '!' nondestructive = nondestructive ']' >> VIRTUAL_ENV=/home/yoh/.tmp/rm-TIAZarc/venv3 >> export VIRTUAL_ENV >> _OLD_VIRTUAL_PATH=/home/yoh/bin:/home/yoh/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/sbin:/usr/sbin:/ usr/local/sbin >> PATH=/home/yoh/.tmp/rm-TIAZarc/venv3/bin:/home/yoh/bin:/home/yoh/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/sbin:/usr/sbin:/usr/local/sbin >> export PATH >> '[' -z '' ']' >> '[' -z '' ']' venv3/bin/activate: line 57: PS1: unbound variable >>> echo Finished for setup=pip under PWD=/home/yoh/.tmp/rm-TIAZarc Finished for setup=pip under PWD=/home/yoh/.tmp/rm-TIAZarc
…gure local resource
locally had 0.12.6 and it might be interfering
…) into doc-usecases * origin/master: (21 commits) MNT: setup.py: Add optional DataLad dependency ENH: Raise a dedicated JobError if any subjob fails ENH: run: Specify constraints for --sub and --orc ENH: Submitter.follow - a smart --follow sleeping/reporting ENH: runcsript - print "Waiting" once and pool each second ENH: runscript - remove no longer necessary and anyways ineffective sleep 1 ENH: skip with a warning a job which fails to render due to a missing key BF: orchestrators: Call create-sibling unconditionally ENH: orchestrators: Provide access to datalad-publish results ENH: orchestrators(datalad-pair): Do a more targeted update ENH: orchestrators: Make head_at() helper abort on submodule changes BF: orchestrators: Fetch with --recurse-submodules=no BF: orchestrators: Initialize submodules ENH: orchestrators: Meld handling of local and SSH sessions RF: test_orchestrators: Use DataLad's repo.call_git() MNT: orchestrators: Drop warning about an unsupported DataLad version RF: orchestrators: Adjust datalad run imports RF: orchestrators: Prune now unneeded SSH URL compatibility kludge RF: run: Use 'datalad save' instead of 'datalad add' RF: orchestrators: Check whether datalad >=0.13 is installed locally ...
…ly (not as {inputs}) That should allow to upload all needed licenses to the submitting host
An initial local attempt was slowed down by WTF of ReproNim/containers#23
Preprint which inspired the fmriprep workflow here is http://dx.doi.org/10.1101/694364
Edit: an example of prototypical use of reproman for mriqc is now formalized in README: https://github.com/ReproNim/reproman#step-2-create-analysis-datalad-dataset-and-run-computation-on-aws-hpc2 . I will keep this PR open until we furnish a full workflow which would also include fmriprep (thus handling secret license for matlab) and then some linear modeling