Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial sketch for the mriqc/fmriprep singularity based workflow #438

Open
wants to merge 8 commits into
base: master
from

Conversation

@yarikoptic
Copy link
Member

yarikoptic commented Jul 18, 2019

An initial local attempt was slowed down by WTF of ReproNim/containers#23

@codecov

This comment has been minimized.

Copy link

codecov bot commented Jul 18, 2019

Codecov Report

Merging #438 into master will increase coverage by 0.45%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #438      +/-   ##
==========================================
+ Coverage   89.03%   89.49%   +0.45%     
==========================================
  Files         148      148              
  Lines       11863    12114     +251     
==========================================
+ Hits        10562    10841     +279     
+ Misses       1301     1273      -28
Impacted Files Coverage Δ
reproman/interface/tests/test_run.py 99.56% <0%> (-0.44%) ⬇️
reproman/interface/retrace.py 95.23% <0%> (-0.05%) ⬇️
reproman/distributions/vcs.py 95.83% <0%> (-0.02%) ⬇️
reproman/interface/run.py 100% <0%> (ø) ⬆️
reproman/distributions/conda.py 94.16% <0%> (ø) ⬆️
reproman/tests/test_utils.py 93.36% <0%> (+0.06%) ⬆️
reproman/utils.py 86.84% <0%> (+0.08%) ⬆️
reproman/interface/jobs.py 98.95% <0%> (+1.04%) ⬆️
reproman/resource/ssh.py 89.16% <0%> (+1.87%) ⬆️
reproman/support/jobs/tests/test_orchestrators.py 93.06% <0%> (+2.18%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c52c5e3...57b6b6c. Read the comment docs.


# Sample run without any parallelization, and doing both levels (participant and group)
reproman run --follow -r "${RM_RESOURCE}" --sub "${RM_SUB}" --orc "${RM_ORC}" \
--jp container=containers/bids-mriqc data/bids data/mriqc participant,group

This comment has been minimized.

Copy link
@yarikoptic

yarikoptic Aug 1, 2019

Author Member

so the original, datalad command would be

datalad containers-run -n containers/bids-mriqc data/bids data/mriqc participant,group

TODO : add inputs/outputs specification

# - datalad-container

RM_RESOURCE=smaug
RM_SUB=condor

This comment has been minimized.

Copy link
@yarikoptic

yarikoptic Aug 1, 2019

Author Member

so for local execution it could be

RM_RESOURCE=localshell
RM_SUB=local
reproman run --follow -r "${RM_RESOURCE}" --sub "${RM_SUB}" --orc "${RM_ORC}" \
--bp 'thing=thing-*' \
--input '{p[thing]}' \
sh -c 'cat {p[thing]} {p[thing]} >doubled-{p[thing]}'

This comment has been minimized.

Copy link
@kyleam

kyleam Aug 6, 2019

Contributor

With the latest push to run-subjobs (ac14277) checked out, try

reproman run --follow -r "${RM_RESOURCE}" --sub "${RM_SUB}" --orc "${RM_ORC}" \
  --jp container=containers/bids-mriqc \
  --bp 'pl=02,13' \
  --input data/bids \
  data/bids data/mriqc participant --participant_label '{p[pl]}'

I was able to get that [*] to successfully run via condor on smaug. As you've already experienced, the management of existing datasets is a bit rough, so you may want to use a fresh dataset.

[*] Or more specifically, this script:

script
#!/bin/sh
set -eu

cd $(mktemp -d --tmpdir=. ds-XXXX)
datalad create -c text2git .
datalad install -d . ///repronim/containers
datalad install -d . -s https://github.com/ReproNim/ds000003-demo data/bids

mkdir licenses/
echo freesurfer.txt > licenses/.gitignore
cat > licenses/README.md <<EOF

Freesurfer
----------

Place your FreeSurfer license into freesurfer.txt file in this directory.
Visit https://surfer.nmr.mgh.harvard.edu/registration.html to obtain one if
you don't have it yet - it is free.

EOF
datalad save -m "DOC: licenses/ directory stub" licenses/

datalad create -d . data/mriqc

reproman run --resource sm --follow \
         --sub condor --orc datalad-pair-run \
         --jp container=containers/bids-mriqc --bp 'pl=02,13' \
         -i data/bids \
         data/bids data/mriqc participant --participant_label '{p[pl]}'
Unfortunately initial run has failed with

	2019-08-15 14:32:13,311 [INFO   ] Waiting on job 1848: running
	2019-08-15 14:32:23,478 [INFO   ] Fetching results for 20190815-142046-33ea
	2019-08-15 14:35:51,720 [INFO   ] Creating run commit in /home/yoh/proj/repronim/reproman-master/docs/usecases/bids-fmriprep-workflow-NP/out7
	2019-08-15 14:36:06,509 [INFO   ] Unregistered job 20190815-142046-33ea
	+ reproman_run --jp container=containers/bids-mriqc --input data/bids --output data/mriqc "{inputs}" "{outputs}" group
	+ reproman run --follow -r smaug --sub condor --orc datalad-pair-run --jp container=containers/bids-mriqc --input data/bids --output data/mriqc "{inputs}" "{outputs}" group
	2019-08-15 14:36:10,588 [INFO   ] No root directory supplied for smaug; using "/home/yoh/.reproman/run-root"
	[INFO   ] Publishing <Dataset path=/home/yoh/proj/repronim/reproman-master/docs/usecases/bids-fmriprep-workflow-NP/out7/data/mriqc> to smaug
	ECDSA host key for IP address "129.170.233.9" not in list of known hosts.
	[INFO   ] Publishing <Dataset path=/home/yoh/proj/repronim/reproman-master/docs/usecases/bids-fmriprep-workflow-NP/out7> to smaug
	[ERROR  ] failed to push to smaug: master -> smaug/master [rejected] (non-fast-forward); pushed: ["d145d97..97de059"] [publish(/home/yoh/proj/repronim/reproman-master/docs/usecases/bids-fmriprep-workflow-NP/out7)]
	2019-08-15 14:36:59,238 [ERROR  ] "datalad publish" failed. Try running "datalad update -s smaug --merge --recursive" first [orchestrators.py:prepare_remote:792] (OrchestratorError)
	CONTAINERS_REPO=~/proj/repronim/containers INPUT_DATASET_REPO=    70.57s user 22.71s system 9% cpu 16:50.41 total

and stderr.1 on remote end showed that tar failed to find some output file:

    $> tail -n 3 stderr.1
    tar: ./work/workflow_enumerator/anatMRIQCT1w/ComputeIQMs/_in_file_..home..yoh...reproman..run-root..44671e06-bf85-11e9-95c1-8019340ce7f2..data..bids..sub-02..anat..sub-02_T1w.nii.gz/ComputeQI2/_0x9713a172faade86794f9c56a3080a44e_unfinished.json: Cannot stat: No such file or directory
    tar: ./work/workflow_enumerator/anatMRIQCT1w/ComputeIQMs/_in_file_..home..yoh...reproman..run-root..44671e06-bf85-11e9-95c1-8019340ce7f2..data..bids..sub-02..anat..sub-02_T1w.nii.gz/ComputeQI2/error.svg: Cannot stat: No such file or directory
    tar: Exiting with failure status due to previous errors
@yarikoptic

This comment has been minimized.

Copy link
Member Author

yarikoptic commented Aug 16, 2019

@kyleam I have reran the script and got the same failure due to tar -- could you please confirm that you get the same?

@kyleam

This comment has been minimized.

Copy link
Contributor

kyleam commented Aug 16, 2019

@kyleam

This comment has been minimized.

Copy link
Contributor

kyleam commented Aug 16, 2019

could you please confirm that you get the same?

Sure, I'll give it a try this afternoon.

I've triggered it, though I don't yet have an explanation of what's going on.

kyleam added a commit to kyleam/niceman that referenced this pull request Aug 16, 2019
After a command completes, it writes to "status.$subjob".  If, after
completing its command, a subjob sees that the status files for all
the other subjobs are in, it claims responsibility for the
post-processing step.  For the datalad-run orchestrators,
post-processing includes calling `find` to get a list of newly added
files and then calling `tar` with these files as input.

Given that the above procedure waits until each command exits, the
hope is that all the output files are created and any temporary files
will have been cleaned up.  But we're hitting into cases [*] where
apparently intermediate files are present for the `find` call but gone
by the time `tar` is called.  This leads to `tar` exiting with a
non-zero status and the post-processing being aborted.

Until someone has a better idea of how to deal with this, instruct
`tar` to exit with zero even if an expected file isn't present.  This
allows post-processing to succeed and the incident will still show up
in the captured stderr.

[*] ReproNim#438 (comment)
@kyleam

This comment has been minimized.

Copy link
Contributor

kyleam commented Aug 16, 2019

I've triggered it, though I don't yet have an explanation of what's going on.

Hmm, with several attempts, I was able to trigger the failure only once. Looking at the successful runs, the togethome file does not include the files that tar is complaining about in the failed runs. The only explanation I have for that is that these are temporary files that, based on the timing of things, might end getting removed between the find ... >togethome call and the tar call. I've submitted gh-451 as a workaround.

script

For completeness: In all the above tries, I was using this script, which is a stripped-down version of 5b95ded:docs/usecases/bids-fmriprep-workflow-NP.sh.

set -eu

cd $(mktemp -d --tmpdir=. ds-XXXX)
datalad create -c text2git .
datalad install -d . ///repronim/containers
datalad install -d . -s https://github.com/ReproNim/ds000003-demo data/bids

mkdir licenses/
echo freesurfer.txt > licenses/.gitignore
cat > licenses/README.md <<EOF

Freesurfer
----------

Place your FreeSurfer license into freesurfer.txt file in this directory.
Visit https://surfer.nmr.mgh.harvard.edu/registration.html to obtain one if
you don't have it yet - it is free.

EOF
datalad save -m "DOC: licenses/ directory stub" licenses/

datalad create -d . -c text2git data/mriqc

reproman run --resource sm --follow \
         --sub condor --orc datalad-pair-run \
         --jp container=containers/bids-mriqc --bp 'pl=02,13' \
         -i data/bids -o data/mriqc \
         '{inputs}' '{outputs}' participant --participant_label '{p[pl]}'
@yarikoptic

This comment has been minimized.

Copy link
Member Author

yarikoptic commented Aug 20, 2019

The only explanation I have for that is that these are temporary files that, based on the timing of things, might end getting removed between the find ... >togethome call and the tar call.

indeed probably with NFS etc we could see even more of such usecases. But I am still wondering what is exactly happening here besides may be condor returning "complete" job status before it (and all kids) actually finished, and either we wouldn't miss some results if we rush into collecting/tarring them up. May be adding some fuser call to check if any process is still holding on that output path or alike. I will try to look into it when I get a moment.

@kyleam

This comment has been minimized.

Copy link
Contributor

kyleam commented Aug 20, 2019

indeed probably with NFS etc we could see even more of such usecases. But I am still wondering what is exactly happening here

I am still wondering too :)

besides may be condor returning "complete" job status before it (and all kids) actually finished

This status isn't coming from condor. Its creation is chained after the run of the command:

/bin/sh -c "$cmd" && \
echo "succeeded" >"$metadir/status.$subjob" || \
(echo "failed: $?" >"$metadir/status.$subjob";
mkdir -p "$metadir/failed" && touch "$metadir/failed/$subjob")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.