Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Idealized" versus "good-enough" processing stream #26

Open
20 tasks
jdkent opened this issue May 1, 2019 · 2 comments
Open
20 tasks

"Idealized" versus "good-enough" processing stream #26

jdkent opened this issue May 1, 2019 · 2 comments

Comments

@jdkent
Copy link

jdkent commented May 1, 2019

I'm curious what would be considered an "idealized" reproducible processing stream, and what is a "good enough" reproducible processing stream, and identify the tools/skills needed to complete a "good enough" reproducible analysis. I have some hypothesized steps and some tools listed to complete those steps.

Sparse Learner's Profile

Starting from the top where a PI (or someone) hands you a bunch of dicoms and asks you get subcortical volumes from the structural scans (but there are other currently irrelevant dicoms as well). The PI also wants to be able to run your analysis and wants the data to be publicly available (assuming all IRB/data sharing agreements are satisfied)

An Idealized Processing Pipeline

I imagine we would be using datalad to record all our data/code/processing steps, and always be using/developing containers from the beginning. I'm not exactly sure where/how to place NIDM annotations of data/results or what tool I should use (PyNIDM?).

  • search through and find the relevant dicoms
    • nibabel
    • afni
  • version control the relevant dicoms
    • datalad
    • git-annex
  • convert the dicoms to nifti file format named to the BIDS standard
    • heudiconv (via docker/singularity)
    • datalad
  • deface and rename the files
    • pydeface (via docker/singularity)
    • shell
    • datalad
  • write a script that calculates subcortical volumes
    • niflows (via pip/conda env)
    • fsl
    • datalad
  • place the script in a container with all the requisite software installed
    • neurodocker
  • upload the container to a hub (docker and/or singularity)
    • docker
    • singularity
  • run the script on the data and output data in a derivatives directory
    • docker
    • singularity
  • upload the BIDS organized nifti files to some online database
    • openneuro
  • upload the code/outputs to an online repository
    • git
    • github
  • test your code against that uploaded data
    • testkraken
    • circleci
    • travisci
    • shell

Good Enough Processing Pipeline

Removed datalad from the processing stream, removed testing, removed niflows, but still want to use desired software from within a container.

  • search through and find the relevant dicoms
    • nibabel
    • afni
  • convert the dicoms to nifti file format named to the BIDS standard
    • heudiconv (via docker/singularity)
  • deface and rename the files
    • pydeface (via docker/singularity)
    • shell
  • write a script that calculates subcortical volumes
    • shell
    • fsl
    • datalad
  • place the script in a container with all the requisite software installed
    • neurodocker
  • upload the container to a hub (docker and/or singularity)
    • docker
    • singularity
  • run the script on the data and output data in a derivatives directory
    • docker
    • singularity
  • upload the BIDS organized nifti files to some online database
    • openneuro
  • upload the code/outputs to an online repository and link to what containers you used
    • git
    • github

I would like feedback on both the "Idealized" and "Good Enough" analyses since I am not as knowledgeable as I would like to be on designing processing pipelines. I may not be most up to date on what are the hot/new tools versus what will get the job done.

Once we pin what we would like workshop attendees to be able to do (and hopefully this matches with what they wish to do as well), then I think we will have an easier time elucidating necessary skills and modifying episodes to make sure they help build these skills.

@yarikoptic
Copy link
Member

A fun exercise, thanks! Very much inline with our now elderly https://github.com/ReproNim/simple_workflow container of which I have now reused locally for "a script that calculates subcortical volumes" ;) It is also well aligns with the http://www.repronim.org/5steps .
Instead of a hard split between the two (Idealized/Good enough) it might be better to annotate the steps in the full list with some kind of score of "importance" for reproducibility. We could also imagine that there could be a precrafted workflow (e.g. that simple_workflow, just generalized) which takes care about consuming bunch of dicoms, and performing all actions as a "unitary step" so then particular inner steps might not be any longer relevant but still reflected in the result.

As for particulars, I think a custom heudiconv heuristic could perform the "search" and conversion. So overall a simplified, datalad-centric workflow could be something like

  • datalad create analysis-for-the-pi; cd analysis-for-the-pi
  • datalad create -d . sourcedata && cp ALL_DICOMS sourcedata/
  • datalad install -d . https://github.com/ReproNim/containers/
  • workout heuristic for heudiconv under code/heudiconv-heuristic.py
  • datalad create -d . -c bids bids # -c bids is coming with 0.12 release of datalad and datalad-neuroimaging some time soonish
  • datalad create -d . -c text2git results
  • datalad containers-run -n containers/heudiconv -f code/heudionv-heuristic -o bids --files sourcedata (TODO - container: add repronim/ and other additional commonly used images containers#2)
  • Deface! apparently there is no "official" bids-app yet, but there is a number of defacers available, thus TODO - streamline (bids-app, container etc)
  • datalad containers-run -n containers/simple_workflow -i bids -o results + whatever params it consumes (TODO - container: add repronim/ and other additional commonly used images containers#2)
  • when all is good, look into upload to wherever (datalad create-sibling*, datalad publish) ;)

@satra
Copy link
Contributor

satra commented May 1, 2019

@jdkent - continuing on the datalad theme, one place where the nidm model could be integrated is how datalad stores the input, process, output relationships. or as an exporter from the git log.

regarding the workflows themselves, reproducibility would come from making them niflows, as you started with the simple1 example.

more broadly, the same data typically gets used for many experiments. different pieces are used to test different hypotheses. thus the graph model of data does make a lot of sense.

perhaps the idealized to good enough spectrum can be refactored a bit through the lens of the goal of the workflow. highlighting points where things can make a difference. as an example, there is a piece of software that kevin (in my group) is using that only works if the dicoms are converted via spm rather than dcm2niix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants