# Versioning data

In this notebook we will look at using tools to version data. Specifically, [git-annex](https://git-annex.branchable.com/) and [datalad](http://datalad.org/)

- Initializing, searching, downloading, and removing (dropping) a datalad dataset
- Creating and modifying a datalad dataset

### Let's start by using the shell version

The `!` character at the beginning of the next line indicates that the subsequent command will be executed in a shell. In the following section we are generating the help command. The `--help-np` option ensures that the underlying help output does not page. Since the notebook does not support paging, we will be using this option.

In [1]:
!datalad --help-np

Usage: datalad [global-opts] command [command-opts]

DataLad provides a unified data distribution with the convenience of git-annex
repositories as a backend.  DataLad command line tools allow to manipulate
(obtain, create, update, publish, etc.) datasets and their collections.

*Commands for dataset operations*

  create
      Create a new dataset from scratch
  install
      Install a dataset from a (remote) source
  get
      Get any dataset content (files/directories/subdatasets)
  add
      Add files/directories to an existing dataset
  publish
      Publish a dataset to a known sibling
  uninstall
      Uninstall subdatasets
  drop
      Drop file content from datasets
  remove
      Remove components from datasets
  update
      Update a dataset from a sibling
  create-sibling
      Create a dataset sibling on a UNIX-like SSH-accessible machine
  create-sibling-github
      Create dataset sibling on Github
  unlock
      Unlock file(s) of a dataset

## Exercise 1: Generate the help for the install command of datalad.

In [2]:
!datalad install --help-np

Usage: datalad install [--version] [-h] [-l LEVEL] [--pbs-runner {condor}]
                       [-s SOURCE] [-d DATASET] [-g] [-D DESCRIPTION] [-r]
                       [--recursion-limit LEVELS] [--nosave] [--reckless]
                       [-J NJOBS]
                       [PATH [PATH ...]]

Install a dataset from a (remote) source.

This command creates a local sibling of an existing dataset from a
(remote) location identified via a URL or path. Optional recursion into
potential subdatasets, and download of all referenced data is supported.
The new dataset can be optionally registered in an existing
superdataset by identifying it via the DATASET argument (the new
dataset's path needs to be located within the superdataset for that).

It is recommended to provide a brief description to label the dataset's
nature *and* location, e.g. "Michael's music on black laptop". This helps
humans to identify data locations in distributed scenarios.  By default an
identifier

We will install the datalad metadataset. Note this only installs the top level directory structure

In [3]:
%%bash

cd /data
datalad install /// 

install(ok): /data/datasets.datalad.org (dataset)


[INFO] Cloning dataset from 'http://datasets.datalad.org/' (trying 2 location candidate(s)) to '/data/datasets.datalad.org' 


## Exercise 2

List the contents of the installed dataset using the tree command up to three levels

In [6]:
%%bash

tree -L 3 /data/datasets.datalad.org/

/data/datasets.datalad.org/
├── adhd200
├── corr
├── crcns
├── datapackage.json
├── dbic
├── devel
├── dicoms
├── hbnssi
├── indi
├── kaggle
├── labs
├── neurovault
├── nidm
├── openfmri
└── workshops

14 directories, 1 file


We note that only the top level datasets are created.

Another useful command is to `search` for information across datalad datasets.

## Exercise 3:

Display the help for the search command

In [7]:
!datalad search --help-np

Usage: datalad search [--version] [-h] [-l LEVEL] [--pbs-runner {condor}]
                      [-d DATASET] [-s PROPERTY] [-r PROPERTY] [-R]
                      [-f FORMAT] [--regex]
                      STRING [STRING ...]

Search within available in datasets' meta data

*Arguments*
  STRING                a string (or a regular expression if --regex) to
                        search for in all meta data values. If multiple
                        provided, all must have a match among some fields of a
                        dataset.

*Options*
  --version             show the program's version and license information
  -h, --help, --help-np
                        show this help message. --help-np forcefully disables
                        the use of a pager for displaying the help message
  -l LEVEL, --log-level LEVEL
                        set logging verbosity level. Choose among critical,
                        integer <10 to provide even more debuggin

Now let's search for datasets containing information about Jim Haxby

In [8]:
!datalad search -d /data/datasets.datalad.org haxby

[[1;37mINFO   [0m] Loading and caching local meta-data... might take a few seconds 
[1;4m/data/datasets.datalad.org/labs/haxby/attention[0m 
[1;4m/data/datasets.datalad.org/openfmri/ds000233[0m 
[1;4m/data/datasets.datalad.org/hbnssi[0m 
[1;4m/data/datasets.datalad.org/labs/haxby[0m 
[1;4m/data/datasets.datalad.org/labs/haxby/raiders[0m 
[1;4m/data/datasets.datalad.org/openfmri/ds000105[0m 


## Exercise 4:

Search for information about datasets related to chris gorgolewski, using `gorgolewski` as the keyword.

In [9]:
!datalad search -d /data/datasets.datalad.org gorgolewski

[1;4m/data/datasets.datalad.org/openfmri/ds000114[0m 
[1;4m/data/datasets.datalad.org/workshops/nih-2017/ds000114[0m 
[1;4m/data/datasets.datalad.org/openfmri/ds000158[0m 
[1;4m/data/datasets.datalad.org/openfmri/ds000221[0m 
[1;4m/data/datasets.datalad.org/workshops/nipype-2017/ds000114[0m 


We will now install dataset ds0000114 from the recent NIH workshop. This will help us run some of the other notebooks.

In [10]:
%%bash 

datalad install /data/datasets.datalad.org/workshops/nih-2017/ds000114

install(ok): /data/datasets.datalad.org/workshops (dataset) [Installed subdataset in order to get /data/datasets.datalad.org/workshops/nih-2017/ds000114]
install(ok): /data/datasets.datalad.org/workshops/nih-2017 (dataset) [Installed subdataset in order to get /data/datasets.datalad.org/workshops/nih-2017/ds000114]
install(ok): /data/datasets.datalad.org/workshops/nih-2017/ds000114 (dataset) [Installed subdataset in order to get /data/datasets.datalad.org/workshops/nih-2017/ds000114]
action summary:
  install (ok: 3)


[INFO] Cloning dataset from 'http://datasets.datalad.org/workshops/.git' (trying 1 location candidate(s)) to '/data/datasets.datalad.org/workshops' 
[INFO] Submodule HEAD got detached. Resetting branch master to point to cf3d00f5. Original location was 5f6b24a0 
[INFO] Cloning dataset from 'http://datasets.datalad.org/workshops/nih-2017/.git' (trying 1 location candidate(s)) to '/data/datasets.datalad.org/workshops/nih-2017' 
[INFO] Cloning dataset from 'http://datasets.datalad.org/workshops/nih-2017/ds000114/.git' (trying 2 location candidate(s)) to '/data/datasets.datalad.org/workshops/nih-2017/ds000114' 


## Exercise 5:

Look at the contents of the dataset upto three levels using the `tree` command. Note that there may be files that are not present locally. These will be indicated in red.

In [11]:
!tree -L 3 /data/datasets.datalad.org/workshops/nih-2017/ds000114

[01;34m/data/datasets.datalad.org/workshops/nih-2017/ds000114[00m
├── CHANGES
├── dataset_description.json
├── [01;34mderivatives[00m
│   ├── [01;34mfmriprep[00m
│   └── [01;34mfreesurfer[00m
├── [40;31;01mdwi.bval[00m -> [00m.git/annex/objects/JX/4K/MD5E-s335--5bd6fa32ccd0c79e79f9ac63a2c09c1a.bval/MD5E-s335--5bd6fa32ccd0c79e79f9ac63a2c09c1a.bval[00m
├── [40;31;01mdwi.bvec[00m -> [00m.git/annex/objects/Pg/wk/MD5E-s1248--0641c68ff6ee6164928c984541653430.bvec/MD5E-s1248--0641c68ff6ee6164928c984541653430.bvec[00m
├── [01;34msub-01[00m
│   ├── [01;34mses-retest[00m
│   │   ├── [01;34manat[00m
│   │   ├── [01;34mdwi[00m
│   │   └── [01;34mfunc[00m
│   └── [01;34mses-test[00m
│       ├── [01;34manat[00m
│       ├── [01;34mdwi[00m
│       └── [01;34mfunc[00m
├── [01;34msub-02[00m
│   ├── [01;34mses-retest[00m
│   │   ├── [01;34manat[00m
│   │   ├── [01;34mdwi[00m
│   │   └── [01;34mfunc[00m
│   └── [01;34mses-test[00m
│       ├── [01;34manat[00m

When we try to list the contents of these files we get `No such file or directory`

In [12]:
!cat /data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval

cat: /data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval: No such file or directory


Let's use datalad to fetch this file and then list it again

In [13]:
!datalad get /data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval

[1;1mget[0m([1;32mok[0m): /data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval ([1;35mfile[0m)


In [14]:
!cat /data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval

0 0 0 0 0 0 0 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 


In [15]:
!datalad ls /data/datasets.datalad.org/workshops/nih-2017/ds000114

[1;4m/data/datasets.datalad.org/workshops/nih-2017/ds000114[0m   [annex]  master  2.0.1-11-gf763908b6 2017-08-12/15:55:27  [1;32m✓[0m


Since datalad uses git-annex under the hood, let's try to list things with git-annex.

In [16]:
!git-annex list /data/datasets.datalad.org/workshops/nih-2017/ds000114/

git-annex: First run: git-annex init


Oops! Git and datalad don't recognize root folders without pointing to an annex or dataset location.

In [28]:
%%bash 
cd /data/datasets.datalad.org/workshops/nih-2017/ds000114/
git-annex list sub-01

here
|origin
||web
|||bittorrent
||||datalad-archives
|||||
__X_X sub-01/ses-retest/anat/sub-01_ses-retest_T1w.nii.gz
__X_X sub-01/ses-retest/dwi/sub-01_ses-retest_dwi.nii.gz
__X_X sub-01/ses-retest/func/sub-01_ses-retest_task-covertverbgeneration_bold.nii.gz
__X_X sub-01/ses-retest/func/sub-01_ses-retest_task-fingerfootlips_bold.nii.gz
__X_X sub-01/ses-retest/func/sub-01_ses-retest_task-linebisection_bold.nii.gz
__X_X sub-01/ses-retest/func/sub-01_ses-retest_task-linebisection_events.tsv
__X_X sub-01/ses-retest/func/sub-01_ses-retest_task-overtverbgeneration_bold.nii.gz
__X_X sub-01/ses-retest/func/sub-01_ses-retest_task-overtwordrepetition_bold.nii.gz
__X_X sub-01/ses-test/anat/sub-01_ses-test_T1w.nii.gz
__X_X sub-01/ses-test/dwi/sub-01_ses-test_dwi.nii.gz
__X_X sub-01/ses-test/func/sub-01_ses-test_task-covertverbgeneration_bold.nii.gz
__X_X sub-01/ses-test/func/sub-01_ses-test_task-fingerfootlips_bold.nii.gz
__X_X sub-01/ses-test/func/sub-01_ses-test_task-linebisection_bold.nii.gz
_

## Exercise 6:

Show the help for `git annex list` and then use it to list the `dwi*` files in `ds000114`

In [26]:
!git-annex list --help

git-annex list - show which remotes contain files

Usage: git-annex list [PATH ...] [--allrepos]

Available options:
  --allrepos               show all repositories, not only remotes
  --force                  allow actions that may lose annexed data
  -F,--fast                avoid slow operations
  -q,--quiet               avoid verbose output
  -v,--verbose             allow verbose output (default)
  -d,--debug               show debug messages
  --no-debug               don't show debug messages
  -b,--backend NAME        specify key-value backend to use
  -N,--numcopies NUMBER    override default number of copies
  --trust REMOTE           override trust setting
  --semitrust REMOTE       override trust setting back to default
  --untrust REMOTE         override trust setting to untrusted
  -c,--config NAME=VALUE   override git configuration setting
  --user-agent NAME        override default User-Agent
  --trust-glacier          Trust Amazon Glacier inventory
  --notify-finish 

In [29]:
%%bash

cd /data/datasets.datalad.org/workshops/nih-2017/ds000114/
git-annex list dwi.*

here
|origin
||web
|||bittorrent
||||datalad-archives
|||||
X_X_X dwi.bval
__X_X dwi.bvec


In [None]:
# help coomand here

In [5]:
# list dwi* files here

We can also remove content from our local storage using the `drop` command.

In [32]:
!datalad drop /data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval

[1;1mdrop[0m([1;32mok[0m): /data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval ([1;35mfile[0m) [checking http://openneuro.s3.amazonaws.com/ds000114/ds000114_R2.0.0/uncompressed/dwi.bval?versionId=null...]


## Exercise 7:

Check where the dwi files are with the annex list command and get the missing files

In [34]:
%%bash

cd /data/datasets.datalad.org/workshops/nih-2017/ds000114/
git-annex list dwi.*

here
|origin
||web
|||bittorrent
||||datalad-archives
|||||
__X_X dwi.bval
__X_X dwi.bvec


In [35]:
%%bash

cd /data/datasets.datalad.org/workshops/nih-2017/ds000114/
datalad get dwi.*

Total:   0%|          | 0.00/1.58K [00:00<?, ?B/s]Total:  21%|██        | 335/1.58K [00:01<00:05, 233B/s]          Total (1 ok out of 2) 21%|██        | 335/1.58K [00:01<00:05, 233B/s]Total (1 ok out of 2)100%|██████████| 1.58K/1.58K [00:02<00:00, 315B/s]          Total (2 ok out of 2)100%|██████████| 1.58K/1.58K [00:02<00:00, 315B/s]                                                                       get(ok): /data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval (file)
get(ok): /data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bvec (file)
action summary:
  get (ok: 2)


## Now we will create and version our own toy dataset

In [56]:
!datalad create /data/mydataset

[[1;37mINFO   [0m] Creating a new annex repo at /data/mydataset 
[1;1mcreate[0m([1;32mok[0m): /data/mydataset ([1;35mdataset[0m)           


We will create a dummy file and add it to the dataset

In [57]:
%%bash 

echo "123" > /data/mydataset/123
datalad add -m "initial file" /data/mydataset/123

Total:   0%|          | 0.00/4.00 [00:00<?, ?B/s]Total: 100%|██████████| 4.00/4.00 [00:00<00:00, 27.3B/s]          Total (1 ok out of 1)100%|██████████| 4.00/4.00 [00:00<00:00, 27.3B/s]                                                        add(ok): /data/mydataset/123 (file)
save(ok): /data/mydataset (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)


list where the copy is available

In [58]:
%%bash

cd /data/mydataset
git-annex list

here
|web
||bittorrent
|||
X__ 123


In [59]:
!tree /data/mydataset

[01;34m/data/mydataset[00m
└── [01;36m123[00m -> .git/annex/objects/pF/Zf/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f

0 directories, 1 file


In [60]:
!cat /data/mydataset/.git/annex/objects/pF/Zf/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f

123


Let us try removing the data

In [61]:
!datalad drop /data/mydataset/123

[[1;31mERROR  [0m] Failed to run ['git', '-c', 'receive.autogc=0', '-c', 'gc.auto=0', 'annex', 'drop', '--json', '123'] under '/data/mydataset'. Exit code=1. out={"command":"drop","wanted":[],"note":"(Use --force to override this check, or adjust numcopies.)","skipped":[],"success":false,"key":"MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f","file":"123"}
|  err=git-annex: drop: 1 failed
|  
[[1;31mERROR  [0m] configured minimum number of copies not found [drop(/data/mydataset/123)] 
[1;1mdrop[0m([1;31merror[0m): /data/mydataset/123 ([1;35mfile[0m) [configured minimum number of copies not found]


Let us try modifying the file

In [62]:
!echo "321" > /data/mydataset/123

/usr/bin/sh: 1: cannot create /data/mydataset/123: Permission denied


The proper way to modify this is to unlock the file, change it and then commit it back

In [63]:
%%bash

datalad unlock /data/mydataset/123
echo "321" > /data/mydataset/123
datalad add -m "add modified file" /data/mydataset/123

unlock(ok): 123 (file)
Total:   0%|          | 0.00/4.00 [00:00<?, ?B/s]Total: 100%|██████████| 4.00/4.00 [00:00<00:00, 16.2B/s]          Total (1 ok out of 1)100%|██████████| 4.00/4.00 [00:00<00:00, 16.2B/s]                                                        add(ok): /data/mydataset/123 (file)
save(ok): /data/mydataset (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)


If we try modifying it, we now again get permission denied, because the file is locked.

In [64]:
!echo "123" > /data/mydataset/123

/usr/bin/sh: 1: cannot create /data/mydataset/123: Permission denied


In [65]:
!tree /data/mydataset

[01;34m/data/mydataset[00m
└── [01;36m123[00m -> .git/annex/objects/6v/gZ/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e

0 directories, 1 file


In [66]:
!cat /data/mydataset/.git/annex/objects/6v/gZ/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e

321


But the old object is still there. 

In [67]:
!cat /data/mydataset/.git/annex/objects/pF/Zf/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f

123


The entire history of the repo is available.

In [68]:
%%bash 

cd /data/mydataset/
git log

commit c1fef9ff6e8494847a9a36d27fad5877a181c391
Author: neuro <neuro>
Date:   Wed Aug 16 16:19:23 2017 +0000

    add modified file

commit 8cdcf1758e34c8b559e8cef1b428803e24ee58a1
Author: neuro <neuro>
Date:   Wed Aug 16 16:19:09 2017 +0000

    initial file

commit 7e916743ded0cb337a9064f2c355ca5bc9fc3652
Author: neuro <neuro>
Date:   Wed Aug 16 16:19:07 2017 +0000

    [DATALAD] new dataset

commit be710698193b13aea758791b4b9d8bab0cd89183
Author: neuro <neuro>
Date:   Wed Aug 16 16:19:06 2017 +0000

    [DATALAD] Set default backend for all files to be MD5E


Let us create a simple script that counts the number of characters in a file

In [69]:
%%bash

cd /data/mydataset
mkdir -p scripts
cmd=$(cat << EOM
#!/bin/bash\ncat \$1 | wc -c
EOM
)
echo -e $cmd > scripts/run.sh
chmod +x scripts/run.sh
cat scripts/run.sh

#!/bin/bash
cat $1 | wc -c


Now we will run the script and add the script and the output to annex.

In [70]:
%%bash
cd /data/mydataset
scripts/run.sh 123 > out
datalad add -m "Added scripts and output" out scripts

Total:   0%|          | 0.00/29.0 [00:00<?, ?B/s]Total:   7%|▋         | 2.00/29.0 [00:00<00:02, 11.7B/s]          Total (1 ok out of 2)  7%|▋         | 2.00/29.0 [00:00<00:02, 11.7B/s]Total (1 ok out of 2)100%|██████████| 29.0/29.0 [00:00<00:00, 16.6B/s]          Total (2 ok out of 2)100%|██████████| 29.0/29.0 [00:00<00:00, 16.6B/s]                                                                      add(ok): /data/mydataset/out (file)
add(ok): /data/mydataset/scripts/run.sh (file)
add(ok): /data/mydataset/scripts (directory)
save(ok): /data/mydataset (dataset)
action summary:
  add (ok: 3)
  save (ok: 1)


We can look at the log again

In [71]:
%%bash 

cd /data/mydataset/
git log

commit 8d5e9248cf3347457b9a24891cc1989889391dd8
Author: neuro <neuro>
Date:   Wed Aug 16 16:19:44 2017 +0000

    Added scripts and output

commit c1fef9ff6e8494847a9a36d27fad5877a181c391
Author: neuro <neuro>
Date:   Wed Aug 16 16:19:23 2017 +0000

    add modified file

commit 8cdcf1758e34c8b559e8cef1b428803e24ee58a1
Author: neuro <neuro>
Date:   Wed Aug 16 16:19:09 2017 +0000

    initial file

commit 7e916743ded0cb337a9064f2c355ca5bc9fc3652
Author: neuro <neuro>
Date:   Wed Aug 16 16:19:07 2017 +0000

    [DATALAD] new dataset

commit be710698193b13aea758791b4b9d8bab0cd89183
Author: neuro <neuro>
Date:   Wed Aug 16 16:19:06 2017 +0000

    [DATALAD] Set default backend for all files to be MD5E


We can go back to a previous state and check the contents of the `123` file. Note that we return back to current state after this excursion.

In [72]:
%%bash

tree /data/mydataset/

/data/mydataset/
├── 123 -> .git/annex/objects/6v/gZ/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e
├── out -> .git/annex/objects/6w/1x/MD5E-s2--48a24b70a0b376535542b996af517398/MD5E-s2--48a24b70a0b376535542b996af517398
└── scripts
    └── run.sh -> ../.git/annex/objects/G7/VV/MD5E-s27--ddaa7c667769596874750a6eff28a467.sh/MD5E-s27--ddaa7c667769596874750a6eff28a467.sh

1 directory, 3 files


In [75]:
%%bash

cd /data/mydataset/
git checkout 8cdcf1758
tree /data/mydataset/
git checkout master
tree /data/mydataset/

/data/mydataset/
└── 123 -> .git/annex/objects/pF/Zf/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f

0 directories, 1 file
/data/mydataset/
├── 123 -> .git/annex/objects/6v/gZ/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e
├── out -> .git/annex/objects/6w/1x/MD5E-s2--48a24b70a0b376535542b996af517398/MD5E-s2--48a24b70a0b376535542b996af517398
└── scripts
    └── run.sh -> ../.git/annex/objects/G7/VV/MD5E-s27--ddaa7c667769596874750a6eff28a467.sh/MD5E-s27--ddaa7c667769596874750a6eff28a467.sh

1 directory, 3 files


Note: checking out '8cdcf1758'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at 8cdcf17... initial file
Previous HEAD position was 8cdcf17... initial file
Switched to branch 'master'


## Exercise 8:

- Copy one binary brainmask image file from ds000114/derivatives/fmriprep into mydataset
  - To do so first you should install the dataset recursively
  - And then get the file
- Add to version control
- use git-annex to list where that file can be found
- Add a simple python script to count and print the number of non-zero voxels
- Store the output into a new out file
- Use datalad to add everything to the repository

In [84]:
%%bash

datalad install -r /data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/
tree -L 1 /data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/

get(notneeded): /data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep (dataset) [already installed]
/data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/
├── sub-01
├── sub-01.html -> .git/annex/objects/MF/jw/MD5E-s20077561--03ecea8730492d537e050941bdf654bf.html/MD5E-s20077561--03ecea8730492d537e050941bdf654bf.html
├── sub-02
├── sub-02.html -> .git/annex/objects/99/j3/MD5E-s19975906--5ede67fcdad59b65a02f572360db2863.html/MD5E-s19975906--5ede67fcdad59b65a02f572360db2863.html
├── sub-03
├── sub-03.html -> .git/annex/objects/z4/8w/MD5E-s20227534--64e1a981338e8fb9c87f026a79a34785.html/MD5E-s20227534--64e1a981338e8fb9c87f026a79a34785.html
├── sub-04
├── sub-04.html -> .git/annex/objects/qF/J1/MD5E-s22389786--2954e6ece2a825c0008e9b1dcfcaf0a6.html/MD5E-s22389786--2954e6ece2a825c0008e9b1dcfcaf0a6.html
├── sub-05
├── sub-05.html -> .git/annex/objects/6G/Z6/MD5E-s22109848--70a1908c811102744f39b87ae03216a2.html/MD5E-s22109848--70a1908c811102744f39b87a

[INFO] Installing <Dataset path=/data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep> recursively 


In [85]:
%%bash

tree -L 1 /data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/sub-01/anat/

/data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/sub-01/anat/
├── sub-01_t1w_brainmask.nii.gz -> ../../.git/annex/objects/jJ/Wz/MD5E-s93130--2572248880aa5978f8d4049feff2282a.nii.gz/MD5E-s93130--2572248880aa5978f8d4049feff2282a.nii.gz
├── sub-01_t1w_class-csf_probtissue.nii.gz -> ../../.git/annex/objects/jg/4g/MD5E-s3066660--67fbe37bd5440773a9951eb116d9192c.nii.gz/MD5E-s3066660--67fbe37bd5440773a9951eb116d9192c.nii.gz
├── sub-01_t1w_class-gm_probtissue.nii.gz -> ../../.git/annex/objects/80/65/MD5E-s3395247--d4e1b01832f3514a796788ff9d814134.nii.gz/MD5E-s3395247--d4e1b01832f3514a796788ff9d814134.nii.gz
├── sub-01_t1w_class-wm_probtissue.nii.gz -> ../../.git/annex/objects/W9/v4/MD5E-s3107921--50711fd1ad6729e9d8f8b0c0611578f8.nii.gz/MD5E-s3107921--50711fd1ad6729e9d8f8b0c0611578f8.nii.gz
├── sub-01_t1w_dtissue.nii.gz -> ../../.git/annex/objects/gZ/qJ/MD5E-s241087--cedef0e31b37f728f09c92c4bce16f61.nii.gz/MD5E-s241087--cedef0e31b37f728f09c92c4bce16f61.nii.gz
├── sub-

In [88]:
%%bash

datalad get /data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/sub-01/anat/sub-01_t1w_brainmask.nii.gz
cp /data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/sub-01/anat/sub-01_t1w_brainmask.nii.gz \
   /data/mydataset/brainmask.nii.gz

get(notneeded): /data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/sub-01/anat/sub-01_t1w_brainmask.nii.gz (file) [already present]


In [90]:
%%writefile /data/mydataset/scripts/count_voxels.py

import nibabel as nb
import sys
print(nb.load(sys.argv[1]).get_data().sum())

Overwriting /data/mydataset/scripts/count_voxels.py


In [92]:
%%bash

cd /data/mydataset/
python scripts/count_voxels.py brainmask.nii.gz > mask_count

In [93]:
%%bash

tree /data/mydataset/

/data/mydataset/
├── 123 -> .git/annex/objects/6v/gZ/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e
├── brainmask.nii.gz
├── mask_count
├── out -> .git/annex/objects/6w/1x/MD5E-s2--48a24b70a0b376535542b996af517398/MD5E-s2--48a24b70a0b376535542b996af517398
└── scripts
    ├── count_voxels.py
    └── run.sh -> ../.git/annex/objects/G7/VV/MD5E-s27--ddaa7c667769596874750a6eff28a467.sh/MD5E-s27--ddaa7c667769596874750a6eff28a467.sh

1 directory, 6 files


In [94]:
%%bash

cd /data/mydataset/
datalad add -m "added brainmask, script, and output" brainmask.nii.gz scripts/count_voxels.py mask_count

Total:   0%|          | 0.00/93.2K [00:00<?, ?B/s]Total: 100%|█████████▉| 93.1K/93.2K [00:00<00:00, 601KB/s]          Total (1 ok out of 3)100%|█████████▉| 93.1K/93.2K [00:00<00:00, 601KB/s]          Total (2 ok out of 3)100%|█████████▉| 93.2K/93.2K [00:00<00:00, 601KB/s]          Total (3 ok out of 3)100%|██████████| 93.2K/93.2K [00:00<00:00, 601KB/s]                                                          add(ok): /data/mydataset/brainmask.nii.gz (file)
add(ok): /data/mydataset/scripts/count_voxels.py (file)
add(ok): /data/mydataset/mask_count (file)
save(ok): /data/mydataset (dataset)
action summary:
  add (ok: 3)
  save (ok: 1)


In [95]:
%%bash
cd /data/mydataset/
git-annex whereis brainmask.nii.gz

whereis brainmask.nii.gz (1 copy) 
  	b0f579af-62f8-4f5d-9187-690bd6e71d34 -- neuro@9600bfbe8fce:/data/mydataset [here]
ok
