# Versioning data

In this notebook we will look at using tools to version data. Specifically, [git-annex](https://git-annex.branchable.com/) and [datalad](http://datalad.org/)

- Initializing, searching, downloading, and removing (dropping) a datalad dataset
- Creating and modifying a datalad dataset

### Let's start by using the shell version

The `!` character at the beginning of the next line indicates that the subsequent command will be executed in a shell. In the following section we are generating the help command. The `--help-np` option ensures that the underlying help output does not page. Since the notebook does not support paging, we will be using this option.

In [1]:
!datalad --help-np

Usage: datalad [global-opts] command [command-opts]

Comprehensive data management solution

DataLad provides a unified data distribution system built on the Git
and Git-annex. DataLad command line tools allow to manipulate (obtain,
create, update, publish, etc.) datasets and provide a comprehensive
toolbox for joint management of data and code. Compared to Git/annex
it primarly extends their functionality to transparently and
simultaneously work with multiple inter-related repositories.

*Commands for dataset operations*

  create
      Create a new dataset from scratch
  install
      Install a dataset from a (remote) source
  get
      Get any dataset content (files/directories/subdatasets)
  add
      Add files/directories to an existing dataset
  publish
      Publish a dataset to a known sibling
  uninstall
      Uninstall subdatasets
  drop
      Drop file content from datasets
  remove
      Remove components from datasets
  update
      Update a dataset from a sibling
  create

## Exercise 1: Generate the help for the install command of datalad.

In [2]:
!datalad install --help-np

Usage: datalad install [-h] [-s SOURCE] [-d DATASET] [-g] [-D DESCRIPTION]
                       [-r] [--recursion-limit LEVELS] [--nosave] [--reckless]
                       [-J NJOBS]
                       [PATH [PATH ...]]

Install a dataset from a (remote) source.

This command creates a local sibling of an existing dataset from a
(remote) location identified via a URL or path. Optional recursion into
potential subdatasets, and download of all referenced data is supported.
The new dataset can be optionally registered in an existing
superdataset by identifying it via the DATASET argument (the new
dataset's path needs to be located within the superdataset for that).

It is recommended to provide a brief description to label the dataset's
nature *and* location, e.g. "Michael's music on black laptop". This helps
humans to identify data locations in distributed scenarios.  By default an
identifier comprised of user and machine name, plus path will be generated.

When only partial dat

We will install the datalad metadataset. Note this only installs the top level directory structure

In [3]:
%%bash

cd /data
datalad install /// 

install(ok): /data/datasets.datalad.org (dataset)


[INFO] Cloning http://datasets.datalad.org/ to '/data/datasets.datalad.org' 


## Exercise 2

List the contents of the installed dataset using the tree command up to three levels

In [4]:
%%bash

tree -L 3 /data/datasets.datalad.org/

/data/datasets.datalad.org/
├── abide
├── abide2
├── adhd200
├── corr
├── crcns
├── datapackage.json
├── dbic
├── devel
├── dicoms
├── hbnssi
├── indi
├── kaggle
├── labs
├── neurovault
├── nidm
├── openfmri
└── workshops

16 directories, 1 file


We note that only the top level datasets are created.

Another useful command is to `search` for information across datalad datasets.

## Exercise 3:

Display the help for the search command

In [5]:
!datalad search --help-np

Usage: datalad search [-h] [-d DATASET] [--reindex]
                      [--max-nresults MAX_NRESULTS]
                      [--mode {egrep,textblob,autofield}] [--full-record]
                      [--show-keys {name,short,full}] [--show-query]
                      [QUERY [QUERY ...]]

Search dataset metadata

DataLad can search metadata extracted from a dataset and/or aggregated into
a superdataset (see the AGGREGATE-METADATA command). This makes it
possible to discover datasets, or individual files in a dataset even when
they are not available locally.

Ultimately DataLad metadata are a graph of linked data structures. However,
this command does not (yet) support queries that can exploit all information
stored in the metadata. At the moment three search modes are implemented that
represent different trade-offs between the expressiveness of a query and
the computational and storage resources required to execute a query.

- egrep (default)

- textblob

- autofield

An alternative de

Now let's search for datasets containing information about Jim Haxby

In [6]:
!datalad search -d /data/datasets.datalad.org haxby

[1;1msearch[0m([1;32mok[0m): /data/datasets.datalad.org/hbnssi ([1;35mdataset[0m)
[1;1msearch[0m([1;32mok[0m): /data/datasets.datalad.org/labs ([1;35mdataset[0m)
[1;1msearch[0m([1;32mok[0m): /data/datasets.datalad.org/labs/gobbini/famface/data ([1;35mdataset[0m)
[1;1msearch[0m([1;32mok[0m): /data/datasets.datalad.org/labs/haxby ([1;35mdataset[0m)
[1;1msearch[0m([1;32mok[0m): /data/datasets.datalad.org/labs/haxby/attention ([1;35mdataset[0m)
[1;1msearch[0m([1;32mok[0m): /data/datasets.datalad.org/labs/haxby/life ([1;35mdataset[0m)
[1;1msearch[0m([1;32mok[0m): /data/datasets.datalad.org/labs/haxby/raiders ([1;35mdataset[0m)
[1;1msearch[0m([1;32mok[0m): /data/datasets.datalad.org/openfmri/ds000105 ([1;35mdataset[0m)
[1;1msearch[0m([1;32mok[0m): /data/datasets.datalad.org/openfmri/ds000233 ([1;35mdataset[0m)
[1;1msearch[0m([1;32mok[0m): /data/datasets.datalad.org/openfmri/ds000241 ([1;35mdataset[0m)
action summary:
  search (ok: 1

## Exercise 4:

Search for information about datasets related to chris gorgolewski, using `gorgolewski` as the keyword.

In [7]:
!datalad search -d /data/datasets.datalad.org gorgolewski

[1;1msearch[0m([1;32mok[0m): /data/datasets.datalad.org/openfmri/ds000114 ([1;35mdataset[0m)
[1;1msearch[0m([1;32mok[0m): /data/datasets.datalad.org/openfmri/ds000158 ([1;35mdataset[0m)
[1;1msearch[0m([1;32mok[0m): /data/datasets.datalad.org/openfmri/ds000221 ([1;35mdataset[0m)
[1;1msearch[0m([1;32mok[0m): /data/datasets.datalad.org/workshops/nipype-2017/ds000114 ([1;35mdataset[0m)
action summary:
  search (ok: 4)
[0m

We will now install dataset ds0000114 from the recent NIH workshop. This will help us run some of the other notebooks.

In [8]:
%%bash 

datalad install /data/datasets.datalad.org/workshops/nih-2017/ds000114

install(ok): /data/datasets.datalad.org/workshops/nih-2017/ds000114 (dataset) [Installed subdataset in order to get /data/datasets.datalad.org/workshops/nih-2017/ds000114]
action summary:
  install (ok: 3)


[INFO] Cloning http://datasets.datalad.org/workshops/.git to '/data/datasets.datalad.org/workshops' 
[INFO] Cloning http://datasets.datalad.org/workshops/nih-2017/.git to '/data/datasets.datalad.org/workshops/nih-2017' 
[INFO] Cloning http://datasets.datalad.org/workshops/nih-2017/ds000114/.git to '/data/datasets.datalad.org/workshops/nih-2017/ds000114' 
[INFO] access to dataset sibling "datalad" not auto-enabled, enable with:
| 		datalad siblings -d "/data/datasets.datalad.org/workshops/nih-2017/ds000114" enable -s datalad 


## Exercise 5:

Look at the contents of the dataset upto three levels using the `tree` command. Note that there may be files that are not present locally. These will be indicated in red.

In [9]:
!tree -L 3 /data/datasets.datalad.org/workshops/nih-2017/ds000114

/data/datasets.datalad.org/workshops/nih-2017/ds000114
├── CHANGES
├── dataset_description.json
├── derivatives
│   ├── fmriprep
│   └── freesurfer
├── dwi.bval -> .git/annex/objects/JX/4K/MD5E-s335--5bd6fa32ccd0c79e79f9ac63a2c09c1a.bval/MD5E-s335--5bd6fa32ccd0c79e79f9ac63a2c09c1a.bval
├── dwi.bvec -> .git/annex/objects/Pg/wk/MD5E-s1248--0641c68ff6ee6164928c984541653430.bvec/MD5E-s1248--0641c68ff6ee6164928c984541653430.bvec
├── sub-01
│   ├── ses-retest
│   │   ├── anat
│   │   ├── dwi
│   │   └── func
│   └── ses-test
│       ├── anat
│       ├── dwi
│       └── func
├── sub-02
│   ├── ses-retest
│   │   ├── anat
│   │   ├── dwi
│   │   └── func
│   └── ses-test
│       ├── anat
│       ├── dwi
│       └── func
├── sub-03
│   ├── ses-retest
│   │   ├── anat
│   │   ├── dwi
│   │   └── func
│   └── ses-test
│       ├── anat
│       ├── dwi
│       └── func
├── sub-04
│   ├── ses-retest
│   │   ├── anat
│   │   ├── dwi
│   │   └── func
│   └── ses-test
│       ├── anat
│       ├── dwi
│

When we try to list the contents of these files we get `No such file or directory`

In [10]:
!cat /data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval

cat: /data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval: No such file or directory


Let's use datalad to fetch this file and then list it again

In [11]:
!datalad get /data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval

[1;1mget[0m([1;32mok[0m): /data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval ([1;35mfile[0m) [from origin...]
[0m

In [12]:
!cat /data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval

0 0 0 0 0 0 0 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 


In [13]:
!datalad ls /data/datasets.datalad.org/workshops/nih-2017/ds000114

[1;4m/data/datasets.datalad.org/workshops/nih-2017/ds000114[0m   [annex]  master  2.0.1-13-g26d1fc2e0 2018-06-08/11:33:43  [1;32m✓[0m
[0m

Since datalad uses git-annex under the hood, let's try to list things with git-annex.

In [14]:
!git-annex list /data/datasets.datalad.org/workshops/nih-2017/ds000114/

git-annex: First run: git-annex init


Oops! Git and datalad don't recognize root folders without pointing to an annex or dataset location.

In [15]:
%%bash 
cd /data/datasets.datalad.org/workshops/nih-2017/ds000114/
git-annex list sub-01

here
|origin
||web
|||bittorrent
||||datalad-archives
|||||
_XX_X sub-01/ses-retest/anat/sub-01_ses-retest_T1w.nii.gz
_XX_X sub-01/ses-retest/dwi/sub-01_ses-retest_dwi.nii.gz
_XX_X sub-01/ses-retest/func/sub-01_ses-retest_task-covertverbgeneration_bold.nii.gz
_XX_X sub-01/ses-retest/func/sub-01_ses-retest_task-fingerfootlips_bold.nii.gz
_XX_X sub-01/ses-retest/func/sub-01_ses-retest_task-linebisection_bold.nii.gz
_XX_X sub-01/ses-retest/func/sub-01_ses-retest_task-linebisection_events.tsv
_XX_X sub-01/ses-retest/func/sub-01_ses-retest_task-overtverbgeneration_bold.nii.gz
_XX_X sub-01/ses-retest/func/sub-01_ses-retest_task-overtwordrepetition_bold.nii.gz
_XX_X sub-01/ses-test/anat/sub-01_ses-test_T1w.nii.gz
_XX_X sub-01/ses-test/dwi/sub-01_ses-test_dwi.nii.gz
_XX_X sub-01/ses-test/func/sub-01_ses-test_task-covertverbgeneration_bold.nii.gz
_XX_X sub-01/ses-test/func/sub-01_ses-test_task-fingerfootlips_bold.nii.gz
_XX_X sub-01/ses-test/func/sub-01_ses-test_task-linebisection_bold.nii.gz
_

## Exercise 6:

Show the help for `git annex list` and then use it to list the `dwi*` files in `ds000114`

In [16]:
!git-annex list --help

git-annex list - show which remotes contain files

Usage: git-annex list [PATH ...] [--allrepos]

Available options:
  --allrepos               show all repositories, not only remotes
  --force                  allow actions that may lose annexed data
  -F,--fast                avoid slow operations
  -q,--quiet               avoid verbose output
  -v,--verbose             allow verbose output (default)
  -d,--debug               show debug messages
  --no-debug               don't show debug messages
  -b,--backend NAME        specify key-value backend to use
  -N,--numcopies NUMBER    override default number of copies
  --trust REMOTE           override trust setting
  --semitrust REMOTE       override trust setting back to default
  --untrust REMOTE         override trust setting to untrusted
  -c,--config NAME=VALUE   override git configuration setting
  --user-agent NAME        override default User-Agent
  --trust-glacier          Trust Amazon Glacier inventory
  --notify-finish 

In [17]:
%%bash

cd /data/datasets.datalad.org/workshops/nih-2017/ds000114/
git-annex list dwi.*

here
|origin
||web
|||bittorrent
||||datalad-archives
|||||
XXX_X dwi.bval
_XX_X dwi.bvec


In [18]:
# help coomand here

In [19]:
# list dwi* files here

We can also remove content from our local storage using the `drop` command.

In [20]:
!datalad drop /data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval

[1;1mdrop[0m([1;32mok[0m): /data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval ([1;35mfile[0m) [checking origin...]
[0m

## Exercise 7:

Check where the dwi files are with the annex list command and get the missing files

In [21]:
%%bash

cd /data/datasets.datalad.org/workshops/nih-2017/ds000114/
git-annex list dwi.*

here
|origin
||web
|||bittorrent
||||datalad-archives
|||||
_XX_X dwi.bval
_XX_X dwi.bvec


In [22]:
%%bash

cd /data/datasets.datalad.org/workshops/nih-2017/ds000114/
datalad get dwi.*

Total (1 ok out of 2):  79%|███████▉  | 1.25k/1.58k [00:00<00:00, 4.18kB/s]
Total (2 ok out of 2): 100%|██████████| 1.58k/1.58k [00:00<00:00, 3.72kB/s]
get(ok): /data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bvec (file) [from origin...]
get(ok): /data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval (file) [from web...]
action summary:
  get (ok: 2)


## Now we will create and version our own toy dataset

In [23]:
!datalad create /data/mydataset

[[1;37mINFO   [0m] Creating a new annex repo at /data/mydataset 
[1;1mcreate[0m([1;32mok[0m): /data/mydataset ([1;35mdataset[0m)
[0m

We will create a dummy file and add it to the dataset

In [24]:
%%bash 

echo "123" > /data/mydataset/123
datalad add -m "initial file" /data/mydataset/123

add(ok): /data/mydataset/123 (file)
save(ok): /data/mydataset (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)


list where the copy is available

In [25]:
%%bash

cd /data/mydataset
git-annex list

here
|web
||bittorrent
|||
X__ 123


In [26]:
!tree /data/mydataset

/data/mydataset
└── 123 -> .git/annex/objects/pF/Zf/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f

0 directories, 1 file


In [27]:
!cat /data/mydataset/.git/annex/objects/pF/Zf/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f

123


Let us try removing the data

In [28]:
!datalad drop /data/mydataset/123

 
[[1;31mERROR  [0m] unsafe; Could only verify the existence of 0 out of 1 necessary copies; Rather than dropping this file, try using: git annex move; (Use --force to override this check, or adjust numcopies.) [drop(/data/mydataset/123)] 
[1;1mdrop[0m([1;31merror[0m): /data/mydataset/123 ([1;35mfile[0m) [unsafe; Could only verify the existence of 0 out of 1 necessary copies; Rather than dropping this file, try using: git annex move; (Use --force to override this check, or adjust numcopies.)]
[0m

Let us try modifying the file

In [29]:
!echo "321" > /data/mydataset/123

/bin/sh: 1: cannot create /data/mydataset/123: Permission denied


The proper way to modify this is to unlock the file, change it and then commit it back

In [30]:
%%bash

datalad unlock /data/mydataset/123
echo "321" > /data/mydataset/123
datalad add -m "add modified file" /data/mydataset/123

unlock(ok): /data/mydataset/123 (file)
add(ok): /data/mydataset/123 (file)
save(ok): /data/mydataset (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)


If we try modifying it, we now again get permission denied, because the file is locked.

In [31]:
!echo "123" > /data/mydataset/123

/bin/sh: 1: cannot create /data/mydataset/123: Permission denied


In [32]:
!tree /data/mydataset

/data/mydataset
└── 123 -> .git/annex/objects/6v/gZ/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e

0 directories, 1 file


In [33]:
!cat /data/mydataset/.git/annex/objects/6v/gZ/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e

321


But the old object is still there. 

In [34]:
!cat /data/mydataset/.git/annex/objects/pF/Zf/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f

123


The entire history of the repo is available.

In [35]:
%%bash 

cd /data/mydataset/
git log

commit 7161896bf141743a64d64c6cc217d36f04382623
Author: yarikoptic <neuro>
Date:   Thu Aug 2 02:46:26 2018 +0000

    add modified file

commit 8b5428983abe4f46c41952d1f72ab29a721c0033
Author: yarikoptic <neuro>
Date:   Thu Aug 2 02:46:22 2018 +0000

    initial file

commit ef2e870514e6f59779c7ca1c0cc0c705b0cc0bc4
Author: yarikoptic <neuro>
Date:   Thu Aug 2 02:46:21 2018 +0000

    [DATALAD] new dataset

commit ee113840402702a31150f908272dd4f1acad027b
Author: yarikoptic <neuro>
Date:   Thu Aug 2 02:46:20 2018 +0000

    [DATALAD] Set default backend for all files to be MD5E


Let us create a simple script that counts the number of characters in a file and commit it directly into git

In [36]:
%%bash

cd /data/mydataset
mkdir -p scripts
cmd=$(cat << EOM
#!/bin/bash\ncat \$1 | wc -c
EOM
)
echo -e $cmd > scripts/run.sh
chmod +x scripts/run.sh
cat scripts/run.sh
datalad add -m "Added the mighty script" --to-git scripts

#!/bin/bash
cat $1 | wc -c
add(ok): /data/mydataset/scripts/run.sh (file) [non-large file; adding content to git repository]
add(ok): /data/mydataset/scripts (directory)
save(ok): /data/mydataset (dataset)
action summary:
  add (ok: 2)
  save (ok: 1)


We could have just ran a script to generate output file, `datalad add out` but that would leave no record of **how** that file was generated.
[datalad run](http://docs.datalad.org/en/latest/generated/man/datalad-run.html) command assists with running a command a saving all produced results.
So now we will `datalad run` the script and have output added to the annex:

In [37]:
%%bash
cd /data/mydataset
datalad run -m "Running the mighty script using datalad run" bash -c 'scripts/run.sh 123 > out'

add(ok): out (file)
save(ok): /data/mydataset (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)


[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 


We can look at the log again (with `--stat` to see statistics on changed files) to see a special message body for the latest commit containing special session with information about executed command, and that produced out file was added to the repository.

In [38]:
%%bash 

cd /data/mydataset/
git log --stat

commit 742ca8cf3224413665456a6d9eb39c49e85e5b02
Author: yarikoptic <neuro>
Date:   Thu Aug 2 02:46:30 2018 +0000

    [DATALAD RUNCMD] Running the mighty script using datalad run
    
    === Do not change lines below ===
    {
     "chain": [],
     "cmd": "bash -c 'scripts/run.sh 123 > out'",
     "dsid": "3c991c3a-95fe-11e8-8616-eb661f417fb8",
     "exit": 0,
     "inputs": [],
     "outputs": [],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

 out | 1 +
 1 file changed, 1 insertion(+)

commit fd195ced83fd2adf380560d72c13eb382b8f2001
Author: yarikoptic <neuro>
Date:   Thu Aug 2 02:46:29 2018 +0000

    Added the mighty script

 scripts/run.sh | 2 ++
 1 file changed, 2 insertions(+)

commit 7161896bf141743a64d64c6cc217d36f04382623
Author: yarikoptic <neuro>
Date:   Thu Aug 2 02:46:26 2018 +0000

    add modified file

 123 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

commit 8b5428983abe4f46c41952d1f72ab29a721c0033
Author: yarikoptic <neuro>
Date:   Thu Aug 2 0

We can go back to a previous state and check the contents of the `123` file. Note that we return back to current state after this excursion.

In [39]:
%%bash

tree /data/mydataset/

/data/mydataset/
├── 123 -> .git/annex/objects/6v/gZ/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e
├── out -> .git/annex/objects/6w/1x/MD5E-s2--48a24b70a0b376535542b996af517398/MD5E-s2--48a24b70a0b376535542b996af517398
└── scripts
    └── run.sh

1 directory, 3 files


In [40]:
%%bash

cd /data/mydataset/
git checkout HEAD^^
tree /data/mydataset/
git checkout master
tree /data/mydataset/

/data/mydataset/
└── 123 -> .git/annex/objects/6v/gZ/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e

0 directories, 1 file
/data/mydataset/
├── 123 -> .git/annex/objects/6v/gZ/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e
├── out -> .git/annex/objects/6w/1x/MD5E-s2--48a24b70a0b376535542b996af517398/MD5E-s2--48a24b70a0b376535542b996af517398
└── scripts
    └── run.sh

1 directory, 3 files


Note: checking out 'HEAD^^'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at 7161896 add modified file
Previous HEAD position was 7161896 add modified file
Switched to branch 'master'


Those records in the commit messages done by `datalad run` could later be used to rerun the entire history of changes.
It might be desired if the environment or input data changes. [datalad rerun](http://docs.datalad.org/en/latest/generated/man/datalad-rerun.html) provides a number of options to fulfil a variety of such cases. For our excercise we will rerun the entire history (well - just one commit for now) recorded using `datalad run`, while reproducing all the results in a separate branch we will call `verify`:

In [41]:
%%bash
cd /data/mydataset
git checkout master
datalad rerun --since= --onto= --branch=verify

add(ok): out (file)
save(ok): /data/mydataset (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)


Already on 'master'
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 


We can look at the git log as a graph to visualize the branch we have created:

In [42]:
%%bash
cd /data/mydataset
git log --oneline --graph --name-only --decorate master verify

* 30edba4 (HEAD -> verify) [DATALAD RUNCMD] Running the mighty script using datalad run
| out
| * 742ca8c (master) [DATALAD RUNCMD] Running the mighty script using datalad run
|/  
|   out
* fd195ce Added the mighty script
| scripts/run.sh
* 7161896 add modified file
| 123
* 8b54289 initial file
| 123
* ef2e870 [DATALAD] new dataset
| .datalad/.gitattributes
| .datalad/config
| .gitattributes
* ee11384 [DATALAD] Set default backend for all files to be MD5E
  .gitattributes


and can `git diff` verify and master to see that they are identical

In [43]:
%%bash
cd /data/mydataset

git checkout master
git diff verify master

Switched to branch 'master'


 whoohoo, we have reproduced our results!

## Exercise 8:

- Copy one binary brainmask image file from `ds000114/derivatives/fmriprep` into `mydataset
  - To do so first you should install the dataset recursively
  - And then get the file
- Add to version control
- use git-annex to list where that file can be found
- Add (to git) a simple python script to count and print the number of non-zero voxels
- Run the script using `datalad run` (learn about `--input` and `--output` options) to store the output into a new out file and make a record of running the script
- Rerun the entire history of changes using `datalad rerun`

In [44]:
%%bash

datalad install -r /data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/
tree -L 1 /data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/

install(ok): /data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep (dataset) [Installed subdataset in order to get /data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep]
/data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/
├── sub-01
├── sub-01.html -> .git/annex/objects/MF/jw/MD5E-s20077561--03ecea8730492d537e050941bdf654bf.html/MD5E-s20077561--03ecea8730492d537e050941bdf654bf.html
├── sub-02
├── sub-02.html -> .git/annex/objects/99/j3/MD5E-s19975906--5ede67fcdad59b65a02f572360db2863.html/MD5E-s19975906--5ede67fcdad59b65a02f572360db2863.html
├── sub-03
├── sub-03.html -> .git/annex/objects/z4/8w/MD5E-s20227534--64e1a981338e8fb9c87f026a79a34785.html/MD5E-s20227534--64e1a981338e8fb9c87f026a79a34785.html
├── sub-04
├── sub-04.html -> .git/annex/objects/qF/J1/MD5E-s22389786--2954e6ece2a825c0008e9b1dcfcaf0a6.html/MD5E-s22389786--2954e6ece2a825c0008e9b1dcfcaf0a6.html
├── sub-05
├── sub-05.html -> .git/annex/objects/6G/Z6/MD

[INFO] Cloning http://datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/.git to '/data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep' 
[INFO] Installing <Dataset path=/data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep> recursively 


In [45]:
%%bash

tree -L 1 /data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/sub-01/anat/

/data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/sub-01/anat/
├── sub-01_t1w_brainmask.nii.gz -> ../../.git/annex/objects/jJ/Wz/MD5E-s93130--2572248880aa5978f8d4049feff2282a.nii.gz/MD5E-s93130--2572248880aa5978f8d4049feff2282a.nii.gz
├── sub-01_t1w_class-csf_probtissue.nii.gz -> ../../.git/annex/objects/jg/4g/MD5E-s3066660--67fbe37bd5440773a9951eb116d9192c.nii.gz/MD5E-s3066660--67fbe37bd5440773a9951eb116d9192c.nii.gz
├── sub-01_t1w_class-gm_probtissue.nii.gz -> ../../.git/annex/objects/80/65/MD5E-s3395247--d4e1b01832f3514a796788ff9d814134.nii.gz/MD5E-s3395247--d4e1b01832f3514a796788ff9d814134.nii.gz
├── sub-01_t1w_class-wm_probtissue.nii.gz -> ../../.git/annex/objects/W9/v4/MD5E-s3107921--50711fd1ad6729e9d8f8b0c0611578f8.nii.gz/MD5E-s3107921--50711fd1ad6729e9d8f8b0c0611578f8.nii.gz
├── sub-01_t1w_dtissue.nii.gz -> ../../.git/annex/objects/gZ/qJ/MD5E-s241087--cedef0e31b37f728f09c92c4bce16f61.nii.gz/MD5E-s241087--cedef0e31b37f728f09c92c4bce16f61.nii.gz
├── sub-

In [46]:
%%bash

datalad get /data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/sub-01/anat/sub-01_t1w_brainmask.nii.gz
cp /data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/sub-01/anat/sub-01_t1w_brainmask.nii.gz \
   /data/mydataset/brainmask.nii.gz
datalad add -d /data/mydataset/ -m "adding the best brainmask out there" brainmask.nii.gz

get(ok): /data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/sub-01/anat/sub-01_t1w_brainmask.nii.gz (file) [from origin...]
add(ok): brainmask.nii.gz (file)
save(ok): /data/mydataset (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)


In [47]:
%%writefile /data/mydataset/scripts/count_voxels.py

import nibabel as nb
import sys
print(nb.load(sys.argv[1]).get_data().sum())

Writing /data/mydataset/scripts/count_voxels.py


In [48]:
%%bash

datalad add -d /data/mydataset --to-git -m "New feature: magnificent script to count voxels" scripts/count_voxels.py

add(ok): scripts/count_voxels.py (file) [non-large file; adding content to git repository]
save(ok): /data/mydataset (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)


In [49]:
%%bash

cd /data/mydataset/
python scripts/count_voxels.py brainmask.nii.gz

883352.0


Let's now run while recording produced output:

In [50]:
%%bash

cd /data/mydataset/
datalad run --input brainmask.nii.gz --output mask_count bash -c 'python scripts/count_voxels.py {inputs} > {outputs}'

get(notneeded): brainmask.nii.gz (file) [already present]
add(ok): mask_count (file)
save(ok): /data/mydataset (dataset)
action summary:
  add (ok: 1)
  get (notneeded: 1)
  save (ok: 1)


[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 


In [51]:
%%bash

tree /data/mydataset/

/data/mydataset/
├── 123 -> .git/annex/objects/6v/gZ/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e
├── brainmask.nii.gz -> .git/annex/objects/jJ/Wz/MD5E-s93130--2572248880aa5978f8d4049feff2282a.nii.gz/MD5E-s93130--2572248880aa5978f8d4049feff2282a.nii.gz
├── mask_count -> .git/annex/objects/wk/5f/MD5E-s9--b1f87c22419b3d0723026b361b26866c/MD5E-s9--b1f87c22419b3d0723026b361b26866c
├── out -> .git/annex/objects/6w/1x/MD5E-s2--48a24b70a0b376535542b996af517398/MD5E-s2--48a24b70a0b376535542b996af517398
└── scripts
    ├── count_voxels.py
    └── run.sh

1 directory, 6 files


In [52]:
%%bash

cd /data/mydataset/
git show

commit dc02409cab83d4093724d664fda31ab4a5ffb0c8
Author: yarikoptic <neuro>
Date:   Thu Aug 2 02:46:55 2018 +0000

    [DATALAD RUNCMD] bash -c 'python scripts/count_voxels.py ...
    
    === Do not change lines below ===
    {
     "chain": [],
     "cmd": "bash -c 'python scripts/count_voxels.py {inputs} > {outputs}'",
     "dsid": "3c991c3a-95fe-11e8-8616-eb661f417fb8",
     "exit": 0,
     "inputs": [
      "brainmask.nii.gz"
     ],
     "outputs": [
      "mask_count"
     ],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

diff --git a/mask_count b/mask_count
new file mode 120000
index 0000000..4dc1458
--- /dev/null
+++ b/mask_count
@@ -0,0 +1 @@
+.git/annex/objects/wk/5f/MD5E-s9--b1f87c22419b3d0723026b361b26866c/MD5E-s9--b1f87c22419b3d0723026b361b26866c
\ No newline at end of file


In [53]:
%%bash
cd /data/mydataset/
git-annex whereis brainmask.nii.gz

whereis brainmask.nii.gz (1 copy) 
  	ede265e2-f3c3-4f61-96ca-3d5d4aa0a8ad -- jovyan@jupyter-yarikoptic:/data/mydataset [here]
ok


In [54]:
%%bash

datalad rerun -d /data/mydataset --since= --onto= --branch=verify2

add(ok): out (file)
save(ok): /data/mydataset (dataset)
run(ok): /data/mydataset (dataset) [517fe3e does not have a command; cherry picking]
run(ok): /data/mydataset (dataset) [230f628 does not have a command; cherry picking]
get(notneeded): brainmask.nii.gz (file) [already present]
add(ok): mask_count (file)
save(ok): /data/mydataset (dataset)
action summary:
  add (ok: 2)
  get (notneeded: 1)
  run (ok: 2)
  save (ok: 2)


[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 
[INFO] == Command start (output follows) ===== 
[INFO] == Command exit (modification check follows) ===== 


In [55]:
%%bash

cd /data/mydataset
git diff master verify2

In [56]:
%%bash

cd /data/mydataset
git log --oneline --graph --name-only --decorate master verify2 verify

* bdfe52d (HEAD -> verify2) [DATALAD RUNCMD] bash -c 'python scripts/count_voxels.py ...
| mask_count
* 4a7276a New feature: magnificent script to count voxels
| scripts/count_voxels.py
* f4ca086 adding the best brainmask out there
| brainmask.nii.gz
* 2b27e00 [DATALAD RUNCMD] Running the mighty script using datalad run
| out
| * dc02409 (master) [DATALAD RUNCMD] bash -c 'python scripts/count_voxels.py ...
| | mask_count
| * 230f628 New feature: magnificent script to count voxels
| | scripts/count_voxels.py
| * 517fe3e adding the best brainmask out there
| | brainmask.nii.gz
| * 742ca8c [DATALAD RUNCMD] Running the mighty script using datalad run
|/  
|   out
| * 30edba4 (verify) [DATALAD RUNCMD] Running the mighty script using datalad run
|/  
|   out
* fd195ce Added the mighty script
| scripts/run.sh
* 7161896 add modified file
| 123
* 8b54289 initial file
| 123
* ef2e870 [DATALAD] new dataset
| .datalad/.gitattributes
| .datalad/config
| .gitattributes
* ee11384 [DATALAD] Set defaul