# Versioning data

In this notebook we will look at using tools to version data. Specifically, [git-annex](https://git-annex.branchable.com/) and [datalad](http://datalad.org/)

- Initializing, searching, downloading, and removing (dropping) a datalad dataset
- Creating and modifying a datalad dataset

### Let's start by using the shell version

**The `!` character at the beginning of the next line indicates that the subsequent command will be executed in a shell. We will be also using `%%bash` to run a multi-line expression.** 

In the following section we are generating the help command. The `--help-np` option ensures that the underlying help output does not page. Since the notebook does not support paging, we will be using this option.

In [None]:
# we will be typing here

In [1]:
!datalad --help-np

Usage: datalad [global-opts] command [command-opts]

Comprehensive data management solution

DataLad provides a unified data distribution system built on the Git
and Git-annex. DataLad command line tools allow to manipulate (obtain,
create, update, publish, etc.) datasets and provide a comprehensive
toolbox for joint management of data and code. Compared to Git/annex
it primarly extends their functionality to transparently and
simultaneously work with multiple inter-related repositories.

*Commands for dataset operations*

  create
      Create a new dataset from scratch
  install
      Install a dataset from a (remote) source
  get
      Get any dataset content (files/directories/subdatasets)
  add
      Add files/directories to an existing dataset
  publish
      Publish a dataset to a known sibling
  uninstall
      Uninstall subdatasets
  drop
      Drop file content from datasets
  remove
      Remove components from datasets
  update
      Update a d

#### Exercise 1: Generate the help for the install command of datalad.

In [None]:
# write your solution here:

In [None]:
!datalad install --help-np

### datalad install

We will install the datalad metadataset. Note this only installs the top level directory structure

In [2]:
%%bash

mkdir ~/data
cd ~/data
datalad install /// 

install(ok): /home/neuro/data/datasets.datalad.org (dataset)


[INFO] Cloning http://datasets.datalad.org/ to '/home/neuro/data/datasets.datalad.org' 
| Failed to run ['git', 'config', 'user.name'] under None. Exit code=1. out= err= [cmd.py:run:530]CommandError: command '['git', 'config', 'user.email']' failed with exitcode 1
| Failed to run ['git', 'config', 'user.email'] under None. Exit code=1. out= err= [cmd.py:run:530].  Some operations might fail or not perform correctly. 


#### Exercise 2

List the contents of the installed dataset using the tree command up to three levels

In [None]:
# write your solution here:

In [3]:
%%bash

tree -L 3 ~/data/datasets.datalad.org/

/home/neuro/data/datasets.datalad.org/
├── abide
├── abide2
├── adhd200
├── corr
├── crcns
├── datapackage.json
├── dbic
├── devel
├── dicoms
├── hbnssi
├── indi
├── kaggle
├── labs
├── neurovault
├── nidm
├── openfmri
└── workshops

16 directories, 1 file


Note that only the top level datasets are created.

### dtalad search
Another useful command is to `search` for information across datalad datasets.

let's start from displaying the help for the search command:

In [None]:
!datalad search --help-np

Now let's search for datasets containing information about Jim Haxby

In [4]:
!datalad search -d ~/data/datasets.datalad.org haxby

| Failed to run ['git', 'config', 'user.name'] under None. Exit code=1. out= err= [cmd.py:run:530]CommandError: command '['git', 'config', 'user.email']' failed with exitcode 1
| Failed to run ['git', 'config', 'user.email'] under None. Exit code=1. out= err= [cmd.py:run:530].  Some operations might fail or not perform correctly. 
Total (20 ok out of 61):  19%|██▋           | 1.44M/7.49M [00:01<00:06, 951kB/s]
Total (21 ok out of 61):  20%|██▊           | 1.49M/7.49M [00:01<00:06, 951kB/s][A
                                                                                [A
Total (36 ok out of 61):  34%|████▍        | 2.58M/7.49M [00:01<00:03, 1.45MB/s][A
Total (37 ok out of 61):  37%|████▊        | 2.78M/7.49M [00:01<00:03, 1.45MB/s][A
Total (47 ok out of 61):  78%|██████████   | 5.82M/7.49M [00:02<00:00, 2.24MB/s][A
Total (60 ok out of 61):  93%|████████████ | 6.98M/7.49M [00:02<00:00, 1.96MB/s][A
.datalad/me .. 7e957f0ae0f:  43%|█████▌       | 383k/888k [00:00<00:00, 763kB/s][

#### Exercise 3:

Search for information about datasets related to chris gorgolewski, using `gorgolewski` as the keyword.

In [None]:
#type your solution here:

In [5]:
!datalad search -d ~/data/datasets.datalad.org gorgolewski

| Failed to run ['git', 'config', 'user.name'] under None. Exit code=1. out= err= [cmd.py:run:530]CommandError: command '['git', 'config', 'user.email']' failed with exitcode 1
| Failed to run ['git', 'config', 'user.email'] under None. Exit code=1. out= err= [cmd.py:run:530].  Some operations might fail or not perform correctly. 
[1;1msearch[0m([1;32mok[0m): /home/neuro/data/datasets.datalad.org/openfmri/ds000114 ([1;35mdataset[0m)
[1;1msearch[0m([1;32mok[0m): /home/neuro/data/datasets.datalad.org/openfmri/ds000158 ([1;35mdataset[0m)
[1;1msearch[0m([1;32mok[0m): /home/neuro/data/datasets.datalad.org/openfmri/ds000221 ([1;35mdataset[0m)
[1;1msearch[0m([1;32mok[0m): /home/neuro/data/datasets.datalad.org/workshops/nipype-2017/ds000114 ([1;35mdataset[0m)
action summary:
  search (ok: 4)


#### Exercise 4:

Install one of the dataset - ds0000114 and look at the contents of the dataset upto three levels using the tree command.

In [None]:
# type your code here:

In [6]:
%%bash 

datalad install ~/data/datasets.datalad.org/workshops/nih-2017/ds000114

install(ok): /home/neuro/data/datasets.datalad.org/workshops/nih-2017/ds000114 (dataset) [Installed subdataset in order to get /home/neuro/data/datasets.datalad.org/workshops/nih-2017/ds000114]
action summary:
  install (ok: 3)


| Failed to run ['git', 'config', 'user.name'] under None. Exit code=1. out= err= [cmd.py:run:530]CommandError: command '['git', 'config', 'user.email']' failed with exitcode 1
| Failed to run ['git', 'config', 'user.email'] under None. Exit code=1. out= err= [cmd.py:run:530].  Some operations might fail or not perform correctly. 
[INFO] Cloning http://datasets.datalad.org/workshops/.git to '/home/neuro/data/datasets.datalad.org/workshops' 
[INFO] Cloning http://datasets.datalad.org/workshops/nih-2017/.git to '/home/neuro/data/datasets.datalad.org/workshops/nih-2017' 
[INFO] Cloning http://datasets.datalad.org/workshops/nih-2017/ds000114/.git to '/home/neuro/data/datasets.datalad.org/workshops/nih-2017/ds000114' 
[INFO] access to dataset sibling "datalad" not auto-enabled, enable with:
| 		datalad siblings -d "/home/neuro/data/datasets.datalad.org/workshops/nih-2017/ds000114" enable -s datalad 


In [7]:
!tree -L 3 ~/data/datasets.datalad.org/workshops/nih-2017/ds000114

[01;34m/home/neuro/data/datasets.datalad.org/workshops/nih-2017/ds000114[00m
├── CHANGES
├── dataset_description.json
├── [01;34mderivatives[00m
│   ├── [01;34mfmriprep[00m
│   └── [01;34mfreesurfer[00m
├── [40;31;01mdwi.bval[00m -> [00m.git/annex/objects/JX/4K/MD5E-s335--5bd6fa32ccd0c79e79f9ac63a2c09c1a.bval/MD5E-s335--5bd6fa32ccd0c79e79f9ac63a2c09c1a.bval[00m
├── [40;31;01mdwi.bvec[00m -> [00m.git/annex/objects/Pg/wk/MD5E-s1248--0641c68ff6ee6164928c984541653430.bvec/MD5E-s1248--0641c68ff6ee6164928c984541653430.bvec[00m
├── [01;34msub-01[00m
│   ├── [01;34mses-retest[00m
│   │   ├── [01;34manat[00m
│   │   ├── [01;34mdwi[00m
│   │   └── [01;34mfunc[00m
│   └── [01;34mses-test[00m
│       ├── [01;34manat[00m
│       ├── [01;34mdwi[00m
│       └── [01;34mfunc[00m
├── [01;34msub-02[00m
│   ├── [01;34mses-retest[00m
│   │   ├── [01;34manat[00m
│   │   ├── [01;34mdwi[00m
│   │   └── [01;34mfunc[00m
│   └── [01;34mses-test[

Let's see what happens when we try to check the content of the file:

In [8]:
!cat ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval

cat: /home/neuro/data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval: No such file or directory


We got `No such file or directory`, because we only installed the dataset, but we didn't download anything!

### datalad get
Let's use datalad to fetch this file and then list it again

In [9]:
!datalad get ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval

| Failed to run ['git', 'config', 'user.name'] under None. Exit code=1. out= err= [cmd.py:run:530]CommandError: command '['git', 'config', 'user.email']' failed with exitcode 1
| Failed to run ['git', 'config', 'user.email'] under None. Exit code=1. out= err= [cmd.py:run:530].  Some operations might fail or not perform correctly. 
[1;1mget[0m([1;32mok[0m): /home/neuro/data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval ([1;35mfile[0m) [from origin...]


and let's try to get the content again

In [10]:
!cat ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval

0 0 0 0 0 0 0 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 


In [11]:
!datalad ls ~/data/datasets.datalad.org/workshops/nih-2017/ds000114

| Failed to run ['git', 'config', 'user.name'] under None. Exit code=1. out= err= [cmd.py:run:530]CommandError: command '['git', 'config', 'user.email']' failed with exitcode 1
| Failed to run ['git', 'config', 'user.email'] under None. Exit code=1. out= err= [cmd.py:run:530].  Some operations might fail or not perform correctly. 
[1;4m/home/neuro/data/datasets.datalad.org/workshops/nih-2017/ds000114[0m   [annex]  master  2.0.1-13-g26d1fc2e0 2018-06-08/11:33:43  [1;32m✓[0m


### gitt-annex list

Since datalad uses git-annex under the hood, let's try to list things with git-annex. Let's first check the help:

In [12]:
!git-annex list --help

git-annex list - show which remotes contain files

Usage: git-annex list [PATH ...] [--allrepos]

Available options:
  --allrepos               show all repositories, not only remotes
  --force                  allow actions that may lose annexed data
  -F,--fast                avoid slow operations
  -q,--quiet               avoid verbose output
  -v,--verbose             allow verbose output (default)
  -d,--debug               show debug messages
  --no-debug               don't show debug messages
  -b,--backend NAME        specify key-value backend to use
  -N,--numcopies NUMBER    override default number of copies
  --trust REMOTE           override trust setting
  --semitrust REMOTE       override trust setting back to default
  --untrust REMOTE         override trust setting to untrusted
  -c,--config NAME=VALUE   override git configuration setting
  --user-agent NAME        override default User-Agent
  --trust-glacier          Trust Amazon Glacier inventory

and try to list all flies from `sub-01` in `ds000114`.

In [13]:
!git-annex list ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/sub-01/

git-annex: Not in a git repository.


Oops! Git and datalad don't recognize root folders without pointing to an annex or dataset location.

In [14]:
%%bash 
cd ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/
git-annex list sub-01

here
|origin
||web
|||bittorrent
||||datalad-archives
|||||
_XX_X sub-01/ses-retest/anat/sub-01_ses-retest_T1w.nii.gz
_XX_X sub-01/ses-retest/dwi/sub-01_ses-retest_dwi.nii.gz
_XX_X sub-01/ses-retest/func/sub-01_ses-retest_task-covertverbgeneration_bold.nii.gz
_XX_X sub-01/ses-retest/func/sub-01_ses-retest_task-fingerfootlips_bold.nii.gz
_XX_X sub-01/ses-retest/func/sub-01_ses-retest_task-linebisection_bold.nii.gz
_XX_X sub-01/ses-retest/func/sub-01_ses-retest_task-linebisection_events.tsv
_XX_X sub-01/ses-retest/func/sub-01_ses-retest_task-overtverbgeneration_bold.nii.gz
_XX_X sub-01/ses-retest/func/sub-01_ses-retest_task-overtwordrepetition_bold.nii.gz
_XX_X sub-01/ses-test/anat/sub-01_ses-test_T1w.nii.gz
_XX_X sub-01/ses-test/dwi/sub-01_ses-test_dwi.nii.gz
_XX_X sub-01/ses-test/func/sub-01_ses-test_task-covertverbgeneration_bold.nii.gz
_XX_X sub-01/ses-test/func/sub-01_ses-test_task-fingerfootlips_bold.nii.gz
_XX_X sub-01/ses-test/func/sub-01_ses-test_task-linebisection_bold.nii.gz
_

#### Exercise 6:

Show the help for `git annex list` and then use it to list the `dwi*` files in `ds000114`

In [None]:
# type your solution here:

In [15]:
%%bash

cd ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/
git-annex list dwi.*

here
|origin
||web
|||bittorrent
||||datalad-archives
|||||
XXX_X dwi.bval
_XX_X dwi.bvec


### datalad drop
We can also remove content from our local storage using the `drop` command.

In [16]:
!datalad drop ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval

| Failed to run ['git', 'config', 'user.name'] under None. Exit code=1. out= err= [cmd.py:run:530]CommandError: command '['git', 'config', 'user.email']' failed with exitcode 1
| Failed to run ['git', 'config', 'user.email'] under None. Exit code=1. out= err= [cmd.py:run:530].  Some operations might fail or not perform correctly. 
[1;1mdrop[0m([1;32mok[0m): /home/neuro/data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval ([1;35mfile[0m) [checking origin...]


see that the file is still listed under the repository

In [17]:
! ls -l ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/

total 96
-rw-r--r-- 1 neuro users  127 Jun 14 06:48 CHANGES
-rw-r--r-- 1 neuro users  319 Jun 14 06:48 dataset_description.json
drwxr-xr-x 4 neuro users 4096 Jun 14 06:48 derivatives
lrwxrwxrwx 1 neuro users  122 Jun 14 06:48 dwi.bval -> .git/annex/objects/JX/4K/MD5E-s335--5bd6fa32ccd0c79e79f9ac63a2c09c1a.bval/MD5E-s335--5bd6fa32ccd0c79e79f9ac63a2c09c1a.bval
lrwxrwxrwx 1 neuro users  124 Jun 14 06:48 dwi.bvec -> .git/annex/objects/Pg/wk/MD5E-s1248--0641c68ff6ee6164928c984541653430.bvec/MD5E-s1248--0641c68ff6ee6164928c984541653430.bvec
drwxr-xr-x 4 neuro users 4096 Jun 14 06:48 sub-01
drwxr-xr-x 4 neuro users 4096 Jun 14 06:48 sub-02
drwxr-xr-x 4 neuro users 4096 Jun 14 06:48 sub-03
drwxr-xr-x 4 neuro users 4096 Jun 14 06:48 sub-04
drwxr-xr-x 4 neuro users 4096 Jun 14 06:48 sub-05
drwxr-xr-x 4 neuro users 4096 Jun 14 06:48 sub-06
drwxr-xr-x 4 neuro users 4096 Jun 14 06:48 sub-07
drwxr-xr-x 4 neuro users 4096 Jun 14 06:48 sub-08
drwxr-xr-x 4 neuro users 4096 Jun 14 06:48 su

but we can't access the content again:

In [18]:
! cat ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval

cat: /home/neuro/data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval: No such file or directory


#### Exercise 7:

Check where the dwi files are with the annex list command and get the missing files

In [None]:
#type your code here:

In [None]:
%%bash

cd ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/
git-annex list dwi.*

In [None]:
%%bash
datalad get ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.*

## Creating and versioning our own dataset

### datalad create
Let's create dataset called `mydataset`

In [None]:
!datalad create ~/data/mydataset

We will create a dummy file and add it to the dataset

In [None]:
%%bash 

echo "123" > ~/data/mydataset/123
datalad add -m "initial file" ~/data/mydataset/123

list where the copy is available

In [None]:
%%bash

cd ~/data/mydataset
git-annex list

In [None]:
!tree ~/data/mydataset

In [None]:
!cat ~/data/mydataset/.git/annex/objects/pF/Zf/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f

Let us try removing the data

In [None]:
!datalad drop ~/data/mydataset/123

Datalad will prevent from doing this, because this is the only copy of the file. It will also not allow to modify the file so easily:

In [None]:
!echo "321" > ~/data/mydataset/123

If we really want to change the content of the file, we have to unlock the file first. After changes we should commit it back:

In [None]:
%%bash

datalad unlock ~/data/mydataset/123
echo "321" > ~/data/mydataset/123
datalad add -m "add modified file" ~/data/mydataset/123

If we try modifying it, we now again get permission denied, because the file is locked.

In [None]:
!echo "123" > ~/data/mydataset/123

In [None]:
!tree ~/data/mydataset

In [None]:
!cat ~/data/mydataset/.git/annex/objects/6v/gZ/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e

But the old object is still there. 

In [None]:
!cat ~/data/mydataset/.git/annex/objects/pF/Zf/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f

The entire history of the repo is available.

In [None]:
%%bash 

cd ~/data/mydataset/
git log

Let us create a simple script that counts the number of characters in a file

In [None]:
%%bash

cd ~/data/mydataset
mkdir -p scripts
cmd=$(cat << EOM
#!/bin/bash\ncat \$1 | wc -c
EOM
)
echo -e $cmd > scripts/run.sh
chmod +x scripts/run.sh
cat scripts/run.sh

Now we will run the script and add the script and the output to annex.

In [None]:
%%bash
cd ~/data/mydataset
scripts/run.sh 123 > out
datalad add -m "Added scripts and output" out scripts

We can look at the log again

In [None]:
%%bash 

cd ~/data/mydataset/
git log

We can go back to a previous state and check the contents of the `123` file. Note that we return back to current state after this excursion.

In [None]:
%%bash

tree ~/data/mydataset/

## Exercise 8:

- Copy one binary brainmask image file from ds000114/derivatives/fmriprep into mydataset
  - To do so first you should install the dataset recursively
  - And then get the file
- Add to version control
- use git-annex to list where that file can be found
- Add a simple python script to count and print the number of non-zero voxels
- Store the output into a new out file
- Use datalad to add everything to the repository

In [None]:
%%bash

datalad install -r ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/
tree -L 1 ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/

In [None]:
%%bash

tree -L 1 ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/sub-01/anat/

In [None]:
%%bash

datalad get ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/sub-01/anat/sub-01_t1w_brainmask.nii.gz
cp ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/sub-01/anat/sub-01_t1w_brainmask.nii.gz \
   ~/data/mydataset/brainmask.nii.gz

In [None]:
%%writefile ~/data/mydataset/scripts/count_voxels.py

import nibabel as nb
import sys
print(nb.load(sys.argv[1]).get_data().sum())

In [None]:
%%bash

cd ~/data/mydataset/
python scripts/count_voxels.py brainmask.nii.gz > mask_count

In [None]:
%%bash

tree ~/data/mydataset/

In [None]:
%%bash

cd ~/data/mydataset/
datalad add -m "added brainmask, script, and output" brainmask.nii.gz scripts/count_voxels.py mask_count

In [None]:
%%bash
cd ~/data/mydataset/
git-annex whereis brainmask.nii.gz