# Versioning data

In this notebook we will look at using tools to version data. Specifically, [git-annex](https://git-annex.branchable.com/) and [datalad](http://datalad.org/)

- Initializing, searching, downloading, and removing (dropping) a datalad dataset
- Creating and modifying a datalad dataset

### Let's start by using the shell version

**The `!` character at the beginning of the next line indicates that the subsequent command will be executed in a shell. We will be also using `%%bash` to run a multi-line expression.** 

In the following section we are generating the help command. The `--help-np` option ensures that the underlying help output does not page. Since the notebook does not support paging, we will be using this option.

In [None]:
# we will be typing here

In [None]:
!datalad --help-np

#### Exercise 1: Generate the help for the install command of datalad.

In [None]:
# write your solution here:

In [None]:
!datalad install --help-np

### datalad install

We will install the datalad metadataset. Note this only installs the top level directory structure

In [None]:
%%bash

mkdir ~/data
cd ~/data
datalad install /// 

#### Exercise 2

List the contents of the installed dataset using the tree command up to three levels

In [None]:
# write your solution here:

In [None]:
%%bash

tree -L 3 ~/data/datasets.datalad.org/

Note that only the top level datasets are created.

### dtalad search
Another useful command is to `search` for information across datalad datasets.

let's start from displaying the help for the search command:

In [None]:
!datalad search --help-np

Now let's search for datasets containing information about Jim Haxby

In [None]:
!datalad search -d ~/data/datasets.datalad.org haxby

#### Exercise 3:

Search for information about datasets related to chris gorgolewski, using `gorgolewski` as the keyword.

In [None]:
#type your solution here:

In [None]:
!datalad search -d ~/data/datasets.datalad.org gorgolewski

#### Exercise 4:

Install one of the dataset - ds0000114 and look at the contents of the dataset upto three levels using the tree command.

In [None]:
# type your code here:

In [None]:
%%bash 

datalad install ~/data/datasets.datalad.org/workshops/nih-2017/ds000114

In [None]:
!tree -L 3 ~/data/datasets.datalad.org/workshops/nih-2017/ds000114

Let's see what happens when we try to check the content of the file:

In [None]:
!cat ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval

We got `No such file or directory`, because we only installed the dataset, but we didn't download anything!

### datalad get
Let's use datalad to fetch this file and then list it again

In [None]:
!datalad get ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval

and let's try to get the content again

In [None]:
!cat ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval

In [None]:
!datalad ls ~/data/datasets.datalad.org/workshops/nih-2017/ds000114

### gitt-annex list

Since datalad uses git-annex under the hood, let's try to list things with git-annex. Let's first check the help:

In [None]:
!git-annex list --help

and try to list all flies from `sub-01` in `ds000114`.

In [None]:
!git-annex list ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/sub-01/

Oops! Git and datalad don't recognize root folders without pointing to an annex or dataset location.

In [None]:
%%bash 
cd ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/
git-annex list sub-01

#### Exercise 6:

Show the help for `git annex list` and then use it to list the `dwi*` files in `ds000114`

In [None]:
# type your solution here:

In [None]:
%%bash

cd ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/
git-annex list dwi.*

### datalad drop
We can also remove content from our local storage using the `drop` command.

In [None]:
!datalad drop ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval

see that the file is still listed under the repository

In [None]:
! ls -l ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/

but we can't access the content again:

In [None]:
! cat ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.bval

#### Exercise 7:

Check where the dwi files are with the annex list command and get the missing files

In [None]:
#type your code here:

In [None]:
%%bash

cd ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/
git-annex list dwi.*

In [None]:
%%bash
datalad get ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/dwi.*

## Creating and versioning our own dataset

### datalad create
Let's create dataset called `mydataset`

In [None]:
!datalad create ~/data/mydataset

We will create a dummy file and add it to the dataset

In [None]:
%%bash 

echo "123" > ~/data/mydataset/123
datalad add -m "initial file" ~/data/mydataset/123

list where the copy is available

In [None]:
%%bash

cd ~/data/mydataset
git-annex list

In [None]:
!tree ~/data/mydataset

In [None]:
!cat ~/data/mydataset/.git/annex/objects/pF/Zf/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f

Let us try removing the data

In [None]:
!datalad drop ~/data/mydataset/123

Datalad will prevent from doing this, because this is the only copy of the file. It will also not allow to modify the file so easily:

In [None]:
!echo "321" > ~/data/mydataset/123

If we really want to change the content of the file, we have to unlock the file first. After changes we should commit it back:

In [None]:
%%bash

datalad unlock ~/data/mydataset/123
echo "321" > ~/data/mydataset/123
datalad add -m "add modified file" ~/data/mydataset/123

If we try modifying it, we now again get permission denied, because the file is locked.

In [None]:
!echo "123" > ~/data/mydataset/123

In [None]:
!tree ~/data/mydataset

In [None]:
!cat ~/data/mydataset/.git/annex/objects/6v/gZ/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e/MD5E-s4--9492fe88f263d58e0b686885e8c98c0e

But the old object is still there. 

In [None]:
!cat ~/data/mydataset/.git/annex/objects/pF/Zf/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f

The entire history of the repo is available.

In [None]:
%%bash 

cd ~/data/mydataset/
git log

Let us create a simple script that counts the number of characters in a file

In [None]:
%%bash

cd ~/data/mydataset
mkdir -p scripts
cmd=$(cat << EOM
#!/bin/bash\ncat \$1 | wc -c
EOM
)
echo -e $cmd > scripts/run.sh
chmod +x scripts/run.sh
cat scripts/run.sh

Now we will run the script and add the script and the output to annex.

In [None]:
%%bash
cd ~/data/mydataset
scripts/run.sh 123 > out
datalad add -m "Added scripts and output" out scripts

We can look at the log again

In [None]:
%%bash 

cd ~/data/mydataset/
git log

We can go back to a previous state and check the contents of the `123` file. Note that we return back to current state after this excursion.

In [None]:
%%bash

tree ~/data/mydataset/

## Exercise 8:

- Copy one binary brainmask image file from ds000114/derivatives/fmriprep into mydataset
  - To do so first you should install the dataset recursively
  - And then get the file
- Add to version control
- use git-annex to list where that file can be found
- Add a simple python script to count and print the number of non-zero voxels
- Store the output into a new out file
- Use datalad to add everything to the repository

In [None]:
%%bash

datalad install -r ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/
tree -L 1 ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/

In [None]:
%%bash

tree -L 1 ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/sub-01/anat/

In [None]:
%%bash

datalad get ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/sub-01/anat/sub-01_t1w_brainmask.nii.gz
cp ~/data/datasets.datalad.org/workshops/nih-2017/ds000114/derivatives/fmriprep/sub-01/anat/sub-01_t1w_brainmask.nii.gz \
   ~/data/mydataset/brainmask.nii.gz

In [None]:
%%writefile ~/data/mydataset/scripts/count_voxels.py

import nibabel as nb
import sys
print(nb.load(sys.argv[1]).get_data().sum())

In [None]:
%%bash

cd ~/data/mydataset/
python scripts/count_voxels.py brainmask.nii.gz > mask_count

In [None]:
%%bash

tree ~/data/mydataset/

In [None]:
%%bash

cd ~/data/mydataset/
datalad add -m "added brainmask, script, and output" brainmask.nii.gz scripts/count_voxels.py mask_count

In [None]:
%%bash
cd ~/data/mydataset/
git-annex whereis brainmask.nii.gz