DataLad makes it easy to use existing datasets that have been published on the web. In this section, we are going to clone a dataset from [OpenNeuro](openneuro.org) and explore it's content to understand the structure of DataLad datasets and how to work with them. Simply execute the cell below to clone the dataset into the current directory - it wil be stred in a folder called `ds004408/`.

In [2]:
!datalad clone https://github.com/OpenNeuroDatasets/ds004408.git

Cloning:   0%|                             | 0.00/2.00 [00:00<?, ? candidates/s]
Enumerating: 0.00 Objects [00:00, ? Objects/s][A
                                              [A
Counting:   0%|                              | 0.00/23.7k [00:00<?, ? Objects/s][A
                                                                                [A
Compressing:   0%|                           | 0.00/15.5k [00:00<?, ? Objects/s][A
                                                                                [A
Receiving:   0%|                             | 0.00/23.7k [00:00<?, ? Objects/s][A
Receiving:  31%|██████▏             | 7.35k/23.7k [00:00<00:00, 71.9k Objects/s][A
Receiving:  76%|███████████████▏    | 18.0k/23.7k [00:00<00:00, 91.4k Objects/s][A
                                                                                [A
Resolving:   0%|                              | 0.00/4.93k [00:00<?, ? Deltas/s][A
[INFO   ] Remote origin not usable by git-annex; setting annex-

## Understanding the Structure of a Dataset

DataLad is a tool that is primarily used through the terminal. Thus, when exploring the content of a DataLad dataset, it makes sense to use terminal commands like `ls` (Linux/MacOs) or `dir` (Windows). In VSCode you can open the terminal via the menu bar by clicking **View > Terminal**  or by pressing the **Ctrl+`** keyboard shortcut and execute these commands there.

Alternatively, you can execute the terminal commands in the code cells of this Jupyter notebook by prefacing them with `!`. With `!` we can execute any shell command as an independent subprocess. Because these commands can't modify the state of the notebook there is a special prefix for the `cd` (change directory) command: `%cd`. This allow the `cd` command to persisitently change the working directory within the notebook.

In the following exercises, we are going to explore the dataset we cloned in the beginning of the notebook. You can do this in the terminal or in the notebook using the `!` and `%` operators, or try both - however you prefer! Here are the commands you need to know:


| Linux/macOS | Windows | Description |
| --- | --- | --- |
| `ls` | `dir` | List the content of the current directory (including hidden files) |
| `ls -a` | `dir /a` | List the content of the current directory (including hidden files) |
| `ls -a data` | `dir /a data` | List the content of the `data` directory |
| `cd code/` | `cd code/` | Move to the `code/` directory |
| `cd ..` | `cd .. ` |  Move to the parent of the current directory |

**Example**: Change the current directory to `ds004408/` (i.e. the root directory of the clones dataset) 

In [3]:
%cd ds004408/

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/01_working_with_a_datalad_dataset/ds004408


**Example**: List the content of the current directory (i.e. `ds004408/`)

In [None]:
# Linux/MacOS
!ls 

CHANGES			  stimuli  sub-005  sub-010  sub-015
README			  sub-001  sub-006  sub-011  sub-016
dataset_description.json  sub-002  sub-007  sub-012  sub-017
participants.json	  sub-003  sub-008  sub-013  sub-018
participants.tsv	  sub-004  sub-009  sub-014  sub-019


**Example**: Display the content of `README.md`.

In [None]:
# Linux/MacOS
!cat README

The data in one study [^1] and then added to by another [^2] and contains EEG responses of healthy, neurotypical adults who listened to naturalistic speech. The subjects listened to segments from an audio book version of "The Old Man and the Sea" and their brain activity was recorded using a 128-channel ActiveTwo EEG system (BioSemi). 

The stimuli folder contains .wav files of the presented audiobook segments as well as a .TextGrid file for each segment, containng the timing of  words and phonemes in that segment. The text grids were generated using the forced-alignment software Prosodylab-Aligner [^3] and inspected by eye. Each subject's folder contains one EEG-recording per audio segment and their starts are aligned (the EEG recordings are longer than the audio to a varying extent).  The recordings are unfiltered, unreferenced and sampled at 512 Hz.

[^1]: Di Liberto, G. M., O’sullivan, J. A., & Lalor, E. C. (2015). Low-frequency cortical entrainment to speech reflects phoneme-level

In [None]:
# Windows
!type README

**Exercise**: Change the current working directory to the `stimuli/` folder and list the contents.

In [22]:
# Linux/MacOS
%cd stimuli
!ls

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/01_working_with_a_datalad_dataset/ds004408/stimuli
audio01.TextGrid  audio06.wav	    audio12.TextGrid  audio17.wav
audio01.wav	  audio07.TextGrid  audio12.wav       audio18.TextGrid
audio02.TextGrid  audio07.wav	    audio13.TextGrid  audio18.wav
audio02.wav	  audio08.TextGrid  audio13.wav       audio19.TextGrid
audio03.TextGrid  audio08.wav	    audio14.TextGrid  audio19.wav
audio03.wav	  audio09.TextGrid  audio14.wav       audio20.TextGrid
audio04.TextGrid  audio09.wav	    audio15.TextGrid  audio20.wav
audio04.wav	  audio10.TextGrid  audio15.wav       results.txt
audio05.TextGrid  audio10.wav	    audio16.TextGrid
audio05.wav	  audio11.TextGrid  audio16.wav
audio06.TextGrid  audio11.wav	    audio17.TextGrid


In [None]:
# Windows
%cd stimuli
!dir

**Exercise**: Change the directory back to `ds004408/` (i.e. the parent directory of `stimuli/`).

In [23]:
%cd ..

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/01_working_with_a_datalad_dataset/ds004408


**Exercise**: List the contents of `ds004408/` including all hidden files and folders.

In [None]:
# Linux/MacOs
!ls -a

.		CHANGES			  stimuli  sub-005  sub-010  sub-015
..		README			  sub-001  sub-006  sub-011  sub-016
.datalad	dataset_description.json  sub-002  sub-007  sub-012  sub-017
.git		participants.json	  sub-003  sub-008  sub-013  sub-018
.gitattributes	participants.tsv	  sub-004  sub-009  sub-014  sub-019


In [None]:
# Windows
!dir /a

dir: cannot access '/-a': No such file or directory


**Exercise**: List the contents of the `.git/` folder.

In [41]:
# Linux/MacOs
!ls .git

HEAD   branches  description  index  logs     packed-refs
annex  config	 hooks	      info   objects  refs


In [None]:
# Windows
!dir

**Exercise**: List the content of the `.datalad/` folder.

In [36]:
# Linux/MacOS
!ls -a

.		CHANGES			  stimuli  sub-005  sub-010  sub-015
..		README			  sub-001  sub-006  sub-011  sub-016
.datalad	dataset_description.json  sub-002  sub-007  sub-012  sub-017
.git		participants.json	  sub-003  sub-008  sub-013  sub-018
.gitattributes	participants.tsv	  sub-004  sub-009  sub-014  sub-019


In [9]:
# Windows
!dir /a

dir: cannot access '/a': No such file or directory


**Exercise**: Display the file content of `participants.tsv`.

In [None]:
# Linux/MacOs
!cat participants.tsv

participant_id	age	sex	hand	weight	height
sub-001	n/a	n/a	n/a	n/a	n/a
sub-002	n/a	n/a	n/a	n/a	n/a
sub-003	n/a	n/a	n/a	n/a	n/a
sub-004	n/a	n/a	n/a	n/a	n/a
sub-005	n/a	n/a	n/a	n/a	n/a
sub-006	n/a	n/a	n/a	n/a	n/a
sub-007	n/a	n/a	n/a	n/a	n/a
sub-008	n/a	n/a	n/a	n/a	n/a
sub-009	n/a	n/a	n/a	n/a	n/a
sub-010	n/a	n/a	n/a	n/a	n/a
sub-011	n/a	n/a	n/a	n/a	n/a
sub-012	n/a	n/a	n/a	n/a	n/a
sub-013	n/a	n/a	n/a	n/a	n/a
sub-014	n/a	n/a	n/a	n/a	n/a
sub-015	n/a	n/a	n/a	n/a	n/a
sub-016	n/a	n/a	n/a	n/a	n/a
sub-017	n/a	n/a	n/a	n/a	n/a
sub-018	n/a	n/a	n/a	n/a	n/a
sub-019	n/a	n/a	n/a	n/a	n/a


In [None]:
# Windows
!type participants.tsv

**Exercise**: Display the file content of `.datalad/config`. This file contains a DataLad ID that uiquely identifies this dataset.

In [38]:
# Linux/MacOS
!cat .datalad/config

[datalad "dataset"]
	id = 37b1ac65-b33e-4e44-9188-d9d57cb1e50d


In [None]:
# Windows
!type .datalad/config

## Managing File Content

You may have noticed that, even though the dataset contains lots of different folders, cloning it was really fast. This is because DataLad manages dataset structure and file content separetely. When you cloned the dataset you didn't actually download the file content - you merely downloaded tiny simbolic links that represent the files. In this section, we will learn how to load the actual file content and also how to remove it again.

**DataLad Commands**
| Command | Description |
| --- | --- |
| `datalad get dir/` | Download the content of the directory `dir/` |
| `datalad drop dir/` | Delete the content of the directory `dir/` |
| `datalad get dir/exampe.txt` | Download the content of the file `dir/example.txt` |
| `datalad drop dir/` | Delete the content of the file `dir/` |

**OS-specific commands**
| Linux/macOS | Windows | Description |
| --- | --- | --- |
| `du -sh .` | `dir /c` | Print the disk usage of the current directory |
| `du -sh data` | `dir /c data` | Print the disk usage of the `data/` directory |
| `!ls -a data` | `!dir /a data` | List the content of the `data` directory |

**Example**: Print the size of the current directory

In [None]:
# Linux/MacOS
!du -sh .

15M	.


In [None]:
# Windows
!dir /c

**Example**: Get the data for the file `stimuli/audio01.wav`.

In [52]:
!datalad get stimuli/audio01.wav

Total:   0%|                                   | 0.00/31.3M [00:00<?, ? Bytes/s]
Get stimuli/audio01.wav:   0%|                 | 0.00/31.3M [00:00<?, ? Bytes/s][A
Get stimuli/audio01.wav:   0%|         | 33.3k/31.3M [00:00<01:40, 313k Bytes/s][A
Get stimuli/audio01.wav:   0%|         | 85.6k/31.3M [00:00<01:15, 414k Bytes/s][A
Get stimuli/audio01.wav:   0%|          | 138k/31.3M [00:00<01:09, 446k Bytes/s][A
Get stimuli/audio01.wav:   1%|          | 294k/31.3M [00:00<00:36, 855k Bytes/s][A
Get stimuli/audio01.wav:   2%|▏        | 608k/31.3M [00:00<00:19, 1.58M Bytes/s][A
Get stimuli/audio01.wav:   4%|▎       | 1.22M/31.3M [00:00<00:10, 2.96M Bytes/s][A
Get stimuli/audio01.wav:   7%|▌       | 2.28M/31.3M [00:00<00:05, 5.22M Bytes/s][A
Get stimuli/audio01.wav:  15%|█▏      | 4.59M/31.3M [00:00<00:02, 10.4M Bytes/s][A
Get stimuli/audio01.wav:  32%|██▌     | 10.2M/31.3M [00:01<00:01, 18.4M Bytes/s][A
Get stimuli/audio01.wav:  41%|███▎    | 12.8M/31.3M [00:01<00:00, 20.1M Bytes/s

**Exercise**: Check the disk usage of the current directory, again.

In [53]:
# Linux/MacOs
!du -sh

78M	.


In [None]:
# Windows
!dir /c

**Exercise**: Get the data for `stimuli/audio02.wav`, then pint the disk usage for the current directory.

In [56]:
# Linux/MacOS
!datalad get stimuli/audio02.wav
!du -sh

Total:   0%|                                   | 0.00/31.9M [00:00<?, ? Bytes/s]
Get stimuli/audio02.wav:   0%|                 | 0.00/31.9M [00:00<?, ? Bytes/s][A
Get stimuli/audio02.wav:  59%|████▋   | 18.9M/31.9M [00:00<00:00, 42.1M Bytes/s][A
Get stimuli/audio02.wav:  78%|██████▎ | 25.0M/31.9M [00:01<00:00, 17.8M Bytes/s][A
Get stimuli/audio02.wav:  86%|██████▉ | 27.5M/31.9M [00:01<00:00, 10.7M Bytes/s][A
Get stimuli/audio02.wav:  91%|███████▎| 28.9M/31.9M [00:02<00:00, 7.68M Bytes/s][A
Get stimuli/audio02.wav:  94%|███████▌| 30.0M/31.9M [00:03<00:00, 5.68M Bytes/s][A
Get stimuli/audio02.wav:  97%|███████▋| 30.9M/31.9M [00:03<00:00, 4.41M Bytes/s][A
Get stimuli/audio02.wav:  98%|███████▊| 31.4M/31.9M [00:04<00:00, 3.52M Bytes/s][A
Get stimuli/audio02.wav: 100%|███████▉| 31.8M/31.9M [00:04<00:00, 3.05M Bytes/s][A
                                                                                [A
Get stimuli/audio02.wav:   0%|                 | 0.00/31.9M [00:00<?, ? Bytes/s

In [None]:
# Windows
!datalad get stimuli/audio02.wav
!dir /c

**Exercise**: Drop the data of the whole stimulus folder, then print the disk usage of the current directory.

In [None]:
# Linux/MacOS
!datalad drop stimuli/
!du -sh

[1;1mdrop[0m([1;32mok[0m): stimuli/audio01.wav ([1;35mfile[0m)
[1;1mdrop[0m([1;32mok[0m): stimuli/audio02.wav ([1;35mfile[0m)
[1;1mdrop[0m([1;32mok[0m): stimuli ([1;35mdirectory[0m)
action summary:
  drop (ok: 3)
49M	.


In [None]:
# Windows
!datalad drop stimuli/
!dir /c

**Exercise**: Get the disk usage of the `stimuli/` folder.

In [58]:
#Linux/MacOS
!du -h stimuli

168K	stimuli


In [None]:
# Windows
!dir /c

**Exercise**: Get all `.TextGrid` files in the `stimuli/` folder, then get the folders disk usage again

In [60]:
!datalad get stimuli/*.TextGrid
!du -sh stimuli/

Total:   0%|                                   | 0.00/3.53M [00:00<?, ? Bytes/s]
Get stimuli/ .. o11.TextGrid:   0%|             | 0.00/185k [00:00<?, ? Bytes/s][A
Get stimuli/ .. o11.TextGrid:  18%|▉    | 33.3k/185k [00:00<00:00, 321k Bytes/s][A
Get stimuli/ .. o11.TextGrid:  47%|██▎  | 87.3k/185k [00:00<00:00, 428k Bytes/s][A
                                                                                [A
Get stimuli/ .. o11.TextGrid:   0%|             | 0.00/185k [00:00<?, ? Bytes/s][A
Total:   5%|█▍                          | 185k/3.53M [00:01<00:25, 134k Bytes/s][A
Get stimuli/ .. o13.TextGrid:   0%|             | 0.00/164k [00:00<?, ? Bytes/s][A
Get stimuli/ .. o13.TextGrid:  81%|████ | 133k/164k [00:00<00:00, 1.26M Bytes/s][A
                                                                                [A
Get stimuli/ .. o13.TextGrid:   0%|             | 0.00/164k [00:00<?, ? Bytes/s][A
                                                                               

**Exercise**: Get the size of the `.git/` folder.

In [None]:
# Linux/MacOs
!du -sh .git

43M	.git


In [None]:
# Windows
!dir /c .git

**Exercise**: Drop the content for all `.TextGrid` files in the `stimuli/` folder then get the disk usage of the `.git/` folder, again.

In [None]:
# Linux/MacOS
!datalad drop stimuli/*.TextGrid
!du -sh .git

action summary:
  drop (notneeded: 20)
40M	.git


In [None]:
# Windows
!datalad drop stimuli/*.TextGrid
!dir /c .git

## Inspecting File Identifier
DataLad used git-annex to store and track the content of large files. Git-annex stores every version of every file and assigns them a checksum (long alphanumeric strings) that uniquely indentifies the file. Most of the times when using DataLad, you don't have to think about git-annex because DataLad handles all the operations for you. However, it can be useful to peek under the hood and use some git-annex commands directly to get more detailed information or configure the data set's behavior.

| Code | Description |
| --- | --- |
| `git annex info` | Show the git-annex information for the whole dataset |
| `git annex info folder/image.png` | Show the git-annex information for the file `image.png`|
| `git annex whereis folder/image.png` | List the repositories that have the file content for `image.png` |
| `git annex numcopies 2` | Configure the dataset so that the required number of copies for a file is 3 |



**Example**: Get the git-annex `info` for the file `stimuli/audio01.wav`.

In [65]:
!git annex info stimuli/audio01.wav

file: stimuli/audio01.wav
size: 31.32 megabytes
key: SHA256E-s31322156--61207e6f7fe2f2d85a857800af6066048c5d18baa424d47d0f0ab596fafdbb12.wav
present: false


**Exercise**: Get the file content for `stimuli/audio01.wav`, then print the git-annex `info` for that file, again.

In [66]:
!datalad get stimuli/audio01.wav
!git annex info stimuli/audio01.wav

Total:   0%|                                   | 0.00/31.3M [00:00<?, ? Bytes/s]
Get stimuli/audio01.wav:   0%|                 | 0.00/31.3M [00:00<?, ? Bytes/s][A
Get stimuli/audio01.wav:   0%|         | 33.3k/31.3M [00:00<01:38, 319k Bytes/s][A
Get stimuli/audio01.wav:   0%|          | 120k/31.3M [00:00<01:17, 401k Bytes/s][A
Get stimuli/audio01.wav:   1%|          | 277k/31.3M [00:00<00:39, 781k Bytes/s][A
Get stimuli/audio01.wav:   2%|▏        | 556k/31.3M [00:00<00:22, 1.40M Bytes/s][A
Get stimuli/audio01.wav:   4%|▎       | 1.13M/31.3M [00:00<00:11, 2.69M Bytes/s][A
Get stimuli/audio01.wav:   7%|▌       | 2.29M/31.3M [00:00<00:05, 5.22M Bytes/s][A
Get stimuli/audio01.wav:  15%|█▏      | 4.58M/31.3M [00:00<00:02, 10.3M Bytes/s][A
Get stimuli/audio01.wav:  24%|█▉      | 7.38M/31.3M [00:01<00:02, 10.7M Bytes/s][A
Get stimuli/audio01.wav:  37%|██▉     | 11.5M/31.3M [00:01<00:01, 17.0M Bytes/s][A
Get stimuli/audio01.wav:  46%|███▋    | 14.6M/31.3M [00:01<00:00, 20.4M Bytes/s

**Exercise**: List the repositories that contain the file content for `stimuli/audio01.wav`.

In [71]:
!git annex whereis stimuli/audio01.wav

whereis stimuli/audio01.wav (3 copies) 
  	35910075-8a45-4d2f-a851-eeba11d474f8 -- [s3-PUBLIC]
  	7515908e-47e2-40dc-86f9-2f5bfcca5ed6 -- olebi@iBots-7:~/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/01_working_with_a_datalad_dataset/ds004408 [here]
  	b602100b-9fb2-44d7-8bf1-0ed87e6d3aa4 -- OpenNeuro

  s3-PUBLIC: https://s3.amazonaws.com/openneuro.org/ds004408/stimuli/single_speaker/audio01.wav?versionId=DgF1hKqcMi0Mbi_Cjrcwxrhe.9fr2GRU
ok


**Exercise**: List the repositories that contain the file content for `stimuli/audio02.wav` - how is this different from the list of repositories in the previous exercise?

In [72]:
!git annex whereis stimuli/audio02.wav

whereis stimuli/audio02.wav (2 copies) 
  	35910075-8a45-4d2f-a851-eeba11d474f8 -- [s3-PUBLIC]
  	b602100b-9fb2-44d7-8bf1-0ed87e6d3aa4 -- OpenNeuro

  s3-PUBLIC: https://s3.amazonaws.com/openneuro.org/ds004408/stimuli/single_speaker/audio02.wav?versionId=0k0s_818LeMgL_NnL..N5YZuAuwBUVrC
ok


**Exercise**: Set the number of required copies of a file to `3`.

In [73]:
!git annex numcopies 3

numcopies 3 ok
(recording state in git...)


**Exercise**: Try to drop `stimuli/audio01.wav`. What does the error message say?

In [74]:
!datalad drop stimuli/audio01.wav

[1;1mdrop[0m([1;31merror[0m): stimuli/audio01.wav ([1;35mfile[0m) [unsafe; Could only verify the existence of 1 out of 3 necessary copies.; (Note that these git remotes have annex-ignore set: origin); (Use --reckless availability to override this check, or adjust numcopies.)]


**Exercise**: Set the number of required copies of a file to 1 and drop `stimuli/audio01.wav`

In [76]:
!git annex numcopies 1
!datalad drop stimuli/audio01.wav

numcopies 1 ok
(recording state in git...)
[1;1mdrop[0m([1;32mok[0m): stimuli/audio01.wav ([1;35mfile[0m)


**Exercise**: Print the git-annex info for the whole dataset.

In [77]:
!git annex info

trusted repositories: 0
semitrusted repositories: 5
	00000000-0000-0000-0000-000000000001 -- web
	00000000-0000-0000-0000-000000000002 -- bittorrent
	35910075-8a45-4d2f-a851-eeba11d474f8 -- [s3-PUBLIC]
	7515908e-47e2-40dc-86f9-2f5bfcca5ed6 -- olebi@iBots-7:~/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/01_working_with_a_datalad_dataset/ds004408 [here]
	b602100b-9fb2-44d7-8bf1-0ed87e6d3aa4 -- OpenNeuro
untrusted repositories: 0
transfers in progress: none
available local disk space: 803.84 gigabytes (+100 megabytes reserved)
temporary object directory size: 32.87 megabytes (clean up with git-annex unused)
local annex keys: 0
local annex size: 0 bytes
annexed files in working tree: 1181
size of annexed files in working tree: 20.08 gigabytes
combined annex size of all repositories: 60.66 gigabytes
annex sizes of repositories: 
	30.41 GB: b602100b-9fb2-44d7-8bf1-0ed87e6d3aa4 -- OpenNeuro
	30.25 GB: 35910075-8a45-4d2f-a851-eeba11d474f8 -- [s3-PUBLIC]
backend us

## Examining a New Data Set

Now you are equipped to consume any DataLad dataset that has been published online - let's try it out!
Seach the [OpenNeuro database](https://openneuro.org/search?query={%22keywords%22:[]}) for a dataset that interests you and clone it. Then:
- print the git annex info of that dataset
- get some of the file contents and check the disk usage before and after
- drop the file contents and check the disk usage again
