In this lesson, we are going to create a new DataLad dataset. We will see how to
- Add data and track providence
- Modify files and track changes
- Use the git history to restore older file versions
- Add sub-datasets

## Creating a new Dataset

| Code | Description |
| --- | --- |
| `mkdir data/` | Create a new directory called `data/` |
| `cd data/` | Change the working directory to `data/` |
| `datalad create my-dataset` | Create a DataLad dataset in the new directory `my-dataset` |
| `datalad status` | Show any untracked changes in the current dataset |
| `datalad save` | Save all untracked changes in the current dataset |
| `echo "hello" > file.txt` | Save the text `"hello"` to `file.txt` | 



**Example**: Create a new DataLad dataset called `my-dataset` in the current directory.

In [1]:
!datalad create my-dataset

[1;1mcreate[0m([1;32mok[0m): /home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/02_creating_a_dataset_from_scratch/my-dataset ([1;35mdataset[0m)


**Example**: Change the current directory to `my-dataset` and print the dataset's `status`.

In [2]:
%cd my-dataset
!datalad status

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/02_creating_a_dataset_from_scratch/my-dataset
nothing to save, working tree clean


**Example**: Change the current directory to the parent (i.e. the directory that contains this notebook).

In [3]:
%cd ..

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/02_creating_a_dataset_from_scratch


**Exercise**: Create a new dataset called `learn-datalad` in the current directory

In [4]:
!datalad create learn-datalad

[1;1mcreate[0m([1;32mok[0m): /home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/02_creating_a_dataset_from_scratch/learn-datalad ([1;35mdataset[0m)


**Exercise**: Change the current directory to `learn-datalad/` and print the dataset's status

In [5]:
%cd learn-datalad
!datalad status

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/02_creating_a_dataset_from_scratch/learn-datalad
nothing to save, working tree clean


**Exercise**: Create a new directory `books/` in `learn-datalad/` and change the current directory to `books/`. 

In [6]:
!mkdir books
%cd books

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/02_creating_a_dataset_from_scratch/learn-datalad/books


**Example**: Download [https://homepages.uc.edu/~becktl/byte_of_python.pdf](https://homepages.uc.edu/~becktl/byte_of_python.pdf) and write it to the output file `byte-of-python.pdf`.

In [7]:
!curl -o byte-of-python.pdf https://homepages.uc.edu/~becktl/byte_of_python.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2630k  100 2630k    0     0   517k      0  0:00:05  0:00:05 --:--:--  587k


**Exercise**: Check the `status` of the dataset

In [8]:
!datalad status

[1;31muntracked[0m: /home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/02_creating_a_dataset_from_scratch/learn-datalad/books ([1;35mdirectory[0m)


**Exercise**: `save` the untracked file and add a message `"add a book on Python"`. Then, check the `status` of the dataset again.

In [9]:
!datalad save -m "add a book on Python"
!datalad status

Total: 0.00 datasets [00:00, ? datasets/s]
Total:   0%|                                   | 0.00/2.69M [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): books/byte-of-python.pdf ([1;35mfile[0m)        [A
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 1)
  save (ok: 1)
nothing to save, working tree clean


**Exercise**: Add another book to the dataset:
- Download [https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf](https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf) and write it to the file `progit.pdf`
- Save the untracked file with a message `"add a book on Git"`
- Check the dataset's `status` to make sure there are no untracked changes

In [10]:
!curl -o progit.pdf https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf
!datalad save -m "add a book on Git"
!datalad status

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
[1;1madd[0m([1;32mok[0m): books/progit.pdf ([1;35mfile[0m)                
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 1)
  save (ok: 1)
nothing to save, working tree clean


**Exercise**: Change the current directory to the parent (i.e. the `learning-datalad/` directory).

In [11]:
%cd ..

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/02_creating_a_dataset_from_scratch/learn-datalad


**Exercise**: Create a new file `README.md` with the text `"This is a DataLad dataset"` either using you editor or the `echo` command. Then, save the untracked file and check the dataset's status.

In [12]:
!echo "This is a DataLad dataset" > README.md
!datalad save -m "add README"
!datalad status

Total: 0.00 datasets [00:00, ? datasets/s]
Total:   0%|                                    | 0.00/26.0 [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): README.md ([1;35mfile[0m)                       [A
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 1)
  save (ok: 1)
nothing to save, working tree clean


## Modifying Content and Tracking Changes

TODO: account for differences between Windows and Linux

| Code | Description |
| --- | --- |
| `git log` | Display the commit history of the repository |
| `git log -2` | Display the last two entries in commit history |
| `git log --oneline` | Display a compact oneline view of the commit history |
| `datalad unlock data/` | Unlock the file content of the `data/` folder |
| `datalad unlock file.txt` | Unlock the file content of the `file.txt` |
| `datalad save` | Save untracked changes and lock unlocked file contents |
| `echo "content" >> file.txt` | Append the text `"content"` to `file.txt` | 

**Exercise**: Display the `git log` to view all commits you made to the `learn-datalad` dataset.

In [13]:
!git log

[33mcommit dbe3ff9a3609d254a06ce9f56532ec46f6959997[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmaster[m[33m)[m
Author: obi <ole.bialas@posteo.de>
Date:   Tue Dec 2 13:07:07 2025 +0100

    add README

[33mcommit fa67eb5745734380e5d347320202922b7b06e295[m
Author: obi <ole.bialas@posteo.de>
Date:   Tue Dec 2 13:07:06 2025 +0100

    add a book on Git

[33mcommit bdd7b31bf7120fc7497619e3a6338402a54480c0[m
Author: obi <ole.bialas@posteo.de>
Date:   Tue Dec 2 13:07:05 2025 +0100

    add a book on Python

[33mcommit 417f3a3215e36b2e8950fdbe613ffd4ae0d2b90e[m
Author: obi <ole.bialas@posteo.de>
Date:   Tue Dec 2 13:06:58 2025 +0100

    [DATALAD] new dataset


**Exercise**: Display the `git log` in a compact one-line view.

In [14]:
!git log --oneline

[33mdbe3ff9[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmaster[m[33m)[m add README
[33mfa67eb5[m add a book on Git
[33mbdd7b31[m add a book on Python
[33m417f3a3[m [DATALAD] new dataset


**Exercise**: Unlock the content of `README.md`. Then, check the dataset's status.

In [15]:
!datalad unlock README.md
!datalad status

[1;1munlock[0m([1;32mok[0m): README.md ([1;35mfile[0m)                    
 [1;31mmodified[0m: README.md ([1;35mfile[0m)                               


**Example**: Append the line `"It uses git and git-annex"` to `README.md`, either using you editor or the echo command. Then, `save` with a message and check the dataset's `status`.

In [16]:
!echo "It uses git and git-annex" >> README.md
!datalad save -m "add line"
!datalad status

Total: 0.00 datasets [00:00, ? datasets/s]
Total:   0%|                                    | 0.00/52.0 [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): README.md ([1;35mfile[0m)                       [A
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 1)
  save (ok: 1)
nothing to save, working tree clean


**Exercise**: Unlock `README.md` and append another line `"for decentralized version control"`. Then, `save` the changes and check the `status`.

In [17]:
!datalad unlock README.md
!echo "For decentralized version control" >> README.md
!datalad save -m "add another line"
!datalad status

[1;1munlock[0m([1;32mok[0m): README.md ([1;35mfile[0m)                    
Total: 0.00 datasets [00:00, ? datasets/s]                                      
Total:   0%|                                    | 0.00/86.0 [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): README.md ([1;35mfile[0m)                       [A
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 1)
  save (ok: 1)
nothing to save, working tree clean


**Exercise**: Display the last two entries in the git history.

In [18]:
!git log -2

[33mcommit cc4ee6732460485fd489daf76682e197e06b86a1[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmaster[m[33m)[m
Author: obi <ole.bialas@posteo.de>
Date:   Tue Dec 2 13:07:11 2025 +0100

    add another line

[33mcommit 70ed6028c4f309c4f059fcf40ac62c48580f28d5[m
Author: obi <ole.bialas@posteo.de>
Date:   Tue Dec 2 13:07:10 2025 +0100

    add line


**Exercise**: Unlock `README.md` and then, without making any changes, `save` with a message. Check the last two entries in the git history, did your `save` command create an entry?

In [19]:
!datalad unlock README.md
!datalad save -m "did nothing"
!git log -2

[1;1munlock[0m([1;32mok[0m): README.md ([1;35mfile[0m)                    
Total: 0.00 datasets [00:00, ? datasets/s]                                      
Total:   0%|                                    | 0.00/86.0 [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): README.md ([1;35mfile[0m)                       [A
action summary:                                                                 
  add (ok: 1)
  save (notneeded: 1)
[33mcommit cc4ee6732460485fd489daf76682e197e06b86a1[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmaster[m[33m)[m
Author: obi <ole.bialas@posteo.de>
Date:   Tue Dec 2 13:07:11 2025 +0100

    add another line

[33mcommit 70ed6028c4f309c4f059fcf40ac62c48580f28d5[m
Author: obi <ole.bialas@posteo.de>
Date:   Tue Dec 2 13:07:10 2025 +0100

    add line


## Installing Subdatasets

You can add any data to you DataLad dataset, including other datasets!
DataLad allows you to install datasets as submodules which means that they are added to you repository while maintaining their own, independent git history. Modularizing your research project with subdatasets (for different modalities, conditions etc.) makes the data more resusable!

| Code | Description |
| --- | --- |
| `datalad install -d my-dataset <URL>` | Install the dataset from the given URL as a subdataset into the `my-dataset/` directory |
| `datalad install -d . <URL>` | Install the dataset from the given URL as a subdataset into the current directory |
| `datalad subdatasets` | List all subdatasets of the current directory |

**Example**: Install the dataset from the OpenNeuro URL [https://github.com/OpenNeuroDatasets/ds005131.git](https://github.com/OpenNeuroDatasets/ds005131.git) as a subdataset into the current dataset.

In [20]:
!datalad install -d . https://github.com/OpenNeuroDatasets/ds005131.git

Cloning:   0%|                             | 0.00/2.00 [00:00<?, ? candidates/s]
Enumerating: 0.00 Objects [00:00, ? Objects/s][A
                                              [A
Counting:   0%|                              | 0.00/3.05k [00:00<?, ? Objects/s][A
                                                                                [A
Compressing:   0%|                           | 0.00/1.81k [00:00<?, ? Objects/s][A
                                                                                [A
Receiving:   0%|                             | 0.00/3.05k [00:00<?, ? Objects/s][A
                                                                                [A
Resolving:   0%|                                | 0.00/559 [00:00<?, ? Deltas/s][A
[INFO   ] Remote origin not usable by git-annex; setting annex-ignore           [A
[INFO   ] https://github.com/OpenNeuroDatasets/ds005131.git/config download failed: Not Found 
[1;1minstall[0m([1;32mok[0m): ds005131 ([1;35md

**Example**: Change the directory to the root of the newly installed sub-dataset and check its `git log`.

In [21]:
%cd ds005131
!git log --oneline

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/02_creating_a_dataset_from_scratch/learn-datalad/ds005131
[33m51c1338[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmain[m[33m, [m[1;33mtag: [m[1;33m1.0.1[m[33m, [m[1;31morigin/master[m[33m, [m[1;31morigin/main[m[33m, [m[1;31morigin/HEAD[m[33m)[m [OpenNeuro] Recorded changes
[33m7bb0e92[m [OpenNeuro] Recorded changes
[33m579b3aa[m [OpenNeuro] Recorded changes
[33m95b3ce9[m [OpenNeuro] Recorded changes
[33m82286ff[m [OpenNeuro] Recorded changes
[33m577f003[m[33m ([m[1;33mtag: [m[1;33m1.0.0[m[33m)[m [OpenNeuro] Recorded changes
[33m2779065[m [OpenNeuro] Recorded changes
[33mde27cca[m [OpenNeuro] Recorded changes
[33m087aafd[m [OpenNeuro] Recorded changes
[33m86fb2d1[m [OpenNeuro] Recorded changes
[33m3e04d03[m [OpenNeuro] Recorded changes
[33m3a6eca0[m [OpenNeuro] Recorded changes
[33mf9191e6[m [OpenNeuro] Recorded changes
[33mbc72cc4[m [OpenNeuro]

**Exercise**: Change the directory back to the parent `learn-datalad/`. Then, browse the [OpenNeuro database](https://openneuro.org/search?query={%22keywords%22:[]}), choose a dataset and install it as another subdataset. 

In [22]:
%cd ..
!datalad install -d . https://github.com/OpenNeuroDatasets/ds003507.git

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/02_creating_a_dataset_from_scratch/learn-datalad
Cloning:   0%|                             | 0.00/2.00 [00:00<?, ? candidates/s]
Enumerating: 0.00 Objects [00:00, ? Objects/s][A
                                              [A
Counting:   0%|                              | 0.00/2.63k [00:00<?, ? Objects/s][A
                                                                                [A
Compressing:   0%|                           | 0.00/1.74k [00:00<?, ? Objects/s][A
                                                                                [A
Receiving:   0%|                             | 0.00/2.63k [00:00<?, ? Objects/s][A
Receiving:  63%|████████████▌       | 1.66k/2.63k [00:00<00:00, 9.72k Objects/s][A
                                                                                [A
Resolving:   0%|                                | 0.00/329 [00:00<?, ? Deltas/s][A
[INFO   ] R

**Exercise**:Change the directory to the newly intstalled subdataset and inspect it's `git log`.

In [23]:
%cd ds003507
!git log --oneline

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/02_creating_a_dataset_from_scratch/learn-datalad/ds003507
[33m8b8fad4[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmaster[m[33m, [m[1;33mtag: [m[1;33m1.0.1[m[33m, [m[1;31morigin/master[m[33m, [m[1;31morigin/HEAD[m[33m)[m [DATALAD] Recorded changes
[33m29ce3cc[m [DATALAD] Recorded changes
[33mb82c86f[m [DATALAD] Recorded changes
[33me548886[m [DATALAD] Recorded changes
[33mf212034[m [DATALAD] Recorded changes
[33m540710b[m [DATALAD] Recorded changes
[33mea7a5e4[m [DATALAD] Recorded changes
[33m80eeffd[m [DATALAD] Recorded changes
[33ma821149[m [DATALAD] Recorded changes
[33mb461339[m [DATALAD] Recorded changes
[33m98e47d8[m [DATALAD] Recorded changes
[33ma9d5d59[m [DATALAD] Recorded changes
[33m5cb3c0b[m[33m ([m[1;33mtag: [m[1;33m1.0.0[m[33m)[m [DATALAD] Recorded changes
[33m2f692c4[m [DATALAD] Recorded changes
[33m7527e33[m [DATALAD] exclude paths

**Exercise**: Change the directory back to the parent `learn-datalad/` and list all `subdatasets`.

In [24]:
%cd ..
!datalad subdatasets

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/02_creating_a_dataset_from_scratch/learn-datalad
[1;1msubdataset[0m([1;32mok[0m): ds003507 ([1;35mdataset[0m)
[1;1msubdataset[0m([1;32mok[0m): ds005131 ([1;35mdataset[0m)


## Going Back and Forth in Time

Because DataLad keeps track of all changes to out dataset, we can restore any previous version of a given file. This can be very useful if we made a mistake and want to restore an older version of our project or we simply want to check how the data looked previously. In this section, are going to learn two ways of doing this: checking out to a specific commit and resetting the repositor. the `checkout` is mostly useful if we want to look at an older state of our project without actually changing the current state of the repository while the `reset` is used to modify the repositories state.

| Code | Description |
| --- | --- |
| `git checkout HEAD~3` | `checkout` to the state of the repository 3 commits ago|
| `git checkout d0e83f29` | `checkout` to the state of the repository at the commit with the has `d0e83f29` |
| `git reset --mixed d0e83f29` | `reset` the state of the repository to the commit with the hash `d0e83f28` but keep the working directory as-is |
| `git reset --hard d0e83f29` | `reset` the state of the repository and delete files from the working directory |


**Example**: Check the git history, identify the last commit before we made any changes to `README.md` and note its commit hash.

In [25]:
!git log --oneline

[33m6ae74ad[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmaster[m[33m)[m [DATALAD] Added subdataset
[33m34db306[m [DATALAD] Added subdataset
[33mcc4ee67[m add another line
[33m70ed602[m add line
[33mdbe3ff9[m add README
[33mfa67eb5[m add a book on Git
[33mbdd7b31[m add a book on Python
[33m417f3a3[m [DATALAD] new dataset


**Example**: Use `checkout` to reset the repositories state to when the `README.md` was created.

In [26]:
!git checkout HEAD~4

Note: switching to 'HEAD~4'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at dbe3ff9 add README


**Exercise**: Check the git commit history and inspect the content of `README.md`.

In [27]:
!git log --oneline
!cat README.md

[33mdbe3ff9[m[33m ([m[1;36mHEAD[m[33m)[m add README
[33mfa67eb5[m add a book on Git
[33mbdd7b31[m add a book on Python
[33m417f3a3[m [DATALAD] new dataset
This is a DataLad dataset


**Exercise**: Switch back to the previous (i.e. the master) branch.

In [28]:
!git switch -

Previous HEAD position was dbe3ff9 add README
Switched to branch 'master'


**Exercise**: Identify the hash of the commit where we appended the first line to `README.md`. Then, `checkout` to that commit and inspect the content of `README.md`.

In [29]:
!git checkout HEAD~3
!cat README.md

Note: switching to 'HEAD~3'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 70ed602 add line
This is a DataLad dataset
It uses git and git-annex


**Exercise**: Switch back to the master branch and inspect the content of `README.md` to make sure it was restored.

In [30]:
!git switch -
!cat README.md

Previous HEAD position was 70ed602 add line
Switched to branch 'master'
This is a DataLad dataset
It uses git and git-annex
For decentralized version control


**Exercise**: Use `git reset --mixed` to reset the repositories state to the point before `README.md` was modified. Then, check the `git log` and the daataset's `status`.

**NOTE**: Using `--mixed` resets the repositories state but does not affect your working directory - commits that happened after the point of reset will appear as unstaged changes.

In [31]:
!git reset --mixed HEAD~4
!git log --oneline
!datalad status

Unstaged changes after reset:
M	README.md
[33mdbe3ff9[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmaster[m[33m)[m add README
[33mfa67eb5[m add a book on Git
[33mbdd7b31[m add a book on Python
[33m417f3a3[m [DATALAD] new dataset
[1;31muntracked[0m: .gitmodules ([1;35mfile[0m)
[1;31muntracked[0m: ds003507 ([1;35mdirectory[0m)
[1;31muntracked[0m: ds005131 ([1;35mdirectory[0m)
 [1;31mmodified[0m: README.md ([1;35msymlink[0m)


**Exercise**: Save the unstaged changes to the `README.md`. Then, check the content of `README.md` to make sure nothing got lost.

**NOTE**: Since you are adding what was multiple commints in a single operation, you may choose a different commit message.

In [32]:
!datalad save -m "adding info to README"
!cat README.md

[1;1madd[0m([1;32mok[0m): ds003507 ([1;35mdataset[0m)                     
[1;1madd[0m([1;32mok[0m): ds005131 ([1;35mdataset[0m)                     
[1;1madd[0m([1;32mok[0m): .gitmodules ([1;35mfile[0m)                     
Total:   0%|                                 | 0.00/1.00 [00:00<?, ? datasets/s]
Total:   0%|                                     | 0.00/808 [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): README.md ([1;35mfile[0m)                       [A
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 4)
  save (ok: 1)
This is a DataLad dataset
It uses git and git-annex
For decentralized version control


**Exercise**: Use `git reset --hard` to reset the repositories stateto the point before `README.md` was modified. Then, check the `git log` and the daataset's `status`.

**NOTE**: Using `--hard` modifies your working directory and all commits that happened after the point of reset will be gone (they can still be recovered if they haven't been deleted by git's garbage collector which happens after 30 days by default). Also, this won't remove the installed subdatasets (you can simply remove them manually)

In [34]:
!git reset --hard HEAD~1
!git log --oneline
!datalad status

HEAD is now at dbe3ff9 add README
[33mdbe3ff9[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmaster[m[33m)[m add README
[33mfa67eb5[m add a book on Git
[33mbdd7b31[m add a book on Python
[33m417f3a3[m [DATALAD] new dataset
[1;31muntracked[0m: ds003507 ([1;35mdirectory[0m)
[1;31muntracked[0m: ds005131 ([1;35mdirectory[0m)


## Dataset Configurations: To Annex or not to Annex?

Per default, DataLad will use `git-annex` to handle the content of every single file in your dataset. However, this is not always desireable. For example, you may not want to annex small text files like code to avoid having to unlock them for every edit. We can tell DataLad which files should be annexed by editing the `.gitattributes` file. Let's look at the default `.gitattributes` that was created when we initialized the dataset:

In [35]:
!cat .gitattributes

* annex.backend=MD5E
**/.git* annex.largefiles=nothing


There are two lines in this file:
- `* annex.backend=MD5E`: tells git-annex to use the `MD5E` backend for generating file hashes
- `**/.git* annex.largefiles=nothing`: tells git-annex to not not annex the `.git` folder (because that folder is where the annexed contents are stored)

We usually don't want to edit these default values. Instead, we want to add lines to `.gitattributes` to specify which contents should and shouldn't be annexed. Note that changes in the configuration will not automatically be applied to files that are already tracked. Thus, it is best to configure `.gitattributes` right after initializing the dataset, before data is added.

| Code | Description |
| --- | --- |
| `* annex.largefiles=(mimeencoding=binary)` | Only annex files with a `binary` encoding |
| `myfile.pdf annex.largefiles=nothing` | Don't annex `myfile.pdf` |
| `* annex.largefiles=(largerthan=5kb)` | Only annex files who's size exceeds 5KB |
| `* annex.largefiles=((largerthan=5kb)or(mimeencoding=binary))` | Only annex binary files and files greater than 5KB|
| `git annex unannex <files>` | Unannex the content of the given files |

**Example**: Check if the files in `books` are symlinks.


In [37]:
!ls -l books/

total 8
lrwxrwxrwx 1 olebi olebi 131 Dec  2 13:07 byte-of-python.pdf -> ../.git/annex/objects/P5/qK/MD5E-s2693891--e61afe4b3c5d76c849c4e61f6547ed03.pdf/MD5E-s2693891--e61afe4b3c5d76c849c4e61f6547ed03.pdf
lrwxrwxrwx 1 olebi olebi 119 Dec  2 13:07 progit.pdf -> ../.git/annex/objects/W1/7x/MD5E-s0--d41d8cd98f00b204e9800998ecf8427e.pdf/MD5E-s0--d41d8cd98f00b204e9800998ecf8427e.pdf


**Example**: Add a line to `.gitattributes` to avoid annexing pdfs.

In [None]:
# new content of .gitattributes
* annex.backend=MD5E
**/.git* annex.largefiles=nothing
**/*.pdf annex.largefiles=nothing

In [38]:
!echo "**/*.pdf annex.largefiles=nothing" >> .gitattributes

**Example**: Unannex the `books/` folder and save to apply the changed configuration. Then, check again if the files in `books/` are symlinks.

In [39]:
!git annex unannex books/*
!datalad save -m "unannex books"

unannex books/byte-of-python.pdf ok
unannex books/progit.pdf ok
(recording state in git...)
[1;1madd[0m([1;32mok[0m): ds003507 ([1;35mdataset[0m)                     
[1;1madd[0m([1;32mok[0m): ds005131 ([1;35mdataset[0m)                     
[1;1madd[0m([1;32mok[0m): .gitmodules ([1;35mfile[0m)                     
Total:   0%|                                 | 0.00/1.00 [00:00<?, ? datasets/s]
Total:   0%|                                   | 0.00/2.69M [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): books/byte-of-python.pdf ([1;35mfile[0m)        [A
[1;1madd[0m([1;32mok[0m): books/progit.pdf ([1;35mfile[0m)                
[1;1madd[0m([1;32mok[0m): .gitattributes ([1;35mfile[0m)                  
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 6)
  save (ok: 1)


**Exercise**: Check again if the files in `books/` are symlinks.

In [40]:
!ls -l books/

total 2632
-rw-r--r-- 1 olebi olebi 2693891 Dec  2 13:07 byte-of-python.pdf
-rw-r--r-- 1 olebi olebi       0 Dec  2 13:07 progit.pdf


**Exercise**: Change the last line in `.gitattributes` so that only binary files will be annexed.

In [None]:
# new content of .gitattributes
* annex.backend=MD5E
**/.git* annex.largefiles=nothing
* annex.largefiles=(mimeencoding=binary)

Total: 0.00 datasets [00:00, ? datasets/s]
Total:   0%|                                    | 0.00/90.0 [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): .gitattributes ([1;35mfile[0m)                  [A
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 1)
  save (ok: 1)


In [41]:
!echo "* annex.largefiles=(mimeencoding=binary)" >> .gitattributes

**Exercise**: Unannex `README.md` and save to apply the changed configuration. Now you should be able to edit `README.md` without having to unlock it.

In [None]:
!git annex unannex README.md
!datalad save -m "annex only binary"

                                                                                

**Exercise**: Change the last line in `.gitattributes` so that (non-binary) files greater than 5kb will also be annexed.

In [None]:
# new content of .gitattributes
* annex.backend=MD5E
**/.git* annex.largefiles=nothing
* annex.largefiles=((mimeencoding=binary)or(largerthan=5kb))

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/02_creating_a_dataset_from_scratch


In [None]:
!echo "* annex.largefiles=((mimeencoding=binary)or(largerthan=5kb))" >> .gitattributes

**Exercise**: Execute the cell below to save a large text file. Then inspect `README.md` and the new file `text.txt`. If you configured `.gitattributes` correctly in the exercise above, `test.txt` should be a symlink but `README.md` shouldn't.

In [47]:
open('test.txt', 'w').write('he' * 5000)
!datalad save -m "add large text file"

Total: 0.00 datasets [00:00, ? datasets/s]
Total:   0%|                                   | 0.00/10.2k [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): test.txt ([1;35mfile[0m)                        [A
[1;1madd[0m([1;32mok[0m): .gitattributes ([1;35mfile[0m)                  
[1;1madd[0m([1;32mok[0m): README.md ([1;35mfile[0m)                       
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 3)
  save (ok: 1)


In [49]:
!ls -l test.txt
!ls -l README.md

lrwxrwxrwx 1 olebi olebi 124 Dec  2 13:22 test.txt -> .git/annex/objects/FG/zj/MD5E-s10000--611726922b50da655c1f49d3af6874c5.txt/MD5E-s10000--611726922b50da655c1f49d3af6874c5.txt
-rw-r--r-- 1 olebi olebi 30 Dec  2 13:16 README.md


In [38]:
!chmod -R +w learn-datalad/.git/annex/objects/
!rm -rf learn-datalad
!rm -rf my-dataset