One of the most valuable features of DataLad is teh ability to create and manage multiple instances of a dataset.
These so-called siblings are linked copies that can communicate changes just like git repositories.
Whether you want to backup your dataset locally, tranfer it to a HPC for analysis or publish it on an open science platform - DataLad's siblings provide a convenient way of doing it without having to worry about the underlying file system operations.

In this lesson, we are first going to create a sibling locally, in a separate folder - this can be useful for example to create backups on an external drive.
We are then going to use an open science platform  (you can choose between GIN and OSF) to publish our dataset.
Finally, we are going to publish our data on GitHub.
While GitHub itself can't host the annexed file contents it can help to make our dataset more visible.
If someone clones the dataset from GitHub (as you did with the OpenNeuro datasets earlier), DataLad will auatomatically fetch the file contents from other repositories that have them (like the GIN or OSF ones).

To create siblings, we first need a dataset. The cell below creates a new dataset with the `-c yoda` option which configures the dataset according to the [YODA principles](https://handbook.datalad.org/en/latest/basics/101-127-yoda.html), a set of practices for data analysis in DataLad datasets.
If you are interested in these prinicples, you can follow the link to the DataLad handbook.
For our purposes, it is enough that this configuration option automatically creates some folders and files (e.g. `README.md` and `code/README.md`) so we can create siblings and exchange data without having to add content ourselves.

In [1]:
!datalad create -c yoda my-data
!ls -a my-data

[INFO   ] Running procedure cfg_yoda 
[INFO   ] == Command start (output follows) ===== 
Total: 0.00 datasets [00:00, ? datasets/s]
Total:   0%|                                     | 0.00/510 [00:00<?, ? Bytes/s][A
[INFO   ] == Command exit (modification check follows) =====                    [A
[1;1mrun[0m([1;32mok[0m): /home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings/my-data ([1;35mdataset[0m) [/home/olebi/projects/Introduction-to-Sci...]
[1;1mcreate[0m([1;32mok[0m): /home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings/my-data ([1;35mdataset[0m)
action summary:
  create (ok: 1)
  run (ok: 1)
.  ..  .datalad  .git  .gitattributes  CHANGELOG.md  README.md	code


## Creating Local Backups

To create a backup at any location, we can simply initialize a bare git repository and add it as a sibling to our DataLad dataset. Bare means that the git repository has no working tree - the contents that are normally hidden in the `.git` folder are in the main directory. The absence of a working tree prevents issues of sychronization and accidental overwriting when pushing to and pulling from the repository.
In this section you are going to create a sibling for you dataset and then clone from that sibling. This creates a linked chain of datasets so that when you change the original repository, the changes can propagate to the clone (and vice versa).

| Command | Description |
| --- | --- |
| `git init --bare ./mydir`| Create a `--bare` repository called `mydir` in the current directory |
| `git branch -a` | List all branches in the current repository |
| `datalad siblings` | List all siblings of the current dataset |
| `datalad sibings add --name new --url <path>` | Add the repository at the URL as a new sibling with the name `new` |
| `datalad siblings remove --name new` | Remove the sibling with the name `new` |
| `datalad push --to new` | Push the dataset content to the sibling named `new` |
| `datalad update -s new` | Update the dataset's content from the sibling `new` |
| `datalad update -s new --merge` | Merge updates from sibling `new` |

**Example**: Initialize a `--bare` git repository in the directory `./my_data_backup`.

In [2]:
!git init --bare ./my-data-backup

Initialized empty Git repository in /home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings/my-data-backup/


**Example**: Add `../my-data-backup` as a sibling to `my-data/` with the name `backup`.

In [3]:
%cd my-data
!datalad siblings add --name backup --url ../my-data-backup

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings/my-data
.: backup(-) [../my-data-backup (git)]


**Exercise**: Push `--to` the sibling `backup`.

In [4]:
!datalad push --to backup

Update availability for 'backup':  75%|▊| 3.00/4.00 [00:00<00:00, 3.49k Steps/s]
Enumerating: 0.00 Objects [00:00, ? Objects/s][A
                                              [A
Counting:   0%|                               | 0.00/19.0 [00:00<?, ? Objects/s][A
                                                                                [A
Compressing:   0%|                            | 0.00/14.0 [00:00<?, ? Objects/s][A
                                                                                [A
Writing:   0%|                                | 0.00/19.0 [00:00<?, ? Objects/s][A
[1;1mpublish[0m([1;32mok[0m): . ([1;35mdataset[0m) [refs/heads/master->backup:refs/heads/master [new branch]]
[1;1mpublish[0m([1;32mok[0m): . ([1;35mdataset[0m) [refs/heads/git-annex->backup:refs/heads/git-annex [new branch]]
action summary:                                                                 
  copy (notneeded: 1)
  publish (ok: 2)


**Exercise**: Create a `--bare` git repository in another folder, add it as a sibling to `my-data` and push to that sibling.

**BONUS**: Create this new folder on a separate drive.

In [None]:
!git init --bare ../my-data-backup2
!datalad siblings add --name backup2 --url ../my-data-backup2
!datalad push --to backup2

Reinitialized existing Git repository in /home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings/my-data-backup2/
.: backup2(-) [../my-data-backup2 (git)]


**Exercise**: Clone `my-data-backup` to a new folder called `recovery`.

In [9]:
%cd ..
!datalad clone ./my-data-backup ./recovery

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings
[1;1minstall[0m([1;32mok[0m): /home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings/recovery ([1;35mdataset[0m)


**Exercise**: Go to the `my-data/` directory, add a line to `README.md` in and save the changes. Then, push `--to` the sibling `backup`.

In [None]:
%cd my-data
!echo "Hello Sibling!" >> README.md
!datalad save
!datalad push --to backup

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings/my-data
Total: 0.00 datasets [00:00, ? datasets/s]
Total:   0%|                                     | 0.00/186 [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): README.md ([1;35mfile[0m)                       [A
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 1)
  save (ok: 1)
error: too few arguments, run with --help or visit https://handbook.datalad.org
usage: datalad [-c (:name|name=value)] [-C PATH] [--cmd] [-l LEVEL]
               [--on-failure {ignore,continue,stop}]
               [--report-status {success,failure,ok,notneeded,impossible,error}]
               [--report-type {dataset,file}]
               [-f {generic,json,json_pp,tailored,disabled,'<template>'}]
               [--dbg] [--idbg] [--version] [-h]
               command [command

**Exercise**: Now, go to the `recovery/` directory and list all siblings.

In [None]:
%cd ../recovery
!datalad siblings

.: here(+) [git]
.: origin(+) [../my-data-backup (git)]


In [None]:
!datalad update -s origin

[INFO   ] Fetching updates for Dataset(/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings/recovery) 
[1;1mupdate[0m([1;32mok[0m): . ([1;35mdataset[0m)


In [15]:
!git branch -a

  git-annex[m
* [32mmaster[m
  [31mremotes/origin/HEAD[m -> origin/master
  [31mremotes/origin/git-annex[m
  [31mremotes/origin/master[m


**Exercise**: You fetched the updates but didn't merge them into the working tree (i.e. `recovery/README.md` in the working directory does not contain the updates).
Update again but use the `--merge` flag. Then, inspect the content of `recovery/README.md` - it should contain the added line.

In [18]:
!datalad update -s origin --merge

[INFO   ] Fetching updates for Dataset(/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings/recovery) 
[1;1mmerge[0m([1;32mok[0m): . ([1;35mdataset[0m) [Merged origin/master]
[1;1mupdate.annex_merge[0m([1;32mok[0m): . ([1;35mdataset[0m) [Merged annex branch]
[1;1mupdate[0m([1;32mok[0m): . ([1;35mdataset[0m)
action summary:
  merge (ok: 1)
  update (ok: 1)
  update.annex_merge (ok: 1)


**BONUS**: Change the directory to `recovery/`, make a change to `README.md`, save it and push it `--to origin`. Then, change the directory to `my-data` and update from the `backup` sibling. You should see the change made to `recovery/README.md` in `my-data/README.md`.

In [None]:
!echo "Hello to you, too!" >> README.md
!datalad save
!datalad push --to origin
%cd ../my-data
!datalad update -s backup --merge

Total: 0.00 datasets [00:00, ? datasets/s]
Total:   0%|                                     | 0.00/205 [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): README.md ([1;35mfile[0m)                       [A
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 1)
  save (ok: 1)
Update availability for 'origin':  75%|▊| 3.00/4.00 [00:00<00:00, 9.48k Steps/s]
Enumerating: 0.00 Objects [00:00, ? Objects/s][A
                                              [A
Counting:   0%|                               | 0.00/10.0 [00:00<?, ? Objects/s][A
                                                                                [A
Compressing:   0%|                            | 0.00/5.00 [00:00<?, ? Objects/s][A
                                                                                [A
Writing:   0%|                                | 0.00/6.00 [00:00<?, ? Objects/s]

## Using Open Science Repositories

Now that we understand how siblings work, we can use online repositories to publish our data.
DataLad offers ready-made `create-sibling` commands to create sibling repositories on different services like GIN, GitHub and OSF.

However, the `create-sibling` commands require that you are correctly authenticated so that DataLad can create new repositories in your name.
There are two main ways of authentication: SSH keys and access tokens.
SSH keys are cryptographic key pairs (public and private) used for secure authentication to servers without passwords, while access tokens are temporary credentials (like passwords). GIN only supports SSH keys, OSF only supports access tokens and GitHub supports both.


| Command | Description |
| --- | --- |
| `ssh-keygen` | Generate a public and private authentication key pair |
| `datalad siblings` | List all siblings of the current dataset |
| `datalad create-sibling-gin my-repo -s gin` | Create a new GIN repository called `my-repo` and add it as a sibling named `gin` |
| `datalad create-sibling-osf my-repo -s osf` | Create a new OSF repository called `my-repo` and add it as a sibling named `osf` |
| `datalad create-sibling-github my-repo -s github` | Create a new GitHub repository called `my-repo` and add it as a sibling named `github` |
| `datalad push --to gin` | Push the dataset content to the sibling named `gin` |

### Creating a Sibling on GIN

[GIN](https://gin.g-node.org/) is run by the German Neuroinformatics Node (G-Node), a research group based at the Ludwig-Maximilians-Universität München (LMU Munich) in Germany. It is a free data management platform designed for neuroscience research that provides Git-based version control for scientific data, supporting both web interface and command-line access with Git/Git-annex integration for managing large datasets.

This section explains how to register your SSH key with GIN to gain access, create a sibling repository and publish your data.

**Exercise**: Use `ssh-keygen` to generate a public/private key pair without a passphrase.
Note the location where the public key is stored, then open it and **copy** it's content which should look like this:

`ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBOYcoRKZZLWA4FWECpW2K/fTOvuRYXBnBA6gcea2bFq <user>@<computer>`

**NOTE**: You can either use `ssh-keygen` withot any arguments and use the dialog menu or specific the arguments in the CLI, as done below.

In [35]:
!ssh-keygen -N "" -f ~/.ssh/test_key
!cat ~/.ssh/test_key.pub

Generating public/private ed25519 key pair.
Your identification has been saved in /home/olebi/.ssh/test_key
Your public key has been saved in /home/olebi/.ssh/test_key.pub
The key fingerprint is:
SHA256:ANSAqf7Nwfvw/xPOALlcMu4XaqkWn4ZWDdEAYHKexhA olebi@iBots-7
The key's randomart image is:
+--[ED25519 256]--+
| E.B=+..o        |
|  X .... .       |
| . =  . o        |
|. .    B .       |
|.   . o S        |
| .   + = + .     |
|  . o.O + = .    |
|   . BoO . +     |
|    o.=oo....    |
+----[SHA256]-----+
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINNyGVDR5MKBuoCEQaApKkt2r+PLhH4m6z8xnjsUlqCt olebi@iBots-7


**Exercise**: Login in to your [GIN account](https://gin.g-node.org/) and  go to your user settings.

![](img/gin1.png)

Then, select "SSH Keys" (red box) and "Add Key" (green box). Enter the public SSH key into as "Content" (blue box) - the "Key Name" can be anything you want.

![](img/gin2.png)

**Exercise**: Run the cell below to test the ssh connection and see if your key is working. If it does you should see the message: 

`Hi there, You've successfully authenticated, but GIN does not provide shell access.`

In [None]:
!ssh -T git@gin.g-node.org

**Exercise** (If your ssh key is working, you can skip this one): Change the path in the cell below to the location of the SSH key you just generated and run the `git config` command to tell Git and DataLad to use this ssh key. Then, check your ssh connection again to confirm it is working.

In [4]:
!git config core.sshCommand "ssh -i ~/.ssh/test_key" # make sure the path is correct
!ssh -T git@gin.g-node.org

Hi there, You've successfully authenticated, but GIN does not provide shell access.


**Exercise**: Use `create-sibling-gin` to create a new GIN repository called `my-data` and name the sibling `gin`. 

In [9]:
!datalad create-sibling-gin my-data -s gin

[1;1mcreate_sibling_gin[0m([1;32mok[0m): [sibling repository 'gin' created at https://gin.g-node.org/obi/my-data]
[1;1mconfigure-sibling[0m([1;32mok[0m): . ([1;35msibling[0m)
action summary:
  configure-sibling (ok: 1)
  create_sibling_gin (ok: 1)


**Exercise**: Push `--to gin` and check the repository in the browser to verify the data was transferred.

In [10]:
!datalad push --to gin

Update availability for 'gin':  75%|███ | 3.00/4.00 [00:00<00:00, 21.1k Steps/s]
Enumerating: 0.00 Objects [00:00, ? Objects/s][A
                                              [A
Counting:   0%|                               | 0.00/20.0 [00:00<?, ? Objects/s][A
                                                                                [A
Compressing:   0%|                            | 0.00/15.0 [00:00<?, ? Objects/s][A
                                                                                [A
Writing:   0%|                                | 0.00/19.0 [00:00<?, ? Objects/s][A
[1;1mpublish[0m([1;32mok[0m): . ([1;35mdataset[0m) [refs/heads/git-annex->gin:refs/heads/git-annex 59e806f..e9dcef0]
[1;1mpublish[0m([1;32mok[0m): . ([1;35mdataset[0m) [refs/heads/master->gin:refs/heads/master [new branch]]
action summary:                                                                 
  publish (ok: 2)


### Creating a Sibling on OSF

The Open Science Framework (OSF) is run by the Center for Open Science (COS), a non-profit technology organization based in Charlottesville, Virginia, USA, dedicated to increasing openness, integrity, and reproducibility of research.

OSF interfaces with Git/Git-annex/DataLad through its storage backend that supports WebDAV protocol - DataLad can create siblings on OSF using the datalad create-sibling-osf command, which sets up Git-annex special remotes to store annexed files on OSF while tracking metadata in Git, enabling version-controlled data sharing and collaboration through OSF's infrastructure.

Creating a sibling on OSF requires the `datalad-osf` extension which you have if you installed the course environment - if you don't have it, just run `pip install datalad-osf`.
It also is recommended to configure git to use the `datalad-next` extension, which can be done by running the following cell.

In [2]:
!git config --global --add datalad.extensions.load next

**Exercise**: Login to [OSF](osf.io), go to "Settings" > "Personal Access Token" (red box) and click on "Create Token" (blue box).

![](img/osf1.png)

Give the token a name of your choice, grant it full read and write permissions and click on "Create Token".

![](img/token1.png)

Copy the token - Careful: you won't be able to see the token again once you closed the window!

![](img/token2.png)

**Exercise**: Run the `datalad-osf-credentials` command and paste the access token when prompted. You should see `osf_credentials(ok): [authenticated as <your name>]`

In [None]:
# This has to be done in the terminal
!datalad-osf-credentials

**Exercise**: Use `create-sibling-osf` to create a new OSF repository and register it as a sibling to `my-data/` with the name `osf`.

In [8]:
%cd my-data
!datalad create-sibling-osf --title my-data -s osf

[Errno 2] No such file or directory: 'my-data'
/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings/my-data
[1;1mcreate-sibling-osf[0m([1;32mok[0m): https://osf.io/5svxm/
[INFO   ] Configure additional publication dependency on "osf-storage" 
[1;1mconfigure-sibling[0m([1;32mok[0m): . ([1;35msibling[0m)


**Exercise**: Push to the `osf` sibling and ispect your OSF repository in the browser

**NOTE**: The OSF repository will not contain the data in a form that is humand readable. You can push to and pull from this repostory but you can't explore files in the browser. Alternatively, you can configure OSF as a human-readable special remote which contains file data but not version history. See [this tutorial](https://docs.datalad.org/projects/osf/en/latest/tutorial/exporthumandata.html) for a description on how to do that.

In [9]:
!datalad push --to osf

Push to 'osf':  25%|█████               | 1.00/4.00 [00:00<00:00, 9.85k Steps/s]
Push:   0%|                                     | 0.00/4.00 [00:00<?, ? Steps/s][A
Push to 'osf-storage':   0%|                    | 0.00/4.00 [00:00<?, ? Steps/s][A
Push to 'osf-storage':  25%|███         | 1.00/4.00 [00:00<00:00, 7.42k Steps/s][A
Transfer data to 'osf-storage':   0%|           | 0.00/4.00 [00:00<?, ? Steps/s][A
Update availability for 'osf':  75%|███ | 3.00/4.00 [00:00<00:00, 19.4k Steps/s][A

Enumerating: 0.00 Objects [00:00, ? Objects/s][A[A

                                              [A[A

Counting:   0%|                               | 0.00/37.0 [00:00<?, ? Objects/s][A[A

                                                                                [A[A

Compressing:   0%|                            | 0.00/30.0 [00:00<?, ? Objects/s][A[A

                                                                                [A[A

Writing:   0%|                        

### Creating a Sibling on GitHub

GitHub is a web-based platform for hosting Git repositories and collaborative software development, owned and operated by Microsoft Corporation since 2018.

Even though a dataset sibling on GitHub does not serve the data, it constitutes a simple, findable access point to retrieve the dataset, and can be used to provide updates and fixes via pull requests, issues, etc.


**Exercise**: Login to [GitHub](github.com) to create an access token.
First, click on you user icon in the top right and select "Settings".

![](img/gh1.png)

Then, select "Developer Setting" at the bottom of the menu on the left.

![](img/gh2.png)

Select "Generate New Token (classic)"

![](img/gh3.png)

Grant full access to repositories, create the token and paste it. Careful - you won't be able to see the token again after closing this window.

![](img/gh4.png)

**Exercise**: Use `create-sibling-github` to create a new GitHub repo called `my-data` and name the sibling `github`. Paste the token you generated when prompted. If you are not prompted for a token an you receive an error message check the next exercise.

In [3]:
!datalad create-sibling-github my-data -s github

[1;1mcreate_sibling_github[0m([1;32mok[0m): [sibling repository 'github' created at https://github.com/OleBialas/my-data]
[1;1mconfigure-sibling[0m([1;32mok[0m): . ([1;35msibling[0m)
action summary:
  configure-sibling (ok: 1)
  create_sibling_github (ok: 1)


**Exercise** (you can skip this one if creating the GitHub sibling worked): replace the token with the one you generated and run the cell below to explicitly tell Git and DataLad to use this token. Then create the GitHub sibling again.

In [None]:
!git config --global datalad.credential.api.github.com.token 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

**Exercise**: Push `--to github` and inspect your repository in the browser

In [None]:
!datalad push --to github

In [None]:
#Cleanup
%cd ..
!rm -rf my-data
!rm -rf my-data-backup
!rm -rf my-data-backup2
!rm -rf recovery

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings
