One of the most valuable features of DataLad is teh ability to create and manage multiple instances of a dataset.
These so-called siblings are linked copies that can communicate changes just like git repositories.
Whether you want to backup your dataset locally, tranfer it to a HPC for analysis or publish it on an open science platform - DataLad's siblings provide a convenient way of doing it without having to worry about the underlying file system operations.

In this lesson, we are first going to create a sibling locally, in a separate folder - this can be useful for example to create backups on an external drive.
We are then going to use an open science platform  (you can choose between GIN and OSF) to publish our dataset.
Finally, we are going to publish our data on GitHub.
While GitHub itself can't host the annexed file contents it can help to make our dataset more visible.
If someone clones the dataset from GitHub (as you did with the OpenNeuro datasets earlier), DataLad will auatomatically fetch the file contents from other repositories that have them (like the GIN or OSF ones).

To create siblings, we first need a dataset. The cell below creates a new dataset with the `-c yoda` option which configures the dataset according to the [YODA principles](https://handbook.datalad.org/en/latest/basics/101-127-yoda.html), a set of practices for data analysis in DataLad datasets.
If you are interested in these prinicples, you can follow the link to the DataLad handbook.
For our purposes, it is enough that this configuration option automatically creates some folders and files (e.g. `README.md` and `code/README.md`) so we can create siblings and exchange data without having to add content ourselves.

In [1]:
!datalad create -c yoda my-data
!ls -a my-data

[INFO   ] Running procedure cfg_yoda 
[INFO   ] == Command start (output follows) ===== 
Total: 0.00 datasets [00:00, ? datasets/s]
Total:   0%|                                     | 0.00/510 [00:00<?, ? Bytes/s][A
[INFO   ] == Command exit (modification check follows) =====                    [A
[1;1mrun[0m([1;32mok[0m): /home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings/my-data ([1;35mdataset[0m) [/home/olebi/projects/Introduction-to-Sci...]
[1;1mcreate[0m([1;32mok[0m): /home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings/my-data ([1;35mdataset[0m)
action summary:
  create (ok: 1)
  run (ok: 1)
.  ..  .datalad  .git  .gitattributes  CHANGELOG.md  README.md	code


## Creating Local Backups

To create a backup at any location, we can simply initialize a bare git repository and add it as a sibling to our DataLad dataset. Bare means that the git repository has no working tree - the contents that are normally hidden in the `.git` folder are in the main directory. The absence of a working tree prevents issues of sychronization and accidental overwriting when pushing to and pulling from the repository.
In this section you are going to create a sibling for you dataset and then clone from that sibling. This creates a linked chain of datasets so that when you change the original repository, the changes can propagate to the clone (and vice versa).

| Command | Description |
| --- | --- |
| `git init --bare ./mydir`| Create a `--bare` repository called `mydir` in the current directory |
| `git branch -a` | List all branches in the current repository |
| `datalad siblings` | List all siblings of the current dataset |
| `datalad sibings add --name new --url <path>` | Add the repository at the URL as a new sibling with the name `new` |
| `datalad siblings remove --name new` | Remove the sibling with the name `new` |
| `datalad push --to new` | Push the dataset content to the sibling named `new` |
| `datalad update -s new` | Update the dataset's content from the sibling `new` |
| `datalad update -s new --merge` | Merge updates from sibling `new` |

**Example**: Initialize a `--bare` git repository in the directory `./my_data_backup`.

In [2]:
!git init --bare ./my-data-backup

Initialized empty Git repository in /home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings/my-data-backup/


**Example**: Add `../my-data-backup` as a sibling to `my-data/` with the name `backup`.

In [3]:
%cd my-data
!datalad siblings add --name backup --url ../my-data-backup

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings/my-data
.: backup(-) [../my-data-backup (git)]


**Exercise**: Push `--to` the sibling `backup`.

In [4]:
!datalad push --to backup

Update availability for 'backup':  75%|▊| 3.00/4.00 [00:00<00:00, 3.49k Steps/s]
Enumerating: 0.00 Objects [00:00, ? Objects/s][A
                                              [A
Counting:   0%|                               | 0.00/19.0 [00:00<?, ? Objects/s][A
                                                                                [A
Compressing:   0%|                            | 0.00/14.0 [00:00<?, ? Objects/s][A
                                                                                [A
Writing:   0%|                                | 0.00/19.0 [00:00<?, ? Objects/s][A
[1;1mpublish[0m([1;32mok[0m): . ([1;35mdataset[0m) [refs/heads/master->backup:refs/heads/master [new branch]]
[1;1mpublish[0m([1;32mok[0m): . ([1;35mdataset[0m) [refs/heads/git-annex->backup:refs/heads/git-annex [new branch]]
action summary:                                                                 
  copy (notneeded: 1)
  publish (ok: 2)


**Exercise**: Create a `--bare` git repository in another folder, add it as a sibling to `my-data` and push to that sibling.

**BONUS**: Create this new folder on a separate drive.

In [None]:
!git init --bare ../my-data-backup2
!datalad siblings add --name backup2 --url ../my-data-backup2
!datalad push --to backup2

Reinitialized existing Git repository in /home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings/my-data-backup2/
.: backup2(-) [../my-data-backup2 (git)]


**Exercise**: Clone `my-data-backup` to a new folder called `recovery`.

In [9]:
%cd ..
!datalad clone ./my-data-backup ./recovery

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings
[1;1minstall[0m([1;32mok[0m): /home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings/recovery ([1;35mdataset[0m)


**Exercise**: Go to the `my-data/` directory, add a line to `README.md` in and save the changes. Then, push `--to` the sibling `backup`.

In [None]:
%cd my-data
!echo "Hello Sibling!" >> README.md
!datalad save
!datalad push --to backup

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings/my-data
Total: 0.00 datasets [00:00, ? datasets/s]
Total:   0%|                                     | 0.00/186 [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): README.md ([1;35mfile[0m)                       [A
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 1)
  save (ok: 1)
error: too few arguments, run with --help or visit https://handbook.datalad.org
usage: datalad [-c (:name|name=value)] [-C PATH] [--cmd] [-l LEVEL]
               [--on-failure {ignore,continue,stop}]
               [--report-status {success,failure,ok,notneeded,impossible,error}]
               [--report-type {dataset,file}]
               [-f {generic,json,json_pp,tailored,disabled,'<template>'}]
               [--dbg] [--idbg] [--version] [-h]
               command [command

**Exercise**: Now, go to the `recovery/` directory and list all siblings.

In [None]:
%cd ../recovery
!datalad siblings

.: here(+) [git]
.: origin(+) [../my-data-backup (git)]


In [None]:
!datalad update -s origin

[INFO   ] Fetching updates for Dataset(/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings/recovery) 
[1;1mupdate[0m([1;32mok[0m): . ([1;35mdataset[0m)


In [15]:
!git branch -a

  git-annex[m
* [32mmaster[m
  [31mremotes/origin/HEAD[m -> origin/master
  [31mremotes/origin/git-annex[m
  [31mremotes/origin/master[m


**Exercise**: You fetched the updates but didn't merge them into the working tree (i.e. `recovery/README.md` in the working directory does not contain the updates).
Update again but use the `--merge` flag. Then, inspect the content of `recovery/README.md` - it should contain the added line.

In [18]:
!datalad update -s origin --merge

[INFO   ] Fetching updates for Dataset(/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/03_creating_siblings/recovery) 
[1;1mmerge[0m([1;32mok[0m): . ([1;35mdataset[0m) [Merged origin/master]
[1;1mupdate.annex_merge[0m([1;32mok[0m): . ([1;35mdataset[0m) [Merged annex branch]
[1;1mupdate[0m([1;32mok[0m): . ([1;35mdataset[0m)
action summary:
  merge (ok: 1)
  update (ok: 1)
  update.annex_merge (ok: 1)


**BONUS**: Change the directory to `recovery/`, make a change to `README.md`, save it and push it `--to origin`. Then, change the directory to `my-data` and update from the `backup` sibling. You should see the change made to `recovery/README.md` in `my-data/README.md`.

In [None]:
!echo "Hello to you, too!" >> README.md
!datalad save
!datalad push --to origin
%cd ../my-data
!datalad update -s backup --merge

Total: 0.00 datasets [00:00, ? datasets/s]
Total:   0%|                                     | 0.00/205 [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): README.md ([1;35mfile[0m)                       [A
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 1)
  save (ok: 1)
Update availability for 'origin':  75%|▊| 3.00/4.00 [00:00<00:00, 9.48k Steps/s]
Enumerating: 0.00 Objects [00:00, ? Objects/s][A
                                              [A
Counting:   0%|                               | 0.00/10.0 [00:00<?, ? Objects/s][A
                                                                                [A
Compressing:   0%|                            | 0.00/5.00 [00:00<?, ? Objects/s][A
                                                                                [A
Writing:   0%|                                | 0.00/6.00 [00:00<?, ? Objects/s]

## Using Open Science Repositories

| Command | Description |
| --- | --- |
| `ssh-keygen` | Generate a public and private authentication key pair |
| `datalad siblings` | List all siblings of the current dataset |
| `datalad siblings add --name gin --url git@gin.g-node.org:/user/repo.git` | Add the gin repository at `/https://gin.g-node.org/user/repo` as a new sibling with the name `gin` |
| `datalad push --to gin` | Push the dataset content to the sibling named `gin` |

**Example**

Use `ssh-keygen` to generate a public and private key pair (you don't have to use a passphrase).
Note the location where the public key is stored, e.g. `.ssh/id_ed25519.pub`.
Open the `.pub` file and copy the whole content --- it should look something like this: `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBOYcoRKZZLWA4FWECpW2K/fTOvuRYXBnBA6gcea2bFq <user>@<computer>`

In [None]:
!ssh-keygen

Generating public/private ed25519 key pair.
Enter file in which to save the key (/home/olebi/.ssh/id_ed25519): 

## Using GitHub for Visibility
- create and additional sibling on GitHub

In [7]:
!chmod -R +w learn-datalad/.git/annex/objects/
!rm -rf learn-datalad
!rm -rf my-dataset

chmod: cannot access 'learn-datalad/.git/annex/objects/': No such file or directory
