---
name: "Using and Modifying DataLad DataSets"
format: html
jupyter: bash
---

## Consuming Existing Datasets

| Linux/macOS | Windows | Description |
| --- | --- | --- |
| `ls -a` | `dir /a` | List the content of the current directory (including hidden files) |
| `ls -a data` | `dir /a data` | List the content of the `data` directory |
| `du -sh` | `dir /s` | Get the disk usage of the current directory |
| `du -sh data` | `dir /s data` Get the disk usage of the `data` directory |
| `cd data` | `cd data` | Change the directory to `data`|
: Terminal commands

| Command | Description |
| --- | --- | 
| `datalad clone https://example.com` | Clone the data set from `example.com` |
| `datalad get folder/` | Get the file content of the `folder/` |
| `datalad get folder/image.png` | Get the file content of the file `image.png` |
| `datalad drop folder/` | Drop the file content of the `folder/` |
: DataLad and git-annex commands

:::{#exr-}

Clone the dataset from https://gin.g-node.org/obi/penguins
:::
::: {.callout-tip}
## Solution

In [1]:
datalad clone https://gin.g-node.org/obi/penguins

:::

:::{#exr-}
Change the directory to `penguins` and list the directory's content
:::
::: {.callout-tip}
## Solution
On Linux/macOS:

In [2]:
cd penguins
ls -a

[0m[01;34m.[0m   [01;34m.datalad[0m  .gitattributes  README.md  [01;34mdata[0m


[01;34m..[0m  [01;34m.git[0m      LICENSE.txt     [01;34mcode[0m       [01;34mexamples[0m


On Windows:
```powershell
cd pegnuins
dir /a
```
:::

:::{#exr-}
Check the disk usage of the `penguins` directory
:::
::: {.callout-tip}
## Solution
On Linux/macOS:

In [3]:
du -sh

11M	.


On Windows:
```powershell
dir /s
```
:::

:::{#exr-}
Get the content of the `examples` subdirectory
:::
::: {.callout-tip}
## Solution

In [4]:
datalad get examples

:::

:::{#exr-}
Check the disk usage of the `penguins` directory again
:::
::: {.callout-tip}
## Solution
On Linux/macOS:

In [5]:
du -sh

11M	.


On Windows:
```powershell
dir /s
```
:::

:::{#exr-}
Drop the content of `examples/chinstrap.jpg` and check the disk usage again
:::
::: {.callout-tip}
## Solution

In [6]:
datalad drop examples/chinstrap.jpg

[1;1mdrop[0m([1;32mok[0m): examples/chinstrap.jpg ([1;35mfile[0m)


On Linux/macOS:

In [7]:
du -sh

6.7M	.


On Windows:
```powershell
dir /s
```
:::

## Checking File Identity and Location with git-annex

| Command | Description |
| `git annex info` | Show the git-annex information for the whole dataset |
| `git annex info folder/image.png` | Show the git-annex information for the file `image.png`|
| `git annex whereis folder/image.png` | List the repositories that have the file content for `image.png` |


:::{#exr-}
Display the `git annex info` for the file `examples/gentoo.jpg`. What is the *size* of that file? Is it *present* on your machine?
:::
::: {.callout-tip}

In [8]:
git annex info examples/gentoo.jpg

file: examples/gentoo.jpg


size: 4.81 megabytes


key: MD5E-s4812332--3ee0c65f57a008ffa2a55b3d59d8c203.jpg


present: true


The file is 4.81 megabtyes and it should be present since we previouslt loaded the content of the `examples` folder.
:::

:::{#exr-}
Display the `git-annex info` of the whole data set. How many annexed files are there in the working tree?
:::
::: {.callout-tip}

In [9]:
git annex info

trusted repositories: 0


semitrusted repositories: 5


	00000000-0000-0000-0000-000000000001 -- web


	00000000-0000-0000-0000-000000000002 -- bittorrent


	323c286f-ca09-419c-a5e8-529301abbde9 -- olebi@iBots-7:~/projects/new_penguins


	49708dd9-2023-4962-b38d-ed2ba412e760 -- olebi@iBots-7:~/projects/DataLad-EuroScipy25/notebooks/penguins [here]


	afa909d9-b37e-4dcd-a35f-b984fb3a7af5 -- git@f4e88adb09c0:/data/repos/obi/penguins.git [origin]


untrusted repositories: 0


transfers in progress: none


available local disk space: 880.59 gigabytes (+100 megabytes reserved)


local annex keys: 2


local annex size: 6.32 megabytes


annexed files in working tree: 6


size of annexed files in working tree: 10.84 megabytes


combined annex size of all repositories: 27.99 megabytes


annex sizes of repositories: 


	10.84 MB: 323c286f-ca09-419c-a5e8-529301abbde9 -- olebi@iBots-7:~/projects/new_penguins


	10.84 MB: afa909d9-b37e-4dcd-a35f-b984fb3a7af5 -- git@f4e88adb09c0:/data/repos/obi/penguins.git [origin]


	 6.32 MB: 49708dd9-2023-4962-b38d-ed2ba412e760 -- olebi@iBots-7:~/projects/DataLad-EuroScipy25/notebooks/penguins [here]


backend usage: 


	MD5E: 6


bloom filter size: 32 mebibytes (0% full)


The nummber of annexed files is displayed in this line:
`annexed files in working tree: 21`
:::

:::{#exr-}
Use `git annex whereis` to list the repositories that have the file content for the image `examples/gentoo.jpg`.
:::
::: {.callout-tip}

In [10]:
git annex whereis examples/gentoo.jpg

whereis examples/gentoo.jpg (3 copies) 


  	323c286f-ca09-419c-a5e8-529301abbde9 -- olebi@iBots-7:~/projects/new_penguins


  	49708dd9-2023-4962-b38d-ed2ba412e760 -- olebi@iBots-7:~/projects/DataLad-EuroScipy25/notebooks/penguins [here]


  	afa909d9-b37e-4dcd-a35f-b984fb3a7af5 -- git@f4e88adb09c0:/data/repos/obi/penguins.git [origin]


ok


:::

:::{#exr-}
Use `git annex whereis` to list the repositories that have the file content for the table `data/table_220.csv`. How does this differ from the list of repositories that contain the content for `gentoo.jpg`?
:::
::: {.callout-tip}

In [11]:
git annex whereis data/table_220.csv

whereis data/table_220.csv (2 copies) 


  	323c286f-ca09-419c-a5e8-529301abbde9 -- olebi@iBots-7:~/projects/new_penguins


  	afa909d9-b37e-4dcd-a35f-b984fb3a7af5 -- git@f4e88adb09c0:/data/repos/obi/penguins.git [origin]


ok


The table is not stored in the local respository, listed in the line marked `[here]`.
:::
