In [1]:
from dsc.notebook import embed_website

# What is DVC?
<br>
<div align="left">
<img src="./figures/3_what_is_dvc.png" alt="What is DVC" width=800/>
<div/>

- DVC extends Git to use it for data science. 
- A [popular and established](https://hn.algolia.com/?q=dvc) tool for data versioning although there are alternatives.
- Its commits are pointers to data sets.
    - The actual data is added to .gitignore and thus not commited to Git but to a storage location.
    - For file that is put into a storage location, DVC creates .dvc files wich are pointers to the data that are tracked by Git.
- Storage is not GitHub or GitLab but can be local or a remote storage like S3 oder Google drive


<br>
<div align="left">
<figcaption>DVC Overview (Source: DVC) </figcaption>
<img src="./figures/4_dvc_overview.png" alt="DVC Overview" width=800/>
<div/>

**Differences to Git**
- DVC is technically **not a version control system** 
    - DVC creates .dvc files
    - .dvc file contents determine data versions
    - .dvc files are versioned by Git
    - DVC checkouts working copies of data 
- **Merging or comparing changes of data not possible**, but merging/comparing changes of .dvc files possible
- The command **dvc add also commits** (in the sense of DVC), so dvc commit is not required
- The command **dvc pull does a fetch and a checkout** but not a merge

## Installation & resources
[Installation](https://dvc.org/doc/install) -> **Please install DVC now!**

**Resources**
- [Official docs](https://dvc.org/doc/start) (pretty amazing!)
- [Short video that motivates DVC](https://www.youtube.com/watch?v=UbL7VUpv1Bs)
- [Official video introduction to DVC for data versioning](https://www.youtube.com/watch?v=kLKBcPonMYw) 
- [Plugin for VSCode](https://marketplace.visualstudio.com/items?itemName=Iterative.dvc)
- Also interesting...
    - https://dvcfan.com/2021/04/14/some-hard-truths-about-dvc/
    - https://www.youtube.com/watch?v=VttqJE-Vcjg

## Important commands
- [init](https://dvc.org/doc/command-reference/init): init dvc, should be run in git repo root
- [add](https://dvc.org/doc/command-reference/add): in contrast to Git, this also commits
- [status](https://dvc.org/doc/command-reference/status): show file differences either between the cache and workspace, or between the cache and remote storage.
- [move](https://dvc.org/doc/command-reference/move): files that are tracked by DVC should be renamed using this command
- [checkout](https://dvc.org/doc/command-reference/checkout): update DVC-tracked files in the workspace based on current .dvc files.
- [fetch](https://dvc.org/doc/command-reference/fetch): download files from remote storage to the cache based on .dvc files
- [pull](https://dvc.org/doc/command-reference/pull): fetches files and makes them visible in the workspace
- [push](https://dvc.org/doc/command-reference/push): upload tracked files to remote storage based on .dvc files.
- [destroy](https://dvc.org/doc/command-reference/destroy): remove all DVC files and internals from a Git repo.

# Preparation in order to run this notebook: TODO

TODO
- Since the data that is `tracked' by DVC is put to .gitignore, we do not checkout a new branch for illustration, it is best to illustrate DVC in an isolated directory so that we can easily remove the created data.
- Let's move to the level of the dvc directory, create a new directory named dvc_lecture_tmp, cd into it and init git.

In [3]:
# dvc_dir = "../../.."

In [4]:
if False:
    import os
    os.chdir(dvc_dir)
    os.getcwd()

In [5]:
# !mkdir dvc_lecture_tmp

In [6]:
if False:
    os.chdir("./dvc_lecture_tmp")
    os.getcwd()

In [7]:
!git checkout -b _dvc_illustration

M	lecture_notes/1_version_control/2_dvc.ipynb
M	lecture_notes/1_version_control/notebook_as_py/2_dvc.py
Switched to a new branch '_dvc_illustration'


Let's also create some dummy data that should be tracked by DVC later on.

In [8]:
!echo "This is a really big data set" > big_data.txt

# Initialization: dvc init

Let's initialize the DVC repository from the Git repository root.

In [11]:
import os
os.getcwd()
os.chdir("../..")

In [12]:
!dvc init

[31mERROR[39m: failed to initiate DVC - '.dvc' exists. Use `-f` to force.
[0m

The previous command creates the file .dvcignore and the directory .dvc that are also put in the staging area of Git.

You have to commit them to Git so that DVC is read to use.

In [27]:
!git status

On branch dvc
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	[32mnew file:   .dvc/.gitignore[m
	[32mnew file:   .dvc/config[m
	[32mnew file:   .dvcignore[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31mbig_data.txt[m



In [28]:
!git commit -m "Init dvc"

[dvc a4809f8] Init dvc
 3 files changed, 6 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvcignore


# Local usage

## Start tracking: dvc add
- Files that are already tracked by Git can not be tracked by DVC (and will raise an error).
- To track files with DVC use dvc add.

In [29]:
!dvc add big_data.txt

[2K[32m⠋[0m Checking graph                                                   [32m⠋[0m Checking graph
Adding...                                                                       
![A
  0% Checking cache in '/home/spa0001f/github/teach/dvc_lecture_tmp/.dvc/cache'|[A
                                                                                [A
![A
  0%|          |Transferring                          0/1 [00:00<?,     ?file/s][A
                                                                                [A
![A
  0%|          |04fddcd2be2fb72fa1900327927423     0.00/? [00:00<?,        ?B/s][A
  0%|          |04fddcd2be2fb72fa1900327927423  0.00/30.0 [00:00<?,        ?B/s][A
100% Adding...|████████████████████████████████████████|1/1 [00:00,  8.90file/s][A

To track the changes with git, run:

	git add big_data.txt.dvc .gitignore

To enable auto staging, run:

	dvc config core.autostage true
[0m

- DVC stores information about the added file in a .dvc file 
named big_data.txt.dvc.
- This metadata file is human-readable and a placeholder for the original data.
- Like source code it can be easily versioned  with Git. 
- **Do not modify .dvc files**, otherwise DVC gets confused/breaks!
- **If you want to move data** that is tracked by DVC use dvc move so that the corresponding .dvc file is also updated!

In [30]:
!cat big_data.txt.dvc

outs:
- md5: b404fddcd2be2fb72fa1900327927423
  size: 30
  path: big_data.txt


The data, meanwhile, is listed in .gitignore.

In [31]:
!cat .gitignore

/big_data.txt


### The cache
- Moreover, the data is committed (!) in the sense that the current 
state of files and directories tracked by DVC are moved to the cache.

- Use the --no-commit option to avoid this, and dvc commit to store the data in the cache.

- The cache consists of directories
    - The first two entries of the md5 hash of a corresponding .dvc file is the name of a directory
    - The following entries of the md5 hash constitue the name of the files in such a directory 

In [34]:
!tree ./.dvc/cache/

[01;34m./.dvc/cache/[00m
└── [01;34mb4[00m
    └── 04fddcd2be2fb72fa1900327927423

1 directory, 1 file


- Each file in the cache is a copy of the original data and DVC tries to replace the original data with a link to this copy.

In [35]:
!cat ./.dvc/cache/*/*

this is a really big data set


To version big_data.txt with Git, we add and commit the corresponding .dvc file and 
put the underlying data to Git's ignore list.

In [36]:
!git status

On branch dvc
Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31m.gitignore[m
	[31mbig_data.txt.dvc[m

nothing added to commit but untracked files present (use "git add" to track)


In [37]:
!git add big_data.txt.dvc .gitignore && git commit -m "Add big_data.txt.dvc"  # I prefer to add big_data.txt.dvc and not big_data.txt so that it is clear that DVC is involved

[dvc e1d90a8] Add big_data.txt.dvc
 2 files changed, 5 insertions(+)
 create mode 100644 .gitignore
 create mode 100644 big_data.txt.dvc


## View status: dvc (data) status

We will add a new text file named git_change.txt and add data to big_data.txt.

In [38]:
!echo "This file should be versioned by Git" >> git_change.txt

In [39]:
!echo "It gets even bigger!" >> big_data.txt

Using git status we see that the change in big_data.txt is not visible because dvc has put it on .gitignore after we used dvc add big_data.txt.

Only the creation of git_change.txt is visible.

In [40]:
!git status

On branch dvc
Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31mgit_change.txt[m

nothing added to commit but untracked files present (use "git add" to track)


- There is no direct equivalent of git status in dvc.
- But we can use **dvc data status** to show changes in the data tracked by DVC in the workspace.
- Moreover, **dvc data** shows file mismatches either between the cache and workspace, or between the cache and remote storage.
- As with Git it's a good practice to check the state of your DVC repository before doing something like dvc commit.


In [41]:
!dvc data status  # files which are untracked by dvc and git can be shown by using the option --untracked-files)

[?25l[32m⠋[0m Calculating diff for big_data.txt between index/workspace
[1A[2K[?25l[32m⠋[0m Calculating diff for big_data.txt between head/index
[1A[2KDVC uncommitted changes:
  (use "dvc commit <file>..." to track changes)
  (use "dvc checkout <file>..." to discard changes)
[33m        modified: big_data.txt[0m
[0m

In [42]:
!dvc status  # compares files between the cache (local copy of the remote, files that are tracked) and workspace

big_data.txt.dvc:                                                     core[39m>
	changed outs:
		modified:           big_data.txt
[0m

You can also use dvc diff.

In [43]:
!dvc diff

[33mModified[0m:                                                    core[39m>
    big_data.txt

files summary: 1 modified
[0m

## Committing: dvc add

- We can now commit the changes to DVC using add (or dvc commit)
- Note that dvc add also commits the data.

In [44]:
!dvc add big_data.txt

[2K[32m⠋[0m Checking graph                                                   [32m⠋[0m Checking graph
Adding...                                                                       
![A
  0% Checking cache in '/home/spa0001f/github/teach/dvc_lecture_tmp/.dvc/cache'|[A
                                                                                [A
![A
  0%|          |Transferring                          0/1 [00:00<?,     ?file/s][A
                                                                                [A
![A
  0%|          |37ff23a79b8ba94a069445cd1de711     0.00/? [00:00<?,        ?B/s][A
  0%|          |37ff23a79b8ba94a069445cd1de711  0.00/51.0 [00:00<?,        ?B/s][A
100% Adding...|████████████████████████████████████████|1/1 [00:00, 10.69file/s][A

To track the changes with git, run:

	git add big_data.txt.dvc

To enable auto staging, run:

	dvc config core.autostage true
[0m

In [45]:
!dvc data status

[?25l[32m⠋[0m Calculating diff for big_data.txt between index/workspace
[1A[2K[?25l[32m⠋[0m Calculating diff for big_data.txt between head/index
[1A[2KDVC committed changes:
  (git commit the corresponding dvc files to update the repo)
[32m        modified: big_data.txt[0m
[1;34m([0m[34mthere are other changes not tracked by dvc, use [0m[34m"git status"[0m[34m to see[0m[1;34m)[0m
[0m

Executing dvc add also changes the correspond .dvc file which should be committed to git.

In [46]:
!git status

On branch dvc
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	[31mmodified:   big_data.txt.dvc[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31mgit_change.txt[m

no changes added to commit (use "git add" and/or "git commit -a")


In [47]:
!git add "big_data.txt.dvc" && git commit -m "Update big_data.txt.dvc"

[dvc 7ddc191] Update big_data.txt.dvc
 1 file changed, 2 insertions(+), 2 deletions(-)


## Renaming files: dvc move

If you want to rename data use dvc move.

This deletes big_data.txt and the corresponding .dvc file and replaces them by massive_data.txt and a corresponding .dvc file.

In [48]:
!dvc move big_data.txt massive_data.txt

                                                                                
To track the changes with git, run:

	git add massive_data.txt.dvc .gitignore

To enable auto staging, run:

	dvc config core.autostage true
[0m

In [49]:
!cat massive_data.txt.dvc

outs:
- md5: e737ff23a79b8ba94a069445cd1de711
  size: 51
  path: massive_data.txt


Having a look at the dvc data status shows that dvc move has also added and committed the changes to DVC (but not to Git).

In [50]:
!dvc data status

[?25l[32m⠋[0m Calculating diff for massive_data.txt between index/workspace
[1A[2K[?25l[32m⠋[0m Calculating diff for massive_data.txt between head/index
[1A[2K[?25l[32m⠋[0m Calculating diff for big_data.txt between head/index
[1A[2KDVC committed changes:
  (git commit the corresponding dvc files to update the repo)
[32m        added: massive_data.txt
        deleted: big_data.txt[0m
[1;34m([0m[34mthere are other changes not tracked by dvc, use [0m[34m"git status"[0m[34m to see[0m[1;34m)[0m
[0m

In [51]:
!dvc status

Data and pipelines are up to date.                                    core[39m>
[0m

Committing to Git.

In [52]:
!git add massive_data.txt.dvc big_data.txt.dvc .gitignore && git commit -m "Renamed big_data.txt.dvc into massive_data.txt.dvc"

[dvc 92a5899] Renamed big_data.txt.dvc into massive_data.txt.dvc
 2 files changed, 4 insertions(+), 2 deletions(-)
 rename big_data.txt.dvc => massive_data.txt.dvc (69%)


Calling dvc data status does not show DVC committed changes anymore if the corresponding .dvc files have been committed to Git.

In [53]:
!dvc data status

[?25l[32m⠋[0m Calculating diff for massive_data.txt between index/workspace
[1A[2K[?25l[32m⠋[0m Calculating diff for massive_data.txt between head/index
[1A[2KNo changes.
[0m

## Returning to previous data: dvc checkout
- Versions of files are determined by the appropriate .dvc files that store their md5 checksums. 
- Thus, data files are fully determined by the version of the corresponding .dvc files which are tracked by Git.

- ```dvc checkout``` is ofter needed after ```git checkout```, ```git clone``` or other operations that change the current state
of the .dvc files. 
- It restores the corresponding versions of all DVC-tracked data files and directories from the cache to the workspace.

- Let's checkout the previous commit.

In [54]:
!git log

[33mcommit 92a58994b3ca07406cfb3632be451533ae62fdc4[m[33m ([m[1;36mHEAD -> [m[1;32mdvc[m[33m)[m
Author: spa0001f <fabian.spanhel@seven.one>
Date:   Tue Oct 18 20:20:11 2022 +0200

    Renamed big_data.txt.dvc into massive_data.txt.dvc

[33mcommit 7ddc19188d36aa3d1318de20cca1d0cdccd1b291[m
Author: spa0001f <fabian.spanhel@seven.one>
Date:   Tue Oct 18 20:19:25 2022 +0200

    Update big_data.txt.dvc

[33mcommit e1d90a8418b6e74eb9d0d39cb2032975e7f25913[m
Author: spa0001f <fabian.spanhel@seven.one>
Date:   Tue Oct 18 20:13:50 2022 +0200

    Add big_data.txt.dvc

[33mcommit a4809f84c5a5630efba3eb24992b435053762ad0[m
Author: spa0001f <fabian.spanhel@seven.one>
Date:   Tue Oct 18 20:12:54 2022 +0200

    Init dvc

[33mcommit 13c99bb7e40f824680c7e8510e36392c1c7af2ef[m[33m ([m[1;32mmain[m[33m)[m
Author: spa0001f <fabian.spanhel@seven.one>
Date:   Tue Oct 18 20:09:31 2022 +0200

    Update

[33mcommit 309044cb0baba1f40733977dfd821d16af5dbef7[m
Author: spa0001f <fabian

In [65]:
!git checkout HEAD~1

Note: checking out 'HEAD~1'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at 7ddc191 Update big_data.txt.dvc


Let's investigate the status of Git.

In [66]:
!git status

[31mHEAD detached at [m7ddc191
Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31mgit_change.txt[m
	[31mmassive_data.txt[m

nothing added to commit but untracked files present (use "git add" to track)


 - Note that massive_data.txt was never added or committed to Git but put on .gitignore by dvc add.
 - By checking out the previous commit we have also revoked putting massive_data.txt on .gitignore.
 - Thus, massive_data.txt is now untracked by git.

Let's have a look at the data status of DVC.

In [67]:
!dvc data status

[?25l[32m⠋[0m Calculating diff for big_data.txt between index/workspace
[1A[2K[?25l[32m⠋[0m Calculating diff for big_data.txt between head/index
[1A[2KDVC uncommitted changes:
  (use "dvc commit <file>..." to track changes)
  (use "dvc checkout <file>..." to discard changes)
[33m        deleted: big_data.txt[0m
[0m

- ```dvc data status``` now shows that big_data.txt is deleted (which was the effect of dvc move)
- This makes sense because in the next Git commit DVC replaces big_data.txt by massive_data.txt but Git isn't ware of this.
- Thus, the current Head in Git just recognizes that massive_data.txt is untracked.

The file big_data.txt.dvc (which has been modified by git checkout) contains the information to restore the data.

In [68]:
!cat big_data.txt.dvc

outs:
- md5: e737ff23a79b8ba94a069445cd1de711
  size: 51
  path: big_data.txt


- To restore big_data.txt we have to use dvc checkout.
- This also deletes massive_data.txt (although massive_data.txt.dvc does not exist).

In [69]:
!dvc checkout

  0% Checkout|                                       |0/1 [00:00<?,     ?file/s]
![A
  0%|          |37ff23a79b8ba94a069445cd1de711     0.00/? [00:00<?,        ?B/s][A
  0%|          |37ff23a79b8ba94a069445cd1de711  0.00/51.0 [00:00<?,        ?B/s][A
[32mA[0m       big_data.txt                                                   [A
[31mD[0m       massive_data.txt
[0m

# Working with remotes
- Working with remote is similar to Git. 
- You first have to specify a remote storate location for the data.
- The remote is typically not a sofware based on Git like GitHub or GitLab.
- The following storage types are supported to serve as a remote storage location.

In [73]:
embed_website("https://dvc.org/doc/command-reference/remote/add#supported-storage-types")

## Setting up a remote storage location: dvc remote add

- You can add a local remote store using ```remote add -d myremote path2store```, where myremote is the name of the remote and path2store its location
- In this course, will be using a Google drive as DVC remote because it is the simplest way to do it.
- To do so, we have to extract the **folder id** of the Google drive folder which is given by the letters after folders/
- For instance, the folder id of https://drive.google.com/drive/u/2/folders/1YIKU5fNFeBkDOUo4OOlhOIhftd-sj24k is 1YIKU5fNFeBkDOUo4OOlhOIhftd-sj24k

Note the Google [drive limits on storage and uploads](https://support.google.com/a/users/answer/7338880?visit_id=637995289613302718-2725308169&rd=1)

To add the Google drive remote we use the following command.

    - With the flag -d this becomes the default remote.
    - With the flag -f this add the remote even if it has already been added.

In [79]:
!dvc remote add -d -f dvc_gdrive gdrive://1YIKU5fNFeBkDOUo4OOlhOIhftd-sj24k

Setting 'dvc_gdrive' as a default remote.
[0m

Before we push, we **need to share the corresponding Google Drive folder to at least one other person or group** (!)

We can also investigate file differences between the cache and the remote storage.

In [80]:
!dvc status -c

	new:                big_data.txt                                               
[0m

## Push data to remotes: dvc push

In [81]:
!dvc push

  0% Transferring|                                   |0/1 [00:00<?,     ?file/s]
![A
  0%|          |37ff23a79b8ba94a069445cd1de711     0.00/? [00:00<?,        ?B/s][A
  0%|          |37ff23a79b8ba94a069445cd1de711  0.00/51.0 [00:00<?,        ?B/s][A
100%|██████████|37ff23a79b8ba94a069445cd1de751.0/51.0 [00:01<00:00,     34.1B/s][A
1 file pushed                                                                   [A
[0m

Typically, we should now push the Git commited .dvc files to a Git remote - but we don't do it in this illustration.

```bash
git remote add git_remote_url
git push -u origin dvc_lecture_temp
```

## Retrieve data from remote: dvc pull

Let's delete big_data.txt and pull it from the remote storage.

In [84]:
!rm big_data.txt && dvc data status

[?25l[32m⠋[0m Calculating diff for big_data.txt between index/workspace
[1A[2K[?25l[32m⠋[0m Calculating diff for big_data.txt between head/index
[1A[2KDVC uncommitted changes:
  (use "dvc commit <file>..." to track changes)
  (use "dvc checkout <file>..." to discard changes)
[33m        deleted: big_data.txt[0m
[1;34m([0m[34mthere are other changes not tracked by dvc, use [0m[34m"git status"[0m[34m to see[0m[1;34m)[0m
[0m

Comparing files between the cache and the remote storage shows no difference.
That is, because the same version of big_data.txt exists both in the cache and the remote storage.

In [85]:
!dvc status -c

Cache and remote 'dvc_gdrive' are in sync.                                      
[0m

By pulling the data, we fetch the data the from the remote storage that corresponds to the current commit
and checkout its version.

In [87]:
!dvc pull

  0% Checkout|                                       |0/1 [00:00<?,     ?file/s]
![A
  0%|          |37ff23a79b8ba94a069445cd1de711     0.00/? [00:00<?,        ?B/s][A
  0%|          |37ff23a79b8ba94a069445cd1de711  0.00/51.0 [00:00<?,        ?B/s][A
[32mA[0m       big_data.txt                                                   [A
1 file added
[0m

# Remove dvc from git repo: dvc destroy
- Let us remove all DVC files and internals.
- This does not remove the actual data.
- To recover DVC you can checkout the corresponding Git commit, pull from the remote and checkout.

In [88]:
!yes | dvc destroy

This will destroy all information about your pipelines, all data files, as well 
as cache in .dvc/cache.
[0m                                                                  core[39m>yes: standard output: Broken pipe


- Note that big_data.txt will remain.

In [89]:
!ls

big_data.txt  git_change.txt


# Clean up
Let's delete the directory dvc_lecture_tmp

In [59]:
if False:
    os.getcwd()
    os.chdir("..")

In [61]:
!rm big_data.txt && git checkout main

rm: remove write-protected regular file 'dvc_lecture_tmp/.dvc/cache/5e/95c4f3e26fc2e940b2478c840e9bb4'? ^C
