# Tutorial: Data Version Control with DVC

- [DVC - Installation](https://dvc.org/)
- [DVC - Tutorial: Data and Model Versioning](https://dvc.org/doc/use-cases/versioning-data-and-models/tutorial)
***

### The 3 axis of change in a Machine Learning application
<center><img src="images/ml_modeling.png"/></center>

#### <font color='green'>Version contrl system vs Data version system</font>
Version control systems help developers manage changes to source code. But data version control, managing changes toÂ **models**Â andÂ **datasets** isnâ€™t so well established. In a version control system, thereâ€™s a central repository of code that represents the current, official state of the project. Developers can make a copy of a project, make some changes, and request that their new version become the official one. Their code is then reviewed and tested before itâ€™s deployed to production. 

#### <font color='green'>dvc & git</font>
Data version control is a set of tools and processes that tries to adapt the version control process to the data world. **DVC is a command-line tool written in Python. It mimics Git commands and workflows to ensure that users can quickly incorporate it into their regular Git practice. DVC is meant to be run alongside Git. In fact, theÂ ```git``` andÂ ```dvc``` commands will often be used in tandem, one after the other. While Git is used to store and version code, DVC does the same for data and model files.**

Git can store code locally and also on a hosting service likeÂ GitHub,Â Bitbucket, orÂ GitLab. Likewise, DVC uses a remote repository to store all your data and models. This is the single source of truth, and it can be shared amongst the whole team. You can get a local copy of the remote repository, modify the files, then upload your changes to share with team members. The remote repository can be on the same computer youâ€™re working on, or it can be in the cloud. DVC supports most major cloud providers, includingÂ **AWS**,Â **GCP**, andÂ **Azure**. But you can set up a DVC remote repository on any server and connect it to your laptop. 

#### <font color='green'>.dvc & .git</font>
Running `dvc init` (similar to `git init`) will create a `.dvc` folder that holds configuration information, just like the `.git` folder for Git. In principle, you donâ€™t ever need to open that folder, but youâ€™ll take a peek in this tutorial so you can understand whatâ€™s happening under the hood.
> Executing `git init` creates a ```.git``` subdirectory in the current working directory, which contains all of the necessary Git metadata for the new repository. This metadata includes subdirectories for objects, refs, and template files. A HEAD file is also created which points to the currently checked out commit.

When you store your data and models in the remote repository, a ```<filename>.dvc``` file is created. A ```<filename>.dvc``` is a small text file that points to your actual data files in remote storage. It is lightweight and meant to be stored with your code in GitHub. When you download a Git repository, you also get the ```<filename>.dvc``` files. You can then use those files to get the data associated with that repository.

## Let's get started!

#### Installing the requirements.

In [None]:
# !pip install --upgrade pip > /dev/null
# !pip install dvc scikit-learn scikit-image pandas numpy > /dev/null

#### Code structure.

Fork [this repository](https://github.com/realpython/data-version-control) and clone it for experiments.

<div class="alert alert-block alert-danger">
<b>Action</b>:
    Change <b>YOUR-GITHUB-USERNAME</b> in the below cell to your username.
</div> 

In [None]:
!git clone https://github.com/YOUR-GITHUB-USERNAME/data-version-control.git

!cd data-version-control && git remote rm origin
!cd data-version-control && git remote add origin git@github.com:YOUR-GITHUB-USERNAME/data-version-control.git

In [None]:
!ls data-version-control

```
â”œâ”€â”€ data/
â”‚   â”œâ”€â”€ prepared/
â”‚   â””â”€â”€ raw/
â”‚
â”œâ”€â”€ metrics/
â”œâ”€â”€ model/
â””â”€â”€ src/
    â”œâ”€â”€ evaluate.py
    â”œâ”€â”€ prepare.py
    â””â”€â”€ train.py
```

**src/** is for source code.
> **prepare.py** contains code for preparing data for training.  
> **train.py** contains code for training a machine learning model.  
> **evaluate.py** contains code for evaluating the results of a machine learning model.  

data/ is for all versions of the dataset.
> **data/raw/** is for data obtained from an external source.  
> **data/prepared/** is for data modified internally.  

**model/** is for machine learning models.  
**data/metrics/** is for tracking the performance metrics of your models.  

#### Downloading [Imagenette](https://github.com/fastai/imagenette) data.

In [None]:
!cd data-version-control && wget -P ./data/raw/ -nc https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz

In [None]:
!cd data-version-control && ls ./data/raw/

In [None]:
!cd data-version-control && tar xf ./data/raw/imagenette2-160.tgz -C ./data/raw/ 
!cd data-version-control && ls ./data/raw/ 

In [None]:
!cd data-version-control && ls ./data/raw/imagenette2-160  

In [None]:
!cd data-version-control && mv ./data/raw/imagenette2-160/train/ ./data/raw
!cd data-version-control && mv ./data/raw/imagenette2-160/val/ ./data/raw
!cd data-version-control && ls ./data/raw/

In [None]:
!cd data-version-control && rm -rf ./data/raw/imagenette2-160.tgz ./data/raw/imagenette2-160
!cd data-version-control && ls ./data/raw/

In [None]:
!cd data-version-control && ls ./data/raw/train/

In [None]:
!cd data-version-control && ls ./data/raw/val/

Each of the train and val subsets of Imagenette dataset contains 10 classes distribted among 10 different folders correspondingly.

#### Start experimenting.

In [None]:
!cd data-version-control && git status

In [None]:
# Create a new branch.
!cd data-version-control && git checkout -b "first_experiment"

In [None]:
# Initialize DVC.
!cd data-version-control && dvc init

This will create a ```.dvc``` folder that holds configuration information, just like the ```.git``` folder for Git.

In [None]:
!ls -la data-version-control

<div class="alert alert-block alert-success">
Git gives you the ability to push your local code to a remote repository so that you have a single source of truth shared with other developers. Other people can check out your code and work on it locally without fear of corrupting the code for everyone else. The same is true for DVC. You need some kind of remote storage for the data and model files controlled by DVC. This can be as simple as another folder on your system.
</div>

Create a folder somewhere on your system outside the `data-version-control/` repository and call it `dvc_remote`. Now come back to your `data-version-control/` repository and tell DVC where the remote storage is on your system:

In [None]:
!mkdir dvc_remote

In [None]:
!cd data-version-control && dvc remote add -d remote_storage ../dvc_remote/

DVC now knows where to back up your data and models. `dvc remote add` stores the location to your remote storage and names it `remote_storage`. You can choose another name if you want. The `-d` switch tells DVC that this is your default remote storage. You can add more than one storage location and switch between them.

In [None]:
!cd data-version-control && cat .dvc/config

DVC supports many cloud-based storage systems, such as **AWS S3 buckets**, **Google Cloud Storage**, and **Microsoft Azure Blob Storage**. You can find out more in the official DVC documentation for the [dvc remote add](https://dvc.org/doc/command-reference/remote/add) command.

## DVC in action.

<div class="alert alert-block alert-success">
Rule of thumb â€“ <b>small files go to GitHub, and large files go to DVC remote storage</b>.
</div>

Like git, DVC also uses the `add` command to start tracking files. This puts the files under their respective control.

In [None]:
!cd data-version-control && dvc add -q ./data/raw/train
!cd data-version-control && dvc add -q ./data/raw/val 

Images are considered large files, especially if theyâ€™re collected into datasets with hundreds or thousands of files. **The `add` command adds these two folders under DVC control**. Hereâ€™s what DVC does under the hood:

- 1) Adds your `train/` and `val/` folders to `.gitignore`
- 2) Creates two files with the `.dvc` extension, `train.dvc` and `val.dvc`
- 3) Copies the `train/` and `val/` folders to a staging area

> `.gitignore` is a text file that has a list of files that Git should ignore, or not track. When a file is listed in `.gitignore`, itâ€™s invisible to git commands. By adding the `train/` and `val/` folders to `.gitignore`, DVC makes sure you wonâ€™t accidentally upload large data files to GitHub.

> `.dvc` files are small text files that point DVC to your data in remote storage. Remember the rule of thumb: large data files and folders go into DVC remote storage, but the small `.dvc` files go into GitHub. When you come back to your work and check out all the code from GitHub, youâ€™ll also get the `.dvc` files, which you can use to get your large data files.

> Finally, DVC copies the data files to a staging area. The staging area is called a **cache**. When you initialized DVC with `dvc init`, it created a `.dvc` folder in your repository. In that folder, it created the cache folder, `.dvc/cache`. When you run `dvc add`, all the files are copied to `.dvc/cache`.

In [None]:
!cd data-version-control && ls ./data/raw

In [None]:
!cd data-version-control && cat ./data/raw/train.dvc

In [None]:
!cd data-version-control && ls .dvc/cache

In [None]:
# Once the large image files have been put under DVC control, we can add all 
# the code and small files to Git control with git add:
!cd data-version-control && git add --all

<center><img src="images/dvc_state_1.png" width="400" height="20"/></center>

If someone wants to work on your project and use the `train/` and `val/` data, then they would first need to download your Git repository. They could then use the `.dvc` files to get the data.

**But before people can get your repository and data, you need to upload your files to remote storage.**

## Uploading files to remote repository

To upload files to GitHub, you first need to create a snapshot of the current state of your repository. When you add all the modified files to the staging area with `git add`, create a snapshot with the `commit` command:

In [None]:
!cd data-version-control && git commit -m "first commit"

<div class="alert alert-block alert-success">
DVC also has a <b>commit</b> command, but it doesnâ€™t do the same thing as <b>git commit</b>. DVC doesnâ€™t need a snapshot of the whole repository. It can just upload individual files as soon as theyâ€™re tracked with <b>dvc add</b>.

You use <b>dvc commit</b> when an already tracked file changes. If you make a local change to the data, then you would commit the change to the cache before uploading it to remote. You havenâ€™t changed your data since it was added, so you can skip the commit step.
</div>

To upload files from **cache** to **remote** use the `push` command:

In [None]:
!cd data-version-control && dvc push

Your data is now safely stored in a location away from your repository. Finally, push the files under Git control to GitHub:

In [None]:
!cd data-version-control && git push --set-upstream origin first_experiment

<center><img src="images/dvc_state_2.png" width="400" height="20"/></center>

## Downloading files

In [None]:
# Remove a data file.
!cd data-version-control && rm -rf data/raw/val

In [None]:
!cd data-version-control && ls data/raw/

The above command deletes the `data/raw/val/` folder from your repository, but the folder is still safely stored in your cache and the remote storage. You can get it back at any time.

**To get your data back from the cache, use the `dvc checkout` command:**

In [None]:
!cd data-version-control && dvc checkout data/raw/val.dvc

In [None]:
!cd data-version-control && ls data/raw/

> If you want DVC to search through your whole repository and check out everything thatâ€™s missing, then use `dvc checkout` with no additional arguments.

> When you clone your GitHub repository on a new machine, the cache will be empty. The fetch command gets the contents of the remote storage into the cache, `dvc fetch data/raw/val.dvc`.

> Or you can use just `dvc fetch` to get the data for all DVC files in the repository. Once the data is in your cache, check it out to the repository with `dvc checkout`. You can perform both fetch and checkout with a single command, `dvc pull`. It executes `dvc fetch` followed by `dvc checkout`. It copies your data from the remote to the cache and into your repository in a single sweep. These commands roughly mimic what Git does, since Git also has `fetch`, `checkout`, and `pull` commands.

## ML comes into the play

In [None]:
import pandas as pd

In [None]:
# Create csv files for storing filenames and corresponding labels for binary classifcation.
!cd data-version-control && python src/prepare.py

In [None]:
!cd data-version-control && ls data/prepared

In [None]:
train_csv = pd.read_csv("data-version-control/data/prepared/train.csv", index_col=0)
train_csv.head()

In [None]:
# Classes.
train_csv.label.unique()

Now we need to add these `train.csv` and `val.csv` files to DVC and the corresponding `.dvc` files to GitHub:

In [None]:
!cd data-version-control && dvc add -q data/prepared/train.csv data/prepared/test.csv
!cd data-version-control && git add --all
!cd data-version-control && git commit -m "Created train and test CSV files"

Training SVM classifier:

In [None]:
!cd data-version-control && python src/train.py

When the script finishes, youâ€™ll have a trained machine learning model saved in the `model/` folder with the name `model.joblib`. This is the most important file of the experiment. It needs to be added to DVC, with the corresponding `.dvc` file committed to GitHub:

In [None]:
!cd data-version-control && ls model

In [None]:
!cd data-version-control && dvc add -q model/model.joblib
!cd data-version-control && git add --all
!cd data-version-control && git commit -m "svm classifier"

Let's evaluate our model.

In [None]:
!cd data-version-control && python src/evaluate.py
!cd data-version-control && cat metrics/accuracy.json

As accuracy JSON file is really small, and itâ€™s useful to keep it in GitHub so you can quickly check how well each experiment performed:

In [None]:
!cd data-version-control && git add --all
!cd data-version-control && git commit -m "evaluated SVM accuracy"

In [None]:
!cd data-version-control && git push
!cd data-version-control && dvc push

## Version Datasets and Models (advanced)

A common practice is to use tagging to mark a specific point in your Git history as being important. Since youâ€™ve completed an experiment and produced a new model, create a tag to signal to yourself and others that you have a ready-to-go model:

In [None]:
!cd data-version-control && git tag -a svm-classifier -m "SVM classifier with accuracy 64%"

In [None]:
# Git tags arenâ€™t pushed with regular commits, so they have to be pushed separately to 
# your repositoryâ€™s origin on GitHub or whatever platform you use. Use the --tags switch 
# to push all tags from your local repository to the remote:

!cd data-version-control && git push origin --tags

A common practice is to create a new branch for every single experiment. Let's increase the number of iterations and observe if there is improvement in classificaion accuracy or not.

<div class="alert alert-block alert-danger">
<b>Action</b>:
    Modify your train.py to increase the number of iterations to 100.
</div> 

In [None]:
!cd data-version-control && git checkout -b "svm-100-iterations"

In [None]:
!cd data-version-control && python src/train.py
!cd data-version-control && python src/evaluate.py
!cd data-version-control && cat metrics/accuracy.json

Now since the training process has changed the *model.joblib* file, you need to commit it to the DVC cache.

Remember, `dvc commit` works differently from `git commit` and is used to update an already tracked file. This wonâ€™t delete the previous model, but it will create a new one.

In [None]:
!cd data-version-control && dvc commit -f  # --force
# Normally, this will ask you if you are sure you want to make the change, click on 'Y' for Yes. To escape
# from doing that in jupyter notebook, use -f or --force. 

In [None]:
# Add and commit the changes youâ€™ve made to Git:
!cd data-version-control && git add --all
!cd data-version-control && git commit -m "change svm max_iter to 100"

# Tag the new experiment:
!cd data-version-control && git tag -a svm-100-iter -m "trained an svm classifier for 100 iterations"
!cd data-version-control && git push origin --tags

# Push the code changes to GitHub and the DVC changes to the remote storage:
!cd data-version-control && git push --set-upstream origin svm-100-iter
!cd data-version-control && dvc push

<div class="alert alert-block alert-success">
We can now jump between branches <b>first_experiment</b> and <b>svm-100-iterations</b> by checking out the code from GitHub and then checking out the data and model from DVC:
</div>

In [None]:
!cd data-version-control && git checkout first_experiment
!cd data-version-control && dvc checkout

In [None]:
# Let's evaluate.
!cd data-version-control && python src/evaluate.py
!cd data-version-control && cat metrics/accuracy.json

I have already seen this number ðŸ˜ƒ

<div class="alert alert-block alert-info">
What we learnt so far is to conduct an experiment, store the corresponding model weights and data associated with it and then conduct another experiment keeping the possibility to always go back and reproduce the results from the previous experiment.
</div>

## Creating Pipelines

So far we have fetched the data manually and added it to remote storage. You can now get it with `dvc checkout` or `dvc pull`. The other steps were executed by running various Python files. These can be chained together into a single execution called a **DVC pipeline** that requires only one command.

In [None]:
# Create a new branch:
!cd data-version-control && git checkout -b svm-pipeline

 A pipeline consists of multiple stages and is executed using a `dvc run` command. Each stage has three components:
- Inputs â€“ pipeline inputs, DVC term **dependencies**
- Outputs â€“ pipeline outputs, DVC term **outs**
- Command â€“ anything you usually run in the command line, including Python files

In [None]:
# This will remove the .dvc files and the associated data targeted by the .dvc files. 
# We should now have a blank slate to re-create these files using DVC pipelines.
!cd data-version-control && dvc remove data/prepared/train.csv.dvc \
                                        data/prepared/test.csv.dvc \
                                        model/model.joblib.dvc --outs

First, youâ€™re going to run `prepare.py` as a DVC pipeline stage. The command for this is `dvc run`, which needs to know the dependencies, outputs, and command:  

- Dependencies: `prepare.py` and the data in `data/raw`  
- Outputs: `train.csv` and `test.csv`  
- Command: `python prepare.py`  


In [None]:
!cd data-version-control && dvc run -n prepare \
                                -d src/prepare.py -d data/raw \
                                -o data/prepared/train.csv -o data/prepared/test.csv \
                                python src/prepare.py

# -n switch gives the stage a name.
# -d switch passes the dependencies to the command.
# -o switch defines the outputs of the command.

Once you create the stage, DVC will create two files, `dvc.yaml` and `dvc.lock`. let's check them:

In [None]:
!cd data-version-control && cat dvc.yaml

The top-level element, stages, has elements nested under it, one for each stage. Currently, we have only one stage, prepare. As we chain more, theyâ€™ll show up in this file. Technically, we donâ€™t have to type `dvc run` commands in the command lineâ€”you can create all your stages here.

<center><img src="images/new-pipeline_1.png" width="400" height="20"/></center>

The next stage in the pipeline is training. The dependencies are the `train.py` file itself and the `train.csv` file in `data/prepared`. The only output is the `model.joblib` file. To create a pipeline stage out of `train.py`, execute it with `dvc run`, specifying the correct dependencies and outputs:

In [None]:
!cd data-version-control && dvc run -n train \
                                    -d src/train.py -d data/prepared/train.csv \
                                    -o model/model.joblib \
                                    python src/train.py

The final stage will be the evaluation. The dependencies are the `evaluate.py` file and the model file generated in the previous stage. The output is the metrics file, `accuracy.json`. Execute `evaluate.py` with `dvc run`:

In [None]:
!cd data-version-control && dvc run -n evaluate \
                                    -d src/evaluate.py -d model/model.joblib \
                                    -M metrics/accuracy.json \
                                    python src/evaluate.py

# Notice that we used the -M switch instead of -o. DVC treats metrics differently from other outputs. 
# When you run this command, it will generate the accuracy.json file, but DVC will know that itâ€™s a 
# metric used to measure the performance of the model.

In [None]:
# You can get DVC to show you all the metrics it knows about with the dvc show command:
!cd data-version-control && dvc metrics show

In [None]:
!cd data-version-control && cat dvc.yaml

<center><img src="images/new-pipeline_final.png" width="1100" height="100"/></center>

In [None]:
# Version and store your code, models, and data for the new DVC pipeline:
!cd data-version-control && git add --all
!cd data-version-control && git commit -m "rerun svm as pipeline"
!cd data-version-control && dvc commit
!cd data-version-control && git push --set-upstream origin svm-pipeline
!cd data-version-control && git tag -a svm-pipeline -m "trained svm as DVC pipeline."
!cd data-version-control && git push origin --tags
!cd data-version-control && dvc push

### Now let's change the model while reusing the pipeline

In [None]:
!cd data-version-control && git checkout -b "random_forest"

<div class="alert alert-block alert-danger">
<b>Action</b>:
    Modify your train.py to use a RandomForestClassifier.
</div> 

In [None]:
# Display all the changed dependencies for every stage of the pipeline:
!cd data-version-control && dvc status

In [None]:
# Since the change in the model will affect the metric as well, you want to reproduce the whole chain. 
# You can reproduce any DVC pipeline file with the `dvc repro` command:
!cd data-version-control && dvc repro evaluate

<div class="alert alert-block alert-success">
And thatâ€™s it! When you run the `repro` command, DVC checks all the dependencies of the entire pipeline to determine whatâ€™s changed and which commands need to be executed again. Think about what this means. You can jump from branch to branch and reproduce any experiment with a single command!
</div>

**DVC reruns only the changed parts of the pipeline. If you noticed above, *skipping* means that DVC skips the corresponding steps regarding data as it wasn't changed.**

In [None]:
!cd data-version-control && cat metrics/accuracy.json

In [None]:
!cd data-version-control && git add --all
!cd data-version-control && git commit -m "train Random Forrest classifier"
!cd data-version-control && dvc commit
!cd data-version-control && git push --set-upstream origin random-forest
!cd data-version-control && git tag -a random-forest -m "Random Forest classifier"
!cd data-version-control && git push origin --tags
!cd data-version-control && dvc push

In [None]:
# Compare metrics across multiple tags:
!cd data-version-control && dvc metrics show -T

When you come back to this project after several months and donâ€™t remember the details, you can check which setup was the most successful with `dvc metrics show -T` and reproduce it with `dvc repro`! Anyone else who wants to reproduce your work can do the same. Theyâ€™ll just need to take three steps:
- Run `git clone` or `git checkout` to get the code and `.dvc` files.
- Get the training data with `dvc checkout`.
- Reproduce the entire workflow with `dvc repro evaluate`.

<center><img src="images/thank_you.jpeg" width="300" height="500"/></center>

## References
- [Data Version Control With Python and DVC](https://realpython.com/python-data-version-control/)
- [Tutorial: Data and Model Versioning](https://dvc.org/doc/use-cases/versioning-data-and-models/tutorial)