# Data Version Control

## Introduction
>[Data Version Control (DVC)](https://dvc.org) is a data-versioning tool that employs lightweight pointers to indicate the storage location of data (e.g. S3 buckets, Google Drive, GCS, etc). 

Additionally, it stores information about the experiments run using the data. Furthermore, it allows you to make or retrieve changes in or from a remote server. Finally, as an added advantage, the commands used are similar to those in Git.

## Installation

DVC exhibits high platform independence (i.e. it works well with various OSs). Thus, there are many ways to install it on your computer. For Windows, Mac, and Linux users, one solution is to run the following:
```
conda install -c conda-forge mamba
mamba install -c conda-forge dvc
```

Next, install the necessary package corresponding to your remote storage service of choice. For example, if you plan to use S3 storage, run the following:

```
mamba install -c conda-forge dvc-s3
```
If you intend to use Google Drive, run
```
mamba install -c conda-forge dvc-gdrive
```

## Initialising DVC

DVC works in harmony with Git projects. As such, we will initialise DVC in a new Git repository. Create a repository in your local machine.

```
mkdir test_dvc && cd test_dvc
git init
```
Once inside, initialise dvc as follows:
```
dvc init
```

At this point, there should be many new files in your initially empty repo. 
<div style="text-align:center"><img src="images/test_dvc.png" width=175/></div>

If you run `git status`, you will find that all the created files have already been staged. As the next step, commit them by running the following:
```
git commit -m "DVC Init"
```

## DVC for Data Versioning

DVC aims to harness the power of Git and apply it to large datasets. Therefore, by cloning a repo, you can obtain an extensive dataset with the corresponding model that was trained using that dataset.

### Downloading the example file
In this section, we will learn how to store local data remotely using commands similar to those in Git. For this example, download `data.xml` (a dataset provided by the DVC team for users to conduct experiments).
```
dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
```

As can be observed, `get` is used with `dvc` to download data tracked by DVC or Git (think of it as `wget`). For more information on the commands, visit this [link](https://dvc.org/doc/start/data-and-model-versioning).



### Staging the file
Once downloaded, stage the file using the familiar git subcommand, `add`:
```
dvc add data/data.xml
```

Thereafter, you will observe that a `.gitignore` file and a `data.xml.dvc` file have been created. The `.gitignore` file prevents you from uploading the whole dataset to Git, while the `data.xml.dvc` file stores information about the data 'pointers' to enable access to the data once it has been pushed. Remember to add them to the stage state:

```
git add 'data\data.xml.dvc' 'data\.gitignore'
git commit -m 'Add DVC tracker'
```

### Pushing the files
Next, we push the files. For this example, we will use an Amazon S3 bucket, which is highly recommended because of its simplicity. 

Before that, however, run the following command to ensure that DVC knows your credentials when you attempt to push data:

```
mamba install -c conda-forge dvc-s3
```

>On a related note, it is imperative that you provide your credentials. If you have an S3 bucket, you have probably already set up your AWS configuration data. Otherwise, you can create an IAM user as well as a user with administration permissions. For more information, check out the Cloud Basics module in the Data-Engineering unit.

Go to the bucket to which you intend to upload your data, and create a new folder. For the demonstration, we create a folder named AiCore_DVC.

<div style="text-align:center"><img src="images/DVC_S3.png" width=600/></div>

Once created, copy the URL of the S3 folder so that you can add a remote bucket to DVC (similar to the case when adding a remote URL to git). The next time you push data, the data tracked by DVC will be uploaded to this folder.

<div style="text-align:center"><img src="images/DVC_S3_2.png" width=600/></div>

To set the remote DVC storage, run the following command:
```
dvc remote add -d <url_name> s3://<bucket_name>/<folder_name>/
```
Here, <url_name> references the name you wish to save the URL as (similar to `origin` in GitHub). In this case, we simply call it storage. Therefore, we run
```
dvc remote add -d storage s3://dvc-aicore/AiCore_DVC/
```

_To overwrite an existing remote URL, add -f (for force) after the -d flag._

Before proceeding, ensure that you commit the new changes. The configurations for the remote storage will be stored in the created `.dvc/config` file. Thus, run
```
git add .dvc/config 
git commit -m "Configures remote data storage"
```
Now, everything is set up. Simply push the data by running
```
dvc push
```
The data should now appear in your bucket.

### Pulling changes
As expected, pulling changes is achieved by running
```
dvc pull
```
It would be pointless to run this command at the moment; however, be aware that you can retrieve the data easily after cloning the repo.

### Making changes
If changes are made to the file, you can also update the remote data. To demonstrate, we will apply a slight change. We double the dataset we had previously by simply copying and appending it, as follows:
```
cp data/data.xml /tmp/data.xml
cat /tmp/data.xml >> data/data.xml
```

Next, as is the case in git, we stage the file using the `add` subcommand:
```
dvc add data/data.xml
```
Now, to be meticulous, we stage and commit the changes in git:

```
git add data/data.xml.dvc
git commit data/data.xml.dvc -m "Dataset updates"
```

Next, we simply push the changes to the DVC remote storage:
```
dvc push
```

### Reverting changes
To revert to a previous version or the data stored in a different branch, you can git checkout to the commit or branch with the corresponding version, and run dvc checkout thereafter.

For example, to revert to the original version of your dataset (before that, confirm that your current data.xml file has 50,000 samples), run the following:
```
git checkout HEAD~1 data/data.xml.dvc
dvc checkout
```
If all goes as expected, you should receive the following message: 
```
$ dvc checkout
M       data\data.xml
```

> __Note:__ use `git checkout`, followed by `dvc checkout` to revert to a previous version of your data.

At this point, you should have 25,000 samples, which is the original length of the dataset.

## DVC for Model Tracking

DVC does not only track data. Consider DVC as the Git for both data and models. Models can be trained and, subsequently, tracked using the above-mentioned commands. Thus, when you run a `pull` or a `checkout`, you will retrieve the data and the corresponding model.

Here is a step-by-step breakdown of the process flow. First, we track files using Git. A repository is created, in which large files (e.g. data or model files) are added. Thereafter, the large files are added to the DVC tracker, at which point DVC will include a `.gitignore`. The `.gitignore` prevents the tracking of the original large file by Git, allowing Git to, instead, track the pointer to the original data, i.e. the `data.xml.dvc` file. 

The next step involves pushing the changes to a remote repository.

### Example

1. Create an __empty__ GitHub repository. 
2. Add remote to your dvc folder.
3. Push the changes.
4. Go to your repo, and observe what has been uploaded.

If you followed all instructions, you will find that no large file was uploaded.

## Retrieving Data

Here, we learn how to retrieve files that have been tracked by DVC. To do this, run `dvc pull`, but note that you need to obtain the tracking file first. Thus, DVC provides different commands for downloading data to your local machine very easily.

### Confirming tracked files
First, it would be useful to know what files in your project are being tracked. If you followed all the steps above, you should have a GitHub repo with the file pointing towards the data stored in the S3 bucket.

<div style="text-align:center"><img src="images/DVC_github.png" width=600/></div>

To check what files are tracked by DVC in our repo, we run
```
dvc list https://github.com/IvanYingX/dvc_test.git
```
Alternatively, to view all the contents of the subdirectories, we run:
```
dvc list -R https://github.com/IvanYingX/dvc_test.git
```
The obtained output is shown below. Your output should be similar to it (ensure that you are pointing to the right repo, and not ours).

<div style="text-align:center"><img src="images/DVC_List.png" width=350/></div>

Anyone can view the contents of the repo provided that they have the repo URL. Observe that `data.xml` is shown, even though it does not appear in the repository.

### Downloading the data
Now, to download `data.xml` to your local machine anew, simply run `dvc get`, which is the equivalent of `wget`, but for files tracked by DVC:
```
dvc get https://github.com/IvanYingX/dvc_test.git data.xml
```
_To experiment with this, go to a new directory (Desktop for example), and run the command above._

In our case, we download the entire folder by running
```
cd ~/Desktop && dvc get https://github.com/IvanYingX/dvc_test.git data
```
or alternatively,
```
dvc get -o ~/Desktop https://github.com/IvanYingX/dvc_test.git data
```
This creates a new folder on the desktop called `data`. If you check the folder, you will find that there is no `data.xml.dvc` file. That is because `dvc get` does not create a tracking system, but simply downloads the tracked files.

> __Note:__ `dvc get` will download the data; however, it will not generate a DVC tracking file.

### Tracking the files
If you intend to keep track of these files, you would have to run `dvc add`. DVC has a shortcut for files that have been downloaded using `dvc get`, which is `dvc import`.

## DVC Pipelines

In addition to keeping track of and retrieving large files and models, DVC allows you to create pipelines to reproduce experiments. This way, we can train models without worrying about data pulling, training, pushing and tracking.

To demonstrate, we will use an example provided by the DVC team. 
- Create a new folder, and run the following commands in your CLI:
```
wget https://code.dvc.org/get-started/code.zip
unzip code.zip
rm -f code.zip
```
If you do not have `wget` installed, go to the following page: [https://www.jcchouinard.com/wget/](https://www.jcchouinard.com/wget/).

Your new folder should appear similar to that shown below:
<div style="text-align:center"><img src="images/DVC_Pipe.png" width=300/></div>

- Observe the contents of each file. You should find a `requirements.txt` file (consider running the subsequent commands in a virtual environment).
- Initialise the source-folder repo using `git init`, and add/commit everything. Ignore the contents of each file, but be aware that the `featurization.py` will transform some raw data into features. We will also need the data we were using previously with the `data.xml` file. Thus, in the same directory, we run `dvc get`:
```
dvc get https://github.com/IvanYingX/dvc_test.git data
```
Next, initialise DVC in your repo, add the large files to be tracked, and make the corresponding commits:
```
dvc init
dvc add data/data.xml
dvc remote add -d storage s3://dvc-aicore/AiCore_DVC/
git add 'data/data.xml.dvc' 'data/.gitignore' '.dvc/config'
git commit -m 'Add DVC tracker and storage'
dvc push
```

Here, we begin creating the pipeline. Remember that pipelines can be treated as workflows with many tasks in line (essentially a linear DAG). You can create a pipeline by creating a yaml file and naming it `dvc.yaml`. Therein, each step of the pipeline will be defined.

Fortunately, DVC offers the command, `dvc run`, which facilitates the creation of these DAGs and storage in a .yaml file.

Each step is called a stage, and each time `dvc run` is run, a new stage is created. To create the first stage, we run the following:
```
dvc run -n prepare \
        -p prepare.seed,prepare.split \
        -d src/prepare.py -d data/data.xml \
        -o data/prepared \
        python src/prepare.py data/data.xml
```

Click on each line to view the details.

<details>
  <summary> -n prepare </summary>
  This will assign a name to the stage ('prepare' in this case).
</details>
<details>
  <summary> -p prepare.seed,prepare.split </summary>
  The p stands for parameters. It will search for a file named params (in this case, 'params.yaml') and use the prepare.seed and prepare.split parameters contained therein.
</details>
<details>
  <summary> -d src/prepare.py -d data/data.xml </summary>
  The d stands for dependencies. It informs the pipeline of the files that are necessary in this stage. If they are not in your repo, your DAG will request you to include them.
</details>
<details>
  <summary> -o data/prepared </summary>
  The output created after running the files. 
</details>
<details>
  <summary> python src/prepare.py data/data.xml </summary>
  This is the actual command that will be run. 
</details>

Once you run the code, you will find that there are two new files and that `data/.gitignore` has been updated. `data/.gitignore` is updated so that you do not track the data split into train and test. The other two files are representations of the DAG, which, in this case, only contains a single stage.

<div style="text-align:center"><img src="images/DVC_Stage.png" width=400/></div>

Once you have a pipeline, you can start creating experiments by running `dvc repro`. However, before that, we add a stage to the pipeline; otherwise, we would only have one step. To visualise the stages in your DAGs, run `dvc dag`, which will enable you view the pipeline in a clear format:

<div style="text-align:center"><img src="images/DVC_Dag.png" width=200/></div>

Now, we add a couple of steps to this DAG. The next step will create features using the prepared data. These features will be reflected in `data/features`, and the stage will depend on featurization.py and the prepared data.

```
dvc run -n featurise \
        -p featurise.max_features,featurise.ngrams \
        -d src/featurisation.py -d data/prepared \
        -o data/features \
        python src/featurisation.py data/prepared data/features
```

Upon running this code, your `dvc.yaml` file will be updated:

<div style="text-align:center"><img src="images/DVC_Stage_2.png" width=400/></div>

Additionally, notice that there are new pickle files in `data/features`.

The last step will correspond to the training step:
```
dvc run -n train \
        -p train.seed,train.n_est,train.min_split \
        -d src/train.py -d data/features \
        -o model.pkl \
        python src/train.py data/features model.pkl
```

After running all these commands, some files should be committed using git:
```
git add dvc.yaml dvc.lock .gitignore data/.gitignore
```

If you run `dvc dag`, the new DAG should appear similar to that shown below:

<div style="text-align:center"><img src="images/DVC_Dag_2.png" width=200/></div>

As you were creating the stages, all the steps were executed. To avoid running them, add `--no-exec` during the creation process.

However, since the goal of this is to automate the steps, you do not have to specify dependencies between tasks the next time. You can run all the defined steps by simply executing `dvc repro` to reproduce complete or partial pipelines by executing their stages. If you run it now, DVC will detect no changes; hence, we will modify some parameters.

In this pipeline, it is possible to modify the train/test split size, the data itself or the number of maximum features. We will change the train/test split size. In `params.yaml`, set a different value for split and execute `dvc repro`.

You should see all the steps running sequentially because the first stage ('prepare') depends on the parameter. However, if you change 'n_est' in `params.yaml` and rerun `dvc repro`, the following occurs:

<div style="text-align:center"><img src="images/DVC_Repro.png" width=400/></div>

DVC detects that the dependencies of some steps have not changed. Hence, it skips them, thereby saving memory.

## Tracking Metrics

Creating DAGs is a good approach for automating a series of steps that will run sequentially. In a later lesson, we will learn how to add steps that track the performance of each experiment.

## Conclusion
At this point, you should have a good understanding of 

- DVC as a data-versioning tool.
- how to install and initialise DVC.
- how to use DVC for model tracking.
- how to retrieve data.
- DVC pipelines.
- tracking metrics.