## Introduction to DVC

Managing datasets using DVC require a set of commands to be executed in a specific order:
1.	First step is to set up a git repository with DVC
```bash
git init
dvc init
git commit -m 'initialize repo'
```
2.	Now, we need to configure the remote storage for DVC:
```bash
dvc remote add -d myremote /tmp/dvc-storage
git commit .dvc/config -m 'Added local remote storage'
```
3.	Let’s create a sample data directory and fill it with some sample data:
```bash
mkdir data
cp sample_google_scholar data/
```
4.	At this stage, we are ready to start tracking the dataset. We just need to add our file to DVC: This operation will create an additional file, example_data.csv.dvc. In addition, the example_data.csv file gets added to .gitignore automatically so that git no longer tracks the original file.
```bash
dvc add data/sample_google_scholar.csv
```
5.	Next, you need to commit and upload example_data.csv.dvc and .gitignore file. We will tag our first dataset as v1.
```bash
git add data/.gitignore data/sample_google_scholar.csv.dvc
git commit -m 'data tracking'
git tag -a 'v1' -m 'test_data'
dvc push
```
6.	after dvc.push command, our data will be available on remote storage. We can remove the local version. To restore example_data.csv you can simply call dvc.pull. 
```bash
dvc pull data/sample_google_scholar.csv.dvc
```
7.	When example_data.csv is modified, we need to add and push again to update the version on the remote storage. We will tag the modified dataset as v2

For example we can delete some rows from the csv files or perform other modification. 

```bash
dvc add data/sample_google_scholar.csv
git add data/sample_google_scholar.csv.dvc
git commit -m 'data modification description'
git tag -a 'v2' -m 'modified test_data'
dvc push
```
After executing a set of commends in this section, you will have two versions of the same dataset tracked by git and DVC: v1 and v2.


## MLflow with DVC 

In [15]:
import mlflow
import dvc.api
import pandas as pd

In [23]:
data_path='data/sample_google_scholar.csv'
repo='/Users/tpalczew/BookDL_demo/'
version='v1'

data_url=dvc.api.get_url(path=data_path, repo=repo, rev=version)

In [24]:
data = pd.read_csv(data_url)

In [25]:
data.head()

Unnamed: 0,author_name,email,affiliation,coauthors_names,research_interest
0,Lawrence Holder,wsu.edu,Washington State University,Diane J Cook##William Eberle,artificial_intelligence##machine_learning##dat...
1,Dr. Sirisha Velampalli,crraoaimscs.res.in,,William Eberle##Lenin Mookiah,graph_mining##big_data_analytics##machine_lear...
2,William Eberle,tntech.edu,Tennessee Technological University,,data_mining##anomaly_detection
3,Diane J Cook,eecs.wsu.edu,Washington State University,Lawrence Holder##Parisa Rashidi##Sajal K. Das#...,artificial_intelligence##machine_learning##sma...
4,Sumi Helal IEEE Fellow AAAS Fellow IET Fellow ...,cise.ufl.edu,University of Florida,Raja Bose##Darrell Woelk##Diane J Cook##Yousse...,digital_health##smart_homes##internet_of_thing...


We have deleted first three rows from the csv file and following instructions presented above, we have added v5 tag for the new version of the dataset

In [32]:
data_path='data/sample_google_scholar.csv'
repo='/Users/tpalczew/BookDL_demo/'
version='v5'

data_url=dvc.api.get_url(path=data_path, repo=repo, rev=version)

In [34]:
data = pd.read_csv(data_url)
data.head()

Unnamed: 0,author_name,email,affiliation,coauthors_names,research_interest
0,Diane J Cook,eecs.wsu.edu,Washington State University,Lawrence Holder##Parisa Rashidi##Sajal K. Das#...,artificial_intelligence##machine_learning##sma...
1,Sumi Helal IEEE Fellow AAAS Fellow IET Fellow ...,cise.ufl.edu,University of Florida,Raja Bose##Darrell Woelk##Diane J Cook##Yousse...,digital_health##smart_homes##internet_of_thing...
2,Hani Hagras,essex.ac.uk,University of Essex,Christian Wagner,explainable_artificial_intelligence##ambient_i...
3,Anupam Joshi,umbc.edu,UMBC,Tim Finin##Yelena Yesha##Lalana Kagal##Dipanja...,data_management##mobile_computing##security##s...
4,Dipanjan Chakraborty,in.ibm.com,,Anupam Joshi##Koustuv Dasgupta##Karl Aberer##T...,context_aware_services##mobile_and_pervasive_c...


We see that thanks to DVC we can easily store and fetch different versions of files

Finally, we can log nformation about our datset to the MLflow

In [36]:
# log important information using mlflow
mlflow.start_run()
mlflow.log_param("data_url", data_url)

and so on 
```python
mlflow.log_artifact(...)
```