# Data Versioning Control (DVC)

## Introduction

The main aim of this exercise is to familiarize students with the awsome `dvc` tool for data and model versioning in machine learning/data mining projects.

The easiest way is to install the library inside a virtual Python environment or using Conda, although direct installation from a repository is possible. All details regarding the installation of the library can be found at [project's website](https://dvc.org/doc/install/linux).

In [1]:
%%bash

pip install dvc

Collecting dvc
  Downloading dvc-2.56.0-py3-none-any.whl (422 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m422.1/422.1 kB[0m [31m958.4 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting colorama>=0.3.9
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Collecting configobj>=5.0.6
  Downloading configobj-5.0.8-py2.py3-none-any.whl (36 kB)
Collecting distro>=1.3
  Downloading distro-1.8.0-py3-none-any.whl (20 kB)
Collecting dpath<3,>=2.1.0
  Downloading dpath-2.1.5-py3-none-any.whl (17 kB)
Collecting dvc-data<0.48,>=0.47.1
  Downloading dvc_data-0.47.5-py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 kB[0m [31m205.6 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting dvc-http>=2.29.0
  Downloading dvc_http-2.30.2-py3-none-any.whl (12 kB)
Collecting dvc-render<1,>=0.3.1
  Downloading dvc_render-0.4.0-py3-none-any.whl (18 kB)
Collecting dvc-studio-client<1,>=0.8.0
  Downloading dvc_st


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m


The first step is to create a directory and to initialize `git` inside it.

In [9]:
%%bash

mkdir dvc-tutorial

cd dvc-tutorial

git init

mkdir: dvc-tutorial: File exists


Initialized empty Git repository in /Users/dominikludwiczak/Library/CloudStorage/OneDrive-put.poznan.pl/semestr 4/Data Mining/Clustering/dvc-tutorial/.git/


In [10]:
%cd dvc-tutorial

/Users/dominikludwiczak/Library/CloudStorage/OneDrive-put.poznan.pl/semestr 4/Data Mining/Clustering/dvc-tutorial


In [11]:
%%bash

dvc init

Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>


In [12]:
%%bash

git status

On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
	new file:   .dvc/.gitignore
	new file:   .dvc/config
	new file:   .dvcignore



In [13]:
%%bash

git add .dvc/plots/*
git add .dvc/config
git add .dvc/.gitignore
git add .dvcignore

git commit -m "Initialize DVC for the project"

fatal: pathspec '.dvc/plots/*' did not match any files


[main (root-commit) 75281c7] Initialize DVC for the project
 3 files changed, 6 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvcignore


## Data versioning

The main goal of `dvc` is to allow for large data files versioning. Using `git` for this purpose is [quite problematic](https://docs.github.com/en/github/managing-large-files/working-with-large-files). In this laboratory we will use `dvc` to work with different versions of the same data file.

Before starting the laboratory you should download and locally store `adult.data` and `adult.names` files from [UCI ML Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/)

In [15]:
%%bash

mkdir data
cp ../adult* data

mkdir: data: File exists


In [16]:
%%bash

dvc add data/adult.data
dvc add data/adult.names

[?25l⠋ Checking graph



To track the changes with git, run:

	git add data/.gitignore data/adult.data.dvc

To enable auto staging, run:

	dvc config core.autostage true


[?25l⠋ Checking graph



To track the changes with git, run:

	git add data/adult.names.dvc data/.gitignore

To enable auto staging, run:

	dvc config core.autostage true


[?25h

Let's take a look at the files which were automatically created as the result of adding data files to the repo.

In [60]:
%%bash

cat data/adult.data.dvc

outs:
- md5: 5d7c39d7b8804f071cdd1f2a7c460872
  size: 3974305
  path: adult.data


In [61]:
%%bash

cat data/adult.names.dvc

outs:
- md5: 1a7cdb3ff7a1b709968b1c7a11def63e
  size: 5229
  path: adult.names


In order to allow for change tracking in data files we need to add `*.dvc` and   `data/.gitignore` files to the Git repository.

In [19]:
%%bash

git add data/.gitignore data/adult.data.dvc data/adult.names.dvc
git commit -m "Added ADULT dataset"

[main b5645f5] Added ADULT dataset
 3 files changed, 10 insertions(+)
 create mode 100644 data/.gitignore
 create mode 100644 data/adult.data.dvc
 create mode 100644 data/adult.names.dvc


Our next step is to create a remote data repository. DVC works with many external data sources, including Amazon S3, Google Cloud Storage, remote servers accessible via `ssh`, HDFS systems, and many more. We will use a local directory to simulate an external repo.

In [20]:
%%bash

mkdir -p ~/dvcrepo
dvc remote add -d repozytorium ~/dvcrepo
git commit .dvc/config -m "Added local directory simulating remote data repository"

Setting 'repozytorium' as a default remote.
[main d7e7ef4] Added local directory simulating remote data repository
 1 file changed, 4 insertions(+)


In [21]:
%%bash

dvc push

2 files pushed


In [22]:
%%bash

ls -al ~/dvcrepo/

total 0
drwxr-xr-x   4 dominikludwiczak  staff   128 May  8 08:37 [34m.[m[m
drwxr-x---+ 52 dominikludwiczak  staff  1664 May  8 08:37 [34m..[m[m
drwxr-xr-x   3 dominikludwiczak  staff    96 May  8 08:37 [34m1a[m[m
drwxr-xr-x   3 dominikludwiczak  staff    96 May  8 08:37 [34m5d[m[m


In [23]:
%%bash 

ls -al ~/dvcrepo/1a/

total 16
drwxr-xr-x  3 dominikludwiczak  staff    96 May  8 08:37 [34m.[m[m
drwxr-xr-x  4 dominikludwiczak  staff   128 May  8 08:37 [34m..[m[m
-r--r--r--@ 1 dominikludwiczak  staff  5229 May  8 08:36 7cdb3ff7a1b709968b1c7a11def63e


In [24]:
%%bash

cat ~/dvcrepo/1a/7cdb3ff7a1b709968b1c7a11def63e

| This data was extracted from the census bureau database found at
| http://www.census.gov/ftp/pub/DES/www/welcome.html
| Donor: Ronny Kohavi and Barry Becker,
|        Data Mining and Visualization
|        Silicon Graphics.
|        e-mail: ronnyk@sgi.com for questions.
| Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
| 48842 instances, mix of continuous and discrete    (train=32561, test=16281)
| 45222 if instances with unknown values are removed (train=30162, test=15060)
| Duplicate or conflicting instances : 6
| Class probabilities for adult.all file
| Probability for the label '>50K'  : 23.93% / 24.78% (without unknowns)
| Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)
|
| Extraction was done by Barry Becker from the 1994 Census database.  A set of
|   reasonably clean records was extracted using the following conditions:
|   ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
|
| Prediction task is to determine whether a person makes over

Remote repo can be used to download original versions of data files when fixing the unnecessary changes, re-creating an experimental branch, etc.

In [62]:
%%bash

rm -rf .dvc/cache/
rm data/adult.data
rm data/adult.names

ls -al data/

total 24
drwxr-xr-x  5 dominikludwiczak  staff  160 May  8 08:56 [34m.[m[m
drwxr-xr-x  6 dominikludwiczak  staff  192 May  8 08:35 [34m..[m[m
-rw-r--r--  1 dominikludwiczak  staff   25 May  8 08:36 .gitignore
-rw-r--r--  1 dominikludwiczak  staff   81 May  8 08:52 adult.data.dvc
-rw-r--r--  1 dominikludwiczak  staff   79 May  8 08:36 adult.names.dvc


In [63]:
%%bash

dvc pull

ls -al data/

A       data/adult.data
A       data/adult.names
2 files added and 2 files fetched
total 7808
drwxr-xr-x  7 dominikludwiczak  staff      224 May  8 08:56 [34m.[m[m
drwxr-xr-x  6 dominikludwiczak  staff      192 May  8 08:35 [34m..[m[m
-rw-r--r--  1 dominikludwiczak  staff       25 May  8 08:36 .gitignore
-rw-r--r--@ 1 dominikludwiczak  staff  3974305 May  8 08:36 adult.data
-rw-r--r--  1 dominikludwiczak  staff       81 May  8 08:52 adult.data.dvc
-rw-r--r--@ 1 dominikludwiczak  staff     5229 May  8 08:36 adult.names
-rw-r--r--  1 dominikludwiczak  staff       79 May  8 08:36 adult.names.dvc


In the next step we will change the data files by removing all information about federal employees. Let's check how many such records do we have, and then let's remove them.

In [64]:
%%bash

cat data/adult.data | wc -l
grep 'Federal-gov' data/adult.data | wc -l

   32562
     960


In [73]:
%%bash

sed -i '' '/Federal-gov/d' data/adult.data # For mac os
# sed -i '/Federal-gov/d' data/adult.data # For regular system
cat data/adult.data | wc -l

   31602


In [74]:
%%bash 

dvc add data/adult.data
git commit data/adult.data.dvc -m "Removed federal workers from the dataset"

dvc push

[?25l⠋ Checking graph



To track the changes with git, run:

	git add data/adult.data.dvc

To enable auto staging, run:

	dvc config core.autostage true
[main d653634] Removed federal workers from the dataset
 1 file changed, 2 insertions(+), 2 deletions(-)
1 file pushed


[?25h

If we want to rollback this change, we need to revert to the correct version of the `adult.data.dvc` file and running `dvc checkout` command to synchronize repos.

In [75]:
%%bash 

git log

commit d65363450b72a213d02857df4db342e4a6560086
Author: DominikLudwiczak <dominik.ludwiczak.priv@gmail.com>
Date:   Mon May 8 09:01:03 2023 +0200

    Removed federal workers from the dataset

commit d7e7ef415743ddb22c5216af7fff9d45d3303336
Author: DominikLudwiczak <dominik.ludwiczak.priv@gmail.com>
Date:   Mon May 8 08:37:12 2023 +0200

    Added local directory simulating remote data repository

commit b5645f5adbd19ebdf891176c2c01ca591e77f84e
Author: DominikLudwiczak <dominik.ludwiczak.priv@gmail.com>
Date:   Mon May 8 08:36:40 2023 +0200

    Added ADULT dataset

commit 75281c7310ce22561c2f2106fb48bd73a3d033b3
Author: DominikLudwiczak <dominik.ludwiczak.priv@gmail.com>
Date:   Mon May 8 08:35:40 2023 +0200

    Initialize DVC for the project


In [76]:
%%bash 

git checkout d7e7ef415743ddb22c5216af7fff9d45d3303336 data/adult.data.dvc
dvc checkout

Updated 1 path from 0a4743c


M       data/adult.data


In [77]:
%%bash

grep 'Federal-gov' data/adult.data

35, Federal-gov, 76845, 9th, 5, Married-civ-spouse, Farming-fishing, Husband, Black, Male, 0, 0, 40, United-States, <=50K
30, Federal-gov, 59951, Some-college, 10, Married-civ-spouse, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K
57, Federal-gov, 337895, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Black, Male, 0, 0, 40, United-States, >50K
50, Federal-gov, 251585, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 55, United-States, >50K
43, Federal-gov, 410867, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, >50K
32, Federal-gov, 249409, HS-grad, 9, Never-married, Other-service, Own-child, Black, Male, 0, 0, 40, United-States, <=50K
38, Federal-gov, 125933, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, Iran, >50K
39, Federal-gov, 235485, Assoc-acdm, 12, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 42, United-States

In [79]:
%%bash 

git commit data/adult.data.dvc -m "Reverting the deletion of federal employees"

[main d55a5ce] Reverting the deletion of federal employees
 1 file changed, 2 insertions(+), 2 deletions(-)


## Access to remote data repositories

Having configured a `git` repo using `dvc` we can easily use `dvc` to quickly download data and models, share the data, etc. The results of the previous chapter were stored in the [https://github.com/megaduks/dvc-tutorial](https://github.com/megaduks/dvc-tutorial) repo and now we will see how we can use remote repo to work with the data. 

In [80]:
%%bash 

dvc list https://github.com/megaduks/dvc-tutorial data

adult.data
adult.names


All datasets can be downloaded using a single command, e.g. to initialize a new project.

In [81]:
%%bash

mkdir new_project
cd new_project
dvc get https://github.com/megaduks/dvc-tutorial data

In [83]:
%%bash

ls -al new_project/data/

total 7784
drwxr-xr-x  4 dominikludwiczak  staff      128 May  8 09:04 [34m.[m[m
drwxr-xr-x  3 dominikludwiczak  staff       96 May  8 09:04 [34m..[m[m
-rw-r--r--  1 dominikludwiczak  staff  3974305 May  8 09:04 adult.data
-rw-r--r--  1 dominikludwiczak  staff     5229 May  8 09:04 adult.names


Unfortunately, using the above command we have lost the information on the origin of the data and we can't re-connect the locally downloaded data with the remote repository. The `dvc get` command resembles `wget` in this regard. If we want to keep the connection between remote and local data, we must use `dvc import`.

In [84]:
%%bash

mkdir -p newer_project/data
dvc import https://github.com/megaduks/dvc-tutorial/ data/adult.data \
    -o newer_project/data/adult.data

Importing 'data/adult.data (https://github.com/megaduks/dvc-tutorial/)' -> 'newer_project/data/adult.data'

To track the changes with git, run:

	git add newer_project/data/adult.data.dvc newer_project/data/.gitignore

To enable auto staging, run:

	dvc config core.autostage true


In [85]:
%%bash

cat newer_project/data/adult.data.dvc

md5: eb47ad7d444e1f1bd3c397fb4af43476
frozen: true
deps:
- path: data/adult.data
  repo:
    url: https://github.com/megaduks/dvc-tutorial/
    rev_lock: a0d82b86e0936e12cfe4e9986ad78a4781f70d86
outs:
- md5: 5d7c39d7b8804f071cdd1f2a7c460872
  size: 3974305
  path: adult.data


As we can see, metadata of the `adult.data` file contain information on the remote repository from which the data originates. Precise hashes identifying a particular version of the data file are stored as well. In addition, we can easily track changes of the origin data in the remote repo.

In [86]:
%%bash

dvc update newer_project/data/adult.data.dvc

'newer_project/data/adult.data.dvc' didn't change, skipping


DVC offers also a programmatical API to access data in remote repos.

In [87]:
import dvc.api

with dvc.api.open('data/adult.data', repo='https://github.com/megaduks/dvc-tutorial') as f:
    for _ in range(10):
        print(f.readline())

39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K

50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K

38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K

53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K

28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K

37, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K

49, Private, 160187, 9th, 5, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 16, Jamaica, <=50K

52, Self-emp-not-inc, 209642, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >5

## Data flows

The most interesting functionality offered by `dvc` is the ability to manage reproducible data workflows. We will use the following flow to illustrate this concept:

- we will pre-process data by removing selected records
- we will add a new feature
- we will train a simple model
- we will evaluate the quality of the model

The code in the following examples is very simplified, but it's purpose is to illustrate the concept of reproducible data flows. First, we need to install some additional dependencies.

In [88]:
%%bash

pip install pandas sklearn pyaml scikit-learn scipy

Collecting pyaml
  Downloading pyaml-23.5.8-py3-none-any.whl (17 kB)
Installing collected packages: pyaml
Successfully installed pyaml-23.5.8



[notice] A new release of pip is available: 23.0.1 -> 23.1.2
[notice] To update, run: python3.11 -m pip install --upgrade pip


We will create the first step of the data flow. In this step we read in a text file and transform it to a serialized binary version (a pickle). 

Create a `params.yaml` file and put the following inside:

In [100]:
%%bash
ls

data
new_project
newer_project


```
prepare:
  split: 0.75
  seed: 42
```

Next, create a `prepare.py` file with the following code.

In [106]:
import pandas as pd
import sklearn
import yaml
import random
import sys

from pathlib import Path
from sklearn.model_selection import train_test_split

params = yaml.safe_load(open('params.yaml'))['prepare']

split = params['split']
random.seed(params['seed'])

input_file = Path(sys.argv[1])
train_output = Path('data') / 'prepared' / 'train.csv'
test_output = Path('data') / 'prepared' / 'test.csv'

Path('data/prepared').mkdir(parents=True, exist_ok=True)

df = pd.read_csv(input_file, sep=',')
train_df, test_df = train_test_split(df, train_size=split)

train_df.to_csv(train_output, header=None)
test_df.to_csv(test_output, header=None)

FileNotFoundError: [Errno 2] No such file or directory: '--ip=127.0.0.1'

Now we create the first data flow in which we:
- create a named step (`-n prepare`)
- pass parameters (`-p prepare.seed,prepare.split`)
- pass dependencies (`-d prepare.py -d data/adult.data`)
- indicate the output (`-o data/prepared/`)
- run the script and pass parameter values

In [111]:
%%bash

dvc run -n prepare \
    -p prepare.seed,prepare.split \
    -d prepare.py -d data/adult.data \
    -o data/prepared \
    python3 prepare.py data/adult.data

Running stage 'prepare':
> python3 prepare.py data/adult.data
Creating 'dvc.yaml'
Adding stage 'prepare' in 'dvc.yaml'
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock dvc.yaml data/.gitignore

To enable auto staging, run:

	dvc config core.autostage true


As the result, we observe output files and a special `dvc.yaml` file with human-readable description of the data flow configuration.

In [112]:
%%bash

cat dvc.yaml

stages:
  prepare:
    cmd: python3 prepare.py data/adult.data
    deps:
    - data/adult.data
    - prepare.py
    params:
    - prepare.seed
    - prepare.split
    outs:
    - data/prepared


In [113]:
%%bash 

ls -al data/prepared/

total 7816
drwxr-xr-x  4 dominikludwiczak  staff      128 May  8 09:25 .
drwxr-xr-x  8 dominikludwiczak  staff      256 May  8 09:25 ..
-rw-r--r--  1 dominikludwiczak  staff  1000336 May  8 09:25 test.csv
-rw-r--r--  1 dominikludwiczak  staff  2995290 May  8 09:25 train.csv


The second step is to add to the data flow data transformation. We will re-code all categorical attributes using [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder) and we will compute feature interactions using [PolynomialFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures). This last class uses the `degree` parameter. Update the parameter file to account for the second step.

```
prepare:
  split: 0.75
  seed: 42
featurize:
  degree: 2
```

Create the `featurize.py` file.

In [None]:
import pandas as pd
import numpy as np
import yaml
import sys
import pickle

from pathlib import Path
from sklearn.preprocessing import LabelEncoder, PolynomialFeatures

params = yaml.safe_load(open('params.yaml'))['featurize']
degree = params['degree']

input_dir = sys.argv[1]
output_dir = sys.argv[2]

Path(output_dir).mkdir(exist_ok=True)

train_file = Path(input_dir) / 'train.csv'
test_file = Path(input_dir) / 'test.csv'

col_names = [
        'age',
        'workclass',
        'weight',
        'education',
        'edu-num',
        'marital-status',
        'occupation',
        'relationship',
        'race',
        'sex',
        'capital-gain',
        'capital-loss',
        'hours-per-week',
        'native-country',
        'class'
]

train_df = pd.read_csv(train_file, sep=',', names=col_names)
test_df = pd.read_csv(test_file, sep=',', names=col_names)

train_df = train_df.apply(LabelEncoder().fit_transform)
test_df = test_df.apply(LabelEncoder().fit_transform)

poly = PolynomialFeatures(degree=degree, interaction_only=True)

train_y = train_df['class']
test_y = test_df['class']

train_df = train_df.drop('class', axis=1)
test_df = test_df.drop('class', axis=1)

train_df = np.column_stack((poly.fit_transform(train_df), train_y))
test_df = np.column_stack((poly.fit_transform(test_df), test_y))

train_output = Path(output_dir) / 'train.p'
test_output = Path(output_dir) / 'test.p'

with open(train_output, 'wb') as f:
    pickle.dump(train_df, f)

with open(test_output, 'wb') as f:
    pickle.dump(test_df, f)


The data flow can be executed by running the following command.

In [114]:
%%bash

dvc run -n featurize \
    -p featurize.degree \
    -d featurize.py -d data/prepared/ \
    -o data/features \
    python3 featurize.py data/prepared/ data/features/

Running stage 'featurize':
> python3 featurize.py data/prepared/ data/features/
Adding stage 'featurize' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add data/.gitignore dvc.lock dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true


In order not to loose the results of our work we should record data flow steps in the `git` repo.

In [117]:
%%bash

git add data/.gitignore dvc.lock dvc.yaml
git commit -m 'Added preparation and featurization steps to data pipeline'

[main 908e377] Added preparation and featurization steps to data pipeline
 3 files changed, 60 insertions(+)
 create mode 100644 dvc.lock
 create mode 100644 dvc.yaml


The third step is to run model training. We will use a simple script with Random Forest, and we will use two parameters: the number of trees in the forest and the maximum depth of each tree. Change the parameter file in the following way:

```
prepare:
  split: 0.75
  seed: 42
featurize:
  degree: 2
train:
  max_depth: 2
  n_estimators: 5
```

Create the `train.py` file:

In [None]:
import sys
import yaml
import pickle

from pathlib import Path
from sklearn.ensemble import RandomForestClassifier

params = yaml.safe_load(open('params.yaml'))['train']
max_depth = params['max_depth']
n_estimators = params['n_estimators']

input_dir = sys.argv[1]
output_dir = sys.argv[2]

Path(output_dir).mkdir(exist_ok=True)

train_file = Path(input_dir) / 'train.p'
model_file = Path(output_dir) / 'model.p'

with open(train_file, 'rb') as f:
    train_df = pickle.load(f)

X = train_df[:, :-1]
y = train_df[:, -1]

clf = RandomForestClassifier(
    n_estimators=n_estimators,
    max_depth=max_depth
)
clf.fit(X, y)

with open(model_file, 'wb') as f:
    pickle.dump(clf, f)


As you can see, the script expects two parameters to be passed via the command line (the input directory with the data and the output directory to store the results of the script). To add the training step to the data flow execute the following command:

In [142]:
%%bash

dvc run -n train --force\
    -p train.max_depth,train.n_estimators \
    -d train.py -d data/features/ \
    -o data/models/ \
    python3 train.py data/features/ data/models/

Stage 'train' is cached - skipping run, checking out outputs
Modifying stage 'train' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true


As usual we record the changes in the data flow in `git`.

In [143]:
%%bash

git add data/.gitignore dvc.lock dvc.yaml
git commit -m 'Added training step to data pipeline'

[main cb9fde3] Added training step to data pipeline
 1 file changed, 3 insertions(+), 3 deletions(-)


Why have we created the `dvc.yaml` file? At the first glance it might seem overly complicated. But this is where `dvc` truly shines, the presence of the full definition of the data flow allows for full reproducibilty using a single command.

In [146]:
%%bash

dvc repro

'data/adult.data.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
Stage 'train' didn't change, skipping
Data and pipelines are up to date.


Let's change a single parameter in the `train` section (e.g., change the number of trees in the RandomForest) and re-run the experiment. Which steps have been executed? Change another parameter in the `prepare` section (e.g. the way train/test split is performed) and re-run the experiment once again. Has something changed?

If you want to visualize the data flow, use the `dvc dag` command.

## Experiments

The last element of the `dvc` framework that we will examine is the way experiments are executed. Before we start experimenting, we need to create a `evaluate.py` file with the code to evaluate the results of training.

In [None]:
import sys
import os
import pickle
import json

from sklearn.metrics import precision_recall_curve
import sklearn.metrics as metrics
from pathlib import Path

model_file = Path(sys.argv[1]) / 'model.p'
test_file = Path(sys.argv[2]) / 'test.p'

scores_file = sys.argv[3]
plots_file = sys.argv[4]

with open(model_file, 'rb') as f:
    model = pickle.load(f)

with open(test_file, 'rb') as f:
    test_df = pickle.load(f)

X = test_df[:,:-1]
y = test_df[:,-1]

predictions_by_class = model.predict_proba(X)
y_pred = predictions_by_class[:, 1]

precision, recall, thresholds = precision_recall_curve(y, y_pred)
auc = metrics.auc(recall, precision)

with open(scores_file, 'w') as f:
    json.dump({'auc': auc}, f)

with open(plots_file, 'w') as f:
    json.dump({'prc': [{
            'precision': p,
            'recall': r,
            'threshold': t
        } for p, r, t in zip(precision, recall, thresholds)
    ]}, f)

Tym razem dodanie kroku ewaluacji do potoku będzie bardziej skomplikowane, ponieważ musimy też uwzględnić specjalny plik do przechowywania wartości metryk oraz plik przechowywania danych na potrzeby wykresów. 

This time adding a step to the data flow is more complicated, because we have to include a special file to store the metrics associated with experiment runs, and an additional file to store the visualizations.

In [148]:
%%bash

dvc run -n evaluate \
    -d evaluate.py -d data/models/ -d data/features/ \
    -M scores.json \
    --plots-no-cache prc.json \
    python3 evaluate.py data/models/ data/features/ scores.json prc.json

Running stage 'evaluate':
> python3 evaluate.py data/models/ data/features/ scores.json prc.json
Adding stage 'evaluate' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml dvc.lock

To enable auto staging, run:

	dvc config core.autostage true


Let's see at the final data flow configuration file.

In [149]:
%%bash

cat dvc.yaml

stages:
  prepare:
    cmd: python3 prepare.py data/adult.data
    deps:
    - data/adult.data
    - prepare.py
    params:
    - prepare.seed
    - prepare.split
    outs:
    - data/prepared
  featurize:
    cmd: python3 featurize.py data/prepared/ data/features/
    deps:
    - data/prepared/
    - featurize.py
    params:
    - featurize.degree
    outs:
    - data/features
  train:
    cmd: python3 train.py data/features/ data/models/
    deps:
    - data/features/
    - train.py
    params:
    - train.max_depth
    - train.n_estimators
    outs:
    - data/models/
  evaluate:
    cmd: python3 evaluate.py data/models/ data/features/ scores.json prc.json
    deps:
    - data/features/
    - data/models/
    - evaluate.py
    metrics:
    - scores.json:
        cache: false
    plots:
    - prc.json:
        cache: false


Don't forget to record all the changes in `git`.

In [150]:
%%bash

git add dvc.lock dvc.yaml
git commit -m 'Added evaluation step to data pipeline'

[main f796dc4] Added evaluation step to data pipeline
 2 files changed, 33 insertions(+)


As the result of the data flow a new file `scores.json` has been added. This file contains the AUROC measure for the experiment run.

In [151]:
%%bash

cat scores.json

{"auc": 0.7025319326748718}

The `prc.json` file contains the information about the training (*precision-recall curve*). Let's add both files to the repository.

In [152]:
%%bash

git add scores.json prc.json
git commit -m 'Added evaluation metrics'

[main b174b22] Added evaluation metrics
 2 files changed, 2 insertions(+)
 create mode 100644 prc.json
 create mode 100644 scores.json


Run the experiment with changed parameters and let's see if these changes affect the metric. Change the `degree` parameter to 3 and change the `n_estimators` parameter to 25. Re-run the experiment.

In [153]:
%%bash 

dvc repro

'data/adult.data.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
Running stage 'train':
> python3 train.py data/features/ data/models/
Updating lock file 'dvc.lock'

Running stage 'evaluate':
> python3 evaluate.py data/models/ data/features/ scores.json prc.json
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock

To enable auto staging, run:

	dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.


In [154]:
%%bash

dvc params diff

DVC failed to load some parameters for following revisions: 'HEAD'.


Path         Param               HEAD    workspace
params.yaml  featurize.degree    -       2
params.yaml  prepare.seed        -       42
params.yaml  prepare.split       -       0.75
params.yaml  train.max_depth     -       3
params.yaml  train.n_estimators  -       25


In [155]:
%%bash 

dvc metrics diff

Path         Metric    HEAD     workspace    Change
scores.json  auc       0.70253  0.70322      0.00069


In [156]:
%%bash

dvc plots diff -x recall -y precision

file:///Users/dominikludwiczak/Library/CloudStorage/OneDrive-put.poznan.pl/semestr%204/Data%20Mining/Clustering/dvc-tutorial/dvc_plots/index.html
