# Preparation


Check README.md file for install/setup instructions 

**References**


https://dvc.org/doc/tutorial/define-ml-pipeline - used as example

## Initialize DVC

In [1]:
!dvc --version #1.1.2

1.1.2
[0m

In [2]:
# Инициализация проекта DVC
!dvc init -f


You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[39m[31m|[39m                                                                     [31m|[39m
[31m|[39m        DVC has enabled anonymous aggregate usage analytics.         [31m|[39m
[31m|[39m     Read the analytics documentation (and how to opt-out) here:     [31m|[39m
[31m|[39m              [34mhttps://dvc.org/doc/user-guide/analytics[39m               [31m|[39m
[31m|[39m                                                                     [31m|[39m
[31m+---------------------------------------------------------------------+
[39m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: [34mhttps://dvc.org/doc[39m
- Get help and share ideas: [34mhttps://dvc.org/chat[39m
- Star us on GitHub: [34mhttps://github.com/iterative/dvc[39m
[0m

In [3]:
%%bash
# Первый коммит в Git 

git add .
git commit -m "Initialize DVC"

[master bf5eab4] Initialize DVC
 8 files changed, 3025 insertions(+), 715 deletions(-)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvc/plots/confusion.json
 create mode 100644 .dvc/plots/default.json
 create mode 100644 .dvc/plots/scatter.json
 create mode 100644 .dvc/plots/smooth.json
 create mode 100644 tutorial_old.ipynb


### Files and Directories 

In [4]:
# Вывести список файлов и папок в подпапке `.dvc`, включая скрытые (флаг `-a`)
!ls -a .dvc 

[34m.[m[m          [34m..[m[m         .gitignore config     [34mplots[m[m      [34mtmp[m[m


In [5]:
# Вывести содеержимое файла `.gitignore` в подпапке `.dvc`
!cat .dvc/.gitignore

/config.local
/tmp
/cache


# Control versions of data

In [6]:
# Get data (утилита wget скрее всего не сработает на Windows)
!wget -P data/ https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv
    
# Вывести использование места на диске файлами в папке `data`
!du -sh data/*

--2020-07-01 15:07:58--  https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv
Распознаётся raw.githubusercontent.com (raw.githubusercontent.com)… 151.101.244.133
Подключение к raw.githubusercontent.com (raw.githubusercontent.com)|151.101.244.133|:443... соединение установлено.
HTTP-запрос отправлен. Ожидание ответа… 200 OK
Длина: 3716 (3,6K) [text/plain]
Сохранение в: «data/iris.csv»


2020-07-01 15:07:58 (8,77 MB/s) - «data/iris.csv» сохранён [3716/3716]

4,0K	data/iris.csv


In [7]:
# Look on data

import pandas as pd

df = pd.read_csv('data/iris.csv')
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Add file under DVC control

In [8]:
%%bash
# добавить данные под контроль DVC
dvc add data/iris.csv
# Посмотреть размер файлов только в подпапке `data` (флаг -s) в килобайтах (флаг -h)
du -sh data/*


To track the changes with git, run:

	git add data/.gitignore data/iris.csv.dvc
4,0K	data/iris.csv
4,0K	data/iris.csv.dvc


In [9]:
!git status -s data/

[31m??[m data/.gitignore
[31m??[m data/iris.csv.dvc


In [10]:
%%bash

git add .
git commit -m "Add a source dataset"

[master d313c36] Add a source dataset
 2 files changed, 4 insertions(+)
 create mode 100644 data/.gitignore
 create mode 100644 data/iris.csv.dvc


## What is DVC-file?

Data file internals


>    If you take a look at the DVC-file, you will see that only outputs are defined in outs. 
    In this file, only one output is defined. The output contains the data file path in the repository and md5 cache.
    This md5 cache determines a location of the actual content file in DVC cache directory .dvc/cache
    >> Output from DVC-files defines the relationship between the data file path in a repository and the path in a cache directory. See also DVC File Format



(c) dvc.org https://dvc.org/doc/tutorial/define-ml-pipeline

In [11]:
# Вывести содеержимое файла `iris.csv.dvc` в подпапке `data`
!cat data/iris.csv.dvc

outs:
- md5: 57fce90c81521889c736445f058c4838
  path: iris.csv


In [12]:
# Посмотреть размер файлов только в подпапке `.dvc/cache` (флаг -s) в килобайтах (флаг -h)
!du -sh .dvc/cache/*/*

4,0K	.dvc/cache/57/fce90c81521889c736445f058c4838


# Create ML pipeline

Stages 
- extract features 
- split dataset 
- train 
- evaluate 


DVC начиная с версии 1.0 хранит описания стадий пайплайнов в файле `dvc.yaml`

## Add feature extraction stage

In [13]:
# Создать стадию (пайплайн) для расчета фичей
!dvc run -n stage_feature_extraction \
    -d src/featurization.py \
    -d data/iris.csv \
    -o data/iris_featurized.csv \
    python src/featurization.py

Running stage 'stage_feature_extraction' with command:                          
	python src/featurization.py
Creating 'dvc.yaml'                                                             
Adding stage 'stage_feature_extraction' in 'dvc.yaml'
Generating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml data/.gitignore dvc.lock
[0m

In [14]:
# Посмотреть скрипт для фичей
# !cat src/featurization.py

In [15]:
# Посмотреть файлы в рабочей папке
!ls 

LICENSE            dvc.lock           [34msrc[m[m
README.md          dvc.yaml           tutorial.ipynb
[34mdata[m[m               requirements.txt   tutorial_old.ipynb


In [16]:
# Соержимое файла стадии
!cat dvc.yaml

stages:
  stage_feature_extraction:
    cmd: python src/featurization.py
    deps:
    - data/iris.csv
    - src/featurization.py
    outs:
    - data/iris_featurized.csv


In [17]:
import pandas as pd

features = pd.read_csv('data/iris_featurized.csv')
features.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [18]:
!git status -s

 [31mM[m data/.gitignore
[31m??[m dvc.lock
[31m??[m dvc.yaml


In [19]:
%%bash
git add .
git commit -m "Add stage_features_extraction"

[master 12b885c] Add stage_features_extraction
 3 files changed, 19 insertions(+)
 create mode 100644 dvc.lock
 create mode 100644 dvc.yaml


## Add split train/test stage

In [20]:
# Создание стадии для разбивки датасета на трейн/тест с параметром `test_size`
!dvc run -n stage_split_dataset \
    -d src/split_dataset.py \
    -d data/iris_featurized.csv \
    -o data/train.csv \
    -o data/test.csv \
    python src/split_dataset.py --test_size 0.4

Running stage 'stage_split_dataset' with command:                               
	python src/split_dataset.py --test_size 0.4
Adding stage 'stage_split_dataset' in 'dvc.yaml'                                
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml dvc.lock data/.gitignore
[0m

In [21]:
!cat dvc.yaml

stages:
  stage_feature_extraction:
    cmd: python src/featurization.py
    deps:
    - data/iris.csv
    - src/featurization.py
    outs:
    - data/iris_featurized.csv
  stage_split_dataset:
    cmd: python src/split_dataset.py --test_size 0.4
    deps:
    - data/iris_featurized.csv
    - src/split_dataset.py
    outs:
    - data/test.csv
    - data/train.csv


## Add train stage

In [22]:
# Создание стадии для обучения модели
!dvc run -n stage_train \
    -d src/train.py \
    -d data/train.csv \
    -o data/model.joblib \
    python src/train.py

Running stage 'stage_train' with command:                                       
	python src/train.py
Adding stage 'stage_train' in 'dvc.yaml'                                        
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add data/.gitignore dvc.yaml dvc.lock
[0m

In [23]:
!cat dvc.yaml

stages:
  stage_feature_extraction:
    cmd: python src/featurization.py
    deps:
    - data/iris.csv
    - src/featurization.py
    outs:
    - data/iris_featurized.csv
  stage_split_dataset:
    cmd: python src/split_dataset.py --test_size 0.4
    deps:
    - data/iris_featurized.csv
    - src/split_dataset.py
    outs:
    - data/test.csv
    - data/train.csv
  stage_train:
    cmd: python src/train.py
    deps:
    - data/train.csv
    - src/train.py
    outs:
    - data/model.joblib


### Add evaluate stage

In [30]:
# Создание стадии для применения модели модели
!dvc run -f -n stage_evaluate \
    -d src/train.py \
    -d src/evaluate.py \
    -d data/test.csv \
    -d data/model.joblib \
    -m data/eval.json \
    python src/evaluate.py

Running stage 'stage_evaluate' with command:                                    
	python src/evaluate.py
[31mERROR[39m: output 'data/eval.json' does not exist                         

[33mHaving any troubles?[39m Hit us up at [34mhttps://dvc.org/support[39m, we are always happy to help!
[0m

In [24]:
!cat dvc.yaml

stages:
  stage_feature_extraction:
    cmd: python src/featurization.py
    deps:
    - data/iris.csv
    - src/featurization.py
    outs:
    - data/iris_featurized.csv
  stage_split_dataset:
    cmd: python src/split_dataset.py --test_size 0.4
    deps:
    - data/iris_featurized.csv
    - src/split_dataset.py
    outs:
    - data/test.csv
    - data/train.csv
  stage_train:
    cmd: python src/train.py
    deps:
    - data/train.csv
    - src/train.py
    outs:
    - data/model.joblib


In [27]:
# !dvc dag 

# Metrics tracking

In [28]:
!dvc metrics show

[31mERROR[39m: failed to show metrics - no metric files in this repository. Use `-m/-M` options for `dvc run` to mark stage outputs as  metrics.

[33mHaving any troubles?[39m Hit us up at [34mhttps://dvc.org/support[39m, we are always happy to help!
[0m

## Commit dvc pipelines

In [36]:
!git status -s

 [31mM[m data/.gitignore
 [31mM[m tutorial.ipynb
[31m??[m stage_evaluate.dvc
[31m??[m stage_split_dataset.dvc
[31m??[m stage_train.dvc


In [37]:
%%bash
git add .
git commit -m "Add pipelines"

[dvc-tutorial c8fad83] Add pipelines
 5 files changed, 458 insertions(+), 36 deletions(-)
 create mode 100644 stage_evaluate.dvc
 create mode 100644 stage_split_dataset.dvc
 create mode 100644 stage_train.dvc


# Reproducibility

## How does it work?

> The most exciting part of DVC is reproducibility.
>> Reproducibility is the time you are getting benefits out of DVC instead of spending time defining the ML pipelines.

> DVC tracks all the dependencies, which helps you iterate on ML models faster without thinking what was affected by your last change.
>> In order to track all the dependencies, DVC finds and reads ALL the DVC-files in a repository and builds a dependency graph (DAG) based on these files.

> This is one of the differences between DVC reproducibility and traditional Makefile-like build automation tools (Make, Maven, Ant, Rakefile etc). It was designed in such a way to localize specification of DAG nodes.
If you run repro on any created DVC-file from our repository, nothing happens because nothing was changed in the defined pipeline.

(c) dvc.org https://dvc.org/doc/tutorial/reproducibility

In [38]:
# Nothing to reproduce
!dvc repro stage_evaluate.dvc

[33m+------------------------------------------+
[39m[33m|[39m                                          [33m|[39m
[33m|[39m     Update available [31m0.80.0[39m -> [32m1.0.1[39m     [33m|[39m
[33m|[39m     Run `[33mpip[39m install dvc [34m--upgrade[39m`      [33m|[39m
[33m|[39m                                          [33m|[39m
[33m+------------------------------------------+
[39m
Data and pipelines are up to date.                                      
[0m

## Add features



### Create new experiment branch

Before editing the code/featurization.py file, please create and checkout a new branch __ratio_features__

In [39]:
# create new branch

!git checkout -b ratio_features
!git branch

Switched to a new branch 'ratio_features'
  dvc-tutorial[m
  master[m
* [32mratio_features[m


### Update featurization.py

in file __featurization.py__ uncomment lines 

    features['sepal_length_to_sepal_width'] = features['sepal_length'] / features['sepal_width']
    features['petal_length_to_petal_width'] = features['petal_length'] / features['petal_width']

In [43]:
!cat src/featurization.py

import pandas as pd


def get_features(dataset):

    features = dataset.copy()

    # uncomment for step 5.2  Add features
    features['sepal_length_to_sepal_width'] = features['sepal_length'] / features['sepal_width']
    features['petal_length_to_petal_width'] = features['petal_length'] / features['petal_width']

    return features


if __name__ == '__main__':

    dataset = pd.read_csv('data/iris.csv')

    features  = get_features(dataset)
    features.to_csv('data/iris_featurized.csv', index=False)


## Reproduce pipeline 

In [42]:
!dvc repro stage_evaluate.dvc

[33m+------------------------------------------+
[39m[33m|[39m                                          [33m|[39m
[33m|[39m     Update available [31m0.80.0[39m -> [32m1.0.1[39m     [33m|[39m
[33m|[39m     Run `[33mpip[39m install dvc [34m--upgrade[39m`      [33m|[39m
[33m|[39m                                          [33m|[39m
[33m+------------------------------------------+
[39m
Running command:
	python src/featurization.py
Running command:
	python src/split_dataset.py --test_size 0.4
Running command:
	python src/train.py
Running command:
	python src/evaluate.py
                                                                        
To track the changes with git, run:

	git add stage_evaluate.dvc stage_split_dataset.dvc stage_feature_extraction.dvc stage_train.dvc
[0m

In [44]:
# Check features used in this pipeline

import pandas as pd

features = pd.read_csv('data/iris_featurized.csv')
features.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,sepal_length_to_sepal_width,petal_length_to_petal_width
0,5.1,3.5,1.4,0.2,setosa,1.457143,7.0
1,4.9,3.0,1.4,0.2,setosa,1.633333,7.0
2,4.7,3.2,1.3,0.2,setosa,1.46875,6.5
3,4.6,3.1,1.5,0.2,setosa,1.483871,7.5
4,5.0,3.6,1.4,0.2,setosa,1.388889,7.0


## Compare metrics for all runs (experiments)

In [45]:
# this pipeline metrics 

!dvc metrics show

[33m+------------------------------------------+
[39m[33m|[39m                                          [33m|[39m
[33m|[39m     Update available [31m0.80.0[39m -> [32m1.0.1[39m     [33m|[39m
[33m|[39m     Run `[33mpip[39m install dvc [34m--upgrade[39m`      [33m|[39m
[33m|[39m                                          [33m|[39m
[33m+------------------------------------------+
[39m
	data/eval.txt: {"f1_score": 0.8084886128364389, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 9, 0], [0, 10, 18]]}}
[0m

In [46]:
# show all commited pipelines metrics 

!dvc metrics show -a

[33m+------------------------------------------+
[39m[33m|[39m                                          [33m|[39m
[33m|[39m     Update available [31m0.80.0[39m -> [32m1.0.1[39m     [33m|[39m
[33m|[39m     Run `[33mpip[39m install dvc [34m--upgrade[39m`      [33m|[39m
[33m|[39m                                          [33m|[39m
[33m+------------------------------------------+
[39m
working tree:                                                           
	data/eval.txt: {"f1_score": 0.8084886128364389, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 9, 0], [0, 10, 18]]}}
dvc-tutorial, ratio_features:
	data/eval.txt: {"f1_score": 0.7861833464670345, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 8, 0], [0, 11, 18]]}}
[0m

In [51]:
!dvc metrics show diff 

[31mERROR[39m: failed to show metrics - no metric files in this repository. Use `-m/-M` options for `dvc run` to mark stage outputs as  metrics.

[33mHaving any troubles?[39m Hit us up at [34mhttps://dvc.org/support[39m, we are always happy to help!
[0m

## Commit new results

In [None]:
!git status -s

In [None]:
!git add .
!git commit -m "New features experiment"

# Checkout (start over new experiment)

- in case new features doesn't result improvements 
- or we want to improve the model by changing the hyperparameters (with OLD dataset)

## Checkout code and data files 

In [None]:
%%bash
# Переход обратно на исходную ветку
git checkout dvc-tutorial
dvc checkout

In [None]:
!git branch

In [None]:
!dvc metrics show

In [None]:
# Nothing to reproduce since code was checked out by `git checkout`
# and data files were checked out by `dvc checkout`
!dvc repro stage_evaluate.dvc

In [None]:
!dvc metrics show

In [None]:
# Check features used in this pipeline

import pandas as pd

features = pd.read_csv('data/iris_featurized.csv')
features.head()

## Tune the model

In [None]:
# create new branch for experiment

!git checkout -b tuning
!git branch

### Change parameters of classifier (LogisticRegression)

in file __train.py__ in constructor of LogisticRegression:

* change C param to 0.1

in the end you should get:

```python
clf = LogisticRegression(C=0.1, solver='newton-cg', multi_class='multinomial', max_iter=100)
```

### Reproduce pipelines

In [None]:
# re-run pipeline 

!dvc repro stage_evaluate.dvc 

In [None]:
!cat data/eval.txt

In [None]:
!dvc metrics show -a

In [None]:
# Check features used in this pipeline

import pandas as pd

features = pd.read_csv('data/iris_featurized.csv')
features.head()

### Commit

In [None]:
%%bash

git add .
git commit -m "Tune model. C=0.1"

### Merge the model to dvc-tutorial

In [None]:
%%bash

git checkout dvc-tutorial
git merge tuning

# Share data

## Setup remote storage (i.e. cloud)

In [None]:
# Create new remote

!dvc remote add -d local /tmp/dvc

In [None]:
# as you can see, .dvc/config is changed

!git status -s

In [None]:
# check config file 

!cat .dvc/config

In [None]:
%%bash

git add .
git commit -m "Add remote storage"

## Push data to remote

In [None]:
# Push data to remote

!dvc push

## Pull date from remote

In [None]:
!dvc pull