# Preparation


Check README.md file for install/setup instructions 

## Initialize DVC

**References**


https://dvc.org/doc/tutorial/define-ml-pipeline - used as example

In [154]:
!dvc init -f

[KAdding '.dvc/state' to '.dvc/.gitignore'.
[KAdding '.dvc/lock' to '.dvc/.gitignore'.
[KAdding '.dvc/config.local' to '.dvc/.gitignore'.
[KAdding '.dvc/updater' to '.dvc/.gitignore'.
[KAdding '.dvc/updater.lock' to '.dvc/.gitignore'.
[KAdding '.dvc/state-journal' to '.dvc/.gitignore'.
[KAdding '.dvc/state-wal' to '.dvc/.gitignore'.
[KAdding '.dvc/cache' to '.dvc/.gitignore'.
[K
You can now commit the changes to git.

[K[31m+---------------------------------------------------------------------+
[39m[31m|[39m                                                                     [31m|[39m
[31m|[39m        DVC has enabled anonymous aggregate usage analytics.         [31m|[39m
[31m|[39m     Read the analytics documentation (and how to opt-out) here:     [31m|[39m
[31m|[39m              [34mhttps://dvc.org/doc/user-guide/analytics[39m               [31m|[39m
[31m|[39m                                                                     [31m|[39m
[31m+--------

In [155]:
%%bash

git add .
git commit -m "Initialize DVC"

[dvc-tutorial 3705786] initialize DVC
 3 files changed, 93 insertions(+), 85 deletions(-)


### Files and Directories 

In [156]:
!ls -a .dvc 


[1m[36m.[m[m            .gitignore   config       updater.lock
[1m[36m..[m[m           [1m[36mcache[m[m        updater


In [157]:
!cat .dvc/.gitignore

/state
/lock
/config.local
/updater
/updater.lock
/state-journal
/state-wal
/cache

# Control versions of data

In [172]:
# Get data 

!wget -P data/ https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv
!du -sh data/*

--2019-06-08 19:54:54--  https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv
Resolving raw.githubusercontent.com... 151.101.12.133
Connecting to raw.githubusercontent.com|151.101.12.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3716 (3.6K) [text/plain]
Saving to: ‘data/iris.csv.5’


2019-06-08 19:54:55 (31.6 MB/s) - ‘data/iris.csv.5’ saved [3716/3716]

4.0K	data/eval.txt
4.0K	data/iris.csv
4.0K	data/iris.csv.1
4.0K	data/iris.csv.2
4.0K	data/iris.csv.3
4.0K	data/iris.csv.4
4.0K	data/iris.csv.5
4.0K	data/iris.csv.dvc
 12K	data/iris_featurized.csv
4.0K	data/model.joblib
4.0K	data/test.csv
8.0K	data/train.csv


In [167]:
# Look on data

import pandas as pd

df = pd.read_csv('data/iris.csv')
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Add flle under DVC control

In [174]:
%%bash

dvc add data/iris.csv
git status -s data/
du -sh data/*

Stage is cached, skipping.
4.0K	data/eval.txt
4.0K	data/iris.csv
4.0K	data/iris.csv.1
4.0K	data/iris.csv.2
4.0K	data/iris.csv.3
4.0K	data/iris.csv.4
4.0K	data/iris.csv.5
4.0K	data/iris.csv.dvc
 12K	data/iris_featurized.csv
4.0K	data/model.joblib
4.0K	data/test.csv
8.0K	data/train.csv


In [176]:
!git status -s data/

In [177]:
%%bash

git add .
git commit -m "Add a source dataset"

[dvc-tutorial 8843fd2] Add a source dataset
 1 file changed, 71 insertions(+), 61 deletions(-)


## What is DVC-file?

Data file internals


>    If you take a look at the DVC-file, you will see that only outputs are defined in outs. 
    In this file, only one output is defined. The output contains the data file path in the repository and md5 cache.
    This md5 cache determines a location of the actual content file in DVC cache directory .dvc/cache
    >> Output from DVC-files defines the relationship between the data file path in a repository and the path in a cache directory. See also DVC File Format



(c) dvc.org https://dvc.org/doc/tutorial/define-ml-pipeline

In [16]:
!cat data/iris.csv.dvc

md5: 1cff89878034249db68ba6046d5b49a9
wdir: ..
outs:
- md5: 57fce90c81521889c736445f058c4838
  path: data/iris.csv
  cache: true
  metric: false
  persist: false


In [178]:
!du -sh .dvc/cache/*/*

4.0K	.dvc/cache/57/fce90c81521889c736445f058c4838


# Create ML pipeline

Stages 
- extract features 
- split dataset 
- train 
- evaluate 


## Add feature extraction stage

In [180]:
!dvc run -f stage_feature_extraction.dvc \
    -d src/featurization.py \
    -d data/iris.csv \
    -o data/iris_featurized.csv \
    python src/featurization.py

[KRunning command:
	python src/featurization.py
[KSaving 'data/iris_featurized.csv' to '.dvc/cache/cd/9e208c0232da2fb80b4c927da35dbb'.
[KSaving information to 'stage_feature_extraction.dvc'.
[K
To track the changes with git run:

	git add stage_feature_extraction.dvc
[0m

In [181]:
!ls 

README.md                    stage_feature_extraction.dvc
[1m[36mdata[m[m                         tutorial.ipynb
requirements.txt             [1m[36mvenv[m[m
[1m[36msrc[m[m


In [182]:
!cat stage_feature_extraction.dvc

md5: eec5e74d81a441ff02716cadd3779961
cmd: python src/featurization.py
wdir: .
deps:
- md5: 5bce3d2f01813491283efeb24789f97a
  path: src/featurization.py
- md5: 57fce90c81521889c736445f058c4838
  path: data/iris.csv
outs:
- md5: cd9e208c0232da2fb80b4c927da35dbb
  path: data/iris_featurized.csv
  cache: true
  metric: false
  persist: false


In [189]:
import pandas as pd

features = pd.read_csv('data/iris_featurized.csv')
features.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,sepal_length_to_sepal_width,petal_length_to_petal_width
0,5.1,3.5,1.4,0.2,setosa,1.457143,7.0
1,4.9,3.0,1.4,0.2,setosa,1.633333,7.0
2,4.7,3.2,1.3,0.2,setosa,1.46875,6.5
3,4.6,3.1,1.5,0.2,setosa,1.483871,7.5
4,5.0,3.6,1.4,0.2,setosa,1.388889,7.0


In [191]:
!git status -s

[32mD[m  pipeline_evaluate.dvc
[32mD[m  pipeline_featurization.dvc
[32mD[m  pipeline_split_dataset.dvc
[32mD[m  pipeline_train.dvc
 [31mM[m tutorial.ipynb
[31m??[m stage_evaluate.dvc
[31m??[m stage_feature_extraction.dvc
[31m??[m stage_split_dataset.dvc
[31m??[m stage_train.dvc


In [192]:
%%bash
git add .
git commit -m "Add stage_features_extraction"

[dvc-tutorial 140e58d] Add stage_features_extraction
 5 files changed, 247 insertions(+), 80 deletions(-)
 rename pipeline_evaluate.dvc => stage_evaluate.dvc (100%)
 rename pipeline_featurization.dvc => stage_feature_extraction.dvc (79%)
 rename pipeline_split_dataset.dvc => stage_split_dataset.dvc (100%)
 rename pipeline_train.dvc => stage_train.dvc (100%)


## Add split train/test stage

In [183]:
!dvc run -f stage_split_dataset.dvc \
    -d src/split_dataset.py \
    -d data/iris_featurized.csv \
    -o data/train.csv \
    -o data/test.csv \
    python src/split_dataset.py 0.4

[KRunning command:
	python src/split_dataset.py 0.4
[KSaving 'data/train.csv' to '.dvc/cache/87/43ef62798f623fbaae4401f4aab654'.
[KSaving 'data/test.csv' to '.dvc/cache/3d/40f0c85187dda2cd9bf58b3e916630'.
[KSaving information to 'stage_split_dataset.dvc'.
[K
To track the changes with git run:

	git add stage_split_dataset.dvc
[0m

In [184]:
!cat stage_split_dataset.dvc

md5: 2c0cd9e4926980b60a70eb58bc123727
cmd: python src/split_dataset.py 0.4
wdir: .
deps:
- md5: e111aa0fa66588bf06c5f716d11bcff5
  path: src/split_dataset.py
- md5: cd9e208c0232da2fb80b4c927da35dbb
  path: data/iris_featurized.csv
outs:
- md5: 8743ef62798f623fbaae4401f4aab654
  path: data/train.csv
  cache: true
  metric: false
  persist: false
- md5: 3d40f0c85187dda2cd9bf58b3e916630
  path: data/test.csv
  cache: true
  metric: false
  persist: false


## Add train stage

In [185]:
!dvc run -f stage_train.dvc \
    -d src/train.py \
    -d data/train.csv \
    -o data/model.joblib \
    python src/train.py

[KRunning command:
	python src/train.py
[KSaving 'data/model.joblib' to '.dvc/cache/b2/7070fdbd6a055a610f270c3f732a71'.
[KSaving information to 'stage_train.dvc'.
[K
To track the changes with git run:

	git add stage_train.dvc
[0m

In [186]:
!cat stage_train.dvc

md5: 9c04ce24755b5e4c50b8050a312df8c1
cmd: python src/train.py
wdir: .
deps:
- md5: 57acac82e8be65927cf80a6ed0f089bc
  path: src/train.py
- md5: 8743ef62798f623fbaae4401f4aab654
  path: data/train.csv
outs:
- md5: b27070fdbd6a055a610f270c3f732a71
  path: data/model.joblib
  cache: true
  metric: false
  persist: false


### Add evaluate stage

In [187]:
!dvc run -f stage_evaluate.dvc \
    -d src/train.py \
    -d src/evaluate.py \
    -d data/test.csv \
    -d data/model.joblib \
    -m data/eval.txt \
    python src/evaluate.py

[KRunning command:
	python src/evaluate.py
[KSaving 'data/eval.txt' to '.dvc/cache/a1/e2ca7bd1d5b4730c857fffc8941395'.
[KSaving information to 'stage_evaluate.dvc'.
[K
To track the changes with git run:

	git add stage_evaluate.dvc
[0m

In [188]:
!cat stage_evaluate.dvc

md5: 1372a8796d77fd4c8a1d577a50f910c6
cmd: python src/evaluate.py
wdir: .
deps:
- md5: 57acac82e8be65927cf80a6ed0f089bc
  path: src/train.py
- md5: 9b394d26e9427759256195b47917028b
  path: src/evaluate.py
- md5: 3d40f0c85187dda2cd9bf58b3e916630
  path: data/test.csv
- md5: b27070fdbd6a055a610f270c3f732a71
  path: data/model.joblib
outs:
- md5: a1e2ca7bd1d5b4730c857fffc8941395
  path: data/eval.txt
  cache: true
  metric: true
  persist: false


# Metrics tracking

In [193]:
!dvc metrics show

[K	data/eval.txt: {"f1_score": 0.981981981981982, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 18, 0], [0, 1, 18]]}}
[0m

## Commit dvc pipelines

In [170]:
!git status -s

In [169]:
%%bash
git add .
git commit -m "Add pipelines"

[dvc-tutorial 1bf2912] Add pipelines
 1 file changed, 139 insertions(+), 77 deletions(-)


# Reproducibility

## How does it work?

> The most exciting part of DVC is reproducibility.
>> Reproducibility is the time you are getting benefits out of DVC instead of spending time defining the ML pipelines.

> DVC tracks all the dependencies, which helps you iterate on ML models faster without thinking what was affected by your last change.
>> In order to track all the dependencies, DVC finds and reads ALL the DVC-files in a repository and builds a dependency graph (DAG) based on these files.

> This is one of the differences between DVC reproducibility and traditional Makefile-like build automation tools (Make, Maven, Ant, Rakefile etc). It was designed in such a way to localize specification of DAG nodes.
If you run repro on any created DVC-file from our repository, nothing happens because nothing was changed in the defined pipeline.

(c) dvc.org https://dvc.org/doc/tutorial/reproducibility

In [194]:
# Nothing to reproduce
!dvc repro stage_evaluate.dvc

[KStage 'data/iris.csv.dvc' didn't change.
[KStage 'stage_feature_extraction.dvc' didn't change.
[KStage 'stage_split_dataset.dvc' didn't change.
[KStage 'stage_train.dvc' didn't change.
[KStage 'stage_evaluate.dvc' didn't change.
[KPipeline is up to date. Nothing to reproduce.
[0m

## Add features

in file __featurization.py__ uncomment lines 

    features['sepal_length_to_sepal_width'] = features['sepal_length'] / features['sepal_width']
    features['petal_length_to_petal_width'] = features['petal_length'] / features['petal_width']

## Reproduce pipeline 

In [196]:
!dvc repro stage_evaluate.dvc

[KStage 'data/iris.csv.dvc' didn't change.
[KStage 'stage_feature_extraction.dvc' didn't change.
[KStage 'stage_split_dataset.dvc' didn't change.
[KStage 'stage_train.dvc' didn't change.
[KStage 'stage_evaluate.dvc' didn't change.
[KPipeline is up to date. Nothing to reproduce.
[0m

In [100]:
# Check features used in this pipeline

import pandas as pd

features = pd.read_csv('data/iris_featurized.csv')
features.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,sepal_length_to_sepal_width,petal_length_to_petal_width
0,5.1,3.5,1.4,0.2,setosa,1.457143,7.0
1,4.9,3.0,1.4,0.2,setosa,1.633333,7.0
2,4.7,3.2,1.3,0.2,setosa,1.46875,6.5
3,4.6,3.1,1.5,0.2,setosa,1.483871,7.5
4,5.0,3.6,1.4,0.2,setosa,1.388889,7.0


## Compare metrics for all runs (experiments)

In [201]:
# this pipeline metrics 

!dvc metrics show

[K	data/eval.txt: {"f1_score": 0.981981981981982, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 18, 0], [0, 1, 18]]}}
[0m

In [202]:
# show all commited pipelines metrics 

!dvc metrics show -a

[KWorking Tree:
[K	data/eval.txt: {"f1_score": 0.981981981981982, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 18, 0], [0, 1, 18]]}}
[Kdvc-tutorial:
[K	data/eval.txt: {"f1_score": 0.981981981981982, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 18, 0], [0, 1, 18]]}}
[Ktrain-squares-of-sizes:
[K	data/eval.txt: {"f1_score": 0.981981981981982, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 18, 0], [0, 1, 18]]}}
[Ktuning:
[K	data/eval.txt: {"f1_score": 0.981981981981982, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 18, 0], [0, 1, 18]]}}
[0m

## Commit new results

In [198]:
!git status -s

 [31mM[m tutorial.ipynb


In [203]:
!git add .
!git commit -m "New features experiment"

[dvc-tutorial 33c2768] New features experiment
 1 file changed, 48 insertions(+), 24 deletions(-)


# Checkout (start over new experiment)

- in case new features doesn't result improvements 
- or we want to improve the model by changing the hyperparameters (with OLD dataset)

## Checkout code and data files 

In [206]:
%%bash

git checkout dvc-tutorial
dvc checkout


M	tutorial.ipynb
[##############################] 100% Checkout finished!


Already on 'dvc-tutorial'


In [208]:
# Nothing to reproduce since code was checked out by `git checkout`
# and data files were checked out by `dvc checkout`
!dvc repro stage_evaluate.dvc

[KStage 'data/iris.csv.dvc' didn't change.
[KStage 'stage_feature_extraction.dvc' didn't change.
[KStage 'stage_split_dataset.dvc' didn't change.
[KStage 'stage_train.dvc' didn't change.
[KStage 'stage_evaluate.dvc' didn't change.
[KPipeline is up to date. Nothing to reproduce.
[0m

In [209]:
!dvc metrics show

[K	data/eval.txt: {"f1_score": 0.981981981981982, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 18, 0], [0, 1, 18]]}}
[0m

In [211]:
# Check features used in this pipeline

import pandas as pd

features = pd.read_csv('data/iris_featurized.csv')
features.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,sepal_length_to_sepal_width,petal_length_to_petal_width
0,5.1,3.5,1.4,0.2,setosa,1.457143,7.0
1,4.9,3.0,1.4,0.2,setosa,1.633333,7.0
2,4.7,3.2,1.3,0.2,setosa,1.46875,6.5
3,4.6,3.1,1.5,0.2,setosa,1.483871,7.5
4,5.0,3.6,1.4,0.2,setosa,1.388889,7.0


## Tune the model

In [212]:
# create new branch for experiment

!git checkout -b tuning

fatal: A branch named 'tuning' already exists.


### Change parameters of classifier (LogisticRegression)

in file __train.py__ in constructor of LogisticRegression:

* change C param to 0.1

in the end you should get such constructor of classifier:

```python
clf = LogisticRegression(C=0.1, solver='newton-cg', multi_class='multinomial', max_iter=100)
```

### Reproduce pipelines

In [213]:
# re-run pipeline 

!dvc repro stage_evaluate.dvc

[KStage 'data/iris.csv.dvc' didn't change.
[KStage 'stage_feature_extraction.dvc' didn't change.
[KStage 'stage_split_dataset.dvc' didn't change.
[KStage 'stage_train.dvc' didn't change.
[KStage 'stage_evaluate.dvc' didn't change.
[KPipeline is up to date. Nothing to reproduce.
[0m

In [214]:
!cat data/eval.txt

{"f1_score": 0.981981981981982, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 18, 0], [0, 1, 18]]}}

In [215]:
!dvc metrics show -a

[KWorking Tree:
[K	data/eval.txt: {"f1_score": 0.981981981981982, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 18, 0], [0, 1, 18]]}}
[Kdvc-tutorial:
[K	data/eval.txt: {"f1_score": 0.981981981981982, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 18, 0], [0, 1, 18]]}}
[Ktrain-squares-of-sizes:
[K	data/eval.txt: {"f1_score": 0.981981981981982, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 18, 0], [0, 1, 18]]}}
[Ktuning:
[K	data/eval.txt: {"f1_score": 0.981981981981982, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 18, 0], [0, 1, 18]]}}
[0m

### Commit

In [134]:
%%bash

git add .
git commit -m "Tune model. C=0.1"

[tuning 149ded7] Tune model. C=0.1
 4 files changed, 134 insertions(+), 118 deletions(-)


### Merge the model to dvc-tutorial

In [136]:
%%bash

git checkout dvc-tutorial
git merge tuning

Already up-to-date.


fatal: A branch named 'train-squares-of-sizes' already exists.


### Resolve conflicts 

Replace conflicting __checksums__ to empty string '' in __stage_evaluate.dvc__ and __stage_train.dvc__

In [139]:
!dvc checkout

[K[##############################] 100% Checkout finished!
[0m

Then reproduce pipelines

In [217]:
!dvc repro stage_evaluate.dvc

[KStage 'data/iris.csv.dvc' didn't change.
[KStage 'stage_feature_extraction.dvc' didn't change.
[KStage 'stage_split_dataset.dvc' didn't change.
[KStage 'stage_train.dvc' didn't change.
[KStage 'stage_evaluate.dvc' didn't change.
[KPipeline is up to date. Nothing to reproduce.
[0m

### View target metric

In [218]:
!cat data/eval.txt

{"f1_score": 0.981981981981982, "confusion_matrix": {"classes": ["setosa", "versicolor", "virginica"], "matrix": [[23, 0, 0], [0, 18, 0], [0, 1, 18]]}}

### Commit

In [219]:
%%bash

git add .
git commit -m 'Merge add_features into tuning'

[dvc-tutorial f4b21d0] Merge add_features into tuning
 1 file changed, 243 insertions(+), 79 deletions(-)


### Merge all into dvc-tutorial

In [221]:
%%bash

git checkout dvc-tutorial
dvc checkout
git merge train-squares-of-sizes

M	.dvc/config
M	tutorial.ipynb
[##############################] 100% Checkout finished!
Already up-to-date.


Already on 'dvc-tutorial'


In [222]:
%%bash

dvc checkout
# Nothing to reproduce:
dvc repro stage_evaluate.dvc

[##############################] 100% Checkout finished!
Stage 'data/iris.csv.dvc' didn't change.
Stage 'stage_feature_extraction.dvc' didn't change.
Stage 'stage_split_dataset.dvc' didn't change.
Stage 'stage_train.dvc' didn't change.
Stage 'stage_evaluate.dvc' didn't change.
Pipeline is up to date. Nothing to reproduce.


# Share data

## Setup remote storage (i.e. cloud)

In [236]:
# Create new remote

!dvc remote add -d local /tmp/dvc

[K[31mERROR[39m: Remote with name local already exists. Use -f (--force) to overwrite remote with new value

[33mHaving any troubles?[39m. Hit us up at [34mhttps://dvc.org/support[39m, we are always happy to help!
[0m

In [237]:
# as you can see, .dvc/config is changed

!git status -s

In [238]:
# check config file 

!cat .dvc/config

['remote "local"']
url = /tmp/dvc
[core]
remote = local


In [239]:
%%bash

git add .
git commit -m "Add remote storage"

On branch dvc-tutorial
nothing to commit, working tree clean


CalledProcessError: Command 'b'\ngit add .\ngit commit -m "Add remote storage"\n'' returned non-zero exit status 1.

## Push data to remote

In [240]:
# Push data to remote

!dvc push

[KPreparing to upload data to '/tmp/dvc'
[KPreparing to collect status from /tmp/dvc
[K[##############################] 100% Collecting information
[K[##############################] 100% Analysing status.
[KEverything is up to date.
[0m

In [241]:
%%bash

git add .
git commit -m "Add remote storage"

On branch dvc-tutorial
nothing to commit, working tree clean


CalledProcessError: Command 'b'\ngit add .\ngit commit -m "Add remote storage"\n'' returned non-zero exit status 1.

## Pull date from remote

In [242]:
!dvc pull

[KPreparing to download data from '/tmp/dvc'
[KPreparing to collect status from /tmp/dvc
[K[##############################] 100% Collecting information
[K[##############################] 100% Analysing status.
[K[##############################] 100% Checkout finished!

[KEverything is up to date.
[0m