# Install and init DVC

Prerequisites: 
-  DVC and requirements.txt packages installed (if not - check README.md file for instructions)
-  A project repository is a Git repo 

## Install with pip

In [1]:
!pip install dvc==1.8.4



## Checkout branch `experiments`

In [1]:
!git checkout -b experiments

fatal: A branch named 'experiments' already exists.


## Initialize DVC

References: 
- https://dvc.org/doc/get-started/initialize 

In [2]:
!dvc init

[31mERROR[39m: failed to initiate DVC - '.dvc' exists. Use `-f` to force.
[0m

## Commit changes

In [4]:
%%bash

git add .
git commit -m "Initialize DVC"

[experiments 147dac4] Initialize DVC
 8 files changed, 172 insertions(+), 87 deletions(-)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvc/plots/confusion.json
 create mode 100644 .dvc/plots/default.json
 create mode 100644 .dvc/plots/scatter.json
 create mode 100644 .dvc/plots/smooth.json
 create mode 100644 .dvcignore


## Review Files and Directories created by DVC

In [5]:
!ls -a .dvc 

.  ..  config  .gitignore  plots  tmp


In [6]:
!cat .dvc/.gitignore

/config.local
/tmp
/cache


# Quick Tour of DVC features

## Data Versioning

In [1]:
# Get data 

import pandas as pd
from sklearn.datasets import load_iris

data = load_iris(as_frame=True)
list(data.target_names)
data.frame.to_csv('data/iris.csv', index=False)

ModuleNotFoundError: No module named 'pandas'

In [8]:
# Look on data

data.frame.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [9]:
%%bash

du -sh data/*

4,0K	data/iris.csv


## Add file under DVC control

In [10]:
%%bash

dvc add data/iris.csv


To track the changes with git, run:

	git add data/iris.csv.dvc data/.gitignore


In [11]:
!du -sh data/*

4,0K	data/iris.csv
4,0K	data/iris.csv.dvc


In [12]:
!git status -s data/

[31m??[m data/.gitignore
[31m??[m data/iris.csv.dvc


In [13]:
%%bash

git add .
git commit -m "Add a source dataset"

[experiments 40d1734] Add a source dataset
 2 files changed, 4 insertions(+)
 create mode 100644 data/.gitignore
 create mode 100644 data/iris.csv.dvc


### What is DVC-file?

Data file internals


>    If you take a look at the DVC-file, you will see that only outputs are defined in outs. 
    In this file, only one output is defined. The output contains the data file path in the repository and md5 cache.
    This md5 cache determines a location of the actual content file in DVC cache directory .dvc/cache
    >> Output from DVC-files defines the relationship between the data file path in a repository and the path in a cache directory. See also DVC File Format



(c) dvc.org https://dvc.org/doc/tutorial/define-ml-pipeline

In [14]:
!cat data/iris.csv.dvc

outs:
- md5: 4d301abed5efe50eccda350cde38e0eb
  path: iris.csv


## Create and Reproduce ML pipelines 

Stages 
- extract features 
- split dataset 
- train 
- evaluate 


### Add a pipeline stage with 'dvc run'

In [15]:
!dvc run -n feature_extraction \
    -d src/featurization.py \
    -d data/iris.csv \
    -o data/iris_featurized.csv \
    python src/featurization.py

Running stage 'feature_extraction' with command:                      core[39m>
	python src/featurization.py
Creating 'dvc.yaml'                                                   core[39m>
Adding stage 'feature_extraction' in 'dvc.yaml'
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml data/.gitignore dvc.lock
[0m

In [16]:
!ls 

data			       dvc.lock  dvc.yaml   requirements.txt
dvc-1-get-started.ipynb.ipynb  dvc-venv  README.md  src


In [17]:
!cat dvc.yaml

stages:
  feature_extraction:
    cmd: python src/featurization.py
    deps:
    - data/iris.csv
    - src/featurization.py
    outs:
    - data/iris_featurized.csv


In [18]:
import pandas as pd

features = pd.read_csv('data/iris_featurized.csv')
features.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [19]:
!git status -s

 [31mM[m data/.gitignore
[31m??[m dvc.lock
[31m??[m dvc.yaml


In [20]:
%%bash
git add .
git commit -m "Add stage features_extraction"

[experiments 5f733fe] Add stage features_extraction
 3 files changed, 19 insertions(+)
 create mode 100644 dvc.lock
 create mode 100644 dvc.yaml


### Add split train/test stage (via dvc.yaml)

In [21]:
!dvc run -n split_dataset \
    -d src/split_dataset.py \
    -d data/iris_featurized.csv \
    -o data/train.csv \
    -o data/test.csv \
    python src/split_dataset.py --test_size 0.4

Running stage 'split_dataset' with command:                           core[39m>
	python src/split_dataset.py --test_size 0.4
Adding stage 'split_dataset' in 'dvc.yaml'                            core[39m>
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml data/.gitignore dvc.lock
[0m

In [22]:
!cat dvc.yaml

stages:
  feature_extraction:
    cmd: python src/featurization.py
    deps:
    - data/iris.csv
    - src/featurization.py
    outs:
    - data/iris_featurized.csv
  split_dataset:
    cmd: python src/split_dataset.py --test_size 0.4
    deps:
    - data/iris_featurized.csv
    - src/split_dataset.py
    outs:
    - data/test.csv
    - data/train.csv


### Add train stage

In [23]:
!dvc run -n train \
    -d src/train.py \
    -d data/train.csv \
    -o data/model.joblib \
    python src/train.py

Running stage 'train' with command:                                   core[39m>
	python src/train.py
Adding stage 'train' in 'dvc.yaml'                                    core[39m>
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock dvc.yaml data/.gitignore
[0m

In [24]:
!cat dvc.yaml

stages:
  feature_extraction:
    cmd: python src/featurization.py
    deps:
    - data/iris.csv
    - src/featurization.py
    outs:
    - data/iris_featurized.csv
  split_dataset:
    cmd: python src/split_dataset.py --test_size 0.4
    deps:
    - data/iris_featurized.csv
    - src/split_dataset.py
    outs:
    - data/test.csv
    - data/train.csv
  train:
    cmd: python src/train.py
    deps:
    - data/train.csv
    - src/train.py
    outs:
    - data/model.joblib


### Add evaluate stage

In [25]:
!dvc run -n evaluate \
    -d src/train.py \
    -d src/evaluate.py \
    -d data/test.csv \
    -d data/model.joblib \
    -m data/eval.txt \
    python src/evaluate.py

Running stage 'evaluate' with command:                                core[39m>
	python src/evaluate.py
Adding stage 'evaluate' in 'dvc.yaml'                                 core[39m>
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml dvc.lock data/.gitignore
[0m

In [26]:
!cat dvc.yaml

stages:
  feature_extraction:
    cmd: python src/featurization.py
    deps:
    - data/iris.csv
    - src/featurization.py
    outs:
    - data/iris_featurized.csv
  split_dataset:
    cmd: python src/split_dataset.py --test_size 0.4
    deps:
    - data/iris_featurized.csv
    - src/split_dataset.py
    outs:
    - data/test.csv
    - data/train.csv
  train:
    cmd: python src/train.py
    deps:
    - data/train.csv
    - src/train.py
    outs:
    - data/model.joblib
  evaluate:
    cmd: python src/evaluate.py
    deps:
    - data/model.joblib
    - data/test.csv
    - src/evaluate.py
    - src/train.py
    metrics:
    - data/eval.txt


### Reproduce pipeline

In [27]:
!dvc repro split_dataset

'data/iris.csv.dvc' didn't change, skipping                           core[39m>
Stage 'feature_extraction' didn't change, skipping
Stage 'split_dataset' didn't change, skipping
Data and pipelines are up to date.
[0m

## Collaborate on ML Experiments 

### Specify remote storage (local ~ /tmp/dvc)


In [28]:
!dvc remote add -d local /tmp/dvc

Setting 'local' as a default remote.
[0m

### Push features to remote storage

In [29]:
!dvc push

  0% Uploading|                                      |0/2 [00:00<?,     ?file/s]
![A
  0%|          |data/model.joblib              0.00/1.01k [00:00<?,       ?it/s][A

![A[A

  0%|          |data/eval.txt                       0/124 [00:00<?,       ?it/s][A[A
                                                                                [A

2 files pushed                                                                  [A[A
[0m

### Create tag `experiment-1`

In [30]:
!git tag -a experiment-1 -m "experiment-1"

### Checkout into your teammate experiment state

In [31]:
%%bash 

git checkout experiment-1
dvc checkout

M	.dvc/config
M	data/.gitignore
M	dvc-1-get-started.ipynb.ipynb
M	dvc.lock
M	dvc.yaml


Note: checking out 'experiment-1'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD сейчас на 5f733fe... Add stage features_extraction


### Check Metrics

In [32]:
!dvc metrics show

	data/eval.txt:                                                       core[39m>
		f1_score: 0.7861833464670345
[0m

### Reproduce experiment

In [33]:
# Nothing to reproduce
!dvc repro

'data/iris.csv.dvc' didn't change, skipping                           core[39m>
Stage 'feature_extraction' didn't change, skipping
Stage 'split_dataset' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping
Data and pipelines are up to date.
[0m

In [34]:
!dvc repro -f

Verifying data sources in stage: 'data/iris.csv.dvc'                  core[39m>

Running stage 'feature_extraction' with command:
	python src/featurization.py
                                                                      core[39m>
Running stage 'split_dataset' with command:
	python src/split_dataset.py --test_size 0.4
                                                                      core[39m>
Running stage 'train' with command:
	python src/train.py
                                                                      core[39m>
Running stage 'evaluate' with command:
	python src/evaluate.py
                                                                      core[39m>
To track the changes with git, run:

	git add data/iris.csv.dvc
[0m

In [35]:
# Check Metrics

!dvc metrics show

	data/eval.txt:                                                       core[39m>
		f1_score: 0.7861833464670345
[0m