In [3]:
import dvc
dvc.__version__

'1.8.4'

In [4]:
!git checkout -b experiments

Switched to a new branch 'experiments'


#### initialize DVC
- reference https://dvc.org/doc/get-started/initialize

In [5]:
!dvc init


You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>


#### commit changes

In [6]:
%%bash

git add .
git commit -m "initialize dvc"

[experiments 4aa6ecf] initialize dvc
 9 files changed, 1743 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvc/plots/confusion.json
 create mode 100644 .dvc/plots/default.json
 create mode 100644 .dvc/plots/scatter.json
 create mode 100644 .dvc/plots/smooth.json
 create mode 100644 .dvcignore
 create mode 100644 dvc-1-from-scratch.ipynb
 create mode 100644 dvc-1-get-started-reserve.ipynb


#### Review files and directories created by DVC

In [7]:
! ls -a .dvc

.          ..         .gitignore config     plots      tmp


In [8]:
!cat .dvc/.gitignore

/config.local
/tmp
/cache


## Quick tour of DVC features
#### Data vesioning

In [9]:
# get data
import pandas as pd
from sklearn.datasets import load_iris

data = load_iris(as_frame=True)
list(data.target_names)
data.frame.to_csv("data/iris.csv", index=False)

In [10]:
# look on data
data.frame.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [11]:
! du -sh data/*

4.0K	data/iris.csv


#### Add file inder DVC control

In [13]:
!dvc add data/iris.csv

100% Add|██████████████████████████████████████████████|1/1 [00:03,  3.07s/file]

To track the changes with git, run:

	git add data/iris.csv.dvc


In [14]:
!du -sh data/*

4.0K	data/iris.csv
4.0K	data/iris.csv.dvc


In [15]:
!git status -s data/

In [16]:
%%bash

git add .
git commit -m "add a source dataset"

[experiments 6f8d7f4] add a source dataset
 1 file changed, 287 insertions(+), 9 deletions(-)


#### What is DVC file ?

Data file internals

If you take a look at the DVC-file, you will see that only outputs are defined in outs. In this file, only one output is defined. The output contains the data file path in the repository and md5 cache. This md5 cache determines a location of the actual content file in DVC cache directory .dvc/cache

Output from DVC-files defines the relationship between the data file path in a repository and the path in a cache directory. See also DVC File Format

(c) dvc.org https://dvc.org/doc/tutorial/define-ml-pipeline

In [19]:
%%bash

ls data
cat data/iris.csv.dvc

iris.csv
iris.csv.dvc
outs:
- md5: 4d301abed5efe50eccda350cde38e0eb
  path: iris.csv


#### Create and reproduce ML pipelines
Stages
- extract features
- split dataset
- train
- evaluate

#### add a pipeline stage with dvc run

In [20]:
%%bash

dvc run -n feature-extraction \
    -d src/featurization.py \
    -d data/iris.csv \
    -o data/iris_featurized.csv \
    python src/featurization.py

Running stage 'feature-extraction' with command:
	python src/featurization.py
Creating 'dvc.yaml'
Adding stage 'feature-extraction' in 'dvc.yaml'
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml dvc.lock


In [21]:
!ls

README.md                       dvc-venv
data                            dvc.lock
dvc-1-from-scratch.ipynb        dvc.yaml
dvc-1-get-started-reserve.ipynb requirements.txt
dvc-1-get-started.ipynb         src


In [22]:
!cat dvc.yaml

stages:
  feature-extraction:
    cmd: python src/featurization.py
    deps:
    - data/iris.csv
    - src/featurization.py
    outs:
    - data/iris_featurized.csv


In [24]:
import pandas as pd
features = pd.read_csv("data/iris_featurized.csv")
features.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [26]:
!git status -s

 M dvc-1-from-scratch.ipynb
?? dvc.lock
?? dvc.yaml


In [28]:
%%bash
git add .
git commit -m "add stage features_extraction"

[experiments ccb613c] add stage features_extraction
 3 files changed, 346 insertions(+), 22 deletions(-)
 create mode 100644 dvc.lock
 create mode 100644 dvc.yaml


#### Add split train/test stage via dvc.yaml

In [38]:
%%bash

dvc run --force -n split_dataset \
    -d src/split_dataset.py \
    -d data/iris_featurized.csv \
    -o data/train.csv \
    -o data/test.csv \
    python src/split_dataset.py --test_size 0.4

Running stage 'split_dataset' with command:
	python src/split_dataset.py --test_size 0.4
Modifying stage 'split_dataset' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock dvc.yaml


In [39]:
!cat dvc.yaml

stages:
  feature-extraction:
    cmd: python src/featurization.py
    deps:
    - data/iris.csv
    - src/featurization.py
    outs:
    - data/iris_featurized.csv
  split_dataset:
    cmd: python src/split_dataset.py --test_size 0.4
    deps:
    - data/iris_featurized.csv
    - src/split_dataset.py
    outs:
    - data/test.csv
    - data/train.csv


#### Add train stage

In [43]:
%%bash

# -d - dependencies
# -o - outs
# python - script to execute
# -n - stage to add
# -m - metrics

dvc run --force -n train \
    -d src/train.py \
    -d data/train.csv \
    -o data/model.joblib \
    python src/train.py

Running stage 'train' with command:
	python src/train.py
Adding stage 'train' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml dvc.lock


In [44]:
!cat dvc.yaml

stages:
  feature-extraction:
    cmd: python src/featurization.py
    deps:
    - data/iris.csv
    - src/featurization.py
    outs:
    - data/iris_featurized.csv
  split_dataset:
    cmd: python src/split_dataset.py --test_size 0.4
    deps:
    - data/iris_featurized.csv
    - src/split_dataset.py
    outs:
    - data/test.csv
    - data/train.csv
  train:
    cmd: python src/train.py
    deps:
    - data/train.csv
    - src/train.py
    outs:
    - data/model.joblib


#### Add evaluate stage

In [46]:
%%bash

dvc run --force -n evaluate \
    -d src/train.py \
    -d src/evaluate.py \
    -d data/test.csv \
    -d data/model.joblib \
    -m data/eval.txt \
    python src/evaluate.py

Running stage 'evaluate' with command:
	python src/evaluate.py
Adding stage 'evaluate' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock dvc.yaml


In [47]:
!cat dvc.yaml

stages:
  feature-extraction:
    cmd: python src/featurization.py
    deps:
    - data/iris.csv
    - src/featurization.py
    outs:
    - data/iris_featurized.csv
  split_dataset:
    cmd: python src/split_dataset.py --test_size 0.4
    deps:
    - data/iris_featurized.csv
    - src/split_dataset.py
    outs:
    - data/test.csv
    - data/train.csv
  train:
    cmd: python src/train.py
    deps:
    - data/train.csv
    - src/train.py
    outs:
    - data/model.joblib
  evaluate:
    cmd: python src/evaluate.py
    deps:
    - data/model.joblib
    - data/test.csv
    - src/evaluate.py
    - src/train.py
    metrics:
    - data/eval.txt


#### Reproduce pipeline

In [48]:
!dvc repro split_dataset

+------------------------------------------+
|                                          |
|     Update available 1.8.4 -> 2.11.0     |
|     Run `pip install dvc --upgrade`      |
|                                          |
+------------------------------------------+

'data/iris.csv.dvc' didn't change, skipping                           
Stage 'feature-extraction' didn't change, skipping
Stage 'split_dataset' didn't change, skipping
Data and pipelines are up to date.


## Collaborate on ML Experiments
#### Specify remote storage (local ~ /tmp/dvc)

In [49]:
!dvc remote add -d loal /tmp/dvc

Setting 'loal' as a default remote.


#### Push features to remote storage

In [50]:
!dvc push

+------------------------------------------+
|                                          |
|     Update available 1.8.4 -> 2.11.0     |
|     Run `pip install dvc --upgrade`      |
|                                          |
+------------------------------------------+

6 files pushed                                                                  


#### Create tag experiment-1

In [51]:
!git tag -a experiment-1 -m "experiment-1"

#### Checkout into your teammate experiment state

In [52]:
%%bash

git checkout experiment-1
dvc checkout

M	.dvc/config
M	dvc-1-from-scratch.ipynb
M	dvc.lock
M	dvc.yaml


Note: switching to 'experiment-1'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at ccb613c add stage features_extraction


#### Check Metrics

In [53]:
!dvc metrics show

+------------------------------------------+
|                                          |
|     Update available 1.8.4 -> 2.11.0     |
|     Run `pip install dvc --upgrade`      |
|                                          |
+------------------------------------------+

	data/eval.txt:                                                       
		f1_score: 0.7861833464670345


#### Reproduce experiment

In [54]:
# nothing to reporduce
!dvc repro

+------------------------------------------+
|                                          |
|     Update available 1.8.4 -> 2.11.0     |
|     Run `pip install dvc --upgrade`      |
|                                          |
+------------------------------------------+

'data/iris.csv.dvc' didn't change, skipping                           
Stage 'feature-extraction' didn't change, skipping
Stage 'split_dataset' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping
Data and pipelines are up to date.


In [55]:
!dvc repro -f

+------------------------------------------+
|                                          |
|     Update available 1.8.4 -> 2.11.0     |
|     Run `pip install dvc --upgrade`      |
|                                          |
+------------------------------------------+

Verifying data sources in stage: 'data/iris.csv.dvc'                  

Running stage 'feature-extraction' with command:
	python src/featurization.py
                                                                      
Running stage 'split_dataset' with command:
	python src/split_dataset.py --test_size 0.4
                                                                      
Running stage 'train' with command:
	python src/train.py
                                                                      
Running stage 'evaluate' with command:
	python src/evaluate.py
                                                                      
To track the changes with git, run:

	git add data/iris.csv.dvc


In [56]:
# recheck metrics
!dvc metrics show

+------------------------------------------+
|                                          |
|     Update available 1.8.4 -> 2.11.0     |
|     Run `pip install dvc --upgrade`      |
|                                          |
+------------------------------------------+

	data/eval.txt:                                                       
		f1_score: 0.7861833464670345
