### Intro to DVC Kaggle Notebook
- https://www.kaggle.com/code/kurianbenoy/introduction-to-data-version-control-dvc/notebook

In [1]:
!ls

dvc-3-kaggle-example.ipynb results.zip


In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
!dvc --version
!dvc -h

2.11.0
[0musage: dvc [-q | -v] [-h] [-V] [--cd <path>] COMMAND ...

Data Version Control

optional arguments:
  -q, --quiet        Be quiet.
  -v, --verbose      Be verbose.
  -h, --help         Show this help message and exit.
  -V, --version      Show program's version.
  --cd <path>        Change to directory before executing.

Available Commands:
  COMMAND            Use `dvc COMMAND --help` for command-specific help.
    init             Initialize DVC in the current directory.
    get              Download file or directory tracked by DVC or by Git.
    get-url          Download or copy files from URL.
    destroy          Remove DVC files, local DVC config and data cache.
    add              Track data files or directories with DVC.
    remove           Remove stages from dvc.yaml and/or stop tracking files or directories.
    move             Rename or move a DVC controlled data file or a directory.
    unprotect        Unprotect tracked files or directories (when hardlinks o

In [4]:
!mkdir get-started && cd get-started
!ls

dvc-3-kaggle-example.ipynb results.zip
[34mget-started[m[m


In [5]:
from pathlib import Path
import os

a = Path.cwd() / "get-started"
os.chdir(a)

In [6]:
# initizalize git in our folder
!git init

Initialized empty Git repository in /Users/evgeniimunin/Documents/02_Medium/02_dvc/dvc-3-kaggle-s3/get-started/.git/


In [7]:
# run dvc initialization in a repo directory to create DVC meta files and directories
!dvc init

Initialized DVC repository.

You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[0m[31m|[0m                                                                     [31m|[0m
[31m|[0m        DVC has enabled anonymous aggregate usage analytics.         [31m|[0m
[31m|[0m     Read the analytics documentation (and how to opt-out) here:     [31m|[0m
[31m|[0m             <[36mhttps://dvc.org/doc/user-guide/analytics[39m>              [31m|[0m
[31m|[0m                                                                     [31m|[0m
[31m+---------------------------------------------------------------------+
[0m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: <[36mhttps://dvc.org/doc[39m>
- Get help and share ideas: <[36mhttps://dvc.org/chat[39m>
- Star us on GitHub: <[36mhttps://github.com/iterative/dvc[39m>
[0m

In [8]:
# config git for user account
#! git config --global user.name "kuranbenoy" #Replace with your github username
#! git config --global user.email "kurian.bkk@gmail.com" #Replace with your email id

# commit initialized git files
!git commit -m "initialize DVC"

[master (root-commit) 065b29a] initialize DVC
 3 files changed, 6 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvcignore


### Configuring DVC remotes

A DVC remote is used to share your ML models and datasets with others. The various types of remotes DVC currently supports is: https://dvc.org/doc/get-started/configure

- local - Local directory
- s3 - Amazon Simple Storage Service
- gs - Google Cloud Storage
- azure - Azure Blob Storage
- ssh - Secure Shell
- hdfs - The Hadoop Distributed File System
- http - Support for HTTP and HTTPS protocolbucks

Note we are using remote as a local directory as storage. It's usually recommended to use Cloud storage services as DVC remote.

In [9]:
! dvc remote add -d -f myremote /tmp/dvc-storage

Setting 'myremote' as a default remote.
[0m

In [10]:
! ls /tmp

[32mSublime Text.1412e317863d9ac0332e69a0eea79cd4.8a0daf9de0aa4c55cb04cc4fc066df3f.sock[m[m
[34mcom.apple.launchd.acCrU2uGFk[m[m
[34mdvc[m[m
[34mpowerlog[m[m


In [11]:
!git commit .dvc/config -m "initialize DVC local remote"

[master 18f643d] initialize DVC local remote
 1 file changed, 4 insertions(+)


### Download files

In [13]:
# download the data
! ls data/
! mkdir data/
! dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml

data.xml
mkdir: data/: File exists
[31mERROR[39m: unexpected error - [Errno 17] File exists: 'data/data.xml'

[33mHaving any troubles?[39m Hit us up at [34mhttps://dvc.org/support[39m, we are always happy to help!
[0m

In [14]:
# add file (directory) to DVC
!dvc add data/data.xml

[2K[32m⠋[0m Checking graph                                                   [32m⠋[0m Checking graph
Adding...                                                                       
![A
  0% Checking cache in '/Users/evgeniimunin/Documents/02_Medium/02_dvc/dvc-3-kag[A
                                                                                [A
![A
  0%|          |Transferring                          0/1 [00:00<?,     ?file/s][A
100% Adding...|████████████████████████████████████████|1/1 [00:00, 15.64file/s][A

To track the changes with git, run:

    git add data/data.xml.dvc data/.gitignore

To enable auto staging, run:

	dvc config core.autostage true
[0m

In [15]:
# add DVC files to git and update gitignore
!git add data/.gitignore data/data.xml.dvc
!git commit -m "add source data to DVC"

[master 6085f7d] add source data to DVC
 2 files changed, 5 insertions(+)
 create mode 100644 data/.gitignore
 create mode 100644 data/data.xml.dvc


In [16]:
# push them from your repository to default remote storage
!dvc push

1 file pushed                                                                   
[0m

### Retrieve data
Now since we pushed our data, we are going to do the opposite of push ie pull similar to git analogy. An easy way to test it is by removing currently downloaded data.

In [17]:
! rm -f data/data.xml

In [18]:
# now data returns back to repository
! dvc pull

[32mA[0m       data/data.xml                                                  
1 file added
[0m

In [19]:
# in case just to retrieve a signle dataset or file
! dvc pull data/data.xml.dvc

Everything is up to date.                                                       
[0m

#### Connecting with code
Conncting with code
For providing full Machine Learning reproducibility. It is important to connect code with Datasets which are being reproducible by using commands like dvc add/push/pull.

In [20]:
# run these commands to get sample code
!wget wget https://code.dvc.org/get-started/code.zip
!unzip code.zip
!rm -f code.zip

--2022-07-02 12:01:47--  http://wget/
Resolving wget (wget)... failed: nodename nor servname provided, or not known.
wget: unable to resolve host address ‘wget’
--2022-07-02 12:01:47--  https://code.dvc.org/get-started/code.zip
Resolving code.dvc.org (code.dvc.org)... 104.21.81.205, 172.67.164.76
Connecting to code.dvc.org (code.dvc.org)|104.21.81.205|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://s3-us-east-2.amazonaws.com/dvc-public/code/get-started/code.zip [following]
--2022-07-02 12:01:48--  https://s3-us-east-2.amazonaws.com/dvc-public/code/get-started/code.zip
Resolving s3-us-east-2.amazonaws.com (s3-us-east-2.amazonaws.com)... 52.219.100.66
Connecting to s3-us-east-2.amazonaws.com (s3-us-east-2.amazonaws.com)|52.219.100.66|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5939 (5,8K) [application/zip]
Saving to: ‘code.zip’


2022-07-02 12:01:48 (2,03 MB/s) - ‘code.zip’ saved [5939/5939]

FINISHED --2022-07-02 1

Having installed the src/prepare.py script in your repo, the following command transforms it into a reproducible stage for the ML pipeline we're building (described in detail in the documentation).

Stages are run using dvc run [command] and options among which we use:

- d for dependency: specify an input file
- o for output: specify an output file ignored by git and tracked by dvc
- M for metric: specify an output file tracked by git
- f for file: specify the name of the dvc file.
- command: a bash command, mostly a python script invocation

In [23]:
# create pipeline to create folder data/prepared with false train.csv and test.csv
!dvc run \
    -n prepare_data \
    -f prepare.dvc \
    -d src/prepare.py \
    -d src/data.xml \
    -o data/prepared \
    python src/prepare.py data/data.xml

Running stage 'prepare_data':                                         core[39m>
> prepare.dvc -d src/prepare.py -d src/data.xml -o data/prepared python src/prepare.py data/data.xml
zsh:1: command not found: prepare.dvc
[31mERROR[39m: failed to run: prepare.dvc -d src/prepare.py -d src/data.xml -o data/prepared python src/prepare.py data/data.xml, exited with 127
[0m

In [26]:
! dvc run -n prepare_data \
        -f prepare.dvc \
          -d src/prepare.py -d data/data.xml \
          -o data/prepared \
          python src/prepare.py data/data.xml

Running stage 'prepare_data':                                         core[39m>
> prepare.dvc -d src/prepare.py -d data/data.xml -o data/prepared python src/prepare.py data/data.xml
zsh:1: command not found: prepare.dvc
[31mERROR[39m: failed to run: prepare.dvc -d src/prepare.py -d data/data.xml -o data/prepared python src/prepare.py data/data.xml, exited with 127
[0m

In [None]:
!git add data/.gitignore prepare.dvc
!git commit -m "add data preparation stage"

In [None]:
!dvc push

### Pipeline
Using dvc run multiple times, and specifying outputs of a command (stage) as dependencies in another one, we can describe a sequence of commands that gets to a desired result. This is what we call a data pipeline or computational graph.


In [None]:
# Let's create second stage (after prepare.dvc, created in the previous chapter) to perform feature extraction
! dvc run -f featurize.dvc \
          -d src/featurization.py -d data/prepared/ \
          -o data/features \
           python src/featurization.py data/prepared data/features

In [None]:
# 3rd stage for training the model
!dvc run -f train.dvc \
    -d src/train.py -d data/features \
    -o model.pkl \
    python src/train.py data/features model.pkl

In [None]:
%bash
git add data/.gitignore .gitignore featurize.dvc train.dvc
git commit -m "add featurization and train steps to the pipeline"
dvc push

### Pipelines visualisation

In [None]:
!dvc pipeline show --ascii train.dvc

### Metrics
The last stage we would like to add to our pipeline is its the evaluation. Data science is a metric-driven R&D-like process and dvc metrics along with DVC metric files provide a framework to capture and compare experiments performance.

evaluate.py calculates AUC value using the test data set. It reads features from the features/test.pkl file and produces a DVC metric file - auc.metric. It is a special DVC output file type, in this case it's just a plain text file with a single number inside.

In [25]:
! dvc run -f evaluate.dvc \
          -d src/evaluate.py -d model.pkl -d data/features \
          -M auc.metric \
          python src/evaluate.py model.pkl \
                 data/features auc.metric

[31mERROR[39m: unexpected error - [Errno 1] Operation not permitted

[33mHaving any troubles?[39m Hit us up at [34mhttps://dvc.org/support[39m, we are always happy to help!
Traceback (most recent call last):
  File "/Users/evgeniimunin/opt/anaconda3/bin/dvc", line 8, in <module>
    sys.exit(main())
  File "/Users/evgeniimunin/opt/anaconda3/lib/python3.8/site-packages/dvc/cli/__init__.py", line 207, in main
    if analytics.is_enabled():
  File "/Users/evgeniimunin/opt/anaconda3/lib/python3.8/site-packages/dvc/analytics.py", line 50, in is_enabled
    Config(validate=False).get("core", {}).get("analytics", "true")
  File "/Users/evgeniimunin/opt/anaconda3/lib/python3.8/site-packages/dvc/config.py", line 99, in __init__
    self.dvc_dir = Repo.find_dvc_dir()
  File "/Users/evgeniimunin/opt/anaconda3/lib/python3.8/site-packages/dvc/repo/__init__.py", line 354, in find_dvc_dir
    root_dir = cls.find_root(root)
  File "/Users/evgeniimunin/opt/anaconda3/lib/python3.8/si

In [None]:
%%bash
git add evaluate.dvc auc.metric
git commit -m "add evaluation step to the pipeline"

### Experiemnts
Data science process is inherently iterative and R&D like - data scientist may try many different approaches, different hyper-parameter values and "fail" many times before the required level of a metric is achieved.

We are modifying our feature extraction of our files. Inorder to use bigrams. We are increasing no of features and n_gram_range in our file src/featurization.py.

In [None]:
%%writefile src/featurization.py
import os
import sys
import errno
import pandas as pd
import numpy as np
import scipy.sparse as sparse

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

try:
    import cPickle as pickle
except ImportError:
    import pickle

np.set_printoptions(suppress=True)

if len(sys.argv) != 3 and len(sys.argv) != 5:
    sys.stderr.write('Arguments error. Usage:\n')
    sys.stderr.write('\tpython featurization.py data-dir-path features-dir-path\n')
    sys.exit(1)

train_input = os.path.join(sys.argv[1], 'train.tsv')
test_input = os.path.join(sys.argv[1], 'test.tsv')
train_output = os.path.join(sys.argv[2], 'train.pkl')
test_output = os.path.join(sys.argv[2], 'test.pkl')

try:
    reload(sys)
    sys.setdefaultencoding('utf-8')
except NameError:
    pass


def mkdir_p(path):
    try:
        os.makedirs(path)
    except OSError as exc:  # Python >2.5
        if exc.errno == errno.EEXIST and os.path.isdir(path):
            pass
        else:
            raise


def get_df(data):
    df = pd.read_csv(
        data,
        encoding='utf-8',
        header=None,
        delimiter='\t',
        names=['id', 'label', 'text']
    )
    sys.stderr.write('The input data frame {} size is {}\n'.format(data, df.shape))
    return df


def save_matrix(df, matrix, output):
    id_matrix = sparse.csr_matrix(df.id.astype(np.int64)).T
    label_matrix = sparse.csr_matrix(df.label.astype(np.int64)).T

    result = sparse.hstack([id_matrix, label_matrix, matrix], format='csr')

    msg = 'The output matrix {} size is {} and data type is {}\n'
    sys.stderr.write(msg.format(output, result.shape, result.dtype))

    with open(output, 'wb') as fd:
        pickle.dump(result, fd, pickle.HIGHEST_PROTOCOL)
    pass


mkdir_p(sys.argv[2])

# Generate train feature matrix
df_train = get_df(train_input)
train_words = np.array(df_train.text.str.lower().values.astype('U'))

bag_of_words = CountVectorizer(stop_words='english',
                               max_features=5000,
                              ngram_range=(1, 2),)
bag_of_words.fit(train_words)
train_words_binary_matrix = bag_of_words.transform(train_words)
tfidf = TfidfTransformer(smooth_idf=False)
tfidf.fit(train_words_binary_matrix)
train_words_tfidf_matrix = tfidf.transform(train_words_binary_matrix)

save_matrix(df_train, train_words_tfidf_matrix, train_output)

# Generate test feature matrix
df_test = get_df(test_input)
test_words = np.array(df_test.text.str.lower().values.astype('U'))
test_words_binary_matrix = bag_of_words.transform(test_words)
test_words_tfidf_matrix = tfidf.transform(test_words_binary_matrix)

save_matrix(df_test, test_words_tfidf_matrix, test_output)

### Reproduce
We described our first pipeline. Basically, we created a number of DVC-file. Each file describes a single stage we need to run (a pipeline) towards a final result. Each depends on some data (either source data files or some intermediate results from another DVC-file file) and code files.

In [26]:
# using dvc repro here
! dvc repro train.dvc

[31mERROR[39m: unexpected error - [Errno 1] Operation not permitted

[33mHaving any troubles?[39m Hit us up at [34mhttps://dvc.org/support[39m, we are always happy to help!
Traceback (most recent call last):
  File "/Users/evgeniimunin/opt/anaconda3/bin/dvc", line 8, in <module>
    sys.exit(main())
  File "/Users/evgeniimunin/opt/anaconda3/lib/python3.8/site-packages/dvc/cli/__init__.py", line 207, in main
    if analytics.is_enabled():
  File "/Users/evgeniimunin/opt/anaconda3/lib/python3.8/site-packages/dvc/analytics.py", line 50, in is_enabled
    Config(validate=False).get("core", {}).get("analytics", "true")
  File "/Users/evgeniimunin/opt/anaconda3/lib/python3.8/site-packages/dvc/config.py", line 99, in __init__
    self.dvc_dir = Repo.find_dvc_dir()
  File "/Users/evgeniimunin/opt/anaconda3/lib/python3.8/site-packages/dvc/repo/__init__.py", line 354, in find_dvc_dir
    root_dir = cls.find_root(root)
  File "/Users/evgeniimunin/opt/anaconda3/lib/python3.8/si

In [None]:
!git commit -a -m "bigram model"

In [None]:
! git checkout baseline-experiment
! dvc checkout

### Compare experiments
DVC makes it easy to iterate on your project using Git commits with tags or Git branches. It provides a way to try different ideas, keep track of them, switch back and forth. To find the best performing experiment or track the progress, a special metric output type is supported in DVC (described in one of the previous steps).

In [None]:
%%bash
git checkout master
dvc checkout
dvc repro evaluate.dvc

In [None]:
%%bash
git commit -a -m "evaluate bigram model"
git tag -a "bigram-experiment" -m "bigrams"

In [None]:
!dvc metrics show -T

### Get older data files
The answer is the dvc checkout command, and we already touched briefly the process of switching between different data versions in the Experiments step of this get started guide.

In [None]:
! git checkout baseline-experiment train.dvc
! dvc checkout train.dvc

In [None]:
! git checkout baseline-experiment
! dvc checkout