In [29]:
! dvc init


You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[39m[31m|[39m                                                                     [31m|[39m
[31m|[39m        DVC has enabled anonymous aggregate usage analytics.         [31m|[39m
[31m|[39m     Read the analytics documentation (and how to opt-out) here:     [31m|[39m
[31m|[39m              [34mhttps://dvc.org/doc/user-guide/analytics[39m               [31m|[39m
[31m|[39m                                                                     [31m|[39m
[31m+---------------------------------------------------------------------+
[39m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: [34mhttps://dvc.org/doc[39m
- Get help and share ideas: [34mhttps://dvc.org/chat[39m
- Star us on GitHub: [34mhttps://github.com/iterative/dvc[39m
[0m

## Load data

In [30]:
%load_ext dotenv
%dotenv

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [31]:
! dvc get https://github.com/iterative/dataset-registry \
          get-started/data.xml -o data/data.xml
! dvc add data/data.xml

0% Downloading|                                    |0/1 [00:00<?,     ?file/s]
![A
  0%|          |get-started/data.xml           0.00/37.9M [00:00<?,       ?it/s][A
  0%|          |get-started/data.xml      64.0k/36.1M [00:00<02:46,     227kB/s][A
  0%|          |get-started/data.xml       128k/36.1M [00:00<02:21,     266kB/s][A
  1%|          |get-started/data.xml       256k/36.1M [00:00<02:07,     296kB/s][A
  1%|          |get-started/data.xml       384k/36.1M [00:00<01:37,     386kB/s][A
  1%|▏         |get-started/data.xml       512k/36.1M [00:01<01:20,     466kB/s][A
  2%|▏         |get-started/data.xml       640k/36.1M [00:01<01:09,     536kB/s][A
  2%|▏         |get-started/data.xml       768k/36.1M [00:01<00:59,     627kB/s][A
  2%|▏         |get-started/data.xml       896k/36.1M [00:01<00:53,     689kB/s][A
  3%|▎         |get-started/data.xml      1.00M/36.1M [00:01<00:49,     738kB/s][A
  3%|▎         |get-started/data.xml      1.12M/36.1M [00:01<00:46,     784k

## Create a pipeline with some steps. i.e.:
    *  prepare data
    *  turn data into features
    *  train models from features
    *  evaluate models

In [32]:
! dvc run -f -n prepare \
             -p prepare.seed,prepare.split \
             -d src/prepare.py -d data/data.xml \
             -o data/prepared \
             python src/prepare.py data/data.xml

If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Running stage 'prepare' with command:
	python src/prepare.py data/data.xml
Creating 'dvc.yaml'
Adding stage 'prepare' in 'dvc.yaml'
Generating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml data/.gitignore dvc.lock
[0m

In [33]:
! dvc run -f -n featurize \
          -p featurize.max_features,featurize.ngrams \
          -d src/featurization.py -d data/prepared \
          -o data/features \
          python src/featurization.py data/prepared data/features

If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Running stage 'featurize' with command:
	python src/featurization.py data/prepared data/features
The input data frame data/prepared/train.tsv size is (20017, 3)
The output matrix data/features/train.pkl size is (20017, 502) and data type is float64
The input data frame data/prepared/test.tsv size is (4983, 3)
The output matrix data/features/test.pkl size is (4983, 502) and data type is float64
Adding stage 'featurize' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml dvc.lock data/.gitignore
[0m

In [34]:
! dvc run -f -n train \
          -p train.seed,train.n_estimators \
          -d src/train.py -d data/features \
          -o model.pkl \
          python src/train.py data/features model.pkl

If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Running stage 'train' with command:
	python src/train.py data/features model.pkl
Input matrix size (20017, 502)
X matrix size (20017, 500)
Y matrix size (20017,)
If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Adding stage 'train' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock dvc.yaml
[0m

In [35]:
! dvc run -f -n evaluate \
          -d src/evaluate.py -d model.pkl -d data/features \
          -M scores.json \
          --plots-no-cache prc.json \
          python src/evaluate.py model.pkl \
                 data/features scores.json prc.json

If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Running stage 'evaluate' with command:
	python src/evaluate.py model.pkl data/features scores.json prc.json
If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Adding stage 'evaluate' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock dvc.yaml
[0m

In [36]:
! dvc repro

If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Stage 'data/data.xml.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping
Data and pipelines are up to date.
[0m

In [37]:
! dvc remote add -df myremote gs://dvc_intro
! git add .
! git commit -m 'run'
! git push origin data_pipelines

Setting 'myremote' as a default remote.
[0m[data_pipelines d86de9b] run
 12 files changed, 646 insertions(+), 410 deletions(-)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvc/plots/confusion.json
 create mode 100644 .dvc/plots/default.json
 create mode 100644 .dvc/plots/scatter.json
 create mode 100644 .dvc/plots/smooth.json
 create mode 100644 data/.gitignore
 create mode 100644 data/data.xml.dvc
 create mode 100644 dvc.lock
 create mode 100644 dvc.yaml
 create mode 100644 prc.json
 rewrite test.ipynb (78%)
Enumerating objects: 11, done.
Counting objects: 100% (11/11), done.
Delta compression using up to 8 threads
Compressing objects: 100% (7/7), done.
Writing objects: 100% (8/8), 4.42 KiB | 2.21 MiB/s, done.
Total 8 (delta 4), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (4/4), completed with 2 local objects.[K
To github.com:NichitaDiaconu/dvc_intro.git
   39a465d..d86de9b  data_pipelines -> data_pipelines


In [38]:
! cat scores.json

{"auc": 0.5417487597055675}

In [39]:
! dvc push

0% Uploading|                                      |0/8 [00:00<?,     ?file/s]
![A
  0%|          |data/prepared/test.tsv         0.00/4.76M [00:00<?,       ?it/s][A

  0%|          |data/features/test.pkl         0.00/1.41M [00:00<?,       ?it/s][A[A


  0%|          |data/features/train.pkl        0.00/5.67M [00:00<?,       ?it/s][A[A[A



  0%|          |data/prepared/train.tsv        0.00/19.1M [00:00<?,       ?it/s][A[A[A[A




  0%|          |model.pkl                      0.00/2.66M [00:00<?,       ?it/s][A[A[A[A[A





![A[A[A[A[A[A





  0%|          |data/data.xml                  0.00/37.9M [00:00<?,       ?it/s][A[A[A[A[A[A

100%|██████████|data/features/test.pkl    1.35M/1.35M [00:00<00:00,    5.77MB/s][A[A
100%|██████████|data/prepared/test.tsv    4.54M/4.54M [00:00<00:00,    17.7MB/s][A


100%|██████████|data/features/train.pkl   5.41M/5.41M [00:00<00:00,    16.5MB/s][A[A[A





 28%|██▊       |data/data.xml             10.0M/36.1M [00:

In [40]:
! dvc dag

7[?47h[?1h=






















[H[2J[H[H[2J[H    +-------------------+  
    | data/data.xml.dvc |  
    +-------------------+  
              *            
              *            
              *            
         +---------+       
         | prepare |       
         +---------+       
              *            
              *            
              *            
        +-----------+      
        | featurize |      
        +-----------+      
         **        **      
       **            *     
      *               **   
+-------+               *  
| train |             **   
+-------+            *     
         **        **      
           **    **        
[7m/tmp/tmp4xhglgkp[m[K

In [42]:
! rm -rf data/features
! rm -rf data/prepared
! rm -rf data/data.xml
! rm -rf model.pkl
! rm -rf .dvc/cache
! rm -rf .dvc/tmp

### We can have as many pipeline steps that we want
### Each can be computed sequentially on separate machines and what not