In [56]:
! dvc init


You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[39m[31m|[39m                                                                     [31m|[39m
[31m|[39m        DVC has enabled anonymous aggregate usage analytics.         [31m|[39m
[31m|[39m     Read the analytics documentation (and how to opt-out) here:     [31m|[39m
[31m|[39m              [34mhttps://dvc.org/doc/user-guide/analytics[39m               [31m|[39m
[31m|[39m                                                                     [31m|[39m
[31m+---------------------------------------------------------------------+
[39m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: [34mhttps://dvc.org/doc[39m
- Get help and share ideas: [34mhttps://dvc.org/chat[39m
- Star us on GitHub: [34mhttps://github.com/iterative/dvc[39m
[0m

## Load data

In [57]:
%load_ext dotenv
%dotenv

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [58]:
! dvc get https://github.com/iterative/dataset-registry \
          get-started/data.xml -o data/data.xml
! dvc add data/data.xml

0% Downloading|                                    |0/1 [00:00<?,     ?file/s]
![A
  0%|          |get-started/data.xml           0.00/37.9M [00:00<?,       ?it/s][A
  0%|          |get-started/data.xml      64.0k/36.1M [00:00<02:45,     229kB/s][A
  0%|          |get-started/data.xml       128k/36.1M [00:00<02:20,     269kB/s][A
  1%|          |get-started/data.xml       192k/36.1M [00:00<02:05,     300kB/s][A
  1%|          |get-started/data.xml       256k/36.1M [00:00<01:48,     345kB/s][A
  1%|▏         |get-started/data.xml       512k/36.1M [00:00<01:21,     456kB/s][A
  2%|▏         |get-started/data.xml       640k/36.1M [00:01<01:09,     537kB/s][A
  2%|▏         |get-started/data.xml       768k/36.1M [00:01<01:00,     609kB/s][A
  2%|▏         |get-started/data.xml       896k/36.1M [00:01<00:54,     676kB/s][A
  3%|▎         |get-started/data.xml      1.00M/36.1M [00:01<00:50,     728kB/s][A
  3%|▎         |get-started/data.xml      1.12M/36.1M [00:01<00:47,     778k

## Create a pipeline with some steps. i.e.:
    *  prepare data
    *  turn data into features
    *  train models from features
    *  evaluate models

In [59]:
! dvc run -f -n prepare \
             -p prepare.seed,prepare.split \
             -d src/prepare.py -d data/data.xml \
             -o data/prepared \
             python src/prepare.py data/data.xml

If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Running stage 'prepare' with command:
	python src/prepare.py data/data.xml
Creating 'dvc.yaml'
Adding stage 'prepare' in 'dvc.yaml'
Generating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml data/.gitignore dvc.lock
[0m

In [60]:
! dvc run -f -n featurize \
          -p featurize.max_features,featurize.ngrams \
          -d src/featurization.py -d data/prepared \
          -o data/features \
          python src/featurization.py data/prepared data/features

If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Running stage 'featurize' with command:
	python src/featurization.py data/prepared data/features
The input data frame data/prepared/train.tsv size is (20017, 3)
The output matrix data/features/train.pkl size is (20017, 502) and data type is float64
The input data frame data/prepared/test.tsv size is (4983, 3)
The output matrix data/features/test.pkl size is (4983, 502) and data type is float64
  0% Saving features|                          |0.00/2.00 [00:00<?,     ?file/s]Adding stage 'featurize' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add data/.gitignore dvc.lock dvc.yaml
[0m

In [61]:
! dvc run -f -n train \
          -p train.seed,train.n_estimators \
          -d src/train.py -d data/features \
          -o model.pkl \
          python src/train.py data/features model.pkl

If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Running stage 'train' with command:
	python src/train.py data/features model.pkl
Input matrix size (20017, 502)
X matrix size (20017, 500)
Y matrix size (20017,)
If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Adding stage 'train' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml dvc.lock
[0m

In [62]:
! dvc run -f -n evaluate \
          -d src/evaluate.py -d model.pkl -d data/features \
          -M scores.json \
          --plots-no-cache prc.json \
          python src/evaluate.py model.pkl \
                 data/features scores.json prc.json

If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Running stage 'evaluate' with command:
	python src/evaluate.py model.pkl data/features scores.json prc.json
If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Adding stage 'evaluate' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml dvc.lock
[0m

In [63]:
! dvc repro

If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Stage 'data/data.xml.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping
Data and pipelines are up to date.
[0m

In [65]:
! dvc remote add -df myremote gs://dvc_intro
! git add .
! git commit -m 'run'
! git push origin data_pipelines

On branch data_pipelines
nothing to commit, working tree clean
Enumerating objects: 13, done.
Counting objects: 100% (13/13), done.
Delta compression using up to 8 threads
Compressing objects: 100% (8/8), done.
Writing objects: 100% (9/9), 9.42 KiB | 1.88 MiB/s, done.
Total 9 (delta 5), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (5/5), completed with 3 local objects.[K
To github.com:NichitaDiaconu/dvc_intro.git
   70bd6eb..9d3c7c6  data_pipelines -> data_pipelines


In [66]:
! cat scores.json

{"auc": 0.5417487597055675}

In [67]:
! dvc push

0% Uploading|                                      |0/1 [00:00<?,     ?file/s]
![A
  0%|          |data/data.xml                  0.00/37.9M [00:00<?,       ?it/s][A
 28%|██▊       |data/data.xml             10.0M/36.1M [00:00<00:00,    54.2MB/s][A
 55%|█████▌    |data/data.xml             20.0M/36.1M [00:02<00:01,    15.3MB/s][A
 83%|████████▎ |data/data.xml             30.0M/36.1M [00:03<00:00,    10.7MB/s][A
100%|██████████|data/data.xml             36.1M/36.1M [00:05<00:00,    6.66MB/s][A
1 file pushed
[0m

In [68]:
! dvc dag

7[?47h[?1h=






















[H[2J[H[H[2J[H    +-------------------+  
    | data/data.xml.dvc |  
    +-------------------+  
              *            
              *            
              *            
         +---------+       
         | prepare |       
         +---------+       
              *            
              *            
              *            
        +-----------+      
        | featurize |      
        +-----------+      
         **        **      
       **            *     
      *               **   
+-------+               *  
| train |             **   
+-------+            *     
         **        **      
           **    **        
[7m/tmp/tmpym5wv_cb[m[K

In [69]:
! rm -rf data/features
! rm -rf data/prepared
! rm -rf data/data.xml
! rm -rf model.pkl
! rm -rf .dvc/tmp

### We can have as many pipeline steps that we want
### Each can be computed sequentially on separate machines and what not