In [1]:
! dvc init


You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[39m[31m|[39m                                                                     [31m|[39m
[31m|[39m        DVC has enabled anonymous aggregate usage analytics.         [31m|[39m
[31m|[39m     Read the analytics documentation (and how to opt-out) here:     [31m|[39m
[31m|[39m              [34mhttps://dvc.org/doc/user-guide/analytics[39m               [31m|[39m
[31m|[39m                                                                     [31m|[39m
[31m+---------------------------------------------------------------------+
[39m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: [34mhttps://dvc.org/doc[39m
- Get help and share ideas: [34mhttps://dvc.org/chat[39m
- Star us on GitHub: [34mhttps://github.com/iterative/dvc[39m
[0m

## Load data

In [2]:
%load_ext dotenv
%dotenv

In [3]:
! dvc get https://github.com/iterative/dataset-registry \
          get-started/data.xml -o data/data.xml
! dvc add data/data.xml

0% Downloading|                                    |0/1 [00:00<?,     ?file/s]
![A
  0%|          |get-started/data.xml           0.00/37.9M [00:00<?,       ?it/s][A
  0%|          |get-started/data.xml      64.0k/36.1M [00:00<02:38,     238kB/s][A
  0%|          |get-started/data.xml       128k/36.1M [00:00<02:14,     280kB/s][A
  1%|          |get-started/data.xml       192k/36.1M [00:00<02:24,     261kB/s][A
  1%|          |get-started/data.xml       256k/36.1M [00:00<02:00,     313kB/s][A
  1%|▏         |get-started/data.xml       512k/36.1M [00:00<01:29,     418kB/s][A
  2%|▏         |get-started/data.xml       640k/36.1M [00:01<01:15,     495kB/s][A
  2%|▏         |get-started/data.xml       768k/36.1M [00:01<01:03,     589kB/s][A
  2%|▏         |get-started/data.xml       896k/36.1M [00:01<00:56,     658kB/s][A
  3%|▎         |get-started/data.xml      1.00M/36.1M [00:01<00:50,     725kB/s][A
  3%|▎         |get-started/data.xml      1.12M/36.1M [00:01<00:47,     774k

## Create a pipeline with some steps. i.e.:
    *  prepare data
    *  turn data into features
    *  train models from features
    *  evaluate models

In [5]:
! dvc run -f -n prepare \
                     -p prepare.seed,prepare.split \
                     -d src/prepare.py -d data/data.xml \
                     -o data/prepared \
                     python src/prepare.py data/data.xml

If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Running stage 'prepare' with command:
	python src/prepare.py data/data.xml
Creating 'dvc.yaml'
Adding stage 'prepare' in 'dvc.yaml'
Generating lock file 'dvc.lock'

To track the changes with git, run:

	git add data/.gitignore dvc.lock dvc.yaml
[0m

In [4]:
! dvc run -f -n featurize \
          -p featurize.max_features,featurize.ngrams \
          -d src/featurization.py -d data/prepared \
          -o data/features \
          python src/featurization.py data/prepared data/features

If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Restored stage 'featurize' from run-cache
Skipping run, checking out outputs
  0% Saving features|                          |0.00/2.00 [00:00<?,     ?file/s]Modifying stage 'featurize' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml
[0m

In [5]:
! dvc run -f -n train \
          -p train.seed,train.n_estimators \
          -d src/train.py -d data/features \
          -o model.pkl \
          python src/train.py data/features model.pkl

If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Restored stage 'train' from run-cache
Skipping run, checking out outputs
Modifying stage 'train' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml
[0m

In [6]:
! dvc run -f -n evaluate \
          -d src/evaluate.py -d model.pkl -d data/features \
          -M scores.json \
          --plots-no-cache prc.json \
          python src/evaluate.py model.pkl \
                 data/features scores.json prc.json

If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Stage is cached, skipping
[0m

In [7]:
! dvc repro

If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Stage 'data/data.xml.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping
Data and pipelines are up to date.
[0m

In [8]:
! dvc remote add -df myremote gs://dvc_intro
! git add .
! git commit -m 'run'
! git push origin data_pipelines

Setting 'myremote' as a default remote.
[0m[data_pipelines 2020cf3] run
 1 file changed, 3 insertions(+), 3 deletions(-)
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 8 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 501 bytes | 501.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To github.com:NichitaDiaconu/dvc_intro.git
   85be72d..2020cf3  data_pipelines -> data_pipelines


In [9]:
! cat scores.json

{"auc": 0.5417487597055675}

In [10]:
! dvc push

0% Analysing status|                         |0.00/6.00 [00:00<?,     ?file/s]Everything is up to date.
[0m

In [11]:
! dvc dag

7[?47h[?1h=






















[H[2J[H[H[2J[H    +-------------------+  
    | data/data.xml.dvc |  
    +-------------------+  
              *            
              *            
              *            
         +---------+       
         | prepare |       
         +---------+       
              *            
              *            
              *            
        +-----------+      
        | featurize |      
        +-----------+      
         **        **      
       **            *     
      *               **   
+-------+               *  
| train |             **   
+-------+            *     
         **        **      
           **    **        
[7m/tmp/tmpoev5vrtg[m[K

In [15]:
! rm -rf data/features
! rm -rf data/prepared
! rm -rf data/data.xml
! rm -rf model.pkl
! rm -rf .dvc/cache
! rm -rf .dvc/tmp