## Load data

In [1]:
%load_ext dotenv
%dotenv

In [2]:
! dvc get https://github.com/iterative/dataset-registry \
          get-started/data.xml -o data/data.xml

0% Downloading|                                    |0/1 [00:00<?,     ?file/s]
![A
  0%|          |get-started/data.xml           0.00/37.9M [00:00<?,       ?it/s][A
  0%|          |get-started/data.xml      64.0k/36.1M [00:00<02:48,     225kB/s][A
  0%|          |get-started/data.xml       128k/36.1M [00:00<02:23,     264kB/s][A
  1%|          |get-started/data.xml       256k/36.1M [00:00<02:17,     274kB/s][A
  2%|▏         |get-started/data.xml       576k/36.1M [00:01<01:40,     371kB/s][A
  2%|▏         |get-started/data.xml       704k/36.1M [00:01<01:23,     447kB/s][A
  2%|▏         |get-started/data.xml       832k/36.1M [00:01<01:10,     525kB/s][A
  3%|▎         |get-started/data.xml       960k/36.1M [00:01<01:01,     598kB/s][A
  3%|▎         |get-started/data.xml      1.06M/36.1M [00:01<00:55,     662kB/s][A
  3%|▎         |get-started/data.xml      1.19M/36.1M [00:01<00:51,     718kB/s][A
  4%|▎         |get-started/data.xml      1.31M/36.1M [00:01<00:48,     760k

## Create a pipeline with some steps. i.e.:
    *  prepare data
    *  turn data into features
    *  train models from features
    *  evaluate models

In [3]:
! dvc run -f -n prepare \
                     -p prepare.seed,prepare.split \
                     -d src/prepare.py -d data/data.xml \
                     -o data/prepared \
                     python src/prepare.py data/data.xml

[39m[1mLoading .env environment variables...[39m[22m
If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Running stage 'prepare' with command:
	python src/prepare.py data/data.xml
Modifying stage 'prepare' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml
[0m

In [5]:
! dvc run -f -n featurize \
          -p featurize.max_features,featurize.ngrams \
          -d src/featurization.py -d data/prepared \
          -o data/features \
          python src/featurization.py data/prepared data/features

If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Running stage 'featurize' with command:
	python src/featurization.py data/prepared data/features
The input data frame data/prepared/train.tsv size is (20017, 3)
The output matrix data/features/train.pkl size is (20017, 502) and data type is float64
The input data frame data/prepared/test.tsv size is (4983, 3)
The output matrix data/features/test.pkl size is (4983, 502) and data type is float64
Modifying stage 'featurize' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml
[0m

In [6]:
! dvc run -f -n train \
          -p train.seed,train.n_estimators \
          -d src/train.py -d data/features \
          -o model.pkl \
          python src/train.py data/features model.pkl

If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Running stage 'train' with command:
	python src/train.py data/features model.pkl
Input matrix size (20017, 502)
X matrix size (20017, 500)
Y matrix size (20017,)
If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Modifying stage 'train' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml dvc.lock
[0m

In [7]:
! dvc run -f -n evaluate \
          -d src/evaluate.py -d model.pkl -d data/features \
          -M scores.json \
          --plots-no-cache prc.json \
          python src/evaluate.py model.pkl \
                 data/features scores.json prc.json

If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Running stage 'evaluate' with command:
	python src/evaluate.py model.pkl data/features scores.json prc.json
If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Modifying stage 'evaluate' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock dvc.yaml
[0m

In [8]:
! dvc repro

If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>Verifying data sources in stage: 'data/data.xml.dvc'

Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping

To track the changes with git, run:

	git add data/data.xml.dvc
[0m

In [15]:
! dvc remote add -df myremote gs://dvc_intro
! git add .
! git commit -m 'run'
! git push origin data_pipelines

Setting 'myremote' as a default remote.
[0mOn branch data_pipelines
nothing to commit, working tree clean
Everything up-to-date


In [18]:
! cat scores.json

{"auc": 0.5417487597055675}

In [16]:
! dvc push

0% Uploading|                                      |0/8 [00:00<?,     ?file/s]
![A
  0%|          |data/prepared/test.tsv         0.00/4.76M [00:00<?,       ?it/s][A

  0%|          |data/prepared/train.tsv        0.00/19.1M [00:00<?,       ?it/s][A[A


  0%|          |data/features/test.pkl         0.00/1.41M [00:00<?,       ?it/s][A[A[A



  0%|          |data/features/train.pkl        0.00/5.67M [00:00<?,       ?it/s][A[A[A[A




  0%|          |model.pkl                      0.00/2.66M [00:00<?,       ?it/s][A[A[A[A[A





  0%|          |data/data.xml                  0.00/37.9M [00:00<?,       ?it/s][A[A[A[A[A[A
100%|██████████|data/prepared/test.tsv    4.54M/4.54M [00:00<00:00,    22.9MB/s][A

 55%|█████▍    |data/prepared/train.tsv   10.0M/18.2M [00:00<00:00,    35.0MB/s][A[A





 28%|██▊       |data/data.xml             10.0M/36.1M [00:00<00:00,    29.0MB/s][A[A[A[A[A[A




100%|██████████|model.pkl                 2.54M/2.54M [00:00<00:00,    7

In [2]:
! dvc dag

/bin/bash: dvg: command not found
