# DVC Pipelines

Orchestrating data science workflows and tracking computation artefacts and their lineage, using DVC.

## Initialise the Project

In [1]:
!dvc init --subdir

Initialized DVC repository.

You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[0m[31m|[0m                                                                     [31m|[0m
[31m|[0m        DVC has enabled anonymous aggregate usage analytics.         [31m|[0m
[31m|[0m     Read the analytics documentation (and how to opt-out) here:     [31m|[0m
[31m|[0m             <[36mhttps://dvc.org/doc/user-guide/analytics[39m>              [31m|[0m
[31m|[0m                                                                     [31m|[0m
[31m+---------------------------------------------------------------------+
[0m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: <[36mhttps://dvc.org/doc[39m>
- Get help and share ideas: <[36mhttps://dvc.org/chat[39m>
- Star us on GitHub: <[36mhttps://github.com/iterative/dvc[39m>
[0m

## Setup a Remote Artefact Location

In [2]:
!dvc remote add -d s3 s3://dvc-example-artefacts/pipelines

Setting 's3' as a default remote.
[0m

## Define the Pipeline

The pipeline is defined in a YAML file, which is reproduced below. This is all that is required to get DVC to track the various artefacts and metrics.

In [3]:
!cat dvc.yaml

stages:
  get_data:
    cmd: python stages/get_data.py
    deps:
      - stages/get_data.py
    outs:
      - artefacts/dataset.csv
  train_model:
    cmd: python stages/train_model.py
    deps:
      - artefacts/dataset.csv
      - stages/get_data.py
    params:
      - train.random_state
    outs:
      - artefacts/model.joblib
    metrics:
      - metrics/metrics.json:
          cache: false
  

The implied DAG can be reproduced as follows,

In [4]:
!dvc dag

  +----------+   
  | get_data |   
  +----------+   
        *        
        *        
        *        
+-------------+  
| train_model |  
+-------------+  
[0m

### Run the Pipeline

The pipeline can be run with one command,

In [6]:
!dvc repro

Stage 'get_data' didn't change, skipping                              core[39m>
Running stage 'train_model':
> python stages/train_model.py
Updating lock file 'dvc.lock'                                                   

To track the changes with git, run:

    git add dvc.lock

To enable auto staging, run:

	dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.
[0m

### Version Control the Artefacts and Metrics

In [2]:
!git add dvc.lock
!git commit -m "Pipeline run #1"
!dvc push


fatal: pathspec 'artetacts/' did not match any files
[dvc 82ec75f] Pipeline run #1
 1 file changed, 3 insertions(+)
 create mode 100644 dvc-pipelines/metrics/metrics.json
  0% Transferring|                                   |0/2 [00:00<?,     ?file/s]
  0%|          |3b60a9f5ae38e1fb07ee489b429281     0.00/? [00:00<?,        ?B/s][A
  0%|          |3b60a9f5ae38e1fb07ee489b429281   0.00/849 [00:00<?,        ?B/s][A

![A[A

  0%|          |d704fe2b5559102c9051d87a62b668     0.00/? [00:00<?,        ?B/s][A[A

  0%|          |d704fe2b5559102c9051d87a62b668 0.00/38.1k [00:00<?,        ?B/s][A[A
100%|██████████|3b60a9f5ae38e1fb07ee489b429281849/849 [00:00<00:00,    4.24kB/s][A
 50% Transferring|███████████████▌               |1/2 [00:00<00:00,  2.85file/s][A

100%|██████████|d704fe2b5559102c9051d87a6238.1k/38.1k [00:00<00:00,     143kB/s][A[A

2 files pushed                                                                  [A[A
[0m

## Displaying Metrics

All metrics can be retrieved wth one command.

In [8]:
!dvc metrics show

Path                  MAE                                             core[39m>
metrics/metrics.json  0.07772
[0m