# Setting up dvc+mlflow demo for tracking AI pipeline experiments

Define working directory

In [1]:
%%capture
import os
wd_path = os.getcwd().split('notebooks')[0]
os.chdir(os.path.dirname(wd_path))

In [18]:
!git init -q

Clone github in temporary file, get .dvc/config and data.dvc, which are importat to download the required files

In [2]:
!git clone https://github.com/HarryKalantzopoulos/dvc_data_version.git .temp

Cloning into '.temp'...


In [3]:
%%bash
mv .temp/.dvc ./.dvc
mv .temp/data.dvc ./data.dvc
rm -rf .temp

# Download data wih DVC

Or 'cat data.dvc' for bash and powershell

In [4]:
%%bash
cat data.dvc 

outs:
- md5: ba30de71e034b2e63036d2d2f122e82a.dir
  size: 86051558
  nfiles: 20
  path: data
  desc: A demo to data versioning
  type: .mha,.nii.gz
  meta:
    Images: T2
    mask: Whole_gland
    purpose: Prostate_Segmentation
    Author: Harry
    Provenance: PI-CAI challenge
    Image_source_url: https://zenodo.org/record/6624726
    Segmentation_source_url: https://github.com/DIAGNijmegen/picai_labels


In [5]:
!dvc pull

A       data\
1 file added and 20 files fetched


Remove cache if running low in space

In [6]:
%%bash
rm -rf .dvc/cache

# Use DVC to create a pipeline

In this section we build the pipeline for image segmentation. Generally DVC run only the stages where changes are located (tracked by md5). DVC will run a stage with **dvc repro** if:

    1. -p:  dependent paramater is changed (see params.yaml)

    2. -d: dependancy changed
    
    3. -o: output changed

capital O (-O), it tracks but does not keep cache of the output. Use lowercase o to keep cache, this will allow to dvc checkout to your previous results of git commit.

Reads metadata stored in data.dvc

In [7]:
%%bash
dvc stage add -n read_dataset_info \
    -d code/read_DS_info.py -d data \
    -O  .temp/read_dvc.txt \
    python code/read_DS_info.py

Creating 'dvc.yaml'
Adding stage 'read_dataset_info' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true


First is *pipeline_create.py*, where a new experiment is created in mlflow, it can be set to **False** inside the params.yaml to stop mlflow tracking.

To iniate a new experiment, set another name, otherwise reruning the experiment with some changes will delete the previous runs. (Except if you force this stage again, which will create another experiment with the same name.)

In [8]:
%%bash
dvc stage add -n pipeline_create \
    -p params.yaml:mlflow.activate,mlflow.name \
    -d code/pipeline_create.py -d .temp/read_dvc.txt \
    -O  .temp/pipeline.txt \
    python code/pipeline_create.py

Adding stage 'pipeline_create' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true


As it is shown above, DVC will reccomend to add and commit files, you can perform your commit at the end.

The above step will create **dvc.yaml**, which will keep the informations about the pipeline. As output it will create a hidden folder with a txt to keep the order in sequence.

In [9]:
%%bash
cat dvc.yaml

stages:
  read_dataset_info:
    cmd: python code/read_DS_info.py
    deps:
    - code/read_DS_info.py
    - data
    outs:
    - .temp/read_dvc.txt:
        cache: false
  pipeline_create:
    cmd: python code/pipeline_create.py
    deps:
    - .temp/read_dvc.txt
    - code/pipeline_create.py
    params:
    - mlflow.activate
    - mlflow.name
    outs:
    - .temp/pipeline.txt:
        cache: false


The following stage is the preprocess stage, set to do resampling, cropping and convert images to 8bit

In [10]:
%%bash
dvc stage add -n Preprocess \
    -p params.yaml:Preprocess.image_size,Preprocess.resample,Preprocess.maskcrop,Preprocess.8bit,mlflow.activate,mlflow.name \
    -d code/preprocess.py -d data -d .temp/pipeline.txt \
    -O preprocess/dataset.csv -O preprocess/images -O preprocess/masks \
    python code/preprocess.py

Adding stage 'Preprocess' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true


After the excecution of the previous step, code/return_md5.py is running to collect the md5 hashes, defined by dvc, and the pythonic packages used. These informations are uploaded at Mlflow.

In [11]:
%%bash
dvc stage add -n md5_Preprocess \
    -p params.yaml:Preprocess.image_size,Preprocess.resample,Preprocess.maskcrop,Preprocess.8bit,mlflow.activate,mlflow.name \
    -d code/preprocess.py -d preprocess/images -d preprocess/masks \
    -O .temp/Preprocess.txt \
    python  code/return_md5.py "Preprocess"

Adding stage 'md5_Preprocess' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true


kfold split

In [12]:
%%bash
dvc stage add -n Prepare \
    -p params.yaml:Prepare.kfold,mlflow.activate,mlflow.name \
    -d code/prepare.py -d preprocess/dataset.csv -d .temp/Preprocess.txt \
    -O prepared/kfold.json \
    python code/prepare.py

Adding stage 'Prepare' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true


In [13]:
%%bash
dvc stage add -n md5_Prepare \
    -p params.yaml:Prepare.kfold,mlflow.activate,mlflow.name \
    -d code/prepare.py \
    -d prepared/kfold.json \
    -O .temp/Prepare.txt \
    python code/return_md5.py "Prepare"

Adding stage 'md5_Prepare' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true


Train stage and evaluation (this is a mockup, only 1 epoch)

In [14]:
%%bash
dvc stage add -fn Train \
    -p  params.yaml:Preprocess.image_size,mlflow.activate,mlflow.name \
    -p  model.filters,model.architecture,model.loss,model.optimiser,model.metric \
    -p  model.Number_inputs,model.Number_labels,model.layer_activation \
    -p  model.activation,model.dilation  \
    -p  Train.zscore,Train.batch_size,Train.epoch \
    -d code/train.py -d prepared/kfold.json -d .temp/Prepare.txt \
    -O model \
    python code/train.py

Adding stage 'Train' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true


In [15]:
%%bash
dvc stage add -n md5_Train \
    -p  params.yaml:Preprocess.image_size,mlflow.activate,mlflow.name \
    -p  model.filters,model.architecture,model.loss,model.optimiser,model.metric \
    -p  model.Number_inputs,model.Number_labels,model.layer_activation \
    -p  model.activation,model.dilation  \
    -p  Train.zscore,Train.batch_size,Train.epoch \
    -d code/train.py \
    -d model \
    python code/return_md5.py "Train"

Adding stage 'md5_Train' in 'dvc.yaml'

To track the changes with git, run:

	git add dvc.yaml

To enable auto staging, run:

	dvc config core.autostage true


And the DAG of the pipeline

In [16]:
!dvc dag

                  +----------+         
                  | data.dvc |         
                  +----------+         
                 **           **       
               **               **     
             **                   **   
+-------------------+               ** 
| read_dataset_info |                * 
+-------------------+                * 
          *                          * 
          *                          * 
          *                          * 
 +-----------------+                ** 
 | pipeline_create |              **   
 +-----------------+            **     
                 **           **       
                   **       **         
                     **   **           
                 +------------+        
                 | Preprocess |        
                 +------------+        
                 **          ***       
               **               *      
             **                  ***   
  +----------------+                *  


The pipeline is ready we can proceed with git commits. To excecute the pipeline use dvc repro. One incovenience of this dvc, mlflow approach is the git commit.

MLflow is running inside each stage, keeps the last commit. However, DVC changes are commited after dvc repro.

You can now run **dvc repro** and if you want to register the best model (in this case just the one with the best IoU score on evaluation phase) use **code/register_model.py**

In [None]:
# %%bash
# dvc repro
# python3 code/register_model.py

Or you can use the dockerfile

In [None]:
# %%bash
# docker build -t demo_dvc_mlflow .
# docker run --network=host demo_dvc_mlflow