## Object Detection Pipeline on UCS using Darknet & YOLO

This notebook focuses on implementing object detection as a Kubeflow pipeline on Cisco UCS by using Darknet which is a open-source neural network framework, YOLO (You Only Look Once) which is a real-time object detection system.

## Clone Cisco Kubeflow starter pack repository

In [1]:
BRANCH_NAME="dev" #Provide git branch "master" or "dev"
! git clone -b $BRANCH_NAME https://github.com/CiscoAI/cisco-kubeflow-starter-pack.git

Cloning into 'cisco-kubeflow-starter-pack'...
remote: Enumerating objects: 267, done.[K
remote: Counting objects: 100% (267/267), done.[K
remote: Compressing objects: 100% (130/130), done.[K
remote: Total 7239 (delta 131), reused 191 (delta 84), pack-reused 6972[K
Receiving objects: 100% (7239/7239), 46.41 MiB | 32.16 MiB/s, done.
Resolving deltas: 100% (2995/2995), done.


## Install required packages

In [2]:
!pip install kfp==1.0.1 pillow==7.2.0 mlflow==1.13.1 --user

Collecting kfp==1.0.1
  Downloading kfp-1.0.1.tar.gz (116 kB)
[K     |████████████████████████████████| 116 kB 19.6 MB/s eta 0:00:01
[?25hCollecting pillow==7.2.0
  Downloading Pillow-7.2.0-cp36-cp36m-manylinux1_x86_64.whl (2.2 MB)
[K     |████████████████████████████████| 2.2 MB 28.5 MB/s eta 0:00:01
[?25hCollecting mlflow==1.13.1
  Downloading mlflow-1.13.1-py3-none-any.whl (14.1 MB)
[K     |████████████████████████████████| 14.1 MB 23.8 MB/s eta 0:00:01
Collecting requests_toolbelt>=0.8.0
  Downloading requests_toolbelt-0.9.1-py2.py3-none-any.whl (54 kB)
[K     |████████████████████████████████| 54 kB 519 kB/s  eta 0:00:01
Collecting kfp-server-api<2.0.0,>=0.2.5
  Downloading kfp-server-api-1.3.0.tar.gz (54 kB)
[K     |████████████████████████████████| 54 kB 941 kB/s  eta 0:00:01
Collecting tabulate
  Downloading tabulate-0.8.7-py3-none-any.whl (24 kB)
Collecting click
  Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
[K     |████████████████████████████████| 82 kB 277 

## Restart kernel

In [None]:
from IPython.display import display_html
display_html("<script>Jupyter.notebook.kernel.restart()</script>",raw=True)

## Import libraries

In [10]:
import os
import json
import time
import yaml
import calendar
import requests
import logging
import numpy as np
from PIL import Image, ImageDraw

#Kubeflow
import kfp
from kfp.aws import use_aws_secret
import kfp.compiler as compiler

#Kubernetes
from kubernetes import client

#Tensorflow
import tensorflow as tf
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

#MLFlow
import mlflow

## Load pipeline components

Declare the paths of respective YAML configuration files of each of the pipeline components, in order to load each component into a variable for pipeline execution. 

In [2]:
path='cisco-kubeflow-starter-pack/apps/computer-vision/object-detection/onprem/pipeline/components/v2/'
component_root_dwn= path+'download/'
component_root_katib= path+'katib/'
component_root_train= path+'train/'
component_root_validate= path+'validate/'
component_root_cleanup=path+'cleanup/'
component_root_convert_ncnn=path+'conversion_ncnn/'

download_op = kfp.components.load_component_from_file(os.path.join(component_root_dwn, 'component.yaml'))
hptuning_op = kfp.components.load_component_from_file(os.path.join(component_root_katib, 'component.yaml'))
train_op = kfp.components.load_component_from_file(os.path.join(component_root_train, 'component.yaml'))
validate_op = kfp.components.load_component_from_file(os.path.join(component_root_validate, 'component.yaml'))
convert_ncnn_op=kfp.components.load_component_from_file(os.path.join(component_root_convert_ncnn, 'component.yaml'))
cleanup_op=kfp.components.load_component_from_file(os.path.join(component_root_cleanup, 'component.yaml'))

## Define volume claim & volume mount for storage during pipeline execution

Persistent volume claim & volume mount is created for the purpose of storing entities such as dataset, model files, etc, and to share the stored resources between the various components of the pipeline during it's execution. 

In [3]:
nfs_pvc = client.V1PersistentVolumeClaimVolumeSource(claim_name='nfs')
nfs_volume = client.V1Volume(name='nfs', persistent_volume_claim=nfs_pvc)
nfs_volume_mount = client.V1VolumeMount(mount_path='/mnt/', name='nfs')

## Specify Katib hyperparameters configuration

Specify your custom configuration of hyperparameters in dict format as shown below. 
If your configuration is present in YAML format, please use [this site](https://codebeautify.org/yaml-to-json-xml-csv) to convert it into JSON/dict format.

Use [YAML sample](cisco-kubeflow-starter-pack/apps/computer-vision/object-detection/onprem/pipeline/sample_tuning_spec.yaml) of partial tuning spec for reference.

In case of vice-versa conversion (from JSON to YAML), please utilize [this site](https://codebeautify.org/json-to-yaml). This can be used in case of manual modification of ```trialTemplate``` part present in Katib component's source code.


#### Note: 

- Please provide only that part of configuration related to objective, algorithm, trial count related parameters & hyperparameters of Katib Experiment spec as shown below.
- Even if you specify maxTrialCount in your configuration dict, the value will be overridden with 'trials' variable's value.

In [6]:
tuning_spec={
   "objective":{
      "type":"minimize",
      "goal":0.4,
      "objectiveMetricName":"loss"
   },
   "algorithm":{
      "algorithmName":"random"
   },
   "parallelTrialCount":5,
   "maxFailedTrialCount":3,
   "parameters":[
      {
         "name":"--momentum",
         "parameterType":"double",
         "feasibleSpace":{
            "min":"0.88",
            "max":"0.92"
         }
      },
      {
         "name":"--decay",
         "parameterType":"double",
         "feasibleSpace":{
            "min":"0.00049",
            "max":"0.00052"
         }
      }
   ]
}

## Convert configuration to JSON serialized string format

In [8]:
str_tuning_spec = json.dumps(tuning_spec,separators=(',',':'))

## Define pipeline function

In [9]:
gpus=2 # Number of GPUs to run training

def object_detection_pipeline(
    s3_path="s3://object-det-test/001",        # AWS S3 bucket URL. Ex: s3://<bucket-name>/ 
    user_namespace='anonymous',                # User Namespace 
    timestamp="",                              # Current timestamp
    cfg_data="voc.data",                       # Config file containing file name specifications of train, test and validate datasets
    cfg_file="yolov3-voc.cfg",                 # Config file containing hyperparameters declarations Ex: yolov3.cfg / yolov4.cfg
    weights="yolov3-voc_50000.weights",        # Weights which are already pre-trained upto 50000 iterations is used. Therefore,  
                                               # training happens from 50000 iterations upto a limit of max_batches (say 50200) specified 
                                               # in cfg_file. 
    trials=2,                                  # Total number of trials under Katib experiment
    gpus_per_trial=1,                          # Maximum GPUS to be used for each trial
    classes_file="voc.names",                  # File containing the names of object classes (such as person, bus, car,etc)
    trained_weights="yolov3-voc_final.weights",# Trained output weights to proceed with validation
    max_batches=50050                          # Max batches/No of iterations for training specified in config file
):
    # Download component
    dwn_task = download_op(s3_path=s3_path,
                           timestamp=timestamp,
                           cfg_data=cfg_data,
                           user_namespace=user_namespace
                          ).apply(use_aws_secret(secret_name='aws-secret', aws_access_key_id_name='AWS_ACCESS_KEY_ID', aws_secret_access_key_name='AWS_SECRET_ACCESS_KEY'))
    dwn_task.add_volume(nfs_volume)
    dwn_task.add_volume_mount(nfs_volume_mount) 
    
    # HP tuning (Katib) component
    hptuning_task = hptuning_op(s3_path=s3_path,
                                cfg_data=cfg_data,             # Config file containing file name specifications of train, test and validate datasets
                                cfg_file=cfg_file,             # Config file containing hyperparameters declarations Ex: yolov3.cfg / yolov4.cfg
                                weights=weights,               # Pre-trained weights for VOC dataset
                                trials=trials,                 # total number of trials under Katib experiment
                                timestamp=timestamp,           # Current timestamp to create unique experiment 
                                                               # Ex: object-detection-1599547688-random-588d7877f5-zvlx5
                                gpus_per_trial=gpus_per_trial, # Maximum GPUS to be used for each trial 
                                max_batches=max_batches,       # Provide Max Batches
                                experiment_spec=str_tuning_spec # JSON serialized tuning spec
                                ).apply(use_aws_secret(secret_name='aws-secret', aws_access_key_id_name='AWS_ACCESS_KEY_ID', aws_secret_access_key_name='AWS_SECRET_ACCESS_KEY'))
    hptuning_task.add_volume(nfs_volume)
    hptuning_task.add_volume_mount(nfs_volume_mount)
    hptuning_task.after(dwn_task)
    
    # Train component
    train_task = train_op(s3_path=s3_path,
                          cfg_data=cfg_data,          # Config file containing file name specifications of train, test and validate datasets
                          cfg_file=cfg_file,          # Config file containing hyperparameters declarations Ex: yolov3.cfg / yolov4.cfg
                          weights=weights,            # Pre-trained weights for VOC dataset
                          gpus=gpus,             
                          timestamp=timestamp,
                          user_namespace=user_namespace
                         ).apply(use_aws_secret(secret_name='aws-secret', aws_access_key_id_name='AWS_ACCESS_KEY_ID', aws_secret_access_key_name='AWS_SECRET_ACCESS_KEY'))
    train_task.add_volume(nfs_volume)
    train_task.add_volume_mount(nfs_volume_mount).set_gpu_limit(gpus)  #Maximum GPUs to be used for training
    train_task.after(hptuning_task)
    
    # Validation component
    validate_task = validate_op(s3_path=s3_path,
                                cfg_data=cfg_data,          # Config file containing file name specifications of train, test and validate datasets
                                cfg_file=cfg_file,          # Config file containing hyperparameters declarations Ex: yolov3.cfg / yolov4.cfg
                                timestamp=timestamp,
                                trained_weights=trained_weights
                                ).apply(use_aws_secret(secret_name='aws-secret', aws_access_key_id_name='AWS_ACCESS_KEY_ID', aws_secret_access_key_name='AWS_SECRET_ACCESS_KEY'))
    validate_task.add_volume(nfs_volume)
    validate_task.add_volume_mount(nfs_volume_mount)
    validate_task.after(train_task)
    
    
    # Darknet to ncnn conversion component
    conversion_ncnn_task = convert_ncnn_op(push_to_s3="true",  # Flag to decide whether to upload the trained weights and converted
                                           s3_path=s3_path,    # ncnn model pushed to S3 bucket for future inferencing on anyother 
                                                               # environment or proceeding to cleanup on UCS (Input: true/false)
                                           cfg_data=cfg_data,         
                                           cfg_file=cfg_file,
                                           weight_file=trained_weights,
                                           timestamp=timestamp
                                             ).apply(use_aws_secret(secret_name='aws-secret', aws_access_key_id_name='AWS_ACCESS_KEY_ID', aws_secret_access_key_name='AWS_SECRET_ACCESS_KEY'))
    conversion_ncnn_task.add_volume(nfs_volume)
    conversion_ncnn_task.add_volume_mount(nfs_volume_mount)
    conversion_ncnn_task.after(validate_task)
    
    # Clean up component
    cleanup_task = cleanup_op(timestamp=timestamp,
                             user_namespace=user_namespace).apply(use_aws_secret(secret_name='aws-secret', aws_access_key_id_name='AWS_ACCESS_KEY_ID', aws_secret_access_key_name='AWS_SECRET_ACCESS_KEY'))
    cleanup_task.add_volume(nfs_volume)
    cleanup_task.add_volume_mount(nfs_volume_mount)
    cleanup_task.after(conversion_ncnn_task)  

## Compile pipeline function

Compile the pipeline function to create a tar ball for the pipeline.

In [9]:
# Compile pipeline
try:
    compiler.Compiler().compile(object_detection_pipeline, 'object-detection.tar.gz')
except RuntimeError as err:
    logging.debug(err)
    logging.info("Argo workflow failed validation check but it can still be used to run experiments.")



## Create pipeline experiment

In [10]:
kp_client = kfp.Client()
EXPERIMENT_NAME = 'Object Detection'
experiment = kp_client.create_experiment(name=EXPERIMENT_NAME)

## Create timestamp

In [11]:
timestamp = str(calendar.timegm(time.gmtime()))
timestamp

'1613456866'

## Initialize pipeline parameters & run pipeline

In [12]:
#Pipeline parameters
s3_path="s3://object-det-test/002"
user_namespace='anonymous'
cfg_data="voc.data"              
cfg_file="yolov3-voc.cfg"        
weights="yolov3-voc_50000.weights"
classes_file="voc.names"    
trained_weights="yolov3-voc_final.weights" 
trials=1
gpus_per_trial=2


run_name = 'object-detection-'+timestamp

# Execute pipeline
run = kp_client.run_pipeline(experiment.id, run_name,'object-detection.tar.gz', 
                          params={"s3_path": s3_path,
                                  "user_namespace": user_namespace,
                                  "cfg_data": cfg_data,
                                  "cfg_file": cfg_file,
                                  "weights": weights,
                                  "classes_file": classes_file,
                                  "trained_weights": trained_weights,
                                  "timestamp": timestamp,
                                  "trials": trials,
                                  "gpus_per_trial": gpus_per_trial})

## Retrieve current pipeline run ID

In [13]:
run_id = str(run.id)
run_id

'dab062ef-4d62-4ba8-898a-b42d9c0eb3a0'

## Pipeline parameters/metrics logging using MLFlow

MLFlow's tracking component is used to log input parameters as well as target metrics from the executed pipeline run. 

Please proceed with logging only after pipeline run is done.

### Set MLFlow tracking server URI

Tracking server URI is set to log runs at remote location. 

For external cluster try with http://INGRESS_IP:INGRESS__NODEPORT/mlflow-dashbord/

In [14]:
mlflow.set_tracking_uri("http://mlflow-service.kubeflow.svc.cluster.local:80")
tracking_uri = mlflow.get_tracking_uri()
print("Current tracking uri: {}".format(tracking_uri))

Current tracking uri: http://mlflow-service.kubeflow.svc.cluster.local:80


### Retrieve metrics from pipeline run

In [15]:
metrics = kp_client.get_run(run.id).run.metrics
for metric in metrics:
    metric_element = metric.to_dict()
    if metric_element['name'] == "f1-score":
        f1_score = metric_element['number_value']
    elif metric_element['name'] == 'map-score':
        map_score = metric_element['number_value']
    elif metric_element['name'] == 'precision-score':
        precision_score = metric_element['number_value']
    else:
        recall_score = metric_element['number_value']

### Log input parameters and metrics of pipeline run to MLFlow

Logged parameters & metrics can be viewed on MLFlow UI at http://INGRESS_IP:INGRESS_NODEPORT/mlflow-dashboard/

In [16]:
#Create MLFlow experiment using pipeline's run ID
experiment_id = mlflow.create_experiment(run_name)
experiment = mlflow.get_experiment(experiment_id)

#Log params & metrics
with mlflow.start_run(experiment_id= experiment.experiment_id, run_name= experiment.name):
    mlflow.log_param('Number of training GPUs', gpus)
    mlflow.log_param('Timestamp', timestamp)
    mlflow.log_param('S3 path', s3_path)
    mlflow.log_param('User namespace', user_namespace)
    mlflow.log_param('Data config file', cfg_data)
    mlflow.log_param('Config file', cfg_file)
    mlflow.log_param('Weights', weights)
    mlflow.log_param('Trained weights', trained_weights)
    mlflow.log_param('Number of Katib trials', trials)
    mlflow.log_param('Number of GPUs per trial', gpus_per_trial)
    #mlflow.log_params(params)

    mlflow.log_metric('Mean Average Precision',map_score)
    mlflow.log_metric('Precision score',precision_score)
    mlflow.log_metric('Recall score',recall_score)
    mlflow.log_metric('F1 score',f1_score)
    #mlflow.log_metrics(metrics)

## Delete MLFlow experiment

In [17]:
mlflow.delete_experiment(experiment_id)
experiment = mlflow.get_experiment(experiment_id)
print("Name: {}".format(experiment.name))
print("Artifact Location: {}".format(experiment.artifact_location))
print("Lifecycle_stage: {}".format(experiment.lifecycle_stage))

Name: object-detection-1613456866
Artifact Location: file:///tmp/2
Lifecycle_stage: deleted


## Delete pipeline run

In [18]:
kp_client.runs.delete_run(run_id)
print("Pipeline run with run ID '%s' successfully deleted"%run_id)

Pipeline run with run ID 'dab062ef-4d62-4ba8-898a-b42d9c0eb3a0' successfully deleted
