# Demonstrate Local Or Remote Functions And Full Pipelines
  --------------------------------------------------------------------


## Create a project to host our functions, jobs and artifacts

Projects are used to package multiple functions, workflows, and artifacts. We usually store project code and definitions in a Git archive.

The following code creates a new project in a local dir and initialize git tracking on that

In [1]:
import os
import mlrun

# set project name and dir
project_name = 'sk-project-dask'
project_dir = './project'

# specify artifacts target location
artifact_path = mlrun.set_environment(api_path = mlrun.mlconf.dbpath or 'http://mlrun-api:8080',
                                      artifact_path = os.path.abspath('./'),
                                      project = project_name,)

# set project
sk_dask_proj = mlrun.new_project(project_name, project_dir, init_git=True)



## Load and run a functions

load the function object from .py .yaml file or function hub (marketplace)<br>

In [2]:
# load function from local file
dsf = sk_dask_proj.set_function('/User/demos/scikit-learn-pipeline-dask/project/sklearn-classifier-dask.py', 
                                name='dask_classifier', 
                                kind="job",
                                image="mlrun/ml-models")

In [3]:
# set up function specs for dask
dsf.spec.remote = True
dsf.spec.replicas = 6
dsf.spec.service_type = 'NodePort'
dsf.with_limits(mem="4G", cpu=6)
dsf.spec.nthreads = 6

In [4]:
# set up function from local file
dsjob = sk_dask_proj.set_function("/User/demos/scikit-learn-pipeline-dask/project/daskjob.py", 
                                  name='dsjob', 
                                  kind="job", 
                                  image="mlrun/ml-models")

In [5]:
describejob = sk_dask_proj.set_function("/User/demos/scikit-learn-pipeline-dask/project/describe.py", 
                                          name='describe', 
                                          kind="job", 
                                          image="mlrun/ml-models")

## Create a Fully Automated ML Pipeline

#### Add more functions to our project to be used in our pipeline (from the functions hub/marketplace)

Describe data, train and eval model with dask

#### Define and save a pipeline 

The following workflow definition will be written into a file, it describes a Kubeflow execution graph (DAG)<br>
and how functions and data are connected  to form an end to end pipeline. 

* Ingest data
* Describe data
* Train, test and evaluate with dask

Check the code below to see how functions objects are initialized and used (by name) inside the workflow.<br>
The `workflow.py` file has two parts, initialize the function objects and define pipeline dsl (connect the function inputs and outputs).

> Note: the pipeline can include CI steps like building container images and deploying models as illustrated  in the following example.


In [6]:
%%writefile project/workflow.py
from kfp import dsl
from mlrun import mount_v3io

funcs    = {}
LABELS   = "label"
DATA_URL = "/User/iris.csv"
#DATA_URL = "/User/yellow_tripdata_2019-01_subset.csv"


# init functions is used to configure function resources and local settings
def init_functions(functions: dict, project=None, secrets=None):
    for f in functions.values():
        f.apply(mount_v3io())
        pass
     
    
@dsl.pipeline(
    name="Demo training pipeline",
    description="Shows how to use mlrun."
)
def kfpipeline():
    
    # init_dask
    dask_init = funcs['dsjob'].as_step(
        handler="hndlr",
        params={"client_url" : "db://default/mydask"},
        outputs=['client'])
    
    # describe data
    describe = funcs['describe'].as_step(
        handler="describe",
        inputs={"dataset"   : DATA_URL,
                "dask_address"  : dask_init.outputs['client']})
    
    # get data, train, test and evaluate 
    train = funcs['dask_classifier'].as_step(
        name="train-skrf",
        handler="train_model",
        params={"label_column"    : LABELS,
                "test_size"       : 0.10,
                "model_pkg_class" : "sklearn.ensemble.RandomForestClassifier"},
        inputs={"dataset"   : DATA_URL,
                "dask_address"  : dask_init.outputs['client']},
        outputs=['model', 'test_set'])

Overwriting project/workflow.py


In [7]:
# register the workflow file as "main", embed the workflow code into the project YAML
sk_dask_proj.set_workflow('main', 'workflow.py', embed=True)

Save the project definitions to a file (project.yaml), it is recommended to commit all changes to a Git repo.

In [8]:
sk_dask_proj.save()

<a id='run-pipeline'></a>
## Run a pipeline workflow
use the `run` method to execute a workflow, you can provide alternative arguments and specify the default target for workflow artifacts.<br>
The workflow ID is returned and can be used to track the progress or you can use the hyperlinks

> Note: The same command can be issued through CLI commands:<br>
    `mlrun project my-proj/ -r main -p "v3io:///users/admin/mlrun/kfp/{{workflow.uid}}/"`

The `dirty` flag allow us to run a project with uncommited changes (when the notebook is in the same git dir it will always be dirty)<br>
The `watch` flag will wait for the pipeline to complete and print results

In [9]:
artifact_path = os.path.abspath('./pipe/{{workflow.uid}}')
run_id = sk_dask_proj.run(
    'main',
    arguments={}, 
    artifact_path=artifact_path, 
    dirty=False, watch=True)

> 2020-11-20 13:35:40,202 [info] using in-cluster config.


> 2020-11-20 13:35:40,963 [info] Pipeline run id=9503dab6-545b-4c75-bb7b-1336d98d5cbc, check UI or DB for progress
> 2020-11-20 13:35:40,963 [info] waiting for pipeline run completion


uid,start,state,name,results,artifacts
...1977000c,Nov 20 13:36:01,completed,describe-describe,,describe
...b1bd3d9a,Nov 20 13:36:02,completed,train-skrf,micro=0.9935941828254847macro=0.9910470085470084precision-2=1.0precision-0=0.8461538461538461precision-1=0.9166666666666666recall-2=1.0recall-0=0.9166666666666666recall-1=0.8461538461538461f1-2=1.0f1-0=0.8799999999999999f1-1=0.8799999999999999,ROCAUCClassificationReportConfusionMatrixFeatureImportancesmodelstandard_scalerlabel_encodertest_set
...01474b85,Nov 20 13:35:47,completed,dsjob-hndlr,,client


In [11]:
!mlrun clean -f



**[back to top](#top)**