# Summary:

A few do's and don'ts on creating Lightweight (containerless*) Kubeflow Pipeline components using `func_to_container_op`.  Highlights include:

**General:**
* For simple or very iterative tasks, lightweight components may reduce your development burden
* Inputs and outputs for Lightweight components are strings by default.  Use the typing package to enable int or float.  More complex data must be serialzed and passed as string (ex: JSON).  Large data should probably be passed as a reference to a file (eg: minio path, common data store link, etc.)

**Helper functions:**
* If possible, define all helper functions needed in a pipeline component inside the function itself rather than outside.
* If sharing helpers between functions, set `use_code_pickling=True` to automatically pass the helpers with the container functions
* **Note:** Use with care, especially if you cannot ensure your pipeline version here in the notebook is the same as the one in your pipeline container

**Dependencies:**
* When you need dependencies that are already installed on the base image you're using for the KFP component, simply `import` them from inside the component function
* When you need dependencies that are not present on the base image, you can pip install them (yourself through code or using the packages_to_install argument).  This can install packages from anywhere pip can (pypi, github, etc).

**Note:** If you have your own local functions supporting your container, the easiest way to use it is by pushing to github and then pip installing from github in your container (see the last example)

# Examples of Authoring and Iterating Kubeflow Pipelines using Lightweight Components

Machine learning applications are often a chain of encapsulated (or encapsulatable) steps.  [Kubeflow Pipelines](https://www.kubeflow.org/docs/pipelines/overview/pipelines-overview/) is a platform for authoring, orchestrating, monitoring, and reusing these pipelines. 

**TODO: Add links to general kubeflow tutorials.  Explain what is expected (basic idea of pipelines, etc).  Add gotchas like passing numbers without typehints (think thats required), explain namedtuple for multiple output, etc?**

Pipelines are defined by a directed acyclic graph (DAG) of data/execution flow between components.  At their core, each component is a self-contained set of user code, packaged as a docker image.  They each perform one step in the pipeline, optionally consuming the products of upstream tasks.  They can be as complex as building a Tensorflow model or as simple as selecting a column from a data file.

In addition to authoring Pipeline Components from self-contained containers, support exists for several approaches to lightweight components.  Lightweight components use Kubeflow Pipelines helpers to avoid the author having to make a new container for each component (or for each iteration of a component).  Although this approach has limits, it may enable easier authoring of simple components or make it easier to iteratively write complex components.  This demo shows a few examples of these workflows, along with their limitations.

# User Settings

(modify these for your own use)


In [1]:
# Name of the experiment that all pipeline runs will be nested in on
# https://kubeflow.covid.cloud.statcan.ca/_/pipeline/#/experiments
experiment_name = "demo-kfp-lightweight-components"

# General Settings

(likely can leave these alone)


In [2]:
# Define a base image to be used in generating component ops from python functions.
# kfp uses this image to run a python session for the component
# func_to_container_op
BASE_IMAGE = "scribby182/demo-kfp-pipeline-authoring:latest"

import kfp
from kfp import dsl
from kfp.components import func_to_container_op

# Lightweight Components from Self-Contained Python Code

One approach to lightweight components is through the Kubeflow Pipeline function ```func_to_container_op()```.  This function accepts a pipeline component defined as a python function and assembles it into a full component by layering it on top of a base image.  Behind the scenes, this is achieved by Kubeflow Pipelines effectively rewriting your function as a script to be run from within the base image you provide (defined as an argument to ```func_to_container_op```).  

For simple, self contained python code, this is an effective way to define Pipeline Components quickly.  For example, we can define a pipeline that accepts two strings and concatenates them, then concatenates that product with a third string, like so:

In [3]:
def concat_string(a, b) -> str:
    return f"({a} | {b})"

which works like:

In [4]:
print(concat_string("String 1", "String 2"))

(String 1 | String 2)


To use this in a Kubeflow Pipeline, we wrap it using the func_to_container_op decorator

Note: We could have done this all in one go when defining concat_string.  That is done later

In [5]:
concat_string_component = func_to_container_op(concat_string,
                                               base_image=BASE_IMAGE
                                               )

And we define our Pipeline as a function that uses our component(s), decorated by the dsl.pipeline decorator

In [6]:
@dsl.pipeline(
    name="My demo-kfp-lightweight-components pipeline",
    description="This one keeps it nice and simple",
)
def pipeline(str1, str2, str3):
    # Note that we use the concat_string_component, not the
    # original concat_string() function
    concat_result_1 = concat_string_component(str1, str2)

    # By using cancat_result_1's output, we define the dependency of
    # concat_result_2 on concat_result_1
    concat_result_2 = concat_string_component(concat_result_1.output, str3)

We can submit our pipeline from code with arguments like this:

In [7]:
kfp.Client().create_run_from_pipeline_func(
    pipeline,
    arguments={'str1': 'String 1', 'str2': 'String 2', 'str3': 'String 3'},
    experiment_name=experiment_name
)

RunPipelineResult(run_id=8d4b1b53-6c01-4504-b745-a1222aa32615)

Which produces the pipeline and output:

![Simple pipeline](images/demo_kfp_lightweight_components_pipeline1_complete.png)

This is great for very simple actions.  For something less trivial, however, we often have dependencies on helper functions or packages.

# Lightweight Components that need Dependencies or Helpers

## What not to do with dependencies and helpers

Care is required whenever a pipeline component defined using `create_run_from_pipeline_func` requires anything outside the code written directly in the wrapped function.  Common gotchas include:

* defining helper functions outside the pipeline component
* using packages that are not imported from within the pipeline component, or are not available in the base image

For example, while it runs fine locally, this will fail in a pipeline:

In [8]:
def my_sum_helper(*numbers):
    total = 0
    for x in numbers:
        total += x
    return total


# Note: Arguments for components created with func_to_container_op expect string
# To enable float or int types, use type hinting.  For more complex inputs,
# serialize with JSON or store data to a location and pass the path
def my_sum(a: float, b: float, c: float) -> float:
    """
    A function that sums its numeric arguments
    """
    return my_sum_helper(a, b, c)

In [9]:
print(my_sum(1, 2, 3))

6


In [10]:
my_sum_component = func_to_container_op(my_sum,
                                        base_image=BASE_IMAGE
                                        )

In [11]:
@dsl.pipeline(
    name="My demo-kfp-lightweight-components pipeline",
    description="This one keeps it nice and simple",
)
def pipeline(a, b, c):
    sum_result = my_sum_component(a, b, c)

In [12]:
kfp.Client().create_run_from_pipeline_func(
    pipeline,
    arguments={'a': 1.0, 'b': 2.0, 'c': 3.0},
    experiment_name=experiment_name
)

RunPipelineResult(run_id=63624dd0-368c-4959-bc2e-ab1e529358f7)

Result: 

![Simple pipeline](images/demo_kfp_lightweight_components_pipeline_with_helper_my_sum_failed.png)

And so will this:

In [13]:
import json


def sum_via_json(numbers_as_json):
    """
    A summation function that sums a list of numbers defined as a JSON string

    Output is returned as a JSON formatted string (which is really just 
    str(number), but still...)
    """
    numbers = json.loads(numbers_as_json)
    summed = sum(numbers)
    return json.dumps(summed)

Testing locally works great:

In [14]:
numbers_as_json = json.dumps([1, 2, 3])
result_as_json = sum_via_json(numbers_as_json)
result = json.loads(result_as_json)
print(f"result = {result}")

result = 6


But running through a pipeline does not...

In [15]:
sum_via_json_component = func_to_container_op(sum_via_json,
                                              base_image=BASE_IMAGE
                                              )

In [16]:
@dsl.pipeline(
    name="My demo-kfp-lightweight-components pipeline",
    description="This one keeps it nice and simple",
)
def pipeline(numbers_as_json):
    sum_result = sum_via_json_component(numbers_as_json)

In [17]:
kfp.Client().create_run_from_pipeline_func(
    pipeline,
    arguments={'numbers_as_json': [1, 2, 3]},
    experiment_name=experiment_name
)

RunPipelineResult(run_id=80400dc8-32e7-4ac8-bce5-5d27e84cba09)

Result: 

![Simple pipeline](images/demo_kfp_lightweight_components_pipeline_with_dependency_failed.png)

## How to handle dependencies and helpers

For helpers, they can be defined within the pipeline:

In [18]:
# Note: Arguments for components created with func_to_container_op expect string
# To enable float or int types, use type hinting.  For more complex inputs,
# serialize with JSON or store data to a location and pass the path
def my_sum_internal_helper(a: float, b: float, c: float) -> float:
    """
    A function that sums its numeric arguments
    """
    def my_sum_helper(*numbers):
        total = 0
        for x in numbers:
            total += x
        return total

    return my_sum_helper(a, b, c)


my_sum_internal_helper_component = func_to_container_op(my_sum_internal_helper,
                                                        base_image=BASE_IMAGE
                                                        )

Or, they can be defined outside the helper and pickled with the code

In [19]:
def my_sum_helper(*numbers):
    total = 0
    for x in numbers:
        total += x
    return total


def my_sum_external_helper(a: float, b: float, c: float) -> float:
    """
    A function that sums its numeric arguments
    """
    return my_sum_helper(a, b, c)


# NOTE the extra argument here
my_sum_external_helper_component = func_to_container_op(my_sum_internal_helper,
                                                        base_image=BASE_IMAGE,
                                                        use_code_pickling=True,
                                                        )

Code pickling will wrap up simple helper functions with the code.  Docs suggest it can do more, but initial testing couldn't get that to work (anyone who gets packages or complex imports working through use_code_pickling should let us know!)

Code pickling can have some downsides (mainly related to python version differences between where it is picked and where it is executed).  If you don't need it, you should probably leave it off.

In [20]:
@dsl.pipeline(
    name="My demo-kfp-lightweight-components pipeline",
    description="This one keeps it nice and simple",
)
def pipeline(a, b, c):
    sum_result_internal = my_sum_internal_helper_component(a, b, c)
    sum_result_external = my_sum_external_helper_component(a, b, c)
    final_sum = my_sum_internal_helper_component(sum_result_internal.output,
                                                 sum_result_external.output,
                                                 0)

In [21]:
kfp.Client().create_run_from_pipeline_func(
    pipeline,
    arguments={'a': 1.0, 'b': 2.0, 'c': 3.0},
    experiment_name=experiment_name
)

RunPipelineResult(run_id=53ef2fcc-f4f9-4c9f-9095-a062be7d679e)

Working pipeline! 

![Working pipeline](images/demo_kfp_lightweight_components_pipeline_with_helper_my_sum_successful.png)

For dependencies, several options exist depending on whether the package you want is already installed on the base image.

If the package you want is installed on the image, you can simply import it from within your pipeline component.  Revisiting the JSON example from above, this will work (because JSON is available on BASE_IMAGE):

In [67]:
def sum_via_json(numbers_as_json: str) -> str:
    """
    A summation function that sums a list of numbers defined as a JSON string

    Output is returned as a JSON formatted string (which is really just 
    str(number), but still...)
    """
    # Import necessary libraries inside the function, as the only code executed
    # by the pipeline component is what you write here (plus some wrapper
    # material KFP creates for you)
    import json
    numbers = json.loads(numbers_as_json)
    summed = sum(numbers)
    return json.dumps(summed)

Testing locally works great:

In [56]:
numbers_as_json = json.dumps([1, 2, 3])
result_as_json = sum_via_json(numbers_as_json)
result = json.loads(result_as_json)
print(f"result = {result}")

[1, 2, 3]
result = 6


In [57]:
sum_via_json_component = func_to_container_op(sum_via_json,
                                              base_image=BASE_IMAGE
                                              )

If the package we need is not available on the base image, we can either install it [ourselves](https://github.com/kubeflow/pipelines/blob/master/samples/core/lightweight_component/lightweight_component.ipynb) or use func_to_container_op

In [58]:
# pip install here so we can test locally, but this has no effect on the
# pipeline component at pipeline runtime
!pip install delorean==1.0.0



In [59]:
def annotate_with_pip_installed_package(json_string):
    import delorean
    return json_string + " | " + str(delorean.Delorean(timezone="US/Eastern"))

In [60]:
annotate_with_pip_installed_package("some string, could be json")

"some string, could be json | Delorean(datetime=datetime.datetime(2020, 5, 13, 15, 10, 50, 167890), timezone='US/Eastern')"

In [64]:
# Add string list of package names exactly how you'd pass
# them to pip, with or without version
packages_to_install = ['delorean==1.0.0']
annotate_with_pip_installed_package_component = func_to_container_op(annotate_with_pip_installed_package,
                                              base_image=BASE_IMAGE,
                                              packages_to_install=packages_to_install,
                                              )

In [68]:
@dsl.pipeline(
    name="My demo-kfp-lightweight-components pipeline",
    description="This one keeps it nice and simple",
)
def pipeline(numbers_as_json):
    sum_result = sum_via_json_component(numbers_as_json)
    annotated_result = annotate_with_pip_installed_package_component(sum_result.output)

In [69]:
kfp.Client().create_run_from_pipeline_func(
    pipeline,
    arguments={'numbers_as_json': json.dumps([1, 2, 3])},
    experiment_name=experiment_name
)

RunPipelineResult(run_id=d826ed10-02f3-410f-a576-086020af2156)

And we see everything works!

![everything works!](images/demo_kfp_lightweight_components_pipeline_with_dependencies_successful.png)

In [73]:
## Iterative development using your own code

For dependencies using your own code, one way to quickly iterate is by pushing your code to git and then installing it in the pipeline component every iteration.  For example, say we were locally iterating on this [project](https://github.com/ca-scribner/lrl).  We could do our development, push the code, and then use the following:

In [74]:
# pip install here is local only.  This enables local testing but doesn't help
# the pipeline
!pip install git+https://github.com/ca-scribner/lrl

Collecting git+https://github.com/ca-scribner/lrl
  Cloning https://github.com/ca-scribner/lrl to /tmp/pip-req-build-qk28fcrg
  Running command git clone -q https://github.com/ca-scribner/lrl /tmp/pip-req-build-qk28fcrg
Collecting gym>=0.12.1
  Downloading gym-0.17.2.tar.gz (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 32.5 MB/s eta 0:00:01
Collecting pyglet<=1.5.0,>=1.4.0
  Downloading pyglet-1.5.0-py2.py3-none-any.whl (1.0 MB)
[K     |████████████████████████████████| 1.0 MB 74.3 MB/s eta 0:00:01
[?25hCollecting cloudpickle<1.4.0,>=1.2.0
  Downloading cloudpickle-1.3.0-py2.py3-none-any.whl (26 kB)
Building wheels for collected packages: lrl, gym
  Building wheel for lrl (setup.py) ... [?25ldone
[?25h  Created wheel for lrl: filename=lrl-1.0.0-py3-none-any.whl size=50250 sha256=e3e828a145de3227234d6098e49cbe0ddccb8c7bac2db0dcbd9e1e576004f735
  Stored in directory: /tmp/pip-ephem-wheel-cache-9lq4noyu/wheels/b6/f4/fd/5adb39371c78e248241d2d9d8e5fc4d2f7eebdb6af2ce64a3f
  

In [91]:
def pipeline_step(something_just_because: str) -> str:
    import lrl
    rt = lrl.environments.get_racetrack(track='20x10_U',
                             x_vel_limits=(-2, 2),
                             y_vel_limits=(-2, 2),
                             x_accel_limits=(-2, 2),
                             y_accel_limits=(-2, 2),
                             max_total_accel=2,
                             )

    # Return a list of strings that looks like:
    # ['GGGGGGGGGGGGGGGGGGGG',
    #  'GGGGGGGGGGGGGGGGGGGG',
    #  'GGG     OOOO      GG',
    #  'GGG     GGGG      GG',
    #  'GGOOOOGGGGGGGGOOOOGG',
    #  'GGOOOGGGGGGGGGGOOOGG',
    #  'GG    GGGGGGGG    GG',
    #  'GG      SGGGGG    GG',
    #  'GGGGGGGGGGGGGGFFFFGG',
    #  'GGGGGGGGGGGGGGFFFFGG']
    # (converted to a string, as lightweight component needs)
    return str(rt.track)

In [92]:
pipeline_step('a')

"['GGGGGGGGGGGGGGGGGGGG', 'GGGGGGGGGGGGGGGGGGGG', 'GGG     OOOO      GG', 'GGG     GGGG      GG', 'GGOOOOGGGGGGGGOOOOGG', 'GGOOOGGGGGGGGGGOOOGG', 'GG    GGGGGGGG    GG', 'GG      SGGGGG    GG', 'GGGGGGGGGGGGGGFFFFGG', 'GGGGGGGGGGGGGGFFFFGG']"

In [93]:
# Add string list of package names exactly how you'd pass
# them to pip, with or without version
packages_to_install = ['git+https://github.com/ca-scribner/lrl']
pipeline_component = func_to_container_op(pipeline_step,
                                              base_image=BASE_IMAGE,
                                              packages_to_install=packages_to_install,
                                              )

In [94]:
@dsl.pipeline(
    name="My demo-kfp-lightweight-components pipeline",
    description="This one keeps it nice and simple",
)
def pipeline(some_arg):
    result = pipeline_component(some_arg)

In [95]:
kfp.Client().create_run_from_pipeline_func(
    pipeline,
    arguments={'some_arg': "I am not very important"},
    experiment_name=experiment_name
)

RunPipelineResult(run_id=50f92901-65de-42e6-b609-b88f593867fc)

(Pipeline runs successfully, but not shown)