**Difficulty: Beginner**

# Summary:

## A few do's and don'ts on creating Lightweight (containerless*) Kubeflow Pipeline components using `func_to_container_op`.  
\* The pipeline components still run from containers, but you do not need to explicitly create and push these containers somewhere.  Instead, Kubeflow uses a base container you specify and runs your code from within that container.  See more detail below.

**Highlights include:**

**General:**
* For simple or very iterative tasks, lightweight components may reduce your development burden
* Inputs and outputs for Lightweight components are strings by default.  Use the typing package to enable int or float.  More complex data must be serialzed and passed as string (ex: JSON).  Large data should probably be passed as a reference to a file stored elsewhere (eg: MinIO path, common data store link, etc.)

**When code needs helper functions:**
* If possible, define all helper functions needed in a pipeline component inside the function itself rather than outside.
* If sharing helpers between functions, set `use_code_pickling=True` to automatically pass the helpers with the container functions
* **Note:** Apply `use_code_pickling=True` with care, especially if you cannot ensure the Python version here in the notebook will be the same as the version in your pipeline container.  If you see odd behaviour, it could be this.

**When code needs Python dependencies:**
* When you need dependencies that are already installed on the base image you're running off of, simply `import` them from inside the component function rather than above the code.  This will ensure they're imported at runtime in the pipeline
* When you need dependencies that are not installed on the base image, you can pip install them (via `func_to_container_op(packages_to_install=[...])` or by making system calls to pip yourself like [here](https://github.com/kubeflow/pipelines/blob/master/samples/core/lightweight_component/lightweight_component.ipynb)).  This can install packages from anywhere pip can (pypi, github, etc).

**Note:** If you're developing code that is well encapsulated across multiple files (eg: my main.py imports from ./utilities.py and ./other_code.py) the easiest way to develop is to push the code to GitHub and then pip install from your GitHub repo in the container (see the Github example below)

# Examples of Authoring and Iterating Kubeflow Pipelines using Lightweight Components

Machine learning applications are often a chain of encapsulated (or encapsulatable) steps.  [Kubeflow Pipelines](https://www.kubeflow.org/docs/pipelines/overview/pipelines-overview/) is a platform for authoring, orchestrating, monitoring, and reusing these pipelines. 

Pipelines are defined by a directed acyclic graph (DAG) of data/execution flow between components.  At their core, each component is a self-contained set of user code, packaged as a docker container.  They each perform one step in the pipeline, optionally consuming the products of upstream tasks and/or providing results for downstream tasks.  They can be as complex as building a Tensorflow model or as simple as selecting a column from a data file.

The typical workflow for building Pipeline Components is:
* code up the logic for your component and test it locally.  This could be in Python, R, ..., but in the end it needs to work like an executable you could run from a command line
* package the code as a docker image and store that container somewhere Kubeflow can find
* define a pipeline step that runs that docker container (probably with some command line arguments passed from the pipeline)

In addition to this typical workflow of authoring self-contained docker images, there are also several ways to make pipeline steps without explicitly creating a container.  These **[lightweight components](https://www.kubeflow.org/docs/pipelines/sdk/lightweight-python-components/)** automate some of the setup for you to make things easier.  The typical workflow for building lightweight components is:
* code up the logic for your component in Python from within the notebook you're defining your pipeline (examples shown later)
* define a pipeline step that runs your **Python function** from within a generic, base docker container you specify (say the generic [tensorflow container](https://hub.docker.com/r/tensorflow/tensorflow/tags))

Although this approach has limits, it may enable easier authoring and iterating on pipeline components because you don't need to rebuild your docker image each time.  This demo shows a few examples of these workflows, along with their limitations.

# User Settings

(modify these for your own use)


In [1]:
# Name of the experiment that all pipeline runs will be nested in on
# https://kubeflow.covid.cloud.statcan.ca/_/pipeline/#/experiments
experiment_name = "demo-kfp-lightweight-components"

# General Settings

(likely can leave these alone)


In [30]:
# Restart kernel after this step
!pip install delorean==1.0.0



***Note:*** Restart Kernel after above cell

In [2]:
# Define a base image to be used in generating component ops from python functions.
# kfp uses this image to run a python session for the component func_to_container_op
# This uses a generic python image, but you can use whichever one provides the best 
# starting point for your code (generally an image that has everything dependency 
# you need preinstalled, but not much else.  For example, if you're building 
# tensorflow models, you should probably start with tensorflow preinstalled, such as
# a tensorflow base image)
import sys

python_version_as_string = f"{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}"
BASE_IMAGE = f"python:{python_version_as_string}-buster"
print(f"This example notebook was executed using python {python_version_as_string}")
print(f"Using base_image_python: {BASE_IMAGE}")

import kfp
from kfp import dsl
from kfp.components import func_to_container_op

This example notebook was executed using python 3.8.8
Using base_image_python: python:3.8.8-buster


# Lightweight Components from Self-Contained Python Code

The Kubeflow Pipeline function `func_to_container_op` lets you turn a generic Python function into a full pipeline component.  To do this, behind the scenes Kubeflow Pipelines effectively rewrites your function as a script to be run from within the base image you provide (the base image is defined as an argument to `func_to_container_op` as seen below).  

For simple, self contained Python code, this is an effective way to define Pipeline Components quickly.  For example, we can define a pipeline step that accepts two strings and concatenates.

First, we define a base Python function that contains all the logic of our single pipeline step

In [3]:
def concat_string(a, b) -> str:
    return f"({a} | {b})"

Testing locally to make sure it works:

In [4]:
# Test locally
print(concat_string("String 1", "String 2"))

(String 1 | String 2)


And now converting this from a Python function to a kubeflow component factory (a function that can be used to define instances of this particular type of pipeline component)

In [5]:
concat_string_component = func_to_container_op(concat_string,
                                               base_image=BASE_IMAGE
                                               )

And we define our Pipeline as a function that uses our component(s), decorated by the dsl.pipeline decorator.  In this case, we have two concat_string_components, one that concatenates str1+str2, and another that concatenates (str1+str2) with str3

In [6]:
@dsl.pipeline(
    name="My demo-kfp-lightweight-components pipeline",
    description="This one keeps it nice and simple",
)
def pipeline(str1, str2, str3):
    # Note that we use the concat_string_component, not the
    # original concat_string() function
    concat_result_1 = concat_string_component(str1, str2)

    # By using cancat_result_1's output, we define the dependency of
    # concat_result_2 on concat_result_1
    concat_result_2 = concat_string_component(concat_result_1.output, str3)

We can submit our pipeline from code with arguments like this:

In [7]:
kfp.Client().create_run_from_pipeline_func(
    pipeline,
    arguments={'str1': 'String 1', 'str2': 'String 2', 'str3': 'String 3'},
    experiment_name=experiment_name
)

RunPipelineResult(run_id=f72a522a-cd79-4dd0-b0e6-01bf2d355b8d)

Which produces the pipeline and output:

![Simple pipeline](images/demo_kfp_lightweight_components_pipeline1_complete.png)

Note that what happens here is the `kfp.Client().create_run_from_pipeline_func` is taking our `pipeline()` we defined and translating it into a pipeline definition (`pipeline.yaml.zip`) that it then passes to Kubeflow Pipelines to actually run. 

This workflow is great for very simple actions.  For something less trivial, however, we often have dependencies on helper functions or packages...

# Lightweight Components that need Dependencies or Helpers

## What not to do with dependencies and helpers

Care is required whenever a pipeline component defined using `create_run_from_pipeline_func` requires anything outside the code written directly in the wrapped function.  Common gotchas include:

* defining helper functions outside the pipeline component
* using packages in the pipeline component without importing them, or importing packages that are not installed in the base image at all

For example, while it runs fine locally, this function that depends on a helper will fail in a pipeline:

In [8]:
def my_sum_helper(*numbers):
    total = 0
    for x in numbers:
        total += x
    return total


# Note: Arguments for components created with func_to_container_op expect string
# To enable float or int types, use type hinting.  For more complex inputs,
# serialize with JSON or store data to a location and pass the path
def my_sum(a: float, b: float, c: float) -> float:
    """
    A function that sums its numeric arguments
    """
    return my_sum_helper(a, b, c)

Looks good here...

In [9]:
print(my_sum(1, 2, 3))

6


In [10]:
my_sum_component = func_to_container_op(my_sum,
                                        base_image=BASE_IMAGE
                                        )

But not here...

In [11]:
@dsl.pipeline(
    name="My demo-kfp-lightweight-components pipeline",
    description="This one is a bit more complicated and fails",
)
def pipeline(a, b, c):
    sum_result = my_sum_component(a, b, c)

In [12]:
kfp.Client().create_run_from_pipeline_func(
    pipeline,
    arguments={'a': 1.0, 'b': 2.0, 'c': 3.0},
    experiment_name=experiment_name
)

RunPipelineResult(run_id=a7614e29-7560-40be-8d0b-868736913240)

Result: 

![Simple pipeline](images/demo_kfp_lightweight_components_pipeline_with_helper_my_sum_failed.png)

Similarly, this function that depends on the json package will also work locally but fail in the pipeline

In [13]:
import json


def sum_via_json(numbers_as_json):
    """
    A summation function that sums a list of numbers defined as a JSON string

    Output is returned as a JSON formatted string (which is really just 
    str(number), but still...)
    """
    numbers = json.loads(numbers_as_json)
    summed = sum(numbers)
    return json.dumps(summed)

Testing locally works great:

In [14]:
numbers_as_json = json.dumps([1, 2, 3])
result_as_json = sum_via_json(numbers_as_json)
result = json.loads(result_as_json)
print(f"result = {result}")

result = 6


But running through a pipeline does not...

In [15]:
sum_via_json_component = func_to_container_op(sum_via_json,
                                              base_image=BASE_IMAGE
                                              )

In [16]:
@dsl.pipeline(
    name="My demo-kfp-lightweight-components pipeline",
    description="This one is a bit more complicated and fails",
)
def pipeline(numbers_as_json):
    sum_result = sum_via_json_component(numbers_as_json)

In [17]:
kfp.Client().create_run_from_pipeline_func(
    pipeline,
    arguments={'numbers_as_json': [1, 2, 3]},
    experiment_name=experiment_name
)

RunPipelineResult(run_id=89a47261-c161-4aec-8f67-a6f1a5bdfdd1)

Result: 

![Simple pipeline](images/demo_kfp_lightweight_components_pipeline_with_dependency_failed.png)

## How to handle dependencies and helpers

For helpers, they can be defined within the pipeline:

In [18]:
# Note: Arguments for components created with func_to_container_op expect string
# To enable float or int types, use type hinting.  For more complex inputs,
# serialize with JSON or store data to a location and pass the path
def my_sum_internal_helper(a: float, b: float, c: float) -> float:
    """
    A function that sums its numeric arguments
    """
    def my_sum_helper(*numbers):
        total = 0
        for x in numbers:
            total += x
        return total

    return my_sum_helper(a, b, c)


my_sum_internal_helper_component = func_to_container_op(my_sum_internal_helper,
                                                        base_image=BASE_IMAGE
                                                        )

Or, they can be defined outside the helper and pickled with the code

In [19]:
def my_sum_helper(*numbers):
    total = 0
    for x in numbers:
        total += x
    return total


def my_sum_external_helper(a: float, b: float, c: float) -> float:
    """
    A function that sums its numeric arguments
    """
    return my_sum_helper(a, b, c)


# NOTE the extra argument here
my_sum_external_helper_component = func_to_container_op(my_sum_external_helper,
                                                        base_image=BASE_IMAGE,
                                                        packages_to_install = ['cloudpickle'],
                                                        use_code_pickling=True,  # <----
                                                        )

Code pickling will wrap up simple helper functions with the code.  Docs suggest it can do more, but initial testing couldn't get that to work (anyone who gets packages or complex imports working through use_code_pickling should let us know!)

Code pickling can have some downsides (mainly related to Python version differences between where it is picked and where it is executed).  If you don't need it, you should probably leave it off.

In [20]:
@dsl.pipeline(
    name="My demo-kfp-lightweight-components pipeline",
    description="This one is a bit more complicated, but works!",
)
def pipeline(a, b, c):
    sum_result_internal = my_sum_internal_helper_component(a, b, c)
    sum_result_external = my_sum_external_helper_component(a, b, c)
    final_sum = my_sum_internal_helper_component(sum_result_internal.output,
                                                 sum_result_external.output,
                                                 0)

In [21]:
kfp.Client().create_run_from_pipeline_func(
    pipeline,
    arguments={'a': 1.0, 'b': 2.0, 'c': 3.0},
    experiment_name=experiment_name
)

RunPipelineResult(run_id=6bb87e50-bb5a-47f4-9b5b-d651cd72b233)

Working pipeline! 

![Working pipeline](images/demo_kfp_lightweight_components_pipeline_with_helper_my_sum_successful.png)

For dependencies, several options exist depending on whether the package you want is already installed on the base image.

If the package you want is installed on the image, you can simply `import` it from within your pipeline component.  Revisiting the JSON example from above, this will work (because JSON is available on BASE_IMAGE):

In [22]:
def sum_via_json(numbers_as_json: str) -> str:
    """
    A summation function that sums a list of numbers defined as a JSON string

    Output is returned as a JSON formatted string (which is really just 
    str(number), but still...)
    """
    # Import necessary libraries inside the function, as the only code written
    # in the pipeline component's function will be executed by the pipeline
    # (plus some wrapper material KFP creates for you)
    import json
    numbers = json.loads(numbers_as_json)
    summed = sum(numbers)
    return json.dumps(summed)

Testing locally works great:

In [23]:
numbers_as_json = json.dumps([1, 2, 3])
result_as_json = sum_via_json(numbers_as_json)
result = json.loads(result_as_json)
print(f"result = {result}")

result = 6


In [24]:
sum_via_json_component = func_to_container_op(sum_via_json,
                                              base_image=BASE_IMAGE
                                              )

If the package we need is not available on the base image, we can either install it [ourselves](https://github.com/kubeflow/pipelines/blob/master/samples/core/lightweight_component/lightweight_component.ipynb) or use func_to_container_op

In [25]:
def annotate_with_pip_installed_package(json_string):
    import delorean
    return json_string + " | " + str(delorean.Delorean(timezone="US/Eastern"))

In [26]:
annotate_with_pip_installed_package("some string, could be json")

"some string, could be json | Delorean(datetime=datetime.datetime(2021, 6, 16, 14, 9, 2, 737495), timezone='US/Eastern')"

And turning this into a component factory

In [27]:
# Add string list of package names exactly how you'd pass
# them to pip, with or without version
packages_to_install = ['delorean==1.0.0']
annotate_with_pip_installed_package_component = func_to_container_op(annotate_with_pip_installed_package,
                                              base_image=BASE_IMAGE,
                                              packages_to_install=packages_to_install,  # <---
                                              )

In [28]:
@dsl.pipeline(
    name="My demo-kfp-lightweight-components pipeline",
    description="This one keeps it nice and simple",
)
def pipeline(numbers_as_json):
    sum_result = sum_via_json_component(numbers_as_json)
    annotated_result = annotate_with_pip_installed_package_component(sum_result.output)

In [29]:
kfp.Client().create_run_from_pipeline_func(
    pipeline,
    arguments={'numbers_as_json': json.dumps([1, 2, 3])},
    experiment_name=experiment_name
)

RunPipelineResult(run_id=209dcf8b-7a7c-4377-a080-5695229ff0cc)

And we see everything works!

![everything works!](images/demo_kfp_lightweight_components_pipeline_with_dependencies_successful.png)

## Iterative development using your own code

For dependencies using your own code, one way to quickly iterate is by pushing your code to git and then installing it in the pipeline component every iteration.  For example, say we were locally iterating on this [project](https://github.com/ca-scribner/lrl).  We could do our development, push the code, and then use the following:

In [30]:
# pip install here is local only.  This enables local testing but doesn't help
# the pipeline
# !pip install git+https://github.com/ca-scribner/lrl

# We leave it commented here to show how testing the pipeline step locally 
# will fail, but running in the pipeline with kfp installing your packages
# for you will succeed

In [34]:
def pipeline_step(something_just_because: str) -> str:
    import lrl
    rt = lrl.environments.get_racetrack(track='20x10_U',
                             x_vel_limits=(-2, 2),
                             y_vel_limits=(-2, 2),
                             x_accel_limits=(-2, 2),
                             y_accel_limits=(-2, 2),
                             max_total_accel=2,
                             )

    # Return a list of strings that looks like:
    # ['GGGGGGGGGGGGGGGGGGGG',
    #  'GGGGGGGGGGGGGGGGGGGG',
    #  'GGG     OOOO      GG',
    #  'GGG     GGGG      GG',
    #  'GGOOOOGGGGGGGGOOOOGG',
    #  'GGOOOGGGGGGGGGGOOOGG',
    #  'GG    GGGGGGGG    GG',
    #  'GG      SGGGGG    GG',
    #  'GGGGGGGGGGGGGGFFFFGG',
    #  'GGGGGGGGGGGGGGFFFFGG']
    # (converted to a string, as lightweight component needs)
    return str(rt.track)

In [35]:
pipeline_step('a')

"['GGGGGGGGGGGGGGGGGGGG', 'GGGGGGGGGGGGGGGGGGGG', 'GGG     OOOO      GG', 'GGG     GGGG      GG', 'GGOOOOGGGGGGGGOOOOGG', 'GGOOOGGGGGGGGGGOOOGG', 'GG    GGGGGGGG    GG', 'GG      SGGGGG    GG', 'GGGGGGGGGGGGGGFFFFGG', 'GGGGGGGGGGGGGGFFFFGG']"

In [36]:
# Add string list of package names exactly how you'd pass
# them to pip, with or without version
packages_to_install = ['git+https://github.com/ca-scribner/lrl']
pipeline_component = func_to_container_op(pipeline_step,
                                              base_image=BASE_IMAGE,
                                              packages_to_install=packages_to_install,
                                              )

In [37]:
@dsl.pipeline(
    name="My demo-kfp-lightweight-components pipeline",
    description="This one keeps it nice and simple",
)
def pipeline(some_arg):
    result = pipeline_component(some_arg)

In [38]:
kfp.Client().create_run_from_pipeline_func(
    pipeline,
    arguments={'some_arg': "I am not very important"},
    experiment_name=experiment_name
)

RunPipelineResult(run_id=cc62fcff-7d2c-4c39-b543-4d0d5b95244d)

(Pipeline runs successfully, but not shown)

# Lightweight components with integer inputs and multiple outputs

In [39]:
# This import happens outside the function because it is used to 
# DEFINE the function, not as something inside the function
from typing import NamedTuple

def funky_divide(dividend: float, divisor: float) -> NamedTuple("funky_divide_output", [('quotient', float), ('remainder', float), ('as_string', str)]):
    """
    Returns a tuple of (dividend // divisor, dividend % divisor)
    
    Note that we use type hints to tell func_to_container_op that:
    - our inputs are floats, not strings
    - our output has several parts, each with their own type
    """
    quotient = dividend // divisor
    remainder = dividend % divisor
    as_string = f"{quotient} and {remainder}/{divisor}"
    
    from collections import namedtuple
    # Define the namedtuple's structure
    output = namedtuple("funky_divide_output", ["quotient", "remainder", "as_string"])
    my_output = output(quotient, remainder, as_string)
    return my_output
    
funky_divide_op_factory = func_to_container_op(funky_divide, 
                                               base_image=BASE_IMAGE
                                              )

In [40]:
funky_divide(5, 3)

funky_divide_output(quotient=1, remainder=2, as_string='1 and 2/3')

In [41]:
@dsl.pipeline(
    name="My demo-kfp-lightweight-components pipeline",
    description="Now we've got lots of fun complications!",
)
def pipeline(x, y):
#     res = my_divmod(x, y)
    funky_divide_1 = funky_divide_op_factory(x, y)
    
    # Now divide the quotient from _1 by the remainder from _1
    funky_divide_2 = funky_divide_op_factory(funky_divide_1.outputs['quotient'], funky_divide_1.outputs['remainder'])


In [42]:
kfp.Client().create_run_from_pipeline_func(
    pipeline,
    arguments={'x': 10., 'y': 4.},
    experiment_name=experiment_name
)

RunPipelineResult(run_id=498c95e5-28dd-4c41-a31f-00502d1d85eb)