# Kubeflow Pipelines

## Introduction
> Kubeflow pipelines is a platform in `kubeflow` employed to __define the end-to-end lifecycle of a `ML` project__ and facilitate the re-use of components.

They can be operated through
- `SDK` (`python` in our case): defines the pipeline.
- `UI`: easy visualisation of the created pipelines.
- `k8s` based engine: transpiles `kubeflow` format to `k8s` specific one (i.e. `.yaml` files).

## Important Concepts

Here are some important concepts associated with Kubeflow `pipelines:`
- `pipeline`
- `component`
- `graph`
- `experiment`
- `run` and `recurring run`
- `run trigger`
- `step`
- `output artifact`

In this notebook, we will learn how to define each of them using Python's SDK.

## Installation

As the first step, start `minikube` cluster using `minikube start` (__you should still use it for local development__).

> `kubeflow-pipelines` can be installed as a standalone platform via `k8s` `kustomize`.

This can be done directly from `GitHub` via the three commands below:

In [None]:
!export PIPELINE_VERSION=1.6.0
!kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
!kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
!kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic-pns?ref=$PIPELINE_VERSION"

Note that you might need to wait momentarily for the initialisation of the appropriate `POD`s.

To view them, check the available `POD`s and their statuses in `kubeflow` namespace:

In [None]:
!kubectl get pods --namespace kubeflow

Once all of the above are `Running` (__note that restarts may occur; however, they are inconsequential__), run the following command to `forward` port `80` of `kubeflow`'s UI `POD` to `localhost:8080`:

In [None]:
!kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80

Now, open `localhost:8080` in your web browser to view the UI.

The output should be similar to that shown in the image below.

![](./images/kubeflow-ui.png)

To delete `kubeflow-pipelines`, use the following command:

In [None]:
# export PIPELINE_VERSION=1.6.0 # Optional, as we exported envvar previously

# kubectl delete -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic-pns?ref=$PIPELINE_VERSION"
# kubectl delete -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"

## `SDK`

> `kubeflow`'s `SDK` is provided as a `PyPI` package and named `kfp`.

It communicates with `kubernetes` Python `SDK` and indirectly by transpiling graph to `.yaml` files.

### Installation

As is conventional, use `pip` for the installation (__you should conduct the process within `AiCore`'s `conda` environment__):

In [None]:
!pip install kfp --upgrade

### `SDK` Packages

Here, we present a high-level overview of the provided functionalities after installation:
- __[`kfp.compiler`](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.compiler.html) - class and methods for compiling `dsl` to `.yaml`__:
    - [`kfp.compiler.Compiler`](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.compiler.html#module-kfp.compiler) - compiles `pipeline` functions to `yaml` workflows.
    
Consider the example schematic code below (__we will explore all parts later__):

In [None]:
@kfp.dsl.pipeline(
  name='name',
  description='description'
)
def my_pipeline(a: int = 1, b: str = "default value"):
  ...

Compiler().compile(my_pipeline, 'path/to/workflow.yaml')


- [`kfp.components`](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.components.html) - classes and methods for interacting with pipeline components.
- [`kfp.dsl`](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.dsl.html) - contains the domain-specific language for defining and interacting with pipelines and components:
    - includes `Pipeline` definition (as shown above).
    - __the most utilised `SDK` core package.__
- [`kfp.client`](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.client.html) - client libraries for [Kubeflow Pipelines API](https://www.kubeflow.org/docs/components/pipelines/reference/api/kubeflow-pipeline-api-spec/); it allows us to create experiments, run pipelines and upload pipelines.
- [`kfp.cli.diagnose_me`](https://github.com/kubeflow/pipelines/tree/master/sdk/python/kfp/cli/diagnose_me) - approaches for debugging an environment interactively; it returns various metadata that are useful for debugging the setup.

### `kfp` in the CLI

After installation, `kfp` is also available as a command-line tool.
- `kfp pipeline <COMMAND>` - for managing `pipeline`s; the commands include
    - `get`: retrieves detailed information about a Kubeflow pipeline from the Kubeflow Pipelines cluster.
    - `list`: lists the pipelines that have been uploaded to the Kubeflow Pipelines cluster.
    - `upload`: uploads a pipeline to the Kubeflow Pipelines cluster.
- `kfp run <COMMAND>` - for managing `kubeflow`'s runs.
    - `get`: displays the details of a pipeline run.
    - `list`: lists the recent pipeline runs.
    - `submit`: submits a pipeline run.

Consider the below example:

In [2]:
!kfp pipeline --help

Usage: kfp pipeline [OPTIONS] COMMAND [ARGS]...

  Manage pipeline resources

Options:
  --help  Show this message and exit.

Commands:
  delete          Delete an uploaded KFP pipeline
  get             Get detailed information about an uploaded KFP pipeline
  list            List uploaded KFP pipelines
  list-versions   List versions of an uploaded KFP pipeline
  upload          Upload a KFP pipeline
  upload-version  Upload a version of the KFP pipeline


## Component

> __This refers to a self-contained piece of code that executes one step in the `pipeline`.__

![](./images/kubeflow-graph.png)

In the above image, `Xgboost train` is an example of a component.

> It is analogous to a __large function__ performing __one semantically valid addition operation.__

Additionally, similar to functions, it has both a name and parameters/arguments, and it returns values and a body (code).

> __Each component must be packaged as a `Docker` image as they are standalone execution units.__

### Component code

The `python` component code comprises two parts:
- `client` - talks to endpoints to submit jobs, e.g. submitting `spark`'s `job`.
- `runtime` - actual code, e.g. creating `pyspark.sql.DataFrame` from `SparkSession`.

The accepted convention is to keep the component's code in a `package` named by it. For example,
- `/component`: `client` modules. 
- `/component/component.py`: `client` code.

### Component definition

- `metadata`: name, description, etcetera.
- `interface`: input/output specification (name, type, description, default value, etcetera).
- `implementation`: actual code; it __also defines how to obtain `outputs` from it.__

> __Since components are defined using `python`, transpiling to `k8s` readable `.yaml` definitions, as done above, is unrequired.__

> __Note: In most cases, __make sure to check `Challenges.Mandatory.Components` for alternatives.__

## Python Function-Based Components

> __Python function-based components allow us to define any component solely in `python`.__

This approach alleviates the additional steps required for the `component` definition, namely:
- defining the `Docker` image file.
- defining the `.yaml` file with component definitions.

### Merits

- The files above will be automatically generated from the `python` code__. 
- Improved readability as `components` are defined like functions.
- Relatively high development speed, since only `python` knowledge is required.

### Demerits

- Due to automation, it is difficult to customise (__However, this can be achieved after the transpilation by `kfp`__).
- For more complicated use cases (e.g. `Docker` image with custom dependencies), a `Dockerfile` must be created as a base for the automation, which eliminates one of the merits.

> __Note: Check `Challenges.Mandatory.Components` for a direct approach to creating `components`. __This knowledge is mandatory and will be verified.__

Before proceeding, we create a [`kfp.Client`](https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.client.html) instance that will be used henceforth:

In [3]:
import kfp

# We are using default values
# Host is automatically inferred from within `jupyter notebook`s hence not specified
# In our case it would be localhost
client = kfp.Client()

## Standalone Functions

Before we proceed, here are a few things to `note` regarding functions, __which are considerably different from the conventional functions in `Python`__:

- > __Code declaration outside the function is prohibited.__

See the below example of a prohibited declaration:

In [None]:
x = 12

def foo():
    return x

- > __`Import` statements are used within the function.__

Although there is no apparent 'best practice', we recommend `import`ing every necessary dependency at the top of your function and dividing them with code blocks. See the example:

In [4]:
# We can't do that anymore.
# import numpy as np


def bar():
    ###########################################################################
    #
    #                               IMPORTS
    #
    ###########################################################################

    import numpy as np
    import this  # Well, now I'm not so sure

    ###########################################################################
    #
    #                                 SRC
    #
    ###########################################################################

    return np.array([1])

- > __Helper functions are defined within the function__

### Demerits

As you can probably tell, the drawbacks of this approach are significant, particularly for relatively large and highly complicated functions, where transformation into separate microservices is impractical:

- Long and semi-readable code (achieved by incorporating good practices).

### Alternatives

In the following cases, an alternative may be better:
- For __long-term `services` carrying out laborious tasks.__
- For optimisation while preserving code readability (disk `I/O` is discouraged because of its high cost).

*Note:* 

- __There is an option to `cache` results; however,__ __this does not resolve the readability problem.__
- __Conversely, one has to create a standalone program and define `.yaml` and `Dockerfile`.__

## Simple Function-Based `Component`

The necessary steps are outlined below.

1. __Create a `standalone` function.__ 

*Things to note:*

- Use the `typing` Python feature for `type` inference.
- __The repercussions will be apparent when multiple values are returned.__

In [5]:
def add(a: float, b: float) -> float:
  '''Calculates sum of two arguments'''
  return a + b

2. __Create [`kfp.dsl.ContainerOp`](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.dsl.html#kfp.dsl.ContainerOp) using `kfp.components.create_component_from_func`.__

*Things to note:*
- A factory function will be created, which can be handled in different ways.
- __This `op` should be used within `Pipeline` (described in more detail later).__
- A `.yaml` component definition is automatically created.

In [7]:
add_op = kfp.components.create_component_from_func(
    add, output_component_file="add_component.yaml"
)

3. __Create `Pipeline` that runs our `op`(s).__

Please note the `comments` in the code below:

In [8]:
@kfp.dsl.pipeline(
  name='Addition pipeline',
  description='An example pipeline that performs addition calculations.'
)
def add_pipeline(
  a='1',
  b='7',
):
  # Passes a pipeline parameter and a constant value to the `add_op` factory.
  # function.
  first_add_task = add_op(a, 4)
  # Passes an output reference from `first_add_task` and a pipeline parameter
  # to the `add_op` factory function. For operations with a single return
  # value, the output reference can be accessed as `task.output` or
  # `task.outputs['output_name']`.
  second_add_task = add_op(first_add_task.output, b)

4. __Create and `run` the pipeline from function.__

> __Please run `kubeflow` port forwarding as described previously to view the `run`.__

In [10]:
# Specify argument values for your pipeline run.
arguments = {'a': '7', 'b': '8'}

# Create a pipeline run using the client you initialised in a prior step.
client.create_run_from_pipeline_func(add_pipeline, arguments=arguments)

RunPipelineResult(run_id=85870662-5aa5-4c56-a73e-3a343237565c)

Your run should be visible within the `UI`. __We encourage you to explore the relevant info in the different `UI` sections.__

![](./images/example-add-run.png)

## Using Packages in Functions

> To use custom `package`s, one has to install them within a `Docker` environment.

Ordered from most recommended to least recommended, here are some approaches for achieving this:
1. Contained within `Docker` image: __choose an appropriate base image for your `microservice`__ to prevent issues; __use the `base_image` argument of `kfp.components.create_component_from_func`.__
2. Install packages within `Docker` image: __useful when `Docker` image has most of the `packages` and a few extras are required__; __use `packages_to_install` argument.__
3. Install packages using `subprocess` (from your `function` code): this approach is highly __discouraged__; although highly discouraged, you may use it __only for local packages.__

Consider the example:

In [None]:
kfp.components.create_component_from_func(
    # output_component_file is optional
    my_op,
    output_component_file="add_component.yaml",
    base_image="tensorflow/tensorflow:1.11.0-py3",
    packages_to_install=("torchdata==0.2.0", "torchlayers==0.1.1"),
)

### Additional info about `create_component_from_func`

- Default image: `python:3.7`
- __For relatively large dependencies, create the `Docker` base image from scratch and deploy it.__ This is because
    - its component runs at significantly high speeds (no need to download packages).
    - it is less error-prone (less likely to be OS-dependent).

## Data

> __`Inputs`, `outputs` and data passing in `kubeflow`.__

In general,
- `inputs` are `CLI` arguments for `Docker` containers within `PODs`.
- `outputs` are returned as files.

### Passing parameters
Parameters are passed in different ways.
- Basic types (e.g. `float`, `int` or short `str`) __are passed by value.__
- Parameters passed by file include large data, such as
    - `csv` files
    - images
    - datasets
- __Considerably large `parameters` are stored in specified `PersistentVolumes`.__

### Inferring `types`

> `kubeflow`'s components created from functions can infer `dtype`s __via Python's typing feature.__

Consider the generated `add`'s component `.yaml`: 

In [11]:
!cat ./add_component.yaml

name: Add
description: Calculates sum of two arguments
inputs:
- {name: a, type: Float}
- {name: b, type: Float}
outputs:
- {name: Output, type: Float}
implementation:
  container:
    image: python:3.7
    command:
    - sh
    - -ec
    - |
      program_path=$(mktemp)
      printf "%s" "$0" > "$program_path"
      python3 -u "$program_path" "$@"
    - |
      def add(a, b):
        '''Calculates sum of two arguments'''
        return a + b

      def _serialize_float(float_value: float) -> str:
          if isinstance(float_value, str):
              return float_value
          if not isinstance(float_value, (float, int)):
              raise TypeError('Value "{}" has type "{}" instead of float.'.format(str(float_value), str(type(float_value))))
          return str(float_value)

      import argparse
      _parser = argparse.ArgumentParser(prog='Add', description='Calculates sum of two arguments')
      _parser.add_argument("--a", dest="a", type=floa

Observe the volume of code automatically generated by `kubeflow`, which includes
- argument parsing via `argparse` module.
- outputting values __and serialising them to the desired type.__
- saving data within the `POD`'s storage (`PersistentVolume`).
- the whole `.yaml` structure.

> The above code is not meant to be readable as it is automatically generated.

Therefore, in this case, the `type`s were inferred based on the __function signature.__ 

> __If the function signature is not provided, it is assumed that `str` types are being passed.__

Notably, this function returns one value.

> To return multiple values, __use `NamedTuple` from the `typing` module to decorate the function appropriately.__

Consider the more complicated example:

In [None]:
from typing import NamedTuple

def multiple_return_values_example(a: float, b: float) -> NamedTuple(
  'ExampleOutputs',
  [
    ('sum', float),
    ('product', float),
    ('mlpipeline_ui_metadata', 'UI_metadata'),
    ('mlpipeline_metrics', 'Metrics')
  ]):
  """Example function that demonstrates how to return multiple values."""
  sum_value = a + b
  product_value = a * b

  # Export a sample tensorboard
  metadata = {
    'outputs' : [{
      'type': 'tensorboard',
      'source': 'gs://ml-pipeline-dataset/tensorboard-train',
    }]
  }

  # Export two metrics
  metrics = {
    'metrics': [
      {
        'name': 'sum',
        'numberValue':  float(sum_value),
      },{
        'name': 'product',
        'numberValue':  float(product_value),
      }
    ]
  }

  from collections import namedtuple
  example_output = namedtuple(
      'ExampleOutputs',
      ['sum', 'product', 'mlpipeline_ui_metadata', 'mlpipeline_metrics'])
  return example_output(sum_value, product_value, metadata, metrics)


The above example returns `metadata` for `UI` and `metrics`.

> Special `str` values can be used for these two values __to improve their readability by `metrics` and `UI`.__

Additionally, we must return `namedtuple`, which is __the only way to forward multiple arguments.__

> Please note that these values __will be saved to the `disk` anyway.__

### Caching

> Note that `kubeflow` provides caching out of the box.

#### Working mechanism

- If `component` was run previously __with the same arguments,__ this component __will not run.__
- Instead, outputs from the `PersistentVolume` of choice will be forwarded to the next `component` within the `pipeline`.

One can disable this feature or __force recalculation after some time__. For more information, see [here](https://www.kubeflow.org/docs/components/pipelines/caching/) and [here](https://www.kubeflow.org/docs/components/pipelines/caching-v2/) (`V2` SDK).

## Passing Parameters by File

> __In most cases, `parameters` will be the files (e.g. dataset) on which we wish to operate.__

This raises the question of what happens if the `data` are saved within the function and `None` is __returned implicitly.__

In this case, there is no __appropriate approach__ for inferring what is actually returned from the function's signature return.
Although there is a way, it is not Python compliant with static checkers, such as `mypy`.

> As a solution, we can use special `kfp.components` types __to mark the output returned by the function.__

See the simple example below:

In [None]:
def split_text_lines(
    source_path: comp.InputPath(str),
    odd_lines_path: comp.OutputPath(str),
    even_lines_path: comp.OutputPath(str),
):
    """Splits a text file into two files, with even lines going to one file
    and odds lines to the other."""

    with open(source_path, "r") as reader:
        with open(odd_lines_path, "w") as odd_writer:
            with open(even_lines_path, "w") as even_writer:
                while True:
                    line = reader.readline()
                    if line == "":
                        break
                    odd_writer.write(line)
                    line = reader.readline()
                    if line == "":
                        break
                    even_writer.write(line)

### Additional resources
- You can find other `Input` definitions (e.g. binary) [here](https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.components.html#kfp.components.InputBinaryFile).
- You can find other `Output` definitions (e.g. binary) [here](https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.components.html#kfp.components.OutputBinaryFile).

## Component Best Practices

> Below is a compressed list of the best practices associated with using components (a full list can be found [here](https://www.kubeflow.org/docs/components/pipelines/sdk/best-practices/)).

- __Local files__ should be used with components, __unless not possible__ (Cloud ML Engine and BigQuery require Cloud Storage staging paths).
- Use the aforementioned __pure components__ (i.e. those without side-effects, and no modification should be done without 'informing `kubeflow`').
- __Mix and match programming languages__: use other languages if `python` is not beneficial.
    - Create a `Docker` image containing the `golang` app which requires some `input`s and outputs.
    - Define `.yaml` directly.
    - __Use inter-language formats for data exchange__ (e.g. `JSON`, `CSV`, `ProtoBuf`, etcetera).
    - It is possible to perform minor pre-processing for different languages with `shell` scripts (__Minor__).
- __One output == one file.__
- __Do not pollute with temporary data__: temporary data are also__preserved by `PersistentVolume`.__
- __Stay local whenever possible__: this approach is always easy and less stressful, e.g. when unit testing a specific component.
- __Test in isolation__: use a single container; if not possible, run `minikube` or the like (easy to debug) before going with `kubeadm` and full-blown clusters.

## AiCore Recommendations

- __Do not use function-based components for highly complicated workflows__ as it allows you to use packages and `python` best practices more freely.
- __Be careful of size__: keep the services within 'reasonable size'.
    - use `microservice` to transform data instead of the following:
        - `microservice` to load data.
        - `microservice` to perform one operation on data.
        - `microservice` to rotate an image.
    - Due to the high cost of __I/O,__ its use should be minimised where possible.

## Pipeline

> __This refers to the description of a ML `workflow`, including all of the `component`s in the `workflow` and how they combine in the form of a `graph`.__

Generally, it consists of code packaged in `Docker` image with some `inputs` and `outputs` (as we will later see).

By observing parts of the pipeline, we note the following:
- Some of the components can easily run in parallel (in a different `POD` scheduled on a `Node`).
- They are directly dependent on the previous steps (similar to `Airflow`).
- Data are shared via `artifacts` __as `POD`s do not share data directly.__

The above `graph` is defined via `SDK` (`python`) and the `dsl` (or rather pseudo-dsl) specifically created for this task.

> Once all the concepts have been described, we will see how to create the whole structure in `python`.

### Defining a `pipeline`

Here, we learn how to define a `Pipeline` using `kfp`. However, prior to that, we define the following `component`s:
- The first one downloads the `.tar.gz` file and returns CSV.
- The second, which is __undefined,__ downloads resources from a `url`.
- The last is the `add` component, which showcases the functionality of the `conditional` flow in `dsl`.

Before proceeding, we transform `kfp.components.create_component_from_func` into a __configurable decorator.__

> Note that `kfp.components.create_component_from_func` can be used as a decorator, __although only with default arguments.__

In [None]:
import functools

# Wrapper
@functools.wraps(kfp.components.create_component_from_func)
def our_component_from_func(*args, **kwargs)
    def wrapper(function):
        return kfp.components.create_component_from_func(function, *args, **kwargs)

    return wrapper

In [None]:
@our_component_from_func(
    output_component_file="component.yaml",  # This is optional. It saves the component spec for future use.
    base_image="python:3.7",
    packages_to_install=["pandas==1.1.4"],
)
def merge_csv(file_path: comp.InputPath("Tarball"), output_csv: comp.OutputPath("CSV")):
    import glob
    import tarfile

    import pandas as pd

    tarfile.open(name=file_path, mode="r|gz").extractall("data")
    df = pd.concat(
        [pd.read_csv(csv_file, header=None) for csv_file in glob.glob("data/*.csv")]
    )
    df.to_csv(output_csv, index=False, header=False)

To __reuse components__, simply provide the `URI` resource that contains the `.yaml` specification:

In [None]:
web_downloader_op = kfp.components.load_component_from_url(
    "https://raw.githubusercontent.com/kubeflow/pipelines/master/components/web/Download/component.yaml"
)

With everything in place, we can define the `pipeline` via the following steps:

1. Define the `pipeline` function.
2. Decorate the `pipeline` with `dsl.pipeline` and provide the necessary information therein.
3. Pass it to `client.create_run_from_pipeline_func` following the steps shown previously __or__ compile it and run from the `UI`.

Consider the below example with the first and second steps applied:

In [None]:
@kfp.dsl.pipeline(name="Example pipeline", description="Shows basics of pipelines")
# Define a pipeline and create a task from a component:
# We do not have to specify the types
def my_pipeline(url, run_add: bool):
  web_downloader_task = web_downloader_op(url=url)
  merge_csv_task = create_step_merge_csv(file=web_downloader_task.outputs['data'])
  # Only `if`
  with kfp.dsl.Condition(run_add):
      first_add_task = add_op(a, 4)
  # The outputs of the merge_csv_task can be referenced using the
  # merge_csv_task.outputs dictionary: merge_csv_task.outputs['output_csv']

Now, we apply the first option of running it non-interactively:

In [None]:
client.create_run_from_pipeline_func(
    my_pipeline,
    arguments={
        "url": "https://storage.googleapis.com/ml-pipeline-playground/iris-csv-files.tar.gz",
        "run_add": False,
    },
)

Next, we see the option of running it interactively via the `UI` within the `pipelines` tab:

In [None]:
kfp.compiler.Compiler().compile(
    pipeline_func=my_pipeline,
    package_path='pipeline.yaml',
)

## Conclusion
At this point, you should have a good understanding of 

- the important concepts in Kubeflow pipelines and its installation process.
- components and standalone functions.
- how to use packages in functions.
- how to pass parameters by file.
- the best practices for using components.
- pipelines and how to define them.