**Difficulty: Beginner**

# Summary

This example builds a very simple pipeline from dockerized components.  It example is presented with context [here](https://statcan.github.io/daaas/en/3-Pipelines/Kubeflow-Pipelines/).

# Set up our Components

Our pipeline uses a single component, *average*, defined as a docker image.  This docker image contains a single script that accepts numbers and returns their average.  The API is roughly equivalent to:

```
average.py number1 number2 ... numberN
```

which generates a file `out.txt` inside the running container with the result.  So we would expect:

```
average_docker_image 1 2 3
cat out.txt
```
to print `6`.  To make this script portable, we package it up as a docker image and put that image in our container registry.  The full definition of the container is found in `/containers/average`

To use this image in a Kubeflow pipeline, we build a ContainerOp which specifies to Kubeflow how to do interact with our docker image (image location, how to pass arguments to it, what to return from the container, etc.).  To actually use these ContainerOp's in our pipeline, we build python factory functions like `average_op` below that create instances of ContainerOp's for us wired how we need.

In [None]:
from kfp import dsl

In [None]:
def average_op(*numbers):
    """
    Factory for average ContainerOps
    
    Accepts an arbitrary number of input numbers, returning a ContainerOp that passes those
    numbers to the underlying docker image for averaging
    
    Returns output collected from ./out.txt from inside the container

    """
    # Input validation
    if len(numbers) < 1:
        raise ValueError("Must specify at least one number to take the average of")
        
    return dsl.ContainerOp(
        name="averge",  # What will show up on the pipeline viewer
        image="k8scc01covidacr.azurecr.io/kfp-components/average:v1",  # The image that KFP runs to do the work
        arguments=numbers,  # Passes each number as a separate (string) command line argument
        # Script inside container writes the result (as a string) to out.txt, which 
        # KFP reads for us and brings back here as a string
        file_outputs={'data': './out.txt'},  
    )

When called, this function returns an instance of ContainerOp that is configured to:
* pass our `numbers` argument to the docker image (and thus the `average.py` script) as space separated command line arguments (`number0 number1 number2 ...`)
* run the container
* collect results by reading the file `./out.txt` (from inside the container) **into a string variable\***, making the output available to downstream components in the pipeline.  

In a more complex pipeline, we'd typically have multiple different functions that create different ContainerOps, but one is good for us here.

**\*NOTE: Data passed here (arguments in and results returned) are all converted to strings.**

* Arguments (our *numbers) are passed as strings to the docker container via command line arguments. 
* Results are collected from files (defined in file_outputs below) by Kubeflow Pipelines and passed back to the rest of the pipeline as strings (a string of each file).  

This is fine for small, simple data (eg our numbers here).  For more complex objects, you
must stringify them (convert to json, etc.).  For large results, it is likely better to put your results into a data store (eg: minio bucket) rather than a simple output file and then return a string path to the data rather
than the data itself. 

In our case here, this string detail doesn't affect us because our data is simple.  But this could be a big deal if we wanted to return a binary numpy file.

# Set up our Pipeline

A *pipeline* is a workflow of *components*.  The *pipeline* orchestrates our components (sets the order they run in, ensures that componentA passes data to componentB, etc.) to accomplish our work.  In this example, we define a pipeline that:

1. takes an *average* of one group of numbers
2. takes an *average* of a second group of numbers
3. takes the *average* of the results (1) and (2)

Pipelines are defined as python functions decorated by the @dsl.pipeline decorator:

In [None]:
@dsl.pipeline(
    name="my pipeline's name"
)
def my_pipeline(a, b, c, d, e):
    """
    Averaging pipeline which accepts five numbers and does some averaging operations on them
    """
    # Compute averages for two groups
    avg_1 = average_op(a, b, c)
    avg_2 = average_op(d, e)
    
    # Use the results from _1 and _2 to compute an overall average
    average_result_overall = average_op(avg_1.output, avg_2.output)

In the above pipeline we create two averages:
* `avg_1`: takes the average of parameters a, b, and c
* `avg_2`: takes the average of parameters d, and e

Our pipeline will run `avg_1` and `avg_2`, then pass their outputs to the third average operation.  That data exchange happens by using the `.output` attributes:

```
average_op(average_result_left.output, average_result_right.output)
```

This sort of chaining processes also helps Kubeflow Pipelines with the control flow.  By saying the third average needs outputs from avg_1 and avg_2, Kubeflow Pipelines wont run the last average until the others are complete.  

To translate our python pipeline function into a definition Kubeflow Pipelines can use, we export to a YAML file.  This YAML is a reusable definition of our pipeline that describes all our logic we set above (what to run first, how to run *average*, etc.) but without any runtime particulars (such as the values of `a, b, ...`).  Unzip the YAML and take a look for yourself!

In [None]:
from kfp import compiler
pipeline_yaml = 'pipeline.yaml.zip'
compiler.Compiler().compile(
    my_pipeline,
    pipeline_yaml
)
print(f"Exported pipeline definition to {pipeline_yaml}")

# Run our Pipeline

With our above YAML file, we can now submit our pipeline. To do this, we:

* define an experiment (a group of pipeline executions that we'll put it in)
* submit an instance of our pipeline to Kubeflow Pipelines (populated by the parameters we want to investigate)

## Define an experiment

In [None]:
experiment_name = "averaging-pipeline"

import kfp
client = kfp.Client()
exp = client.create_experiment(name=experiment_name)

![Run details](figures/average_with_docker_components__experiment.png)

## Run an instance of the pipeline

When running the pipeline, we specify the values we want to use for **this** run of the pipeline (we can then reuse the pipeline with new parameters later!)

In [None]:
pl_params = {
    'a': 5,
    'b': 5,
    'c': 8,
    'd': 10,
    'e': 18,
}

In [None]:
import time

run = client.run_pipeline(
    exp.id,  # Run inside the above experiment
    experiment_name + '-' + time.strftime("%Y%m%d-%H%M%S"),  # Give our job a name with a timestamp so its unique
    pipeline_yaml,  # Pass the .yaml.zip we created above.  This defines the pipeline
    params=pl_params  # Pass our parameters we want to run the pipeline with
)

Now click the above links and see your pipeline in action

![Run details](figures/average_with_docker_components__run.png)