# Summary:

**Tutorial Difficulty: Beginner**

This notebook documents:
* How to build a simple pipeline from operations defined with docker containers
* A map-reduce workflow pattern, where we:
    * (map) break our work into many small pieces that can be done in parallel, and then
    * (reduce) aggregate the product of that work back to some final result

To bring it all together, we apply these techniques to compute an estimate of pi. 

# Pipeline to estimate pi, in the most ridiculously parallel way possible

This pipeline estimates pi by repeating the process of:

* Picking a random location inside a 2x2 square centered on the origin
* Checking whether or not that point also resides inside a unit circle centered on the origin
* Assigning a value to this point:
    * value = 4 if the point is inside the circle (red)
    * value = 0 if the point is outside the circle (blue)

By doing this repeatedly and taking the average value over all repetitions, we obtain an estimate of pi

![Parallel Monte Carlo](images/Pi.png)

We implement this procedure using the map-reduce pattern by:
* **Map:** Generating N **sample** operations which pick the point and assign it a value of 0/4.  Note that each **sample** operation is given a different random seed to ensure it picks a different point in the square
* **Reduce:** Combining all **sample** results in an **average** step which then returns the estimate of pi

The pipeline, as visualized in kubeflow pipelines, looks like this:

![The pipeline](images/kf-pipeline.png)

Where the top row of **sample** operations all feed to the single **average** step on the second row.  

# Set up our Project

Define user-level project variables

In [1]:
#################################
### Configure your variables ####
#################################

# Number of parallel sample steps we run
SAMPLES = 15

# Name of our experiment in kubeflow
# Experiment name can contain alphanumeric characters, hyphens, or underscores
EXPERIMENT_NAME = "compute-pi"

# Names we assign to our components (they're used in the definitions below)
# This is what will show up in the Kubeflow Pipelines UI
# These have the same naming restrictions as experiment names
SAMPLE_PIPELINE_OP_NAME = "sample"
AVERAGE_PIPELINE_OP_NAME = "average"

In [2]:
###########################
### DON'T EDIT:         ###
### Validate our inputs ###
###########################
from utilities import validate_kfp_name, validate_bucket_name

assert validate_kfp_name(EXPERIMENT_NAME)
assert validate_kfp_name(SAMPLE_PIPELINE_OP_NAME)
assert validate_kfp_name(AVERAGE_PIPELINE_OP_NAME)

In [3]:
########################################
### DON'T EDIT:                      ###
### Path to the containers used here ###
########################################
SAMPLE_IMAGE_PATH = f"k8scc01covidacr.azurecr.io/kfp-components/map-reduce/sample:v1"
AVERAGE_IMAGE_PATH = f"k8scc01covidacr.azurecr.io/kfp-components/map-reduce/average:v1"

# Define the pipeline

This is where we define all operations in our pipeline, as well as how they chain together.  Pipelines are defined by separate, typically single purpose, operations (or steps).  Each pipeline operation could be used once, multiple times, etc., and might depend on results from upstream steps.

## Define the pipeline operations

Our pipeline here has two steps, both of which are defined in docker containers (paths to those containers were specified above and are used below).  Each step is a factory function that returns ContainerOp's.  These ContainerOps are then used to define the actual pipeline next.

For this example, the containers are already built and pushed to ```k8scc01covidacr.azurecr.io``` (see ```SAMPLE_IMAGE_PATH``` and ```AVERAGE_IMAGE_PATH``` above).  Each container has a small shell script to do the work for that operation (check out the scripts at ```./sample/sample.sh``` and ```./average/average.sh``` to see how they work).  This notebook defines two kubeflow pipeline operations (```sample_op``` and ```average_op```) that specify how kubeflow interacts with those containers (how to call them, what args to provide, what to do with their outputs, ...).  

Side note: Technically ```sample_op``` and ```average_op``` are factories that return ContainerOp instances.  Kubeflow Pipelines uses those ContainerOp instances to construct its definition of your pipeline, but if none of that makes sense its ok...

In [4]:
from kfp import dsl
import itertools

def sample_op(seed):
    """
    Factory for "sample" pipeline operation
    
    Operations created by this factory invoke the SAMPLE step by invoking
    a docker container that includes ./sample/sample.py.  sample.py accepts
    a random seed as argument:
    
        sample.py SEED
    
    The result of this operation will be:
        output_result.txt: A file with either 0 (outside the unit circle) or 4
                           (inside the unit circle)
        output_coordinate.txt: The (x, y) coordinate sampled here
        input_seed.txt: A record of the seed used

    These results are passed back in the .outputs in the ContainerOp result
    
    Args:
        seed (number): Number used as a seed

    Returns:
        ContainerOp
    """
    # Return the ContainerOp that defines our interaction with the container
    op = dsl.ContainerOp(
        name=SAMPLE_PIPELINE_OP_NAME,
        image=SAMPLE_IMAGE_PATH,
        arguments=[seed],
        # Specify where kubeflow will get output from
        file_outputs={"result": "./output_result.txt",
                      "coordinate": "./output_coordinate.txt",
                      "seed": "./input_seed.txt"
                     },
    )
    
    return op

In [5]:
def average_op(numbers: list):
    """
    Factory for "average" pipeline operation
        
    Operations created by this factory invoke the AVERAGE step by invoking
    a docker container that includes ./average/average.py.  average.py accepts
    one or more numbers as command line arguments and computes their average

    The result of this script is:
        out.txt: A file containing the average of the inputs

    This result is passed back as .output in the ContainerOp result
    
    Args:
        numbers (list): List of numeric results from one or more sample steps

    Returns:
        ContainerOp
    """
    if len(numbers) == 0:
        raise ValueError("numbers must be at least of length 1")
    
    # Return the ContainerOp that defines our interaction with the container
    op = dsl.ContainerOp(
        name=AVERAGE_PIPELINE_OP_NAME,
        image=AVERAGE_IMAGE_PATH,
        arguments=numbers,
        # Specify where to get output from
        file_outputs={'result': "./out.txt"},
    )

## Define the pipeline

We define the pipeline here as a python function wrapped in the @dsl.pipeline decorator.  This function, in our case `compute_pi()`, defines the logic for how all the steps within the pipeline chain together.  `compute_pi` tells kubeflow pipelines to run N **sample** operations in parallel, and run a single **average** operation that consumes output (`sample_op.outputs['result']`) from all the **sample** operations.  

This dependency of **average** on **sample**s is what lets kfp know the order in which to run things.  

In [6]:
######################################
### You can change below this      ###
### Create the pipeline            ###
######################################
@dsl.pipeline(
    name="Estimate Pi",
    description='Estimate Pi using a Map-Reduce pattern'
)
def compute_pi():
    """Compute Pi"""

    # Create the seeds for our random samples
    seeds = range(SAMPLES)
    
    # Create a "sample" operation for each seed passed to the pipeline
    sample_ops = [sample_op(seed) for seed in seeds]

    # Define the average operation which consumes the result from each of the sample_ops
    # Note that the results for each sample_op, read in from their respective 
    # output_result.txt files, are available from the sample_op instances through
    # the .outputs attribute
    _average_op = average_op([s.outputs['result'] for s in sample_ops])

It is important to understand here that while ```compute_pi``` describes the pipeline in python code, most of the computation is not done when we run the above block.  Calling ```sample_op``` does not do a **sample** operation, it creates a ContainerOp that tells kubeflow pipelines to run a **sample** operation when running the pipeline.  And when we do something like:
```
    _average_op = average_op([s.outputs['result'] for s in sample_ops])
```

```s.outputs['result']``` is not the actual output of a **sample** operation, it is a placeholder that tells kubeflow pipelines "when you get to this part in the pipeline, insert the output that you've previous computed for this **sample** operation here".  This way you can pipe data from one pipeline step to the next without having to actually compute it now.

Finally, we translate our compute_pi function into a zipped yaml definition of the pipeline.  This zip file is how we tell kubeflow pipelines exactly what to run for your pipeline.  Download and take a look inside to get a better understanding!

In [7]:
###############################################
### DON'T EDIT:                             ###
### Create the pipeline description for kfp ###
###############################################
from kfp import compiler
experiment_yaml_zip = EXPERIMENT_NAME + '.zip'
compiler.Compiler().compile(
    compute_pi,
    experiment_yaml_zip
)
print(f"Exported pipeline definition to {experiment_yaml_zip}")

Exported pipeline definition to compute-pi.zip


**NOTE on reusability and flexibility:** We defined our samples by setting a global variable (`SAMPLES`) and using it in `compute_pi` (the function that defines our pipeline) to make our `seeds`.  This makes our example simple, but has a strong downside: the number (and value of) our `seeds` is fixed.  If we rerun the pipeline twice it will give the exact same answer, and if we want to say compute Pi with `SAMPLES * 2` seeds, we need to rerun our notebook and create a new YAML.  A more flexible way to do this would be to create the seeds at runtime in the pipeline (have a simple component that takes `SAMPLES` and returns the seeds), but that is beyond the scope of this example.

# Ready to roll! Let's run this pipeline!

We can create Kubeflow Pipelines experiment

In [8]:
###################################
### DON'T EDIT:                 ###
### Create the Experiment       ###
###################################
import kfp
client = kfp.Client()
exp = client.create_experiment(name=EXPERIMENT_NAME)

And submit our pipeline to run

In [9]:
###############################################
### DON'T EDIT:                             ###
### Run the pipeline                        ###
###############################################
import time
run = client.run_pipeline(
    exp.id,
    EXPERIMENT_NAME + '-' + time.strftime("%Y%m%d-%H%M%S"),
    EXPERIMENT_NAME + '.zip',
)

# Collect results

To see the pipeline running, click the link above.  To access the returned data from the Average step, click on that step in the pipeline and look in the output artifact as shown below.

![pipeline with results](images/kf-pipeline_with_result.png)

Note that this method of returning a result (accessible in our browser) is likely not that useful for most problems.  See other other demos for saving results to minio or other locations.

# Why use a Map-Reduce Pattern?

The workflow here splits all the samping up into many small pieces, does them all in isolation, then collects the results back up.  Why did we structure it that way, instead of say running a single component `sample-n` that loops to generate n samples in series?  We do that to take advantage of our flexible compute resources.

Consider a case where we have a very expensive *sample* step and each sample takes 10 minutes to compute (instead of <1s here).  If we make a `sample-n` operation, that process looping through just 12 random points would take 2 hours to run!  But because of the nature of our horizontal map process, we can split those N *sample* steps up and run them in parallel.  This way we split N *sample* operations across N compute nodes in parallel, and can generate all N samples in the same time it takes to generate one.  

This sort of pattern lets us horizontally scale our resources and is perfectly suited if you are doing the same sort of thing repetitively on isolated data.  You take advantage of the flexible compute environment to burst up to high usage for a short amount of time, then scale back down automatically.  You could apply the same strategy in other settings too, such as for computing new features from your data. 