**Difficulty: Intermediate**

# Summary:

This example demonstrates:
* building a pipeline with lightweight components (components defined here in python code)
* Saving results to minio
* Running parallel processes, where parallelism is defined at runtime

In doing this, we build a **shareable** pipeline - one that you can share with others and they can rerun on a new problem without needing this notebook.

This example builds on concepts from a few others - see those notebooks for more detail: 
* The problem solved here is from [Compute Pi](../mapreduce-pipeline/Compute-Pi.ipynb) 
* We use lightweight components, which have some important [quirks](../kfp-basics/demo_kfp_lightweight_components.ipynb)

**Note:** Although we demonistrate how to make lightweight components that interact directly with minio, this reduces code reusability and makes things harder to test.  A more reusable/testable version of this is given in [Compute Pi with Reusable Components](Compute-Pi-with-reusable-components-and-minio.ipynb).

In [1]:
from typing import List

import kfp
from kfp import dsl, compiler
from kfp.components import func_to_container_op

# TODO: Move utilities to a central repo
from utilities import get_minio_credentials, copy_to_minio
from utilities import minio_find_files_matching_pattern

# Problem Description

Our task is to compute an estimate of Pi by:
1. picking some random points
1. evaluating whether the points are inside a unit circle
1. aggregating (2) to estimate pi

Our solution to this task here focuses on:
* making a fully reusable pipeline:
    * The pipeline should be sharable.  You should be able to share the pipeline by giving them the pipeline.yaml file **without** sharing this notebook
    * All user inputs are adjustable at runtime (no editing the YAML, changing hard-coded settings in the python code, etc.)
* persisting data in minio
* using existing, reusable components where possible
    * Ex: rather than teach our sample function to store results in minio, we use an existing component to store results
    * This helps improve testability and reduces work when building new pipelines

# Pipeline pseudocode

To solve our problem, we need to: 
* Generate N random seeds
    * For each random seed, do a sample step
    * For each sample step, store the result to a location in minio
* Collect all sample results
* Compute pi (by averaging the results)
* Save the final result to minio

In pseudocode our pipeline looks like:

```python
def compute_pi(n_samples: int,
               output_location: str,
               minio_credentials, 
              ):
    seeds = create_seeds(n_samples)

    for seed in seeds:
        result = sample(seed, minio_credentials, output_location)
    
    all_sample_results = collect_all_results(minio_credentials,
                                             sample_output_location
                                            )
    
    final_result = average(all_sample_results)
```

where we've pulled anything the user might want to set at runtime (the number of samples, the location in minio for results to be placed, and their minio credentials) out as pipeline arguments.

Now lets fill in all the function calls with components

# Define Pipeline Operations as Functions

## create_seeds

In [2]:
def create_seeds_func(n_samples: int) -> list:
    """
    Creates n_samples seeds and returns as a list

    Note: When used as an operation in a KF pipeline, the list is serialized
    to a string.  Can deserialize with strip and split or json package
    This sort of comma separated list will work natively with KF Pipelines'
    parallel for (we can feed this directly into a parallel for loop and it
    breaks into elements for us)

    """
    constant = 10  # just so I know something is happening
    return [constant + i for i in range(n_samples)]

By defining this function in python first, we can test it here to make sure it works as expected (rigorous testing omitted here, but recommended for your own tasks)

In [3]:
# Very rigorous testing!
print(create_seeds_func(10))

[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]


And we can then convert our tested function to a task constructor using `func_to_container_op`

In [4]:
# Define the base image our code will run from.
# This is reused in a few components
base_image_python = "python:3.7.6-buster"

In [5]:
create_seeds_op = func_to_container_op(create_seeds_func,
                                       base_image=base_image_python,
                                       )

This task constructor `create_seeds_op` is what actually creates instances of these components in our pipeline.  

## sample

Similar to above, we a sample function and corresponding task constructor.  For this, we need several helper functions for minio (kept in `utilities.py`).  These helpers are automatically passed to our pipeline by `func_to_container_op` 

In [6]:
def sample_func(seed: int, minio_url: str, minio_bucket: str,
                minio_access_key: str, minio_secret_key: str,
                minio_output_path: str) -> str:
    """
    Define the "sample" pipeline operation

    Args:
        seed (int): Seed for the sample operation
        minio_settings (str): JSON string with:
        minio_url: minio endpoint for storage, without "http://, eg:
                   minimal-tenant1-minio.minio:9000
        minio_bucket: minio bucket to use within the endpoint, eg:
                      firstname-lastname
        minio_access_key: minio access key (from
                          /vault/secrets/minio-minimal-tenant1 on notebook
                          server)
        minio_secret_key: minio secret key (from 
                          /vault/secrets/minio-minimal-tenant1 on notebook
                          server)
        minio_output_path (str): Path in minio to put output data.  Will place
                                 x.out, y.out, result.out, and seed.out in
                                 ./seed_{seed}/

    Returns:
        (str): Minio path where data is saved (common convention in kfp to
               return this, even if it was specified as an input like
               minio_output_path)
    """
    import json
    from minio import Minio
    import random
    random.seed(seed)

    print("Pick random point")
    # x,y ~ Uniform([-1,1])
    x = random.random() * 2 - 1
    y = random.random() * 2 - 1
    print(f"Sample selected: ({x}, {y})")

    if (x ** 2 + y ** 2) <= 1:
        print(f"Random point is inside the unit circle")
        result = 4
    else:
        print(f"Random point is outside the unit circle")
        result = 0

    to_output = {
        'x': x,
        'y': y,
        'result': result,
        'seed': seed,
    }

    # Store all results to bucket
    # Store each of x, y, result, and seed to a separate file with name
    #   {bucket}/output_path/seed_{seed}/x.out
    #   {bucket}/output_path/seed_{seed}/y.out
    #   ...
    # where each file has just the value of the output.
    #
    # Could also have stored them all together in a single json file
    for varname, value in to_output.items():
        # TODO: Make this really a temp file...
        tempfile = f"{varname}.out"
        with open(tempfile, 'w') as fout:
            fout.write(str(value))

        destination = f"{minio_output_path.rstrip('/')}/seed_{seed}/{tempfile}"

        # Put file in minio
        copy_to_minio(minio_url=minio_url,
                      bucket=minio_bucket,
                      access_key=minio_access_key,
                      secret_key=minio_secret_key,
                      sourcefile=tempfile,
                      destination=destination
                      )

    # Return path containing outputs (common pipeline convention)
    return minio_output_path

In [7]:
# (insert your testing here)

# # Example:
# # NOTE: These tests actually write to minio!
# minio_settings = get_minio_credentials("minimal")
# minio_settings['bucket'] = 'andrew-scribner'
# sample = sample_func(5,
#                      minio_url=minio_settings['url'],
#                      minio_bucket=minio_settings['bucket'],
#                      minio_access_key=minio_settings['access_key'],
#                      minio_secret_key=minio_settings['secret_key'],
#                      minio_output_path='test_functions'
#                      )
# # Check the bucket/output_path to see if things wrote correctly

We set `modules_to_capture=['utilities']` and `use_code_pickling=True` because this will pass our helpers to our pipeline.  

In [8]:
sample_op = func_to_container_op(sample_func,
                                 base_image=base_image_python,
                                 use_code_pickling=True,  # Required because of helper functions
                                 modules_to_capture=['utilities'],  # Required because of helper functions
                                 packages_to_install=['minio'],
                                 )

## collect_results

To collect results from our sample operations, we glob from minio and output result data as a JSON list

Again, we need a helper file that feels better housed in a shared repo

In [9]:
def collect_results_as_list(search_location: str, search_pattern: str,
                            minio_url: str, minio_bucket: str,
                            minio_access_key: str, minio_secret_key: str,
                            ) -> List[float]:
    """
    Concatenates all files in minio that match a pattern
    """
    from minio import Minio
    import json

    obj_names = minio_find_files_matching_pattern(
        minio_url=minio_url,
        bucket=minio_bucket,
        access_key=minio_access_key,
        secret_key=minio_secret_key,
        pattern=search_pattern,
        prefix=search_location)

    s3 = Minio(endpoint=minio_url,
               access_key=minio_access_key,
               secret_key=minio_secret_key,
               secure=False,
               region="us-west-1",
               )

    # TODO: Use actual temp files
    to_return = [None] * len(obj_names)
    for i, obj_name in enumerate(obj_names):
        tempfile = f"./unique_temp_{i}"
        s3.fget_object(minio_bucket,
                       object_name=obj_name,
                       file_path=tempfile
                       )
        with open(tempfile, 'r') as fin:
            to_return[i] = float(fin.read())

    print(f"Returning {to_return}")
    return to_return

In [10]:
# (insert your testing here)

# # Example:
# # This only works if you make a directory with some "./something/result.out"
# # files in it
# pattern = re.compile(r".*/result.out$")
# collect_results_as_list(search_location='map-reduce-output/seeds/',
#                         search_pattern=pattern,
#                         minio_url=minio_settings['url'],
#                         minio_bucket=minio_settings['bucket'],
#                         minio_access_key=minio_settings['access_key'],
#                         minio_secret_key=minio_settings['secret_key'],
#                         )
# # (you should see all the result.out files in the bucket/location you're pointed to)

In [11]:
collect_results_op = func_to_container_op(collect_results_as_list,
                                          base_image=base_image_python,
                                          use_code_pickling=True,  # Required because of helper functions
                                          modules_to_capture=['utilities'],  # Required because of helper functions
                                          packages_to_install=["minio"],
                                          )

## average

Average takes a JSON list of numbers and returns their mean as a float

In [12]:
def average_func(numbers) -> float:
    """
    Computes the average value of a JSON list of numbers, returned as a float
    """
    import json
    print(numbers)
    print(type(numbers))
    numbers = json.loads(numbers)
    return sum(numbers) / len(numbers)

In [13]:
average_op = func_to_container_op(average_func,
                                  base_image=base_image_python,
                                  )

# Define and Compile Pipeline

With our component constructors defined, we build our full pipeline.  Remember that while we use a python function to define our pipeline here, anything that depends on a KFP-specific entity (an input argument, a component result, etc) is computed at runtime in kubernetes.  This means we can't do things like 
```
for seed in seeds:
    sample_op = sample_op(seed)
```
because Python would try to interpret seeds, which is a *placeholder* object for a future value, as an iterable.

In [14]:
@dsl.pipeline(
    name="Estimate Pi w/Minio",
    description="Extension of the Map-Reduce example using dynamic number of samples and Minio for storage"
)
def compute_pi(n_samples: int, output_location: str, minio_bucket: str, minio_url,
               minio_access_key: str, minio_secret_key: str):
    seeds = create_seeds_op(n_samples)

    # We add the KFP RUN_ID here in the output location so that we don't
    # accidentally overwrite another run.  There's lots of ways to manage
    # data, this is just one possibility.
    # Ensure you avoid double "/"s in the path - minio does not like this
    this_run_output_location = f"{str(output_location).rstrip('/')}" \
                               f"/{kfp.dsl.RUN_ID_PLACEHOLDER}"

    sample_output_location = f"{this_run_output_location}/seeds"

    sample_ops = []
    with kfp.dsl.ParallelFor(seeds.output) as seed:
        sample_op_ = sample_op(seed, minio_url, minio_bucket, minio_access_key,
                               minio_secret_key, sample_output_location)
        # Make a list of sample_ops so we can do result collection after they finish
        sample_ops.append(sample_op_)

        # NOTE: A current limitation of the ParallelFor loop in KFP is that it
        # does not give us an easy way to collect the results afterwards.  To
        # get around this problem, we store results in a known place in minio
        # and later glob the result files back out

    # Find result files that exist in the seed output location
    # Note that a file in the bucket root does not have a preceeding slash, so
    # to handle the (unlikely) event we've put all results in the bucket root,
    # check for either ^result.out (eg, entire string is just the result.out)
    # or /result.out.  This is to avoid matching something like
    # '/path/i_am_not_a_result.out'
    search_pattern = r'.*(^|/)result.out'

    # Collect all result.txt files in the sample_output_location and read them
    # into a list
    collect_results_op_ = collect_results_op(
        search_location=sample_output_location,
        search_pattern=search_pattern,
        minio_url=minio_url,
        minio_bucket=minio_bucket,
        minio_access_key=minio_access_key,
        minio_secret_key=minio_secret_key,
    )

    # collect_results requires all sample_ops to be done before running (all
    # results must be generated first).  Enforce this by setting files_to_cat
    # to be .after() all copy_op tasks
    for s in sample_ops:
        collect_results_op_.after(s)

    average_op(collect_results_op_.output)

Compile our pipeline into a reusable YAML file

In [15]:
experiment_name = "compute-pi-with-lightweight"
experiment_yaml_zip = experiment_name + '.zip'
compiler.Compiler().compile(
    compute_pi,
    experiment_yaml_zip
)
print(f"Exported pipeline definition to {experiment_yaml_zip}")

Exported pipeline definition to compute-pi-with-lightweight.zip


# Run

Use our above pipeline definition to do our task.  Note that anything below here can be done **without** the above code.  All we need is the yaml file from the last step.  We can even do this from the Kubeflow Pipelines UI or from a terminal.

## User settings
Update the next block to match your own setup.  bucket will be your namespace (likely your firstname-lastname), and output_location is where inside the bucket you want to put your results

In [16]:
# Python Minio SDK expects bucket and output_location to be separate
bucket = "andrew-scribner"
output_location = "map-reduce-output-lw"
n_samples = 10
minio_tenant = "minimal"  # probably can leave this as is

## Other settings
(leave this as is)

In [17]:
# Get minio credentials using a helper
minio_settings = get_minio_credentials(minio_tenant)
minio_url = minio_settings["url"]
minio_access_key = minio_settings["access_key"]
minio_secret_key = minio_settings["secret_key"]

Trying to access minio credentials from:
/vault/secrets/minio-minimal-tenant1


In [18]:
client = kfp.Client()
result = client.create_run_from_pipeline_func(
    compute_pi,
    arguments={"n_samples": n_samples,
               "output_location": output_location,
               "minio_bucket": bucket,
               "minio_url": minio_url,
               "minio_access_key": minio_access_key,
               "minio_secret_key": minio_secret_key,
               },
    )

(Optional)

Wait for the run to complete, then print that it is done

In [19]:
wait_result = result.wait_for_run_completion(timeout=300)

In [20]:
print(f"Run {wait_result.run.id}\n\tstarted at \t{wait_result.run.created_at}\n\tfinished at \t{wait_result.run.finished_at}\n\twith status {wait_result.run.status}")

Run bf7c90c1-4b21-4cc0-8fce-ac8816c4260c
	started at 	2020-06-26 17:54:20+00:00
	finished at 	2020-06-26 17:55:58+00:00
	with status Succeeded
