# KFP lightweight function components v2







A Kubeflow Pipelines component is a self-contained set of code that performs one step in your ML workflow. A pipeline component is composed of:

* The component code, which implements the logic needed to perform a step in your ML workflow.
* A component specification, which defines the following:
* The component’s metadata, its name and description.
* The component’s interface, the component’s inputs and outputs.
* The component’s implementation, the Docker container image to run, how to pass inputs to your component code, and how to get the component’s outputs.

“Lightweight” Python function-based components make it easier to iterate quickly by letting you build your component code as a Python function and generating the component specification for you. This notebook shows how to create Python function-based components for use in Managed Pipelines.

> **Note**: *Currently, these examples only work on Managed Pipelines. Support for OSS KFP will be added soon through the v2-compatible execution mode.*

## Understanding how data is passed between components

When Kubeflow Pipelines runs your component, a container image is started in a Kubernetes Pod and your component’s inputs are passed in as command-line arguments. When your component has finished, the component’s outputs are returned as files.

Python function-based components use the Kubeflow Pipelines SDK to handle the complexity of passing inputs into your component and passing your function’s outputs back to your pipeline.

There are two categories of inputs/outputs supported in Python function-based components: *artifacts* and *parameters*.

* Parameters are passed to your component by value and typically contain `int`, `float`, `bool`, or small `string` values.
* Artifacts are passed to your component as a *reference* to a file. In addition to the artifact’s data, you can also read and write the artifact’s metadata. This lets you record arbitrary key-value pairs for an artifact such as the accuracy of a trained model, and use metadata in downstream components – for example, you could use metadata to decide if a model is accurate enough to deploy for predictions.

## Setup

Before you run this notebook, ensure that your Google Cloud user account and project are granted access to the Managed Pipelines Experimental. To be granted access to the Managed Pipelines Experimental, fill out this [form](http://go/cloud-mlpipelines-signup) and let your account representative know you have requested access. 

This notebook is intended to be run on either one of:
* [AI Platform Notebooks](https://cloud.google.com/ai-platform-notebooks). See the "AI Platform Notebooks" section in the Experimental [User Guide](https://docs.google.com/document/d/1JXtowHwppgyghnj1N1CT73hwD1caKtWkLcm2_0qGBoI/edit?usp=sharing) for more detail on creating a notebook server instance.
* [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb)


**To run this notebook on AI Platform Notebooks**, click on the **File** menu, then select "Download .ipynb".  Then, upload that notebook from your local machine to AI Platform Notebooks. (In the AI Platform Notebooks left panel, look for an icon of an arrow pointing up, to upload).

We'll first install some libraries and set up some variables.


Set `gcloud` to use your project.  **Edit the following cell before running it**.

In [None]:
PROJECT_ID = 'your-project-id'  # <---CHANGE THIS

In [None]:
!gcloud config set project {PROJECT_ID}

If you're running this notebook on colab, authenticate with your user account:

In [None]:
import sys
if 'google.colab' in sys.modules:
  from google.colab import auth
  auth.authenticate_user()

-----------------

**If you're on AI Platform Notebooks**, authenticate with Google Cloud before running the next section, by running
```sh
gcloud auth login
```
**in the Terminal window** (which you can open via **File** > **New** in the menu). You only need to do this once per notebook instance.

### Install the KFP SDK and AI Platform Pipelines client library

For Managed Pipelines Experimental, you'll need to download special versions of the KFP SDK and the AI Platform client library.

In [None]:
!gsutil cp gs://cloud-aiplatform-pipelines/releases/latest/kfp-1.5.0rc5.tar.gz .
!gsutil cp gs://cloud-aiplatform-pipelines/releases/latest/aiplatform_pipelines_client-0.1.0.caip20210415-py3-none-any.whl .


Then, install the libraries and restart the kernel as necessary.

In [None]:
if 'google.colab' in sys.modules:
  USER_FLAG = ''
else:
  USER_FLAG = '--user'

In [None]:
!python3 -m pip install {USER_FLAG} kfp-1.5.0rc5.tar.gz --upgrade
!python3 -m pip install {USER_FLAG} aiplatform_pipelines_client-0.1.0.caip20210415-py3-none-any.whl  --upgrade

In [None]:
if not 'google.colab' in sys.modules:
  # Automatically restart kernel after installs
  import IPython
  app = IPython.Application.instance()
  app.kernel.do_shutdown(True)

The KFP version should be >= 1.5.



In [None]:
# Check the KFP version
!python3 -c "import kfp; print('KFP version: {}'.format(kfp.__version__))"

### Set some variables and do some imports

**Before you run the next cell**, **edit it** to set variables for your project.  See the "Before you begin" section of the User Guide for information on creating your API key.  For `BUCKET_NAME`, enter the name of a Cloud Storage (GCS) bucket in your project.  Don't include the `gs://` prefix.

In [None]:
PATH=%env PATH
%env PATH={PATH}:/home/jupyter/.local/bin

# Required Parameters
USER = 'YOUR_USER_NAME' # <---CHANGE THIS
BUCKET_NAME = 'YOUR_BUCKET_NAME'  # <---CHANGE THIS
PIPELINE_ROOT = 'gs://{}/pipeline_root/{}'.format(BUCKET_NAME, USER)

PROJECT_ID = 'YOUR_PROJECT_ID'  # <---CHANGE THIS
REGION = 'us-central1'
API_KEY = 'YOUR_API_KEY'  # <---CHANGE THIS

print('PIPELINE_ROOT: {}'.format(PIPELINE_ROOT))

Import what's needed for building lightweight components.

In [None]:
from typing import NamedTuple
from kfp.v2 import dsl
from kfp.v2.dsl import (
    component,
    InputPath,
    OutputPath,
    InputArtifact,
    OutputArtifact,
    Artifact,
    Dataset,
    Model,
    ClassificationMetrics,
    Metrics,
)

## Define some components

We'll first define some dummy function-based components that consume parameters and produce (typed) Artifacts and parameters. Functions can produce Artifacts in three ways:
* accept an output local path using `OutputPath` 
* accept an `OutputArtifact` which will give the function a handle to the output artifact's metadata
* return an `Artifact` (or `Dataset`, `Model`, `Metrics`, etc) in a `NamedTuple` 

We'll show examples of these below.

The first component definition, a dummy `preprocess`, shows a component that outputs two `Dataset` Artifacts, as well as an output parameter.  (For this example, the datasets don't reflect real data).

For the parameter output, one would typically use the approach shown here, using the `OutputPath` type, for "larger" data.    
For "small data", like a short string, it may be more convenient to use the `NamedTuple` function output as shown in the second component instead.


In [None]:
@component
def preprocess(
    # An input parameter of type string.
    message: str,
    # Use OutputArtifact to get a metadata-rich handle to the output artifact
    # of type `Dataset`.
    output_dataset_one: OutputArtifact(Dataset),
    # A locally accessible filepath for another output artifact of type
    # `Dataset`.
    output_dataset_two_path: OutputPath('Dataset'),
    # A locally accessible filepath for an output parameter of type string.
    output_parameter_path: OutputPath(str)):
  '''Dummy preprocessing step.

  Writes out the passed in message to the output "Dataset"s and the output message.
  '''
  output_dataset_one.get().metadata['hello'] = 'there'
  # Use OutputArtifact.path to access a local file path for writing.
  # One can also use OutputArtifact.uri to access the actual URI file path.
  with open(output_dataset_one.path, 'w') as f:
    f.write(message)

  # OutputPath is used to just pass the local file path of the output artifact
  # to the function.
  with open(output_dataset_two_path, 'w') as f:
    f.write(message)

  with open(output_parameter_path, 'w') as f:
    f.write(message)


The second component definition, a dummy `train`, defines as input both an `InputPath` of type `Dataset`, and an `InputArtifact` of type `Dataset` (as well as other parameter inputs). 

Note that this component also writes some metrics metadata to the `model` output Artifact.  This information will be displayed in the Console UI when the pipeline runs.


In [None]:
@component(
    base_image='python:3.9', # Use a different base image.
)
def train(
    # An input parameter of type string.
    message: str,
    # Use InputPath to get a locally accessible path for the input artifact
    # of type `Dataset`.
    dataset_one_path: InputPath('Dataset'),
    # Use InputArtifact to get a metadata-rich handle to the input artifact
    # of type `Dataset`.
    dataset_two: InputArtifact(Dataset),
    # Output artifact of type Model.
    model: OutputArtifact(Model),
    # An input parameter of type int with a default value.
    num_steps: int = 3,
    # Use NamedTuple to return either artifacts or parameters.
    # When returning artifacts like this, return the contents of
    # the artifact. The assumption here is that this return value
    # fits in memory.
    ) -> NamedTuple('Outputs', [
        ('output_message', str),  # Return parameter.
        ('generic_artifact', Artifact),  # Return generic Artifact.
    ]):        
  '''Dummy Training step.

  Combines the contents of dataset_one and dataset_two into the
  output Model.
  Constructs a new output_message consisting of message repeated num_steps times.
  '''

  # Directly access the passed in GCS URI as a local file (uses GCSFuse).
  with open(dataset_one_path, 'r') as input_file:
    dataset_one_contents = input_file.read()

  # dataset_two is an Artifact handle. Use dataset_two.path to get a
  # local file path (uses GCSFuse).
  # Alternately, use dataset_two.uri to access the GCS URI directly.
  with open(dataset_two.path, 'r') as input_file:
    dataset_two_contents = input_file.read()

  with open(model.path, 'w') as f:
    f.write('My Model')

  # Use model.get() to get a Model artifact, which has a .metadata dictionary
  # to store arbitrary metadata for the output artifact. This metadata will be
  # recorded in Managed Metadata and can be queried later. It will also show up
  # in the UI (might be currently broken).
  model.get().metadata['accuracy'] = 0.9
  model.get().metadata['framework'] = 'Tensorflow'
  model.get().metadata['time_to_train_in_seconds'] = 257

  artifact_contents = "{}\n{}".format(dataset_one_contents, dataset_two_contents)
  output_message = ' '.join([message for _ in range(num_steps)])
  return (output_message, artifact_contents)

  

## Define a pipeline that uses your components

Next, we'll define a pipeline that uses the two components we just built.

Note that the dummy "train" step takes as inputs three of the outputs of the "preprocess" step. In the "train" inputs we refer to `output_parameter`, which gives us the output string directly.


In [None]:
@dsl.pipeline(
    # Default pipeline root. You can override it when submitting the pipeline.
    pipeline_root='gs://ml-pipeline-artifacts/v2-artifacts',
    # A name for the pipeline. Use to determine the pipeline Context.
    name='metadata-pipeline-v2')
def pipeline(message: str):
  preprocess_task = preprocess(message=message)
  train_task = train(
    dataset_one=preprocess_task.outputs['output_dataset_one'],
    dataset_two=preprocess_task.outputs['output_dataset_two'],
    message=preprocess_task.outputs['output_parameter'],
    num_steps=5)
    

## Compile and run the pipeline

Now we're ready to compile:

In [None]:
from kfp.v2 import compiler
from aiplatform.pipelines import client  

compiler.Compiler().compile(pipeline_func=pipeline,                                                     
                            package_path='metadata_pipeline.json')

...and then run the pipeline:

In [None]:
api_client = client.Client(
  project_id=PROJECT_ID,
  region='us-central1',
  api_key=API_KEY)

response = api_client.create_run_from_job_spec(
  job_spec_path='metadata_pipeline.json',
  pipeline_root=PIPELINE_ROOT,  # Override if needed.
  parameter_values={'message': "Hello, World"})

You can click on the generated link above to go to the pipeline run in the Cloud Console.

If you click on the Model artifact generated by the second step, you can see that the metrics info written by the second component is displayed in the sidebar.

<a href="https://storage.googleapis.com/amy-jo/images/mp/md_metrics.png" target="_blank"><img src="https://storage.googleapis.com/amy-jo/images/mp/md_metrics.png" /></a>


-----------------------------
Copyright 2021 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.