# Managed Pipelines Experimental: Metrics visualization with the KFP SDK

This notebook shows how to visualize ROC curves and Confusion Matrices on [Managed Pipelines Experimental](https://docs.google.com/document/d/1JXtowHwppgyghnj1N1CT73hwD1caKtWkLcm2_0qGBoI/edit?ts=5f90dcea#heading=h.p4rp2vtz67w2), using the [Kubeflow Pipelines (KFP) SDK](https://www.kubeflow.org/docs/pipelines/) 

In the process, it shows how to construct *function-based components* — pipeline components defined from Python function definitions— and how to specify a pipeline using those components, then launch a pipeline run from the notebook.


## Setup

Before you run this notebook, ensure that your Google Cloud user account and project are granted access to the Managed Pipelines Experimental. To be granted access to the Managed Pipelines Experimental, fill out this [form](http://go/cloud-mlpipelines-signup) and let your account representative know you have requested access. 

This notebook is intended to be run on either one of:
* [AI Platform Notebooks](https://cloud.google.com/ai-platform-notebooks). See the "AI Platform Notebooks" section in the Experimental [User Guide](https://docs.google.com/document/d/1JXtowHwppgyghnj1N1CT73hwD1caKtWkLcm2_0qGBoI/edit?usp=sharing) for more detail on creating a notebook server instance.
* [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb)


**To run this notebook on AI Platform Notebooks**, click on the **File** menu, then select "Download .ipynb".  Then, upload that notebook from your local machine to AI Platform Notebooks. (In the AI Platform Notebooks left panel, look for an icon of an arrow pointing up, to upload).

We'll first install some libraries and set up some variables.


Set `gcloud` to use your project.  **Edit the following cell before running it**.

In [None]:
PROJECT_ID = 'your-project-id'  # <---CHANGE THIS

In [None]:
!gcloud config set project {PROJECT_ID}

If you're running this notebook on colab, authenticate with your user account:

In [None]:
import sys
if 'google.colab' in sys.modules:
  from google.colab import auth
  auth.authenticate_user()

-----------------

**If you're on AI Platform Notebooks**, authenticate with Google Cloud before running the next section, by running
```sh
gcloud auth login
```
**in the Terminal window** (which you can open via **File** > **New** in the menu). You only need to do this once per notebook instance.

### Install the KFP SDK and AI Platform Pipelines client library

For Managed Pipelines Experimental, you'll need to download special versions of the KFP SDK and the AI Platform client library.

In [None]:
!gsutil cp gs://cloud-aiplatform-pipelines/releases/latest/kfp-1.5.0rc5.tar.gz .
!gsutil cp gs://cloud-aiplatform-pipelines/releases/latest/aiplatform_pipelines_client-0.1.0.caip20210415-py3-none-any.whl .


Then, install the libraries and restart the kernel as necessary.

In [None]:
if 'google.colab' in sys.modules:
  USER_FLAG = ''
else:
  USER_FLAG = '--user'

In [None]:
!python3 -m pip install {USER_FLAG} kfp-1.5.0rc5.tar.gz --upgrade
!python3 -m pip install {USER_FLAG} aiplatform_pipelines_client-0.1.0.caip20210415-py3-none-any.whl  --upgrade

In [None]:
if not 'google.colab' in sys.modules:
  # Automatically restart kernel after installs
  import IPython
  app = IPython.Application.instance()
  app.kernel.do_shutdown(True)

The KFP version should be >= 1.5.



In [None]:
# Check the KFP version
!python3 -c "import kfp; print('KFP version: {}'.format(kfp.__version__))"

### Set some variables and do some imports

**Before you run the next cell**, **edit it** to set variables for your project.  See the "Before you begin" section of the User Guide for information on creating your API key.  For `BUCKET_NAME`, enter the name of a Cloud Storage (GCS) bucket in your project.  Don't include the `gs://` prefix.

In [None]:
PATH=%env PATH
%env PATH={PATH}:/home/jupyter/.local/bin

# Required Parameters
USER = 'YOUR_USER_NAME' # <---CHANGE THIS
BUCKET_NAME = 'YOUR_BUCKET_NAME'  # <---CHANGE THIS
PIPELINE_ROOT = 'gs://{}/pipeline_root/{}'.format(BUCKET_NAME, USER)

PROJECT_ID = 'YOUR_PROJECT_ID'  # <---CHANGE THIS
REGION = 'us-central1'
API_KEY = 'YOUR_API_KEY'  # <---CHANGE THIS

print('PIPELINE_ROOT: {}'.format(PIPELINE_ROOT))

Import what's needed for building lightweight Python-function-based components.

In [None]:
from typing import NamedTuple
from kfp.v2 import dsl
from kfp.v2.dsl import (
    component,
    InputPath,
    OutputPath,
    InputArtifact,
    OutputArtifact,
    Artifact,
    Dataset,
    Model,
    ClassificationMetrics,
    Metrics,
)

## Define Pipeline components

We'll define some function-based components that use SKLearn to train some classifiers and produce evaluations that can be visualized. 

We're building Python-function-based components.   
Note the use of the `@component()` decorator in the definitions below.  We can optionally set a list of packages for the component to install; the base image to use (the default is a Python3.7 image); and the name of a component yaml file to generate, so that the component definition can be shared and reused.

The first component shows how to visualize an *ROC curve*. 
Note that the function definition includes an input called `metrics`, of type `OutputArtifact(ClassificationMetrics)`. This component will output a `ClassificationMetrics` artifact, and we will be able to visualize the metrics in the Pipelines UI in the Cloud Console.

To do this, we're using the artifact's `log_roc_curve()` method. This method takes as input arrays with the false positive rates, true positive rates, and thresholds, as [generated by the `sklearn.metrics.roc_curve` function](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html).

When you evaluate the cell below, a task factory function called `wine_classification` will be created, that we will use to construct our pipeline definition.  In addition, a component `.yaml` file will be created, which can be shared and loaded via file or URL to create the same task function.


In [None]:
@component(
    packages_to_install=['sklearn'],
    base_image='python:3.9',
    output_component_file='wine_classif_component.yaml'
)
def wine_classification(metrics: OutputArtifact(ClassificationMetrics)):
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import roc_curve
    from sklearn.datasets import load_wine
    from sklearn.model_selection import train_test_split, cross_val_predict

    X, y = load_wine(return_X_y=True)
    # Binary classification problem for label 1.
    y = y == 1

    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
    rfc = RandomForestClassifier(n_estimators=10, random_state=42)
    rfc.fit(X_train, y_train)
    y_scores = cross_val_predict(rfc, X_train, y_train, cv=3, method='predict_proba')
    y_predict = cross_val_predict(rfc, X_train, y_train, cv=3, method='predict')
    fpr, tpr, thresholds = roc_curve(y_true=y_train, y_score=y_scores[:,1], pos_label=True)
    metrics.get().log_roc_curve(fpr, tpr, thresholds)

The second component shows how to visualize a *confusion matrix*.

As with the previous component, we're creating a `metrics` output artifact.  We're then using the artifact's `log_confusion_matrix` method to visualize the confusion matrix results, as generated by the [sklearn.metrics.confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) function.

In [None]:
@component(
    packages_to_install=['sklearn'],
    base_image='python:3.9'
    )
def iris_sgdclassifier(test_samples_fraction: float, metrics: OutputArtifact(ClassificationMetrics)):
    from sklearn import datasets, model_selection
    from sklearn.linear_model import SGDClassifier
    from sklearn.metrics import confusion_matrix

    iris_dataset = datasets.load_iris()
    train_x, test_x, train_y, test_y = model_selection.train_test_split(
        iris_dataset['data'], iris_dataset['target'], test_size=test_samples_fraction)

    
    classifier = SGDClassifier()
    classifier.fit(train_x, train_y)
    predictions = model_selection.cross_val_predict(classifier, train_x, train_y, cv=3)
    metrics.get().log_confusion_matrix(
        ['Setosa', 'Versicolour', 'Virginica'],
        confusion_matrix(train_y, predictions).tolist() # .tolist() to convert np array to list.
      )

## Define a pipeline that uses the new components

Next, we'll define a simple pipeline that uses the two components that we defined.

In [None]:
@dsl.pipeline(
    # Default pipeline root. You can override it when submitting the pipeline.
    pipeline_root=PIPELINE_ROOT,
    # A name for the pipeline. 
    name='metrics-pipeline-v2')
def pipeline():
  wine_classification_op = wine_classification()
  iris_sgdclassifier_op = iris_sgdclassifier(test_samples_fraction=0.3)


## Compile and run the pipeline

We'll compile the pipeline:

In [None]:
from kfp.v2 import compiler
from aiplatform.pipelines import client  

compiler.Compiler().compile(pipeline_func=pipeline,                                                     
                            package_path='metrics_pipeline.json')

... and run it:

In [None]:
api_client = client.Client(
  project_id=PROJECT_ID,
  region='us-central1',
  api_key=API_KEY)

response = api_client.create_run_from_job_spec(
  job_spec_path='metrics_pipeline.json',
  pipeline_root=PIPELINE_ROOT,  # Override if needed.
  parameter_values={})

You can click the generated link above to view the pipeline run in the Cloud Console. When the pipeline steps finish executing, you can view the generated metrics visualizations by clicking on the metrics artifacts.

The ROC curve should look as follows:

<a href="https://storage.googleapis.com/amy-jo/images/mp/roc_curve.png" target="_blank"><img src="https://storage.googleapis.com/amy-jo/images/mp/roc_curve.png" width="90%"/></a>

... and the confusion matrix should look like this:

<a href="https://storage.googleapis.com/amy-jo/images/mp/confusion_matrix.png" target="_blank"><img src="https://storage.googleapis.com/amy-jo/images/mp/confusion_matrix.png" width="90%"/></a>



-----------------------------
Copyright 2021 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.