<a href="https://colab.research.google.com/github/Narwhalprime/vertex-ai-samples/blob/main/notebooks/community/pipelines/google_cloud_pipeline_components_cloud_natural_language_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Vertex Pipelines: Cloud Natural Language model training pipeline
<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/pipelines/google_cloud_pipeline_components_cloud_natural_language_pipeline.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/pipelines/google_cloud_pipeline_components_cloud_natural_language_pipeline.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/ai/platform/notebooks/deploy-notebook?download_url=https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/natural_language/cloud_natural_language_pipeline.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>

## Overview
This notebook shows how to use [Google Cloud Pipeline Components SDK](https://cloud.google.com/vertex-ai/docs/pipelines/components-introduction) and additional components in this directory to run a machine learning pipeline in [Vertex AI Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines/introduction) to train a TensorFlow text classification model.

In this pipeline, the model training Docker image utilizes [TFHub](https://tfhub.dev/) models to perform state-of-the-art text classification training. The image is pre-built and ready to use, so no additional Docker setup is required.

### Objective

In this tutorial, you learn how to construct an end-to-end training pipeine within Vertex AI pipelines that ingests a dataset, trains a text classification model on it, and outputs evaluation metrics.

This tutorial uses the following Google Cloud ML services and resources:

- Vertex AI Pipelines
- Vertex AI Datasets

The steps performed include:

- Define Kubeflow pipeline components
- Setup Kubeflow pipeline
- Run pipeline on Vertex AI

## Dataset

This notebook requires that the user has two datasets exported from Vertex AI [managed datasets](https://cloud.google.com/vertex-ai/docs/training/using-managed-datasets): one with train and validation data splits, and the other with test data used for evaluation. Please ensure no data is shared between the two datasets (in particular, no evaluation data should be part of the train or validation splits). To export a Vertex AI dataset, please follow the following public docs:
* [Preparing data](https://cloud.google.com/vertex-ai/docs/text-data/classification/prepare-data)
* [Creating a Vertex AI dataset](https://cloud.google.com/vertex-ai/docs/text-data/classification/create-dataset) from the above data
* [Exporting dataset and its annotations](https://cloud.google.com/vertex-ai/docs/datasets/export-metadata-annotations); ensure the resulting export is located in a Google Cloud Storage (GCS) bucket you own. You may need to manually separate the test split data into its own file.

## Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Setup

If you are using Colab or Google Vertex AI Workbench Notebooks, your environment already meets all the requirements to run this notebook. You can skip this step.

***NOTE***: This notebook has been tested in the following environment:

* Python version = 3.8

Otherwise, make sure your environment meets this notebook's requirements. You need the following:

- The Cloud Storage SDK
- Python 3
- virtualenv
- Jupyter notebook running in a virtual environment with Python 3

The Cloud Storage guide to [Setting up a Python development environment](https://cloud.google.com/python/setup) and the [Jupyter installation guide](https://jupyter.org/install) provide detailed instructions for meeting these requirements. The following steps provide a condensed set of instructions:

1. [Install and initialize the SDK](https://cloud.google.com/sdk/docs/).

2. [Install Python 3](https://cloud.google.com/python/setup#installing_python).

3. [Install virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv) and create a virtual environment that uses Python 3. Activate the virtual environment.

4. Activate that environment and run `pip3 install Jupyter` in a terminal shell to install Jupyter.

5. Run `jupyter notebook` on the command line in a terminal shell to launch Jupyter.

6. Open this notebook in the Jupyter Notebook Dashboard.


### Install additional packages

Run the following commands to setup the packages for this notebook. Note that the last code snippet in this section restarts your kernel in order to load the installs properly, so when initalizing this notebook from scratch, it is recommended to run up to that cell, then afterwards you may start running the cell after that.

In [None]:
# Install using pip3
!pip3 install -U tensorflow google-cloud-pipeline-components google-cloud-aiplatform kfp==1.8.16 "shapely<2" -q

In [None]:
# Version check
# This has been tested with KFP 1.8.16
! python3 -c "import kfp; print('KFP SDK version: {}'.format(kfp.__version__))"
! python3 -c "import google_cloud_pipeline_components; print('google_cloud_pipeline_components version: {}'.format(google_cloud_pipeline_components.__version__))"

In [None]:
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,storage.googleapis.com).

4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

### Authenticate your Google Cloud account

**If you are using Vertex AI Workbench Notebooks**, your environment is already
authenticated. Skip this step.

**If you are using Colab**, run the cell below and follow the instructions
when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

1. In the Cloud Console, go to the [**Create service account key**
   page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).

2. Click **Create service account**.

3. In the **Service account name** field, enter a name, and
   click **Create**.

4. In the **Grant this service account access to project** section, click the **Role** drop-down list. Type "Vertex AI"
into the filter box, and select
   **Vertex AI Administrator**. Type "Storage Object Admin" into the filter box, and select **Storage Object Admin**.

5. Click *Create*. A JSON file that contains your key downloads to your
local environment.

6. Enter the path to your service account key as the
`GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell.

In [None]:
# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

import os
import sys

# If on Vertex AI Workbench, then don't execute this code
IS_COLAB = "google.colab" in sys.modules
if not os.path.exists("/opt/deeplearning/metadata/env_version") and not os.getenv(
    "DL_ANACONDA_HOME"
):
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

### Set project ID

Set your project ID here. If you don't know this, the following snippet attempts to deterine this from your gcloud config. Please continue only if the notebook can see your desired project.

In [None]:
PROJECT_ID = "your-project-id"  # @param {type:"string"}
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "[your-project-id]":
    # Get your GCP project id from gcloud
    shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
print("Project ID:", PROJECT_ID)

In [None]:
!gcloud config set project $PROJECT_ID

### Setup project information

Enter information about your project and datasets here.

In [None]:
REGION = "us"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}
TRAINING_DATA_LOCATION = "gs://your-training-data-location"  # @param {type:"string"}
TASK_TYPE = "CLASSIFICATION"  # @param ["CLASSIFICATION", "MULTILABEL_CLASSIFICATION"]

In [None]:
# Since we are training a custom model, we need to specify the list of possible
# classes/labels.
# e.g, ["FirstClass", "SecondClass"]
# An additional class "[UNK]" will be added to the list indicating that none of
# the specified labels are a match.
CLASS_NAMES = [""]

# This is a list of GCS URIs; e.g., ["gs://your-bucket-name-here/your-input-file.jsonl"].
TEST_DATA_URIS = ["gs://your-bucket-name-here/your-input-file.jsonl"]

#### UUID

To avoid name collisions with other resources in your project, you can create a UUID with the code below and append it onto the name of the bucket(s) created in this notebook.

In [None]:
import random
import string


# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

When you initialize the Vertex AI SDK for Python, you specify a Cloud Storage staging bucket. The staging bucket is where all the data associated with your dataset and model resources are retained across sessions.

Set the name of your Cloud Storage bucket below. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.

In [None]:
BUCKET_NAME = "[your-bucket-name]"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "[your-bucket-name]":
    BUCKET_NAME = PROJECT_ID + "aip-" + UUID
    BUCKET_URI = "gs://" + BUCKET_NAME

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
!gsutil mb -l $REGION $BUCKET_URI

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
!gsutil ls -al $BUCKET_URI

In [None]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_URI)

## Create training pipeline

### Import libraries

In [None]:
from google_cloud_pipeline_components.aiplatform import ModelBatchPredictOp
from google_cloud_pipeline_components.experimental import natural_language
from google_cloud_pipeline_components.experimental.evaluation import (
    GetVertexModelOp, ModelEvaluationClassificationOp,
    TargetFieldDataRemoverOp)
from kfp import components
from kfp.v2 import compiler, dsl

### Define constants

In [None]:
# Worker pool specs
TRAINING_MACHINE_TYPE = "n1-highmem-8"
ACCELERATOR_TYPE = "NVIDIA_TESLA_T4"
ACCELERATOR_COUNT = 1
EVAL_MACHINE_TYPE = "n1-highmem-8"

## Define components

This pipeline is composed from the following components:

- **train-tfhub-model** - Trains a new Tensorflow model using TFHub layers from pre-built Docker image
- **upload-tensorflow-model-to-google-cloud-vertex-ai** - Uploads resulting model to Vertex AI model registry
- **get-vertex-model** - Gets model that has just been uploaded as an artifact in pipeline
- **convert-dataset-export-for-batch-predict** - Preprocessing component that takes the test dataset exported from Vertex datasets and converts it to a simpler compatible one that is readable from the batch predict component
- **target-field-data-remover** - Removes the target field (i.e., label) in the test dataset for the downstream batch predict component
- **model-batch-predict** - Performs a batch prediction job
- **model-evaluation-classification** - Calculates the evaluation metrics from the above batch predict job and exports the metrics artifact


In [None]:
# Load upload TF model component
upload_tensorflow_model_to_vertex_op = components.load_component_from_url(
    "https://raw.githubusercontent.com/Ark-kun/pipeline_components/c6a8b67d1ada2cc17665c99ff6b410df588bee28/components/google-cloud/Vertex_AI/Models/Upload_Tensorflow_model/workaround_for_buggy_KFPv2_compiler/component.yaml"
)

### Define the pipeline

The pipeline performs the following steps:
- Trains new text classification model
- Uploads model to Vertex AI Model Registry
- Performs preprocessing steps on test dataset export: formats data for batch predcition, removes target field
- Performs batch prediction on preprocessed test data
- Evaluates performance of model based on batch prediction output

In [None]:
@dsl.pipeline(name="text-classification-model")
def pipeline():
    train_task = natural_language.TrainTextClassificationOp()(
        project=PROJECT_ID,
        location=LOCATION,
        machine_type=TRAINING_MACHINE_TYPE,
        accelerator_type=ACCELERATOR_TYPE,
        accelerator_count=ACCELERATOR_COUNT,
        input_data_path=TRAINING_DATA_LOCATION,
        input_format="jsonl",
        natural_language_task_type=TASK_TYPE,
    )

    upload_task = upload_tensorflow_model_to_vertex_op(
        model=train_task.outputs["model_output"]
    )

    get_model_task = GetVertexModelOp(
        model_resource_name=upload_task.outputs["model_name"]
    )

    classification_type = (
        "multilabel" if TASK_TYPE == "MULTILABEL_CLASSIFICATION" else "multiclass"
    )

    convert_dataset_task = natural_language.ConvertDatasetExportForBatchPredictOp(
        file_paths=TEST_DATA_URIS, classification_type=classification_type
    )

    target_field_remover_task = TargetFieldDataRemoverOp(
        project=PROJECT_ID,
        location=LOCATION,
        root_dir=BUCKET_URI,
        gcs_source_uris=convert_dataset_task.outputs["output_files"],
        target_field_name="labels",
        instances_format="jsonl",
    )

    # Note: ModelBatchPredictOp doesn't support accelerators currently.
    batch_predict_task = ModelBatchPredictOp(
        project=PROJECT_ID,
        location=LOCATION,
        model=get_model_task.outputs["model"],
        job_display_name="nl-batch-predict-evaluation",
        gcs_source_uris=target_field_remover_task.outputs["gcs_output_directory"],
        instances_format="jsonl",
        predictions_format="jsonl",
        gcs_destination_output_uri_prefix=BUCKET_URI,
        machine_type=EVAL_MACHINE_TYPE,
    )

    # Note: Because we're running a custom training pipeline, the model source
    # is detected as Custom and thus it doesn't use AutoML NL's default settings
    # and fails if class_labels is excluded.
    ModelEvaluationClassificationOp(
        project=PROJECT_ID,
        location=LOCATION,
        root_dir=BUCKET_URI,
        class_labels=CLASS_NAMES + ["[UNK]"],
        predictions_gcs_source=batch_predict_task.outputs["gcs_output_directory"],
        predictions_format="jsonl",
        prediction_label_column="prediction.displayNames",
        prediction_score_column="prediction.confidences",
        ground_truth_gcs_source=convert_dataset_task.outputs["output_files"],
        ground_truth_format="jsonl",
        target_field_name="labels",
        classification_type=TASK_TYPE,
    )

### Compile the pipeline

In [None]:
compiler.Compiler().compile(pipeline, "nl_pipeline.json")

Running the above line will generate a file locally or in Colab's directory.

### Run the pipeline

This sends a create pipeline job request to Vertex Pipelines. Note that this  task run synchronously and may take a while to complete.

You may view the progress of the job at any time by clicking on the generated links (after "View Pipeline Job" in the console output of the cell below). Once the pipeline finishes, you may examine the artifacts produced from this pipeline.

In [None]:
job = aiplatform.PipelineJob(
    display_name="nl_pipeline",
    template_path="nl_pipeline.json",
    location=LOCATION,
    enable_caching=True,
    parameter_values={},
)

job.run()

Once the pipeline successfully finishes, go to the pipeline and examine the resulting metrics artifacts for the results. Otherwise, refer to the failing step(s) in the pipeline to determine the cause of any errors.

## View model evaluation results

To check the results of evaluation after pipeline execution, find the "model-evaluation-classification" subdirectory in the Cloud Storage bucket created by this pipeline. You may also run the following to directly output the contents of the metrics file:

In [None]:
import tensorflow as tf

EVAL_TASK_NAME = "model-evaluation-classification"
PROJECT_NUMBER = job.gca_resource.name.split("/")[1]
for _ in range(len(job.gca_resource.job_detail.task_details)):
    TASK_ID = job.gca_resource.job_detail.task_details[_].task_id
    EVAL_METRICS = (
        BUCKET_URI
        + "/"
        + PROJECT_NUMBER
        + "/"
        + job.name
        + "/"
        + EVAL_TASK_NAME
        + "_"
        + str(TASK_ID)
        + "/executor_output.json"
    )
    if tf.io.gfile.exists(EVAL_METRICS):
        ! gsutil cat $EVAL_METRICS

## Cleaning up

To clean up the resources used by this pipeline, run the command below:

In [None]:
# Delete GCS bucket.
!gsutil -m rm -r {BUCKET_URI}

# Next steps

For an alternate approach, please check out the ["ready-to-go" text classification pipeline](https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/community/pipelines/google_cloud_pipeline_components_ready_to_go_text_classification_pipeline.ipynb). This pipeline exposes the model logic for further customization if needed, and adds an additional pipeline step to deploy the model to enable online predictions.