In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Vertex AI TensorBoard integration with Vertex AI Pipelines

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/tensorboard/tensorboard_vertex_ai_pipelines_integration.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/tensorboard/tensorboard_vertex_ai_pipelines_integration.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/tensorboard/tensorboard_vertex_ai_pipelines_integration.ipynb">
        <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>
<br/><br/><br/>

## Overview

### What is Vertex AI TensorBoard

[Open source TensorBoard](https://www.tensorflow.org/tensorboard/get_started)
(TB) is a Google open source project for machine learning experiment
visualization. Vertex AI TensorBoard is an enterprise-ready managed
version of TensorBoard.

Vertex AI TensorBoard provides various detailed visualizations, including:

*   Tracking and visualizing metrics, such as loss and accuracy over time.
*   Visualizing model computational graphs (ops and layers).
*   Viewing histograms of weights, biases, or other tensors as they change over time.
*   Projecting embeddings to a lower dimensional space.
*   Displaying image, text, and audio samples.

In addition to the powerful visualizations from
TensorBoard, Vertex AI TensorBoard provides the following benefits:

*  A persistent, shareable link to your experiment's dashboard.

*  A searchable list of all experiments in a project.

*  Tight integrations with Vertex AI services for model training.

*  Enterprise-grade security, privacy, and compliance.

With Vertex AI TensorBoard, you can track, visualize, and compare
ML experiments and share them with your team.

Learn more about [Vertex AI TensorBoard](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-overview) and [Vertex AI Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines/introduction).

### Objective

In this tutorial, you learn how to create a training pipeline using the KFP SDK, execute the pipeline in Vertex AI Pipelines, and monitor your training process on Vertex AI TensorBoard in near real time.

This tutorial uses the following Google Cloud ML services and resources:

- Vertex AI Training
- Vertex AI TensorBoard
- Vertex AI Pipelines

The steps performed include:

* Setup a service account and Google Cloud Storage buckets.
* Construct a KFP pipeline with your custom training code.
* Compile and execute the KFP pipeline in Vertex AI Pipelines with Tensorboard enabled for near real time monitorning.

### Dataset

Dataset used in this tutorial will be the [flower dataset](https://www.tensorflow.org/datasets/catalog/tf_flowers) provided by TensorFlow. No other datasets are required.


### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage pricing](https://cloud.google.com/storage/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

### Set up your local development environment

**If you are using Colab or Vertex AI Workbench**, your environment already meets all the requirements to run this notebook. You can skip this step.

**_NOTE_**: This notebook has been tested in the following environments:

* Python version = 3.9

Otherwise, make sure your environment meets this notebook's requirements. You need the following:

- The Cloud Storage SDK
- Git
- Python 3
- virtualenv
- Jupyter notebook running in a virtual environment with Python 3

The Cloud Storage guide to [Setting up a Python development environment](https://cloud.google.com/python/setup) and the [Jupyter installation guide](https://jupyter.org/install) provide detailed instructions for meeting these requirements. The following steps provide a condensed set of instructions:

1. [Install and initialize the SDK](https://cloud.google.com/sdk/docs/).

2. [Install Python 3](https://cloud.google.com/python/setup#installing_python).

3. [Install virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv) and create a virtual environment that uses Python 3.  Activate the virtual environment.

4. To install Jupyter, run `pip3 install jupyter` on the command-line in a terminal shell.

5. To launch Jupyter, run `jupyter notebook` on the command-line in a terminal shell.

6. Open this notebook in the Jupyter Notebook Dashboard.

## Installation

Install the following packages required to execute this notebook.

In [None]:
import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_WORKBENCH_NOTEBOOK = os.getenv("DL_ANACONDA_HOME")
IS_USER_MANAGED_WORKBENCH_NOTEBOOK = os.path.exists(
    "/opt/deeplearning/metadata/env_version"
)

# Vertex AI Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_WORKBENCH_NOTEBOOK:
    USER_FLAG = "--user"

! pip3 install --upgrade google-cloud-aiplatform {USER_FLAG} -q
! pip3 install --upgrade google-cloud-storage {USER_FLAG} -q
! pip3 install --upgrade kfp google-cloud-pipeline-components {USER_FLAG} -q

### Restart the kernel

Once you've installed the additional packages, you need to restart the notebook kernel so it can find the packages.


In [None]:
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

Check the versions of the packages you installed.  The KFP SDK version should be >=1.6.


In [None]:
! python3 -c "import kfp; print('KFP SDK version: {}'.format(kfp.__version__))"
! python3 -c "import google_cloud_pipeline_components; print('google_cloud_pipeline_components version: {}'.format(google_cloud_pipeline_components.__version__))"

## Before you begin

### GPU runtime

This tutorial does not require a GPU runtime.


### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)

3. [Enable the Vertex AI APIs and Cloud Storage.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,storage-component.googleapis.com)

4. If you are running this notebook locally, you will need to install the [Cloud SDK](https://cloud.google.com/sdk).


5. Enter your project ID in the cell below. Then run the  cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

Note: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$`.

### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.


In [None]:
PROJECT_ID = ""

import os

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

Otherwise, set your project ID here.

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

In [None]:
! gcloud config set project $PROJECT_ID

#### UUID

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a uuid for each instance session, and append it onto the name of resources you create in this tutorial.

In [None]:
import random
import string


# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

### Set your region

You can also change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  Below are regions supported for Vertex AI. We recommend that you choose the region closest to you.

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-east1`

You may not use a multi-regional bucket for training with Vertex AI. Not all regions provide support for all Vertex AI services.

Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations)

In [None]:
REGION = "[your-region]"  # @param {type: "string"}

if REGION == "[your-region]":
    REGION = "us-central1"

### Login to your Google Cloud account

In [None]:
# The Google Cloud Notebook product has specific requirements
import os
import sys

IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

IS_COLAB = "google.colab" in sys.modules
# If on Google Cloud Notebooks, then don't execute this code
if not IS_GOOGLE_CLOUD_NOTEBOOK:
    if IS_COLAB:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

### Create Cloud Storage bucket
A Cloud Storage bucket will be used to a) store your training code distribution (details below), and b) the outputs (including TensorBoard logs) that your training code generates. The bucket must be regional and, not multi-region or dual-region, and the following resources must be in same region:

* Cloud Storage bucket
* Vertex AI training job
* Vertex AI TensorBoard instance

In [None]:
BUCKET_URI = "gs://[your-bucket-name]"  # @param {type:"string"}

if BUCKET_URI == "" or BUCKET_URI is None or BUCKET_URI == "gs://[your-bucket-name]":
    BUCKET_URI = "gs://" + PROJECT_ID + "aip-" + UUID

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket. The created bucket will be deleted in the cleaning up section in the end. 

In [None]:
! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al {BUCKET_URI}

## Setup service account and permissions

A service account will be used to create custom training job. If you do not want to use your project's Compute Engine service account, set SERVICE_ACCOUNT to another service account ID. You can create a service account by following the [documentation instructions](https://cloud.google.com/iam/docs/creating-managing-service-accounts#creating).

In [None]:
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

In [None]:
if (
    SERVICE_ACCOUNT == ""
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == "[your-service-account]"
):
    # Get your service account from gcloud
    if not IS_COLAB:
        shell_output = ! gcloud auth list 2>/dev/null
        SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()

    else:  # IS_COLAB:
        shell_output = ! gcloud projects describe  $PROJECT_ID
        project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
        SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"

    print("Service Account:", SERVICE_ACCOUNT)

In [None]:
# Grant Cloud Storage permission.
! gcloud projects add-iam-policy-binding $PROJECT_ID \
   --member=serviceAccount:$SERVICE_ACCOUNT \
   --role=roles/storage.admin

In [None]:
# Grant AI Platform permission.
! gcloud projects add-iam-policy-binding $PROJECT_ID \
   --member=serviceAccount:$SERVICE_ACCOUNT \
   --role=roles/aiplatform.user

### Import aiplatform

In [None]:
import google.cloud.aiplatform as aiplatform

### Initialize Vertex AI SDK for Python
Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

#### Vertex AI Pipelines constants

Setup up the following constants for Vertex AI Pipelines:


In [None]:
PIPELINE_ROOT = "{}/tensorboard-pipeline-integration/pipeline_root/".format(BUCKET_URI)
BASE_OUTPUT_DIR = "{}/pipeline-output/tensorboard-pipeline-integration-{}".format(
    BUCKET_URI, UUID
)

Additional imports.


In [None]:
from google_cloud_pipeline_components.v1.custom_job.utils import \
    create_custom_training_job_op_from_component
from kfp.v2 import dsl
from kfp.v2.dsl import component

## Create Vertex AI Tensorboard


Create a TensorBoard instance to be used by the Pipeline.

In [None]:
TENSORBOARD_NAME = "[your-tensorboard-name]"  # @param {type:"string"}

if (
    TENSORBOARD_NAME == ""
    or TENSORBOARD_NAME is None
    or TENSORBOARD_NAME == "[your-tensorboard-name]"
):
    TENSORBOARD_NAME = PROJECT_ID + "-tb-" + UUID

tensorboard = aiplatform.Tensorboard.create(
    display_name=TENSORBOARD_NAME, project=PROJECT_ID, location=REGION
)
TENSORBOARD_RESOURCE_NAME = tensorboard.gca_resource.name
print("TensorBoard resource name:", TENSORBOARD_RESOURCE_NAME)

## Define Python function-based pipeline trainer component
In this tutorial, you define function-based components to train the model. The training code will be wrapped as a KFP component to be run in Vertex Pipeline.

Your training code must be configured to write TensorBoard logs to the Cloud Storage bucket, the location of which the Vertex AI Training service will automatically make available via a predefined environment variable `AIP_TENSORBOARD_LOG_DIR`.

This can usually be done by providing `os.environ['AIP_TENSORBOARD_LOG_DIR']` as the log directory where open source TensorBoard logs are written to.

For example, in TensorFlow 2.x, you can use following code to create a `tensorboard_callback`:
```
tensorboard_callback = tf.keras.callbacks.TensorBoard(
  log_dir=os.environ['AIP_TENSORBOARD_LOG_DIR'],
  histogram_freq=1)
```
and add the callback to model.fit(...)
```
# previous things
model.compile(...)

tensorboard_callback = tf.keras.callbacks.TensorBoard(
  log_dir=os.environ['AIP_TENSORBOARD_LOG_DIR'],
  histogram_freq=1)
  
model.fit(dataset, epochs=10, callbacks=[tensorboard_callback])
```

In [None]:
@component(
    base_image="tensorflow/tensorflow:latest",
    packages_to_install=["tensorflow_datasets"],
)
def trainer(tb_log_dir_env_var: str = "AIP_TENSORBOARD_LOG_DIR"):
    """Training component."""
    import logging
    import os

    import tensorflow as tf
    import tensorflow_datasets as tfds

    IMG_WIDTH = 128

    def normalize_img(image):
        """Normalizes image.

        * Resizes image to IMG_WIDTH x IMG_WIDTH pixels
        * Casts values from `uint8` to `float32`
        * Scales values from [0, 255] to [0, 1]

        Returns:
          A tensor with shape (IMG_WIDTH, IMG_WIDTH, 3). (3 color channels)
        """
        image = tf.image.resize_with_pad(image, IMG_WIDTH, IMG_WIDTH)
        return image / 255.0

    def normalize_img_and_label(image, label):
        """Normalizes image and label.

        * Performs normalize_img on image
        * Passes through label unchanged

        Returns:
          Tuple (image, label) where
          * image is a tensor with shape (IMG_WIDTH, IMG_WIDTH, 3). (3 color
            channels)
          * label is an unchanged integer [0, 4] representing flower type
        """
        return normalize_img(image), label

    if "AIP_MODEL_DIR" not in os.environ:
        raise KeyError(
            "The `AIP_MODEL_DIR` environment variable has not been"
            + "set. See https://cloud.google.com/ai-platform-unified/docs/tutorials/image-recognition-custom/training"
        )
    output_directory = os.environ["AIP_MODEL_DIR"]

    logging.info("Loading and preprocessing data ...")
    dataset = tfds.load(
        "tf_flowers:3.*.*",
        split="train",
        try_gcs=True,
        shuffle_files=True,
        as_supervised=True,
    )
    dataset = dataset.map(
        normalize_img_and_label, num_parallel_calls=tf.data.experimental.AUTOTUNE
    )
    dataset = dataset.cache()
    dataset = dataset.shuffle(1000)
    dataset = dataset.batch(128)
    dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)

    logging.info("Creating and training model ...")
    model = tf.keras.Sequential(
        [
            tf.keras.layers.Conv2D(
                16,
                3,
                padding="same",
                activation="relu",
                input_shape=(IMG_WIDTH, IMG_WIDTH, 3),
            ),
            tf.keras.layers.MaxPooling2D(),
            tf.keras.layers.Conv2D(32, 3, padding="same", activation="relu"),
            tf.keras.layers.MaxPooling2D(),
            tf.keras.layers.Conv2D(64, 3, padding="same", activation="relu"),
            tf.keras.layers.MaxPooling2D(),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(512, activation="relu"),
            tf.keras.layers.Dense(5),  # 5 classes
        ]
    )
    model.compile(
        optimizer="adam",
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=["accuracy"],
    )

    # Create a TensorBoard call back and write to the gcs path provided by AIP_TENSORBOARD_LOG_DIR
    tensorboard_callback = tf.keras.callbacks.TensorBoard(
        log_dir=os.environ[tb_log_dir_env_var], histogram_freq=1
    )

    # Train the model with tensorboard_callback
    model.fit(dataset, epochs=14, callbacks=[tensorboard_callback])

    logging.info(f"Exporting SavedModel to: {output_directory}")
    # Add softmax layer for intepretability
    probability_model = tf.keras.Sequential([model, tf.keras.layers.Softmax()])
    probability_model.save(output_directory)

### Define a pipeline that uses your component

Next, define a pipeline that uses the component that was built in the previous section.

The `create_custom_training_job_op_from_component` function converts a given component into a custom training job (`CustomTrainingJobOp`) in Vertex AI.

In [None]:
@dsl.pipeline(
    # Default pipeline root. You can override it when submitting the pipeline.
    pipeline_root=PIPELINE_ROOT,
    # A name for the pipeline. Use to determine the pipeline Context.
    name="tb-pipeline-integration",
)
def pipeline():
    custom_job_op = create_custom_training_job_op_from_component(
        trainer,
        tensorboard=TENSORBOARD_RESOURCE_NAME,
        base_output_directory=BASE_OUTPUT_DIR,
        service_account=SERVICE_ACCOUNT,
    )
    custom_job_op(project=PROJECT_ID, location=REGION)

## Compile the pipeline

Next, compile the pipeline.

In [None]:
from kfp.v2 import compiler  # noqa: F811

compiler.Compiler().compile(
    pipeline_func=pipeline, package_path="tensorboard-pipeline-integration.json"
)

## Run the pipeline

Next, run the pipeline.

In [None]:
DISPLAY_NAME = "tb-pipeline-integration_" + UUID

job = aiplatform.PipelineJob(
    display_name=DISPLAY_NAME,
    template_path="tensorboard-pipeline-integration.json",
    pipeline_root=PIPELINE_ROOT,
)

job.run()

! rm tensorboard-pipeline-integration.json

## Check training logs in Tensorboard

Now you can check the training log in Vertex Tensorboard. In Vertex AI Pipelines, click the trainer component and then click `VIEW JOB`, and it takes you to the custom job page. In the custom jobs page, click `OPEN TENSORBOARD` to view the training log.


## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, **if you created the individual resources in the notebook** you can delete them as follow:

In [None]:
# Delete GCS bucket.
! gsutil -m rm -r {BUCKET_URI}

# Delete TensorBoard instance.
! gcloud ai tensorboards delete {TENSORBOARD_RESOURCE_NAME}

# Delete custom job.
job.delete()