# Vertex AI TensorBoard integration with Vertex AI Pipelines

## Overview

### What is Vertex AI TensorBoard

Vertex AI TensorBoard is an enterprise-ready managed version of
[Open source TensorBoard](https://www.tensorflow.org/tensorboard/get_started)
(TB), which is a Google open source project for machine learning experiment
visualization.

Vertex AI TensorBoard provides various detailed visualizations, including:

*   Tracking and visualizing metrics, such as loss and accuracy over time.
*   Visualizing model computational graphs (ops and layers).
*   Viewing histograms of weights, biases, or other tensors as they change over time.
*   Projecting embeddings to a lower dimensional space.
*   Displaying image, text, and audio samples.

In addition to powerful visualizations from
TensorBoard, Vertex AI TensorBoard provides the following benefits:

*  A persistent, shareable link to your experiment's dashboard.

*  A searchable list of all experiments in a project.

*  Tight integrations with Vertex AI services for model training.

*  Enterprise-grade security, privacy, and compliance.

With Vertex AI TensorBoard, you can track, visualize, and compare
ML experiments and share them with your team.

Learn more about [Vertex AI TensorBoard](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-overview) and [Vertex AI Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines/introduction).

### Objective

This demo shows how to create a training pipeline using the KFP SDK, execute the pipeline in Vertex AI Pipelines, and monitor the training process on Vertex AI TensorBoard in near real time.

It uses the following Google Cloud ML services and resources:

- Vertex AI Training
- Vertex AI TensorBoard
- Vertex AI Pipelines

The steps performed include:

* Setup a service account and Google Cloud Storage buckets.
* Construct a KFP pipeline with your custom training code.
* Compile and execute the KFP pipeline in Vertex AI Pipelines with Tensorboard enabled for near real time monitoring.

### Dataset

Dataset used will be the [flower dataset](https://www.tensorflow.org/datasets/catalog/tf_flowers) provided by TensorFlow. No other datasets are required.


## Installation

Install the following packages required to execute this notebook.

In [None]:
'''
! pip3 install --upgrade --quiet google-cloud-aiplatform \
                                 google-cloud-storage \
                                 "kfp<2" \
                                 "google-cloud-pipeline-components==1.0.20"
'''

### Colab only: Uncomment the following cell to restart the kernel.

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

In [1]:
PROJECT_ID = "ibnd-argls-cstmr-demos"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

Updated property [core/project].


#### Set the region

**Optional**: Update the 'REGION' variable to specify the region that you want to use. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [2]:
REGION = "us-central1"  # @param {type: "string"}

#### UUID

If you're in a live tutorial session, you may be using a shared test account or project.
To avoid name collisions between users on resources created, create a  Universal Unique Identifier (uuid)
for each instance session. Append the UUID to the name of the resources you create in this tutorial.

In [3]:
import random
import string


# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts, for example, datasets.

In [4]:
BUCKET_URI = "gs://ibnd-argls-ml-demos-storage/08_tensorboard_pipelines"  # @param {type:"string"}

**If your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [5]:
# ! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}

## Setup service account and permissions

A service account is used to create custom training job. If you don't want to use your project's Compute Engine service account, set SERVICE_ACCOUNT to another service account ID. You can create a service account by following the [documentation instructions](https://cloud.google.com/iam/docs/creating-managing-service-accounts#creating).

In [6]:
SERVICE_ACCOUNT = "998979163436-compute@developer.gserviceaccount.com"  # @param {type:"string"}

In [7]:
import sys

'''
IS_COLAB = "google.colab" in sys.modules
if (
    SERVICE_ACCOUNT == ""
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == "[your-service-account]"
):
    # Get your service account from gcloud
    if not IS_COLAB:
        shell_output = ! gcloud auth list 2>/dev/null
        SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()

    else:  # IS_COLAB:
        shell_output = ! gcloud projects describe  $PROJECT_ID
        project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
        SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"

    print("Service Account:", SERVICE_ACCOUNT)
'''

'\nIS_COLAB = "google.colab" in sys.modules\nif (\n    SERVICE_ACCOUNT == ""\n    or SERVICE_ACCOUNT is None\n    or SERVICE_ACCOUNT == "[your-service-account]"\n):\n    # Get your service account from gcloud\n    if not IS_COLAB:\n        shell_output = ! gcloud auth list 2>/dev/null\n        SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()\n\n    else:  # IS_COLAB:\n        shell_output = ! gcloud projects describe  $PROJECT_ID\n        project_number = shell_output[-1].split(":")[1].strip().replace("\'", "")\n        SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"\n\n    print("Service Account:", SERVICE_ACCOUNT)\n'

In [8]:
# Grant Cloud Storage permission.
'''
! gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:$SERVICE_ACCOUNT \
    --role=roles/storage.admin \
    --quiet
'''

[1;31mERROR:[0m (gcloud.projects.add-iam-policy-binding) User [998979163436-compute@developer.gserviceaccount.com] does not have permission to access projects instance [ibnd-argls-cstmr-demos:setIamPolicy] (or it may not exist): Policy update access denied.


In [9]:
# Grant AI Platform permission.
'''
! gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:$SERVICE_ACCOUNT \
    --role=roles/aiplatform.user \
    --quiet
'''

'\n! gcloud projects add-iam-policy-binding $PROJECT_ID     --member=serviceAccount:$SERVICE_ACCOUNT     --role=roles/aiplatform.user     --quiet\n'

### Import aiplatform

In [10]:
import google.cloud.aiplatform as aiplatform

### Initialize Vertex AI SDK for Python
Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [11]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

#### Vertex AI Pipelines constants

Setup up the following constants for Vertex AI Pipelines:


In [12]:
PIPELINE_ROOT = "{}/tensorboard-pipeline-integration/pipeline_root/".format(BUCKET_URI)
BASE_OUTPUT_DIR = "{}/pipeline-output/tensorboard-pipeline-integration-{}".format(
    BUCKET_URI, UUID
)

Additional imports.


In [13]:
from google_cloud_pipeline_components.v1.custom_job.utils import \
    create_custom_training_job_op_from_component
from kfp.v2 import dsl
from kfp.v2.dsl import component

## Create a Vertex AI Tensorboard instance


Create a TensorBoard instance to be used by the Pipeline.

In [14]:
TENSORBOARD_NAME = "08 - Vertex Pipelines w/ Tensorboard"  # @param {type:"string"}

if (
    TENSORBOARD_NAME == ""
    or TENSORBOARD_NAME is None
    or TENSORBOARD_NAME == "[your-tensorboard-name]"
):
    TENSORBOARD_NAME = PROJECT_ID + "-tb-" + UUID

tensorboard = aiplatform.Tensorboard.create(
    display_name=TENSORBOARD_NAME, project=PROJECT_ID, location=REGION
)
TENSORBOARD_RESOURCE_NAME = tensorboard.gca_resource.name
print("TensorBoard resource name:", TENSORBOARD_RESOURCE_NAME)

Creating Tensorboard
Create Tensorboard backing LRO: projects/998979163436/locations/us-central1/tensorboards/9192730846811914240/operations/3942188921508593664
Tensorboard created. Resource name: projects/998979163436/locations/us-central1/tensorboards/9192730846811914240
To use this Tensorboard in another session:
tb = aiplatform.Tensorboard('projects/998979163436/locations/us-central1/tensorboards/9192730846811914240')
TensorBoard resource name: projects/998979163436/locations/us-central1/tensorboards/9192730846811914240


## Define Python function-based pipeline trainer component
In this demo, you define function-based components to train the model.
The training code is wrapped as a KFP component that is run in Vertex AI Pipeline.

Your training code must be configured to write TensorBoard logs to the Cloud Storage bucket,
the location of which the Vertex AI Training service automatically makes available using
a predefined environment variable `AIP_TENSORBOARD_LOG_DIR`.

This can usually be done by providing `os.environ['AIP_TENSORBOARD_LOG_DIR']`
as the log directory to which open source TensorBoard logs are written.

For example, in TensorFlow 2.x, you can use following code to create a `tensorboard_callback`:
```
tensorboard_callback = tf.keras.callbacks.TensorBoard(
  log_dir=os.environ['AIP_TENSORBOARD_LOG_DIR'],
  histogram_freq=1)
```
and add the callback to model.fit(...)
```
# previous things
model.compile(...)

tensorboard_callback = tf.keras.callbacks.TensorBoard(
  log_dir=os.environ['AIP_TENSORBOARD_LOG_DIR'],
  histogram_freq=1)
  
model.fit(dataset, epochs=10, callbacks=[tensorboard_callback])
```

In [15]:
@component(
    base_image="tensorflow/tensorflow:latest",
    packages_to_install=["tensorflow_datasets"],
)
def trainer(tb_log_dir_env_var: str = "AIP_TENSORBOARD_LOG_DIR"):
    """Training component."""
    import logging
    import os

    import tensorflow as tf
    import tensorflow_datasets as tfds

    IMG_WIDTH = 128

    def normalize_img(image):
        """Normalizes image.

        * Resizes image to IMG_WIDTH x IMG_WIDTH pixels
        * Casts values from `uint8` to `float32`
        * Scales values from [0, 255] to [0, 1]

        Returns:
          A tensor with shape (IMG_WIDTH, IMG_WIDTH, 3). (3 color channels)
        """
        image = tf.image.resize_with_pad(image, IMG_WIDTH, IMG_WIDTH)
        return image / 255.0

    def normalize_img_and_label(image, label):
        """Normalizes image and label.

        * Performs normalize_img on image
        * Passes through label unchanged

        Returns:
          Tuple (image, label) where
          * image is a tensor with shape (IMG_WIDTH, IMG_WIDTH, 3). (3 color
            channels)
          * label is an unchanged integer [0, 4] representing flower type
        """
        return normalize_img(image), label

    if "AIP_MODEL_DIR" not in os.environ:
        raise KeyError(
            "The `AIP_MODEL_DIR` environment variable has not been"
            + "set. See https://cloud.google.com/ai-platform-unified/docs/tutorials/image-recognition-custom/training"
        )
    output_directory = os.environ["AIP_MODEL_DIR"]

    logging.info("Loading and preprocessing data ...")
    dataset = tfds.load(
        "tf_flowers:3.*.*",
        split="train",
        try_gcs=True,
        shuffle_files=True,
        as_supervised=True,
    )
    dataset = dataset.map(
        normalize_img_and_label, num_parallel_calls=tf.data.experimental.AUTOTUNE
    )
    dataset = dataset.cache()
    dataset = dataset.shuffle(1000)
    dataset = dataset.batch(128)
    dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)

    logging.info("Creating and training model ...")
    model = tf.keras.Sequential(
        [
            tf.keras.layers.Conv2D(
                16,
                3,
                padding="same",
                activation="relu",
                input_shape=(IMG_WIDTH, IMG_WIDTH, 3),
            ),
            tf.keras.layers.MaxPooling2D(),
            tf.keras.layers.Conv2D(32, 3, padding="same", activation="relu"),
            tf.keras.layers.MaxPooling2D(),
            tf.keras.layers.Conv2D(64, 3, padding="same", activation="relu"),
            tf.keras.layers.MaxPooling2D(),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(512, activation="relu"),
            tf.keras.layers.Dense(5),  # 5 classes
        ]
    )
    model.compile(
        optimizer="adam",
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=["accuracy"],
    )

    # Create a TensorBoard call back and write to the gcs path provided by AIP_TENSORBOARD_LOG_DIR
    tensorboard_callback = tf.keras.callbacks.TensorBoard(
        log_dir=os.environ[tb_log_dir_env_var], histogram_freq=1
    )

    # Train the model with tensorboard_callback
    model.fit(dataset, epochs=14, callbacks=[tensorboard_callback])

    logging.info(f"Exporting SavedModel to: {output_directory}")
    # Add softmax layer for intepretability
    probability_model = tf.keras.Sequential([model, tf.keras.layers.Softmax()])
    probability_model.save(output_directory)

### Define a pipeline that uses your component

Next, define a pipeline that uses the component that was built in the previous section.

The `create_custom_training_job_op_from_component` function converts a given component into a custom training job (`CustomTrainingJobOp`) in Vertex AI.

In [16]:
@dsl.pipeline(
    # Default pipeline root. You can override it when submitting the pipeline.
    pipeline_root=PIPELINE_ROOT,
    # A name for the pipeline. Use to determine the pipeline Context.
    name="tb-pipeline-integration",
)
def pipeline():
    custom_job_op = create_custom_training_job_op_from_component(
        trainer,
        tensorboard=TENSORBOARD_RESOURCE_NAME,
        base_output_directory=BASE_OUTPUT_DIR,
        service_account=SERVICE_ACCOUNT,
    )
    custom_job_op(project=PROJECT_ID, location=REGION)

## Compile the pipeline

Next, compile the pipeline.

In [17]:
from kfp.v2 import compiler  # noqa: F811

compiler.Compiler().compile(
    pipeline_func=pipeline, package_path="tensorboard-pipeline-integration.json"
)



## Run the pipeline

Next, run the pipeline.

In [18]:
DISPLAY_NAME = "tb-pipeline-integration_" + UUID

job = aiplatform.PipelineJob(
    display_name=DISPLAY_NAME,
    template_path="tensorboard-pipeline-integration.json",
    pipeline_root=PIPELINE_ROOT,
)

job.run()

! rm tensorboard-pipeline-integration.json

Creating PipelineJob
PipelineJob created. Resource name: projects/998979163436/locations/us-central1/pipelineJobs/tb-pipeline-integration-20240213050904
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/998979163436/locations/us-central1/pipelineJobs/tb-pipeline-integration-20240213050904')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/tb-pipeline-integration-20240213050904?project=998979163436
PipelineJob projects/998979163436/locations/us-central1/pipelineJobs/tb-pipeline-integration-20240213050904 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/998979163436/locations/us-central1/pipelineJobs/tb-pipeline-integration-20240213050904 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/998979163436/locations/us-central1/pipelineJobs/tb-pipeline-integration-20240213050904 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/99

## Check training logs

The Vertex AI TensorBoard web app provides a visualization of logs associated with a Vertex AI TensorBoard experiment. This web application offers several tools and dashboards to visualize and compare data across experiment runs. 

Learn more see [View Vertex AI TensorBoard data](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-view).


## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, **if you created the individual resources in the notebook** you can delete these resouces as follows:

In [None]:
# Delete GCS bucket.
# ! gsutil -m rm -r {BUCKET_URI}

# Delete TensorBoard instance.
# ! gcloud ai tensorboards delete {TENSORBOARD_RESOURCE_NAME}

# Delete custom job.
# job.delete()