In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Profile model training performance using Profiler

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/custom/custom_training_tensorboard_profiler.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/custom/custom_training_tensorboard_profiler.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/custom/custom_training_tensorboard_profiler.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

## Overview

Vertex AI TensorBoard Profiler lets you monitor and optimize your model training performance by helping you understand the resource consumption of training operations. This tutorial demonstrates how to enable Vertex AI TensorBoard Profiler so you can debug model training performance for your custom training jobs.

Learn more about [Vertex AI TensorBoard Profiler](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-profiler).

### Objective

In this tutorial, you learn how to enable Vertex AI TensorBoard Profiler for custom training jobs.

This tutorial uses the following Google Cloud AI services:

- `Vertex AI Training`
- `Vertex AI TensorBoard`

The steps performed include:

- Setup a service account and a Cloud Storage bucket
- Create a TensorBoard instance
- Create and run a custom training job
- View the TensorBoard Profiler dashboard


### Dataset

The dataset used for this tutorial is the [mnist dataset](https://www.tensorflow.org/datasets/catalog/mnist) from [TensorFlow Datasets](https://www.tensorflow.org/datasets/catalog/overview).


### Costs 

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the following packages required to execute this notebook. 

In [None]:
! pip3 install --user --upgrade google-cloud-aiplatform --quiet

### Colab only: Uncomment the following cell to restart the kernel.

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com). 

4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

#### Region

You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "us-central1"  # @param {type:"string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [None]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [None]:
# from google.colab import auth
# auth.authenticate_user()

**4. Setup service account and permissions**

A service account will be used to create custom training jobs. If you do not want to use your project's Compute Engine service account, set SERVICE_ACCOUNT to another service account ID. You can create a service account by following the [instructions](https://cloud.google.com/iam/docs/creating-managing-service-accounts#creating).

In [None]:
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

In [None]:
# Grant Cloud Storage permission.
! gcloud projects add-iam-policy-binding $PROJECT_ID \
        --member="serviceAccount:$SERVICE_ACCOUNT" \
        --role="roles/storage.admin" \
        --quiet

In [None]:
# Grant AI Platform permission.
! gcloud projects add-iam-policy-binding $PROJECT_ID \
        --member="serviceAccount:$SERVICE_ACCOUNT" \
        --role="roles/aiplatform.user" \
        --quiet

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

In [None]:
BUCKET_URI = "gs://your-bucket-name-unique"  # @param {type:"string"}

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

### Import libraries

In [None]:
from google.cloud import aiplatform

### Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

### Enable Artifact Registry API

First, you must enable the Artifact Registry API service for your project.

Learn more about [Enabling service](https://cloud.google.com/artifact-registry/docs/enable-service).

In [None]:
! gcloud services enable artifactregistry.googleapis.com --quiet

### Create a TensorBoard instance

A Vertex AI TensorBoard instance, which is a regionalized resource storing your Vertex AI TensorBoard experiments, must be created before the experiments can be visualized. You can create multiple instances in a project. You can use command  `gcloud ai tensorboards list` to get a list of your existing TensorBoard instances.

#### Set your TensorBoard instance display name


In [None]:
TENSORBOARD_NAME = "your-tensorboard-unique"  # @param {type:"string"}

#### Create a TensorBoard instance

If you don't have a TensorBoard instance, create one by running the following cell:

In [None]:
tensorboard = aiplatform.Tensorboard.create(
    display_name=TENSORBOARD_NAME, project=PROJECT_ID, location=REGION
)

TENSORBOARD_INSTANCE_NAME = tensorboard.resource_name

print("TensorBoard instance name:", TENSORBOARD_INSTANCE_NAME)

## Train a model

To train a model using your custom training code, choose one of the following options:

- **Prebuilt container**: Load your custom training code as a Python package to a prebuilt container image from Google Cloud.

- **Custom container**: Create your own container image that contains your custom training code.

In this tutorial, we will train a custom model using a custom container.

### Create a private Docker repository

Your first step is to create your own Docker repository in Google Artifact Registry.

In [None]:
DOCKER_REPOSITORY = f"{PROJECT_ID}-repo-unique"

! gcloud artifacts repositories create {DOCKER_REPOSITORY} \
    --repository-format=docker \
    --location={REGION} \
    --description="Repository for TensorBoard Custom Training Job" \
    --quiet

! gcloud artifacts repositories list

### Configure authentication to your private Docker repository

Before you push or pull container images, configure Docker to use the `gcloud` command-line tool to authenticate requests to `Artifact Registry` for your region.

In [None]:
import sys

IS_COLAB = "google.colab" in sys.modules

if not IS_COLAB:
    ! gcloud auth configure-docker {REGION}-docker.pkg.dev --quiet

### Create a custom container image and push to your private Docker repository

First, you create a training script file and a docker file.

Create a directory for all of your training code.

In [None]:
PYTHON_PACKAGE_APPLICATION_DIR = "trainer"

!mkdir -p $PYTHON_PACKAGE_APPLICATION_DIR

#### Prepare the training script

Your training code must be configured to write TensorBoard logs to a Cloud Storage bucket, the location of which Vertex AI Training automatically makes available through a predefined environment variable, `AIP_TENSORBOARD_LOG_DIR`.

This can usually be done by providing `os.environ['AIP_TENSORBOARD_LOG_DIR']` as the log directory to the open source TensorBoard log writing APIs. 

For example, in TensorFlow 2.x, you can use following code to create a tensorboard_callback: 

    tensorboard_callback = tf.keras.callbacks.TensorBoard( 
      log_dir=os.environ['AIP_TENSORBOARD_LOG_DIR'], 
      histogram_freq=1) 
`AIP_TENSORBOARD_LOG_DIR` is in the `BASE_OUTPUT_DIR` that you provide when creating the custom training job.

To enable Vertex AI TensorBoard Profiler for your training job, add the following to your training script:

Add the cloud_profiler import at your top level imports:

    from google.cloud.aiplatform.training_utils import cloud_profiler


Initialize the cloud_profiler plugin by adding:


    cloud_profiler.init()

In [None]:
%%writefile trainer/task.py

import tensorflow as tf
import argparse
import os
import sys, traceback
from google.cloud.aiplatform.training_utils import cloud_profiler

"""Train an mnist model and use cloud_profiler for profiling."""

def _create_model():
    model = tf.keras.models.Sequential(
        [
            tf.keras.layers.Flatten(input_shape=(28, 28)),
            tf.keras.layers.Dense(128, activation="relu"),
            tf.keras.layers.Dropout(0.2),
            tf.keras.layers.Dense(10),
        ]
    )
    return model


def main(args):
    print('Loading and preprocessing data ...')
    mnist = tf.keras.datasets.mnist

    (x_train, y_train), (x_test, y_test) = mnist.load_data()
    x_train, x_test = x_train / 255.0, x_test / 255.0

    print('Creating and training model ...')

    model = _create_model()
    model.compile(
      optimizer="adam",
      loss=tf.keras.losses.sparse_categorical_crossentropy,
      metrics=["accuracy"],
    )

    # Initialize the profiler.
    print('Initialize the profiler ...')
        
    try:
        cloud_profiler.init()
    except:
        ex_type, ex_value, ex_traceback = sys.exc_info()
        print("*** Unexpected:", ex_type.__name__, ex_value)
        traceback.print_tb(ex_traceback, limit=10, file=sys.stdout)
    
    print('The profiler initiated.')

    log_dir = "logs"
    if 'AIP_TENSORBOARD_LOG_DIR' in os.environ:
      log_dir = os.environ['AIP_TENSORBOARD_LOG_DIR']

    print('Setting up the TensorBoard callback ...')
    tensorboard_callback = tf.keras.callbacks.TensorBoard(
        log_dir=log_dir,
        histogram_freq=1)

    print('Training model ...')
    model.fit(
        x_train,
        y_train,
        epochs=args.epochs,
        verbose=0,
        callbacks=[tensorboard_callback],
    )
    print('Training completed.')

    print('Saving model ...')

    model_dir = "model"
    if 'AIP_MODEL_DIR' in os.environ:
      model_dir = os.environ['AIP_MODEL_DIR']
    tf.saved_model.save(model, model_dir)

    print('Model saved at ' + model_dir)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--epochs", type=int, default=100, help="Number of epochs to run model."
    )
    
    args = parser.parse_args()
    main(args)

#### Prepare the Dockerfile


In [None]:
%%writefile Dockerfile
# Specifies base image and tag
FROM us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-9:latest
WORKDIR /root

# Installs additional packages as you need.
RUN pip3 install google-cloud-aiplatform[cloud_profiler]

# Copies the trainer code to the docker image.
RUN mkdir /root/trainer
COPY trainer/task.py /root/trainer/task.py

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "trainer.task"]

#### Build a custom container image and push to your private Docker repository

In [None]:
IMAGE_NAME = "tensorboard-custom-container"
IMAGE_URI = f"{REGION}-docker.pkg.dev/{PROJECT_ID}/{DOCKER_REPOSITORY}/{IMAGE_NAME}"

! gcloud builds submit --project {PROJECT_ID} --region={REGION} --tag {IMAGE_URI} --timeout=60m --quiet

### Create and run the custom training job

Configure a [custom job](https://cloud.google.com/vertex-ai/docs/training/create-custom-job) with the custom container image.

In [None]:
JOB_NAME = "tensorboard-job-unique"

job = aiplatform.CustomContainerTrainingJob(
    display_name=JOB_NAME, container_uri=IMAGE_URI
)

#### Run the custom training job

Next, you run the custom job to start the training job by invoking the method `run`, with the following parameters:

- `args`: The command-line arguments to pass to the training script.
   - `--epochs` : The number of epochs for training.
- `replica_count`: The number of compute instances for training (replica_count = 1 is single node training).
- `machine_type`: The machine type for the compute instances.
- `tensorboard`: The TensorBoard instance.
- `service_account`: The service account.
- `sync`: Whether to block until completion of the job.

In [None]:
base_output_dir = "{}/{}".format(BUCKET_URI, JOB_NAME)
MACHINE_TYPE = "n1-standard-4"
EPOCHS = 2
training_args = [
    "--epochs=" + str(EPOCHS),
]

job.run(
    args=training_args,
    replica_count=1,
    machine_type=MACHINE_TYPE,
    base_output_dir=base_output_dir,
    tensorboard=TENSORBOARD_INSTANCE_NAME,
    service_account=SERVICE_ACCOUNT,
)

## View the TensorBoard Profiler dashboard

When the custom job state switches to `Running`, you can access the Vertex AI TensorBoard Profiler dashboard through the Custom jobs page or the Experiments page on the Google Cloud console. 

The Google Cloud guide to [Profile model training performance using Profiler](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-profiler) provides detailed instructions for accessing the Vertex AI TensorBoard Profiler dashboard and capturing a profiling session. 


## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

- Docker repository
- Training job
- TensorBoard instance
- Cloud Storage bucket


In [None]:
delete_tensorboard = True
delete_bucket = False

# Delete docker repository.
! gcloud artifacts repositories delete $DOCKER_REPOSITORY --project {PROJECT_ID} --location {REGION} --quiet

job.delete()

if delete_tensorboard:
    tensorboard.delete()

if delete_bucket and "BUCKET_URI" in globals():
    ! gsutil -m rm -r $BUCKET_URI