In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Vertex AI TensorBoard Custom Training with Custom Container

<table align="left">
  <td>
    <a href="https://console.cloud.google.com/ai-platform/notebooks/deploy-notebook?name=Model%20Monitoring&download_url=https%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmaster%2Fnotebooks%2Fcommunity%2Ftensorboard%2Fvertex_tensorboard_custom_training_with_custom_container.ipynb">
       <img src="https://www.gstatic.com/cloud/images/navigation/vertex-ai.svg" alt="Google Cloud Notebooks">Open in Cloud Notebook
    </a>
  </td> 
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/community/tensorboard/vertex_tensorboard_custom_training_with_custom_container.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Open in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/community/tensorboard/vertex_tensorboard_custom_training_with_custom_container.ipynb">
        <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
</table>

## Overview

### What is Vertex AI TensorBoard

[Open source TensorBoard](https://www.tensorflow.org/tensorboard/get_started)
(TB) is a Google open source project for machine learning experiment
visualization. Vertex AI TensorBoard is an enterprise-ready managed
version of TensorBoard.

Vertex AI TensorBoard provides various detailed visualizations, that
includes:

*   Tracking and visualizing metrics such as loss and accuracy over time
*   Visualizing model computational graphs (ops and layers)
*   Viewing histograms of weights, biases, or other tensors as they change over time
*   Projecting embeddings to a lower dimensional space
*   Displaying image, text, and audio samples

In addition to the powerful visualizations from
TensorBoard, Vertex AI TensorBoard provides:

*  A persistent, shareable link to your experiment's dashboard

*  A searchable list of all experiments in a project

*  Tight integrations with Vertex AI services for model training

*  Enterprise-grade security, privacy, and compliance

With Vertex AI TensorBoard, you can track, visualize, and compare
ML experiments and share them with your team.


### Dataset

Dataset used in this tutorial will be the [flower dataset](https://www.tensorflow.org/datasets/catalog/tf_flowers) provided by TensorFlow. No other datasets are required.

### Objective

In this tutorial, you learn how to create a custom training job using custom containers, and monitor your training process on Vertex AI TensorBoard in near real time.

The steps performed include:

* Create docker repository & config.
* Create a custom container image with your customized training code.
* Setup service account and Google Cloud Storage buckets.
* Create & launch your custom training job with your custom container.

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage
* Google Artifact Registry

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),[Cloud Storage pricing](https://cloud.google.com/storage/pricing), and [Google Artifact Registry pricing](https://cloud.google.com/artifact-registry/pricing). 

Use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

### Set up your local development environment

**If you are using Colab or Vertex AI Workbench**, your environment already meets all the requirements to run this notebook. You can skip this step.

Otherwise, make sure your environment meets this notebook's requirements. You need the following:

- The Cloud Storage SDK
- Git
- Python 3
- virtualenv
- Jupyter notebook running in a virtual environment with Python 3

The Cloud Storage guide to [Setting up a Python development environment](https://cloud.google.com/python/setup) and the [Jupyter installation guide](https://jupyter.org/install) provide detailed instructions for meeting these requirements. The following steps provide a condensed set of instructions:

1. [Install and initialize the SDK](https://cloud.google.com/sdk/docs/).

2. [Install Python 3](https://cloud.google.com/python/setup#installing_python).

3. [Install virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv) and create a virtual environment that uses Python 3.  Activate the virtual environment.

4. To install Jupyter, run `pip3 install jupyter` on the command-line in a terminal shell.

5. To launch Jupyter, run `jupyter notebook` on the command-line in a terminal shell.

6. Open this notebook in the Jupyter Notebook Dashboard.

### Install additional packages

Install the following packages required to execute this notebook.

In [None]:
import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_WORKBENCH_NOTEBOOK = os.getenv("DL_ANACONDA_HOME")

# Vertex AI Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_WORKBENCH_NOTEBOOK:
    USER_FLAG = "--user"

! pip3 install google-cloud-aiplatform {USER_FLAG} -q

# Automatically restart kernel after installs

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)

3. [Enable the Vertex AI, Cloud Storage, Cloud Build, and Artifact Registry APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,storage-component.googleapis.com,cloudbuild.googleapis.com,artifactregistry.googleapis.com)

4. If you are running this notebook locally, you will need to install the [Cloud SDK](https://cloud.google.com/sdk).


5. Enter your project ID in the cell below. Then run the  cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

Note: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$`.

### Set your project ID

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

### Set your region

In [None]:
REGION = "us-central1"  # @param {type:"string"}

### Login to your Google Cloud account

In [None]:
# The Google Cloud Notebook product has specific requirements
import os
import sys

IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# If on Google Cloud Notebooks, then don't execute this code
if not IS_GOOGLE_CLOUD_NOTEBOOK:
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS

### Import aiplatform

In [None]:
import google.cloud.aiplatform as aiplatform

## Create Docker Repository

Create a Docker repository named `DOCKER_REPOSITORY` in your `REGION`.
This docker repository will be deleted in the clearning up section in the end.

In [None]:
DOCKER_REPOSITORY = "tb-custom-docker-repo"  # @param {type:"string"}

In [None]:
! gcloud artifacts repositories create $DOCKER_REPOSITORY --project={PROJECT_ID} \
--repository-format=docker \
--location={REGION} --description="Repository for Tensorboard Custom Training Job"

Verify your Docker repository is created successfully.

In [None]:
! gcloud artifacts repositories list --project={PROJECT_ID}

## Create a Custom Container Image and Push to Artifact Registry


In [None]:
# Create a folder for the image.
! mkdir tb-custom-container
%cd tb-custom-container

Write your own training code in task.py file. You can use the following code as an example.

In [None]:
%%writefile task.py

import logging
import os

import tensorflow as tf
import tensorflow_datasets as tfds

IMG_WIDTH = 128

def normalize_img(image):
    """Normalizes image.

    * Resizes image to IMG_WIDTH x IMG_WIDTH pixels
    * Casts values from `uint8` to `float32`
    * Scales values from [0, 255] to [0, 1]

    Returns:
      A tensor with shape (IMG_WIDTH, IMG_WIDTH, 3). (3 color channels)
    """
    image = tf.image.resize_with_pad(image, IMG_WIDTH, IMG_WIDTH)
    return image / 255.


def normalize_img_and_label(image, label):
    """Normalizes image and label.

    * Performs normalize_img on image
    * Passes through label unchanged

    Returns:
      Tuple (image, label) where
      * image is a tensor with shape (IMG_WIDTH, IMG_WIDTH, 3). (3 color
        channels)
      * label is an unchanged integer [0, 4] representing flower type
    """
    return normalize_img(image), label

logging.info('Loading and preprocessing data ...')
dataset = tfds.load('tf_flowers:3.*.*',
                    split='train',
                    try_gcs=True,
                    shuffle_files=True,
                    as_supervised=True)
dataset = dataset.map(normalize_img_and_label,
                      num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset = dataset.cache()
dataset = dataset.shuffle(1000)
dataset = dataset.batch(128)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)

logging.info('Creating and training model ...')

model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(16,
                           3,
                           padding='same',
                           activation='relu',
                           input_shape=(IMG_WIDTH, IMG_WIDTH, 3)),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Conv2D(32, 3, padding='same', activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Conv2D(64, 3, padding='same', activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(512, activation="relu"),
    tf.keras.layers.Dense(5)  # 5 classes
])

logging.info('Compiling model ...')
model.compile(
    optimizer='adam',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'])

log_dir = "logs"
if 'AIP_TENSORBOARD_LOG_DIR' in os.environ:
  log_dir=os.environ['AIP_TENSORBOARD_LOG_DIR']

tensorboard_callback = tf.keras.callbacks.TensorBoard(
  log_dir=log_dir,
  histogram_freq=1)

logging.info('Training model ...')
model.fit(dataset, epochs=13, callbacks=[tensorboard_callback])

logging.info('Model training done')

Create your own `Dockerfile` to specify all instructions needed to build your container. You can use the following `Dockerfile` as an example.

In [None]:
%%writefile Dockerfile

# Specifies base image and tag
FROM us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-3:latest
WORKDIR /root


# Installs additional packages as you need.

# Copies the trainer code to the docker image.
COPY task.py /root/task.py

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "task.py"]

Build your container image using `gcloud builds` from your training code and `Dockerfile`. Note that this step may take a few minutes.

In [None]:
IMAGE_NAME = "tensorboard-custom-container"
IMAGE_TAG = "v2"
IMAGE_URI = "us-central1-docker.pkg.dev/{}/{}/{}:{}".format(
    PROJECT_ID, DOCKER_REPOSITORY, IMAGE_NAME, IMAGE_TAG
)

! gcloud builds submit --project {PROJECT_ID} --region={REGION} --tag {IMAGE_URI}

## Setup Service Account and Permissions

Create the service account and grant permissions for AI Platform and Cloud Storage.

In [None]:
USER_SA_NAME = "your-serivce-account-name"  # @param {type:"string"}
SA_EMAIL = "{}@{}.iam.gserviceaccount.com".format(USER_SA_NAME, PROJECT_ID)

In [None]:
# Create service account.
! gcloud --project={PROJECT_ID} iam service-accounts create {USER_SA_NAME}

In [None]:
# Grant Cloud Storage permission.
! gcloud projects add-iam-policy-binding {PROJECT_ID} \
   --member=serviceAccount:{SA_EMAIL} \
   --role=roles/storage.admin

In [None]:
# Grant AI Platform permission.
! gcloud projects add-iam-policy-binding {PROJECT_ID} \
   --member=serviceAccount:{SA_EMAIL} \
   --role=roles/aiplatform.user

## Create Cloud Storage Bucket

A Cloud Storage buckets will be used store your training code output (including Tensorboard logs). The bucket must be regional that is, not multi-region or dual-region, and the following resources must be in same region:

* the Cloud Storage bucket
* the Vertex AI training job
* the Vertex AI TensorBoard instance

The created bucket will be deleted in the cleaning up section in the end.

In [None]:
GCS_BUCKET_OUTPUT = "{}-tensorboard-custom-container-output-{}".format(
    PROJECT_ID, REGION
)
! gsutil mb -p {PROJECT_ID} -l {REGION} gs://$GCS_BUCKET_OUTPUT

## Create a Custom Training Job with Your Container

Setup the endpoint we will talk to.

In [None]:
ENDPOINT = "{}-aiplatform.googleapis.com".format(REGION)

If no existing Tensorboard instance for this project and region, create one.

In [None]:
TENSORBOARD_DISPLAY_NAME = "your-tensorboard-display-name"  # @param {type:"string"}

tensorboard = aiplatform.Tensorboard.create(
    display_name=TENSORBOARD_DISPLAY_NAME, project=PROJECT_ID, location=REGION
)
tensorboard_resource_name = tensorboard.gca_resource.name
print("TensorBoard resource name:", tensorboard_resource_name)

If you already have a Tensorboard for this `PROJECT_ID` and `REGION`, you can get your `Tensorboard_ID` either from Google Cloud Console, Vertex AI > Experiments > Tensorboard Instance, or from the command below:

In [None]:
TENSORBOARDS = aiplatform.Tensorboard.list(project=PROJECT_ID, location=REGION)
print(TENSORBOARDS)

Prepare `TENSORBOARD_INSTANCE_NAME`.

In [None]:
TENSORBOARD_INSTANCE_NAME = aiplatform.Tensorboard.list(
    project=PROJECT_ID, location=REGION
)[0].resource_name

Run the following example request to create your own custom training job using the container you just built and uploaded to Artifact Registry, and stream the training results to Tensorboard.

In [None]:
from datetime import datetime

INVOCATION_TIMESTAMP = datetime.now().strftime("%Y%m%d-%H%M%S")
JOB_NAME = "tensorboard-example-job-{}".format(INVOCATION_TIMESTAMP)
BASE_OUTPUT_DIR = "gs://{}/{}".format(GCS_BUCKET_OUTPUT, JOB_NAME)

# The AI Platform services require regional API endpoints.
client_options = {"api_endpoint": f"{REGION}-aiplatform.googleapis.com"}
# Initialize client that will be used to create and send requests.
# This client only needs to be created once, and can be reused for multiple requests.
client = aiplatform.gapic.JobServiceClient(client_options=client_options)
custom_job = {
    "display_name": JOB_NAME,
    "job_spec": {
        "worker_pool_specs": [
            {
                "machine_spec": {
                    "machine_type": "n1-standard-8",
                },
                "replica_count": 1,
                "container_spec": {
                    "image_uri": IMAGE_URI,
                },
            }
        ],
        "service_account": SA_EMAIL,
        "tensorboard": TENSORBOARD_INSTANCE_NAME,
        "base_output_directory": {"output_uri_prefix": BASE_OUTPUT_DIR},
    },
}
parent = f"projects/{PROJECT_ID}/locations/{REGION}"
response = client.create_custom_job(parent=parent, custom_job=custom_job)
print("response:", response)

In Google Cloud Console, you can monitor your training job at Vertex AI > Training > Custom Jobs. In each custom training job, near real time updated TensorBoard is available at `OPEN TENSORBOARD` button.

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

In [None]:
# Delete GCS bucket.
! gsutil -m rm -r gs://{GCS_BUCKET_OUTPUT}

# Delete docker repository.
! gcloud artifacts repositories delete $DOCKER_REPOSITORY --project {PROJECT_ID} --location {REGION}