In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/vertex_endpoints/optimized_tensorflow_runtime/bert_optimized_online_prediction.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/vertex_endpoints/optimized_tensorflow_runtime/bert_optimized_online_prediction.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/community/vertex_endpoints/optimized_tensorflow_runtime/bert_optimized_online_prediction.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>

# Fine-tuning a BERT base classification model and deploying it to Vertex AI Predictions using the optimized TensorFlow runtime

## Overview

In this sample you learn how to fine-tune a BERT base classification model for sentiment analysis.

Then you export a trained model to Vertex AI Prediction service using an open source based TensorFlow 2.7 container and the optimized TensorFlow runtime container, run performance evaluation for those models and compare their predictions.

For additional information about Vertex AI Prediction optimized TensorFlow runtime containers, see https://cloud.google.com/vertex-ai/docs/predictions/optimized-tensorflow-runtime.

### Dataset

This notebook trains a sentiment analysis model to classify movie reviews as positive or negative based on the text of the review.

You use the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/) that contains the 50,000 movie reviews from the [Internet Movie Database](https://www.imdb.com/).


### Objective

In this notebook, you learn how to deploy a fine-tuned BERT classification model to Vertex AI Prediction using the optimized TensorFlow runtime. Next, you compare its performance to an open source based TensorFlow container.

The steps you perform include:
* Download and preprocess the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/)
* Download the BERT base model from TF hub
* Fine-tune the BERT classification model
* Deploy a model to Vertex AI Prediction using a TensorFlow 2.7 container
* Deploy a model to Vertex AI Prediction using an optimized TensorFlow runtime container
* Benchmark the models and validate their predictions

You can to fine-tune the BERT model and upload it to Vertex AI Prediction using Colab. To get reliable benchmark results, this walkthrough must be run on Jupyter VM running in the same region as your model.

### Costs

This tutorial uses the following billable components of Google Cloud:

* Vertex AI
* Cloud Storage

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

### Set up your local development environment

**If you are using Colab or Vertex AI Workbench Notebooks**, your environment already meets
all the requirements to run this notebook. You can skip this step.

**Otherwise**, make sure your environment meets this notebook's requirements.
You need the following:

* The Google Cloud SDK
* Git
* Python 3
* virtualenv
* Jupyter notebook running in a virtual environment with Python 3
* GPU drivers and CUDA 11.2 installed

The Google Cloud guide to [Setting up a Python development
environment](https://cloud.google.com/python/setup) and the [Jupyter
installation guide](https://jupyter.org/install) provide detailed instructions
for meeting these requirements. The following steps provide a condensed set of
instructions:

1. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/)

1. [Install Python 3.](https://cloud.google.com/python/setup#installing_python)

1. [Install
   virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)
   and create a virtual environment that uses Python 3. Activate the virtual environment.

1. To install Jupyter, run `pip3 install jupyter` on the
command-line in a terminal shell.

1. To launch Jupyter, run `jupyter notebook` on the command-line in a terminal shell.

1. Open this notebook in the Jupyter Notebook Dashboard.

### Install additional packages

Install additional package dependencies not installed in your notebook environment, such as tensorflow, tensorflow-text, tensorflow serving APIs, and Vertex AI SDK. Use the latest major GA version of each package.

In [None]:
import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# Vertex AI Workbench Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_GOOGLE_CLOUD_NOTEBOOK:
    USER_FLAG = "--user"

In [None]:
! pip3 install {USER_FLAG} --upgrade tensorflow==2.7.0 -q
! pip3 install {USER_FLAG} --upgrade tensorflow-text==2.7.0 -q
! pip3 install {USER_FLAG} --upgrade tensorflow-serving-api==2.7.0 -q
! pip3 install {USER_FLAG} --upgrade tf-models-official==2.7.0 -q
! pip3 install {USER_FLAG} --upgrade google-cloud-aiplatform -q
! pip3 install {USER_FLAG} --upgrade google-cloud-storage -q

### Restart the kernel

After you install the additional packages, you must restart the notebook kernel so it can find the packages.

In [None]:
# Automatically restart kernel after installs
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Before you begin

### Select a GPU runtime

**Make sure you're running this notebook in a GPU runtime if you have that option. In Colab, select "Runtime --> Change runtime type > GPU"**.

Please note that in order for you to be able to fine-tune model using GPU, your VM needs to have GPU drivers and CUDA 11.2 installed. You can use [TensorFlow Enterprise 2.7](https://cloud.google.com/tensorflow-enterprise/docs/use-with-notebooks) user-managed notebook instance or Colab with GPU runtime for this.

### Set up your Google Cloud project

**The following steps are required for all notebook environments.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 credit towards your compute and storage costs.

1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

1. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

1. If you run this notebook locally, you must install the [Cloud SDK](https://cloud.google.com/sdk).

1. Enter your project ID in the following cell. Then run the cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands.

#### Set your project ID

**If you don't know your project ID**, you can try to get your project ID using `gcloud`.

In [None]:
import os

PROJECT_ID = ""

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

Otherwise, set your project ID here.

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

#### Timestamp

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, create a timestamp for each instance session, then append it to the name of resources you create in this tutorial.

In [None]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

### Authenticate your Google Cloud account

**If you are using  Vertex AI Workbench Notebooks**, your environment is already
authenticated. Skip this step.

**If you are using Colab**, run the cell below and follow the instructions
when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

1. In the Cloud Console, go to the [**Create service account key**
   page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).

2. Click **Create service account**.

3. In the **Service account name** field, enter a name, and
   click **Create**.

4. In the **Grant this service account access to project** section, click the **Role** drop-down list. Type "Vertex AI"
into the filter box, and select
   **Vertex AI Administrator**. Type "Storage Object Admin" into the filter box, and select **Storage Object Admin**.

5. Click **Create**. A JSON file that contains your key downloads to your local environment.

6. Enter the path to your service account key as the
`GOOGLE_APPLICATION_CREDENTIALS` variable in the following cell, then and run the cell.

In [None]:
import os
import sys

# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

# The Vertex AI Workbench Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# If on  Vertex AI Workbench Notebooks, then don't execute this code
if not IS_GOOGLE_CLOUD_NOTEBOOK:
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

### Create a Cloud Storage bucket

**The following steps are required for all notebook environments.**

In order for Vertex AI Prediction to be able to serve your model it has to be uploaded to Cloud Storage bucket first.

Set the name of your Cloud Storage bucket below. It must be unique across all
Cloud Storage buckets.

You may also change the `REGION` variable, which is used for operations
throughout the rest of this notebook. We suggest that you [choose a region where Vertex AI services are
available](https://cloud.google.com/vertex-ai/docs/general/locations#available_regions).

In [None]:
BUCKET_URI = "gs://[your-bucket-name]"  # @param {type:"string"}
REGION = "[your-region]"  # @param {type:"string"}

In [None]:
if BUCKET_URI == "" or BUCKET_URI is None or BUCKET_URI == "gs://[your-bucket-name]":
    BUCKET_URI = "gs://" + PROJECT_ID + "aip-" + TIMESTAMP

if REGION == "[your-region]":
    REGION = "us-central1"

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

The final step for your Cloud Storage bucket is to validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al $BUCKET_URI

### Import libraries and define constants

In [None]:
import json
import os
import shutil
import subprocess
import sys

import numpy as np
import requests as r
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text  # noqa: F401
from official.nlp import optimization  # to create AdamW optimizer

r.packages.urllib3.disable_warnings()

logging = tf.get_logger()
logging.propagate = False
logging.setLevel("INFO")

In [None]:
LOCAL_DIRECTORY = "~/bert_classification"  # @param {type:"string"}
LOCAL_DIRECTORY_FULL = os.path.expanduser(LOCAL_DIRECTORY)

## Download the dataset

Download the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/) from the [Internet Movie Database](https://www.imdb.com/). This dataset contains the text of 50,000 movie reviews.

In [None]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file(
    "aclImdb_v1.tar.gz", url, untar=True, cache_dir=".", cache_subdir=""
)

dataset_dir = os.path.join(os.path.dirname(dataset), "aclImdb")

train_dir = os.path.join(dataset_dir, "train")

# remove unused folders to make it easier to load the data
remove_dir = os.path.join(train_dir, "unsup")
shutil.rmtree(remove_dir)

## Preprocess the dataset

The IMDB dataset has already been divided into train and test, but it lacks a validation set. To create a validation set, in the following cell use an 80:20 split of the training data by using the `validation_split` argument.

In [None]:
AUTOTUNE = tf.data.AUTOTUNE
batch_size = 32
seed = 42

raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="training",
    seed=seed,
)

class_names = raw_train_ds.class_names
train_ds = raw_train_ds.cache().prefetch(buffer_size=AUTOTUNE)

val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="validation",
    seed=seed,
)

val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)

test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

Take a look at a few reviews.

In [None]:
for text_batch, label_batch in train_ds.take(1):
    for i in range(3):
        print(f"Review: {text_batch.numpy()[i]}")
        label = label_batch.numpy()[i]
        print(f"Label : {label} ({class_names[label]})")

## Define BERT base classification model

As a base for our model you take uncased BERT-Base model from TensorFlow Hub:
https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3

In [None]:
bert_model_name = "bert_en_uncased_L-12_H-768_A-12"
tfhub_handle_encoder = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3"

To feed text input into this model, data is preprocessed using the corresponding BERT pre-processor from TensorFlow hub.

In [None]:
tfhub_handle_preprocess = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"

In [None]:
bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)

Define a classification model by feeding the output BERT encoder model into dropout and dense layers.
Learn more about [Classify text with BERT]( https://www.tensorflow.org/text/tutorials/classify_text_with_bert).

In [None]:
def build_classifier_model():
    text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name="text")
    preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name="preprocessing")
    encoder_inputs = preprocessing_layer(text_input)
    encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name="BERT_encoder")
    outputs = encoder(encoder_inputs)
    net = outputs["pooled_output"]
    net = tf.keras.layers.Dropout(0.1)(net)
    net = tf.keras.layers.Dense(1, activation=tf.sigmoid, name="classifier")(net)
    return tf.keras.Model(text_input, net)

In [None]:
classifier_model = build_classifier_model()

In [None]:
epochs = 3
steps_per_epoch = tf.data.experimental.cardinality(train_ds).numpy()
num_train_steps = steps_per_epoch * epochs
num_warmup_steps = int(0.1 * num_train_steps)

init_lr = 3e-5
optimizer = optimization.create_optimizer(
    init_lr=init_lr,
    num_train_steps=num_train_steps,
    num_warmup_steps=num_warmup_steps,
    optimizer_type="adamw",
)

In [None]:
loss = tf.keras.losses.BinaryCrossentropy()
metrics = tf.metrics.BinaryAccuracy()

In [None]:
classifier_model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

Fine-tuning the model takes some time.

In [None]:
print(f"Training model with {tfhub_handle_encoder}")
history = classifier_model.fit(x=train_ds, validation_data=val_ds, epochs=epochs)

Evaluate the model. Its validation loss is expected to be ~0.43.

In [None]:
loss, accuracy = classifier_model.evaluate(test_ds)

print(f"Loss: {loss}")
print(f"Accuracy: {accuracy}")

Export the model for inference.

In [None]:
!mkdir -p $LOCAL_DIRECTORY_FULL

In [None]:
classifier_model.save(LOCAL_DIRECTORY_FULL, include_optimizer=False)

Check the model signature to see which fields prediction request should have.

Note that you might see a stacktrace message about a missing 'CaseFoldUTF8' op. This is a known issue with `saved_model_cli` that you can ignore.

In [None]:
!saved_model_cli show --dir $LOCAL_DIRECTORY_FULL --all

## Generate prediction requests

Now you can generate requests to send to our model for inference. Requests are generated in the JSON Lines format, one request per line.

In [None]:
!mkdir -p $LOCAL_DIRECTORY_FULL/requests

In [None]:
def encode(text):
    rows = []
    for row in text.numpy().tolist():
        rows.append(row.decode("utf-8"))

    return {"text": rows}


def export_requests_jsonl(file_name, rows=2, batch_size=32):
    with tf.io.gfile.GFile(file_name, mode="w") as f:
        for text in test_ds.unbatch().batch(batch_size).take(rows):
            d = encode(text[0])
            f.write(json.dumps(d))
            f.write("\n")

In [None]:
export_requests_jsonl(
    os.path.join(LOCAL_DIRECTORY_FULL, "requests", "requests_1_1.jsonl"),
    rows=1,
    batch_size=1,
)
export_requests_jsonl(
    os.path.join(LOCAL_DIRECTORY_FULL, "requests", "requests_1_32.jsonl"),
    rows=1,
    batch_size=32,
)
export_requests_jsonl(
    os.path.join(LOCAL_DIRECTORY_FULL, "requests", "requests_10_32.jsonl"),
    rows=10,
    batch_size=32,
)
export_requests_jsonl(
    os.path.join(LOCAL_DIRECTORY_FULL, "requests", "requests_100_32.jsonl"),
    rows=100,
    batch_size=32,
)
export_requests_jsonl(
    os.path.join(LOCAL_DIRECTORY_FULL, "requests", "requests_1000_32.jsonl"),
    rows=1000,
    batch_size=32,
)

In [None]:
!cat $LOCAL_DIRECTORY_FULL/requests/requests_1_1.jsonl

## (Optional) Generate warmup requests

The TensorFlow runtime has components that are lazily initialized. Lazy initialization might result in high latency for the first requests that are sent to a model after it's loaded. This latency can be several orders of magnitude higher than that of a single inference request.

For more information about SavedModel warmup, see https://www.tensorflow.org/tfx/serving/saved_model_warmup.

For Vertex AI Prediction using the optimized TensorFlow runtime, when the model is precompiled the first request for each new batch size has higher latency. Precompilation is enabled when the `allow_precompilation` flag is set to true.

To mitigate high latency, provide a warmup request for the runtime to load when it starts.
The warmup file should include the various batch sizes you expect your model to receive in production.

Note that providing a warmup request with multiple batch sizes increase the time for each node to start.

If you expect the model to receive multiple batch sizes, you can use automatic server-side request batching with a set of `allowed_batch_sizes`. For more information, see https://www.tensorflow.org/tfx/serving/serving_config#batching_configurationß.

To enable auto-batching for a model running on Vertex AI Prediction, put your batching configuration into the [config/batching_parameters_config](https://cloud.google.com/vertex-ai/docs/training/exporting-model-artifacts#enable_server-side_request_batching_for_tensorflow) file in the same GCS directory as saved_model.pb.

In [None]:
!mkdir -p $LOCAL_DIRECTORY_FULL/assets.extra

In [None]:
from tensorflow_serving.apis import predict_pb2, prediction_log_pb2


def build_grpc_request(
    row_dict, model_name="default", signature_name="serving_default"
):
    """Generate gRPC inference request with payload."""

    request = predict_pb2.PredictRequest()
    request.model_spec.name = model_name
    request.model_spec.signature_name = signature_name
    for key, value in row_dict.items():
        proto = tf.make_tensor_proto(value)
        request.inputs[key].CopyFrom(proto)
    return request


def export_warmup_file(
    request_files, export_path, model_name="default", signature_name="serving_default"
):
    with tf.io.TFRecordWriter(export_path) as writer:
        for request_file_path in request_files:
            with open(request_file_path) as f:
                row_dict = json.loads(f.readline())
                request = build_grpc_request(row_dict, model_name, signature_name)
            log = prediction_log_pb2.PredictionLog(
                predict_log=prediction_log_pb2.PredictLog(request=request)
            )
            writer.write(log.SerializeToString())


export_warmup_file(
    [
        os.path.join(LOCAL_DIRECTORY_FULL, "requests", "requests_1_1.jsonl"),
        os.path.join(LOCAL_DIRECTORY_FULL, "requests", "requests_1_32.jsonl"),
    ],
    os.path.join(LOCAL_DIRECTORY_FULL, "assets.extra", "tf_serving_warmup_requests"),
)

## Deploy model to Vertex AI Prediction

To deploy a model to Vertex AI Prediction service, you must put it in a GCS bucket.

In [None]:
!gsutil rm -r $BUCKET_URI/*

In [None]:
!gsutil cp -r $LOCAL_DIRECTORY_FULL/* $BUCKET_URI

Import the Vertex AI Python client library into your notebook environment.

In [None]:
from google.cloud.aiplatform import gapic as aip

Define the node type to use for deployments. To learn about Vertex AI Prediction options, see [configure compute resources](https://cloud.google.com/vertex-ai/docs/predictions/configure-compute). 

In [None]:
DEPLOY_COMPUTE = "n1-standard-16"
DEPLOY_GPU = aip.AcceleratorType.NVIDIA_TESLA_T4

The AI Platform Python client library works as a client/server model.

You are going to use following clients in this sample:
- Model Service for managing models.
- Endpoint Service for deployment.
- Prediction Service for serving.

In [None]:
API_ENDPOINT = f"{REGION}-aiplatform.googleapis.com"
PARENT = f"projects/{PROJECT_ID}/locations/{REGION}"

client_options = {"api_endpoint": API_ENDPOINT}
model_service_client = aip.ModelServiceClient(client_options=client_options)
endpoint_service_client = aip.EndpointServiceClient(client_options=client_options)
prediction_service_client = aip.PredictionServiceClient(client_options=client_options)

### Upload models to Vertex AI Prediction

See [model_service.upload_model](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1.services.model_service.ModelServiceClient#google_cloud_aiplatform_v1_services_model_service_ModelServiceClient_upload_model) documentation for details.


`artifact_uri` argument should point to a GCS path where `saved_model.pb` file is located for your model.

`image_uri` specifies which docker image to use. Here you upload the same model using TF2.7 GPU and Vertex AI Prediction optimized TensorFlow runtime images.

In [None]:
tf27_gpu_model_dict = {
    "display_name": "BERT Base TF2.7 GPU model",
    "artifact_uri": BUCKET_URI,
    "container_spec": {
        "image_uri": "us-docker.pkg.dev/vertex-ai/prediction/tf2-gpu.2-7:latest",
    },
}
tf27_gpu_model = (
    model_service_client.upload_model(parent=PARENT, model=tf27_gpu_model_dict)
    .result(timeout=180)
    .model
)
tf27_gpu_model

For deploying model using Vertex AI Prediction optimized TensorFlow runtime, use the `us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.nightly:latest` container.

Two optimization options are applied to the model.
- *allow_precompilation* - turns on model pre-compilation for better performance. Note that model precompilation happens when the first request with the new batch size arrives, and the response for that request is sent after precompilation is complete. To mitigate this, specify a warmup file (see the section earlier in this colab). Model precompilation works for different kinds of models, and in most cases has a positive effect on performance. However, we recommend that you try it out for your model before you enable it in production.
- *allow_precision_affecting_optimizations* - enables precision affecting optimizations. In some cases this makes the model run significantly faster at the cost of very minimal loss to model prediction power. You should assess the precision impact to your model when using this optimization.

For the list of available optimized TensorFlow runtimer containers and options, see https://cloud.google.com/vertex-ai/docs/predictions/optimized-tensorflow-runtime.

In [None]:
tf_opt_gpu_model_dict = {
    "display_name": "BERT Base optimized TensorFlow runtime GPU model",
    "artifact_uri": BUCKET_URI,
    "container_spec": {
        "image_uri": "us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.nightly:latest",
        "args": [
            "--allow_precompilation=true",
            "--allow_precision_affecting_optimizations=false",
        ],
    },
}

tf_opt_gpu_model = (
    model_service_client.upload_model(parent=PARENT, model=tf_opt_gpu_model_dict)
    .result(timeout=180)
    .model
)
tf_opt_gpu_model

In [None]:
tf_opt_lossy_gpu_model_dict = {
    "display_name": "BERT Base optimized TensorFlow runtime GPU model with lossy optimizations",
    "artifact_uri": BUCKET_URI,
    "container_spec": {
        "image_uri": "us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.nightly:latest",
        "args": [
            "--allow_precompilation=true",
            "--allow_precision_affecting_optimizations=true",
        ],
    },
}

tf_opt_lossy_gpu_model = (
    model_service_client.upload_model(parent=PARENT, model=tf_opt_lossy_gpu_model_dict)
    .result(timeout=180)
    .model
)
tf_opt_lossy_gpu_model

List all models.

In [None]:
model_service_client.list_models(parent=PARENT)

### Create endpoints

Learn more about [endpoint_service.create_endpoint](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1.services.endpoint_service.EndpointServiceClient#google_cloud_aiplatform_v1_services_endpoint_service_EndpointServiceClient_create_endpoint).

In [None]:
tf27_gpu_endpoint_dict = {
    "display_name": "BERT Base TF2.7 GPU endpoint",
}
tf27_gpu_endpoint = (
    endpoint_service_client.create_endpoint(
        parent=PARENT, endpoint=tf27_gpu_endpoint_dict
    )
    .result(timeout=300)
    .name
)
tf27_gpu_endpoint

In [None]:
tf_opt_gpu_endpoint_dict = {
    "display_name": "BERT Base optimized TensorFlow runtime GPU endpoint",
}
tf_opt_gpu_endpoint = (
    endpoint_service_client.create_endpoint(
        parent=PARENT, endpoint=tf_opt_gpu_endpoint_dict
    )
    .result(timeout=300)
    .name
)
tf_opt_gpu_endpoint

In [None]:
tf_opt_lossy_gpu_endpoint_dict = {
    "display_name": "BERT Base optimized TensorFlow runtime GPU with lossy optimizations endpoint",
}
tf_opt_lossy_gpu_endpoint = (
    endpoint_service_client.create_endpoint(
        parent=PARENT, endpoint=tf_opt_lossy_gpu_endpoint_dict
    )
    .result(timeout=300)
    .name
)
tf_opt_lossy_gpu_endpoint

### Deploy models to endpoints

Learn more about [enpoint_service.deploy_model](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1.services.endpoint_service.EndpointServiceClient#google_cloud_aiplatform_v1_services_endpoint_service_EndpointServiceClient_deploy_model).

In [None]:
tf27_gpu_deployed_model_dict = {
    "model": tf27_gpu_model,
    "display_name": "BERT Base TF2.7 GPU deployed model",
    "dedicated_resources": {
        "min_replica_count": 1,
        "max_replica_count": 1,
        "machine_spec": {
            "machine_type": DEPLOY_COMPUTE,
            "accelerator_type": DEPLOY_GPU,
            "accelerator_count": 1,
        },
    },
}

tf27_gpu_deployed_model = endpoint_service_client.deploy_model(
    endpoint=tf27_gpu_endpoint,
    deployed_model=tf27_gpu_deployed_model_dict,
    traffic_split={"0": 100},
).result()
tf27_gpu_deployed_model

In [None]:
tf_opt_gpu_deployed_model_dict = {
    "model": tf_opt_gpu_model,
    "display_name": "BERT Base optimized TensorFlow runtime GPU model",
    "dedicated_resources": {
        "min_replica_count": 1,
        "max_replica_count": 1,
        "machine_spec": {
            "machine_type": DEPLOY_COMPUTE,
            "accelerator_type": DEPLOY_GPU,
            "accelerator_count": 1,
        },
    },
}

tf_opt_gpu_deployed_model = endpoint_service_client.deploy_model(
    endpoint=tf_opt_gpu_endpoint,
    deployed_model=tf_opt_gpu_deployed_model_dict,
    traffic_split={"0": 100},
).result()
tf_opt_gpu_deployed_model

In [None]:
tf_opt_lossy_gpu_deployed_model_dict = {
    "model": tf_opt_lossy_gpu_model,
    "display_name": "BERT Base optimized TensorFlow runtime GPU model with lossy optimizations",
    "dedicated_resources": {
        "min_replica_count": 1,
        "max_replica_count": 1,
        "machine_spec": {
            "machine_type": DEPLOY_COMPUTE,
            "accelerator_type": DEPLOY_GPU,
            "accelerator_count": 1,
        },
    },
}

tf_opt_lossy_gpu_deployed_model = endpoint_service_client.deploy_model(
    endpoint=tf_opt_lossy_gpu_endpoint,
    deployed_model=tf_opt_lossy_gpu_deployed_model_dict,
    traffic_split={"0": 100},
).result()
tf_opt_lossy_gpu_deployed_model

### Sending prediction request

Now you can use the `prediction_service_client.predict` API to send prediction requests to your models.

In [None]:
prediction_service_client.predict(
    endpoint=tf27_gpu_endpoint,
    instances=["This was the best movie ever", "Movie was boring"],
)

Alternatively you can send POST REST requests without using the SDK. Learn more about https://cloud.google.com/vertex-ai/docs/predictions/online-predictions-custom-models#online_predict_custom_trained-drest.
This method is slightly faster.

In [None]:
def get_headers():
    gcloud_access_token = (
        subprocess.check_output("gcloud auth print-access-token".split(" "))
        .decode()
        .rstrip("\n")
    )
    return {"authorization": "Bearer " + gcloud_access_token}


def send_post_request(uri, request_dict):
    return r.post(
        uri, data=json.dumps(request_dict), headers=get_headers(), verify=False
    )


uri = f"https://{REGION}-aiplatform.googleapis.com/v1/{tf27_gpu_endpoint}:predict"
print(uri)

request = {
    "instances": [
        {"text": "This was the best movie ever"},
        {"text": "Movie was boring"},
    ]
}
response = send_post_request(uri, request)

print(response.text)

## (optional) Benchmark deployed models

You can run benchmarks from Colab environment, also in order to get reliable results you should use VM is in the same region as your model.

Import helper functions for benchmarking models.

In [None]:
!curl https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/community/vertex_endpoints/optimized_tensorflow_runtime/benchmark.py -o benchmark.py

In [None]:
from benchmark import benchmark

This code sends a specified number of requests asynchronously and uniformly at a given QPS, then records the observed latency. Next, the latency results are aggregated and percentiles are calculated.
The `actual_qps` that the model can handle is calculated as the time it takes for a model to process the sent requests divided by the number of requests.
By providing different implementations for `send_request` and `build_request` functions, the same code can be used for benchmarking models running locally or on Vertex AI Prediction using gRPC and REST protocols.

The main goal of this benchmark is to measure model latency on different loads, and maximum throughput the model can handle. In order to find maximum throughput, gradually increase QPS until `actual_qps` stops increasing and latency increases dramatically.

On the production deployment, the workload is not uniform, and therefore the maximum model throughput is likely to be lower.
We are not trying to simulate production workload here. This benchmark is meant to compare latency and throughput for same model running on different environments.

In [None]:
def build_rest_request(row_dict, model_name):
    payload = json.dumps({"instances": row_dict})
    return payload

In [None]:
headers = get_headers()


def send_rest_request(request):
    res = r.post(
        f"https://{REGION}-aiplatform.googleapis.com/v1/{tf27_gpu_endpoint}:predict",
        data=request,
        headers=headers,
        verify=False,
    )
    assert res.status_code == 200
    return res


tf27_gpu_results = benchmark(
    send_rest_request,
    build_rest_request,
    f"{LOCAL_DIRECTORY_FULL}/requests/requests_10_32.jsonl",
    [1, 2, 3, 4, 5],
    5,
)

tf27_gpu_results

In [None]:
headers = get_headers()


def send_rest_request(request):
    res = r.post(
        f"https://{REGION}-aiplatform.googleapis.com/v1/{tf_opt_gpu_endpoint}:predict",
        data=request,
        headers=headers,
        verify=False,
    )
    assert res.status_code == 200
    return res


tf_opt_gpu_results = benchmark(
    send_rest_request,
    build_rest_request,
    f"{LOCAL_DIRECTORY_FULL}/requests/requests_10_32.jsonl",
    [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    5,
)

tf_opt_gpu_results

In [None]:
headers = get_headers()


def send_rest_request(request):
    res = r.post(
        f"https://{REGION}-aiplatform.googleapis.com/v1/{tf_opt_lossy_gpu_endpoint}:predict",
        data=request,
        headers=headers,
        verify=False,
    )
    assert res.status_code == 200
    return res


tf_opt_lossy_gpu_results = benchmark(
    send_rest_request,
    build_rest_request,
    f"{LOCAL_DIRECTORY_FULL}/requests/requests_10_32.jsonl",
    [1, 5, 10, 15, 20, 21, 22, 23, 24, 25],
    5,
)

tf_opt_lossy_gpu_results

Combine and visualize results.

In [None]:
import matplotlib
import matplotlib.pyplot as plt


def build_graph(x_key, y_key, results_dict, axis):
    matplotlib.rcParams["figure.figsize"] = [10.0, 7.0]

    fig, ax = plt.subplots(facecolor=(1, 1, 1))
    ax.set_xlabel("QPS")
    ax.set_ylabel("Latency(ms)")
    for title, results in results_dict.items():
        x = np.array(results[x_key])
        y = np.array(results[y_key])
        ax.plot(x, y, label=title)
    ax.legend()
    ax.axis(axis)
    ax.set_title(f"BERT base model {y_key} latency, batch size 32")
    return fig

In [None]:
fig = build_graph(
    "actual_qps",
    "p50",
    {
        "TF2.7 GPU": tf27_gpu_results,
        "TF opt GPU": tf_opt_gpu_results,
        "TF opt GPU lossy": tf_opt_lossy_gpu_results,
    },
    (0, 14, 0, 1000),
)
fig.savefig("bert_p50_latency_32.png", bbox_inches="tight")

In [None]:
fig = build_graph(
    "actual_qps",
    "p99",
    {
        "TF2.7 GPU": tf27_gpu_results,
        "TF opt GPU": tf_opt_gpu_results,
        "TF opt GPU lossy": tf_opt_lossy_gpu_results,
    },
    (0, 14, 0, 1000),
)
fig.savefig("bert_p99_latency_32.png", bbox_inches="tight")

You can see that the Vertex AI Prediction optimized TensorFlow runtime has signficantly higher throughput and lower latency compared to TensorFlow 2.7.

## (Optional) Compare performance of deployed models using MLPerf Inference loadgen

MLPerf Inference is a benchmark suite for measuring how fast systems can run models in a variety of deployment scenarios. MLPerf is now an industry standard way of measuring model performance. You can follow instructions at https://github.com/tensorflow/tpu/tree/master/models/experimental/inference/load_test to run MLPerf Inferenence benchmark for deployed models.

## (Optional) Compare prediction results

In this sample the Vertex Prediction optimized TensorFlow runtime is used with the `allow_precision_affecting_optimizations` flag set to `true` to gain additional speedup. Now let's check how those optimizations effect prediction results.

Compare the results of predictions for 32,000 requests for a model running on the optimized TensorFlow runtime with lossy optimizations and on TF2.7.

In [None]:
def get_predictions(endpoint, requests_file_path):
    responses = []

    with tf.io.gfile.GFile(requests_file_path, "r") as f:
        for line in f:
            row_dict = json.loads(line)
            response = prediction_service_client.predict(
                endpoint=endpoint,
                instances=row_dict["text"],
            )
            for prediction in response.predictions:
                responses.append(prediction[0])

    return np.array(responses)

In [None]:
tf27_gpu_predictions = get_predictions(
    tf27_gpu_endpoint, f"{LOCAL_DIRECTORY_FULL}/requests/requests_1000_32.jsonl"
)

In [None]:
tf_opt_lossy_gpu_predictions = get_predictions(
    tf_opt_lossy_gpu_endpoint, f"{LOCAL_DIRECTORY_FULL}/requests/requests_1000_32.jsonl"
)

In [None]:
np.average(tf_opt_lossy_gpu_predictions - tf27_gpu_predictions) * 100

In [None]:
np.max(np.abs(tf_opt_lossy_gpu_predictions - tf27_gpu_predictions)) * 100

You can see the average results are different for less than 0.01%. In the worst case the difference is less than 1%.

## Cleanup

After you are done, it's safe to remove the endpoints you created and the model you deployed.

In [None]:
def cleanup(endpoint, model_name, deployed_model_id):
    response = endpoint_service_client.undeploy_model(
        endpoint=endpoint, deployed_model_id=deployed_model_id
    )
    print("running undeploy_model operation:", response.operation.name)
    print(response.result())

    response = endpoint_service_client.delete_endpoint(name=endpoint)
    print("running delete_endpoint operation:", response.operation.name)
    print(response.result())

    response = model_service_client.delete_model(name=model_name)
    print("running delete_model operation:", response.operation.name)
    print(response.result())

In [None]:
cleanup(tf27_gpu_endpoint, tf27_gpu_model, tf27_gpu_deployed_model.deployed_model.id)
cleanup(
    tf_opt_gpu_endpoint, tf_opt_gpu_model, tf_opt_gpu_deployed_model.deployed_model.id
)
cleanup(
    tf_opt_lossy_gpu_endpoint,
    tf_opt_lossy_gpu_model,
    tf_opt_lossy_gpu_deployed_model.deployed_model.id,
)

You can now also remove model from GCS bucket as well.

In [None]:
# Set this to true only if you'd like to delete your bucket
delete_bucket = False
if delete_bucket or os.getenv("IS_TESTING"):
    !gsutil rm -r $BUCKET_URI