In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Deploying T5x base on Vertex AI Predictions using the optimized TensorFlow runtime

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/vertex_endpoints/optimized_tensorflow_runtime/t5x_base_optimized_online_prediction.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/vertex_endpoints/optimized_tensorflow_runtime/t5x_base_optimized_online_prediction.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/community/vertex_endpoints/optimized_tensorflow_runtime/t5x_base_optimized_online_prediction.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>

## Overview

In this sample you learn how to deploy a T5x base model to Vertex AI Prediction using optimized TensorFlow runtime containers.

You evaluate model performance with different optimizations available on optimized TensorFlow runtime containers using MLPerf inference Vertex Prediction benchmark tool.

For additional information about Vertex AI Prediction optimized TensorFlow runtime containers, see https://cloud.google.com/vertex-ai/docs/predictions/optimized-tensorflow-runtime.

### Objective

In this notebook, you learn how to deploy a fine-tuned T5x base model to Vertex AI Prediction service using the optimized TensorFlow runtime. For the best performance you can use NVIDIA A100 GPUs.

The steps you perform include:
* Learn how to fine-tune T5x base model on Vertex
* Deploy a T5x base model to Vertex AI Prediction using an optimized TensorFlow runtime container using different optimization options
* Benchmark deployed modes and validate their predictions

You can deploy fine-tuned model to Vertex AI Prediction using Colab. But in order to get reliable benchmark results, this walkthrough must be run on Jupyter VM running in the same region as your model.

### Model

In this notebook you use a T5x base model. 

T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g., for translation: translate English to German: …, for summarization: summarize: ….
Model can be further fine-tuned to be used for specific tasks that it was not trained to do. 

T5X is the new and improved implementation of T5 in JAX and Flax.
After model is fine-tuned it can be exported in TensorFlow SavedModel format that can be used on Vertex AI Prediction using optimized TensorFlow runtime.

### Costs

This tutorial uses the following billable components of Google Cloud:

* Vertex AI
* Cloud Storage
* Cloud TPU (if you choose to fine-tune model on your own)

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

### Set up your local development environment

**If you are using Colab or Vertex AI Workbench Notebooks**, your environment already meets
all the requirements to run this notebook. You can skip this step.

**Otherwise**, make sure your environment meets this notebook's requirements.
You need the following:

* The Google Cloud SDK
* Git
* Python 3
* virtualenv
* Jupyter notebook running in a virtual environment with Python 3

The Google Cloud guide to [Setting up a Python development
environment](https://cloud.google.com/python/setup) and the [Jupyter
installation guide](https://jupyter.org/install) provide detailed instructions
for meeting these requirements. The following steps provide a condensed set of
instructions:

1. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/)

1. [Install Python 3.](https://cloud.google.com/python/setup#installing_python)

1. [Install
   virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)
   and create a virtual environment that uses Python 3. Activate the virtual environment.

1. To install Jupyter, run `pip3 install jupyter` on the
command-line in a terminal shell.

1. To launch Jupyter, run `jupyter notebook` on the command-line in a terminal shell.

1. Open this notebook in the Jupyter Notebook Dashboard.

### Install additional packages

Install the packages required for executing this notebook.

In [None]:
import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# Vertex AI Workbench Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_GOOGLE_CLOUD_NOTEBOOK:
    USER_FLAG = "--user"

!pip3 install {USER_FLAG} --upgrade google-cloud-aiplatform google-cloud-storage tensorflow-serving-api -q

If you also plan to run MLPerf infereence benchmark, you'd also need to download and install additional dependencies (see https://github.com/tensorflow/tpu/tree/master/models/experimental/inference/load_test#run-the-benchmark-locally for details).

In [None]:
!pip3 install {USER_FLAG} transformers tf-models-official -q

In [None]:
!git clone --recurse-submodules -b r1.0 https://github.com/mlcommons/inference.git

In [None]:
!cd inference/loadgen && CFLAGS="-std=c++14 -O3" python3 setup.py bdist_wheel && pip3 install {USER_FLAG} --force-reinstall dist/mlperf_loadgen-*

In [None]:
!git clone https://github.com/tensorflow/tpu.git

### Restart the kernel

After you install the additional packages, you must restart the notebook kernel so it can find the packages.

In [None]:
# Automatically restart kernel after installs
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required for all notebook environments.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 credit towards your compute and storage costs.

1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

1. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

1. If you run this notebook locally, you must install the [Cloud SDK](https://cloud.google.com/sdk).

1. Enter your project ID in the following cell. Then run the cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands.

#### Set your project ID

**If you don't know your project ID**, you can try to get your project ID using `gcloud`.

In [None]:
import os

PROJECT_ID = ""

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

Otherwise, set your project ID here.

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

#### Set your region

Select region where you are going to deploy your model to. Note that if you plan to deploy model on NVIDIA A100, it is only available in select regions: https://cloud.google.com/vertex-ai/docs/general/locations#region_considerations


In [None]:
REGION = "[your-region]"  # @param {type:"string"}

if REGION == "[your-region]":
    REGION = "us-central1"

### Authenticate your Google Cloud account

**If you are using  Vertex AI Workbench Notebooks**, your environment is already
authenticated. Skip this step.

**If you are using Colab**, run the cell below and follow the instructions
when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

1. In the Cloud Console, go to the [**Create service account key**
   page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).

2. Click **Create service account**.

3. In the **Service account name** field, enter a name, and
   click **Create**.

4. In the **Grant this service account access to project** section, click the **Role** drop-down list. Type "Vertex AI"
into the filter box, and select
   **Vertex AI Administrator**. Type "Storage Object Admin" into the filter box, and select **Storage Object Admin**.

5. Click **Create**. A JSON file that contains your key downloads to your local environment.

6. Enter the path to your service account key as the
`GOOGLE_APPLICATION_CREDENTIALS` variable in the following cell, then and run the cell.

In [None]:
import os
import sys

# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

# The Vertex AI Workbench Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# If on  Vertex AI Workbench Notebooks, then don't execute this code
if not IS_GOOGLE_CLOUD_NOTEBOOK:
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

# Fine-tune T5x base model

In this sample you use T5x base model fine-tuned to do English to German language translation.

T5x is a JAX based model that can be trained and fine-tuned on Google Cloud TPUs, and then exported as a TensorFlow Saved model.

To fine-tune model, please follow steps at https://github.com/google-research/t5x to fine-tune model on Cloud TPU VM. Alternatively you can fine-tune model using Vertex Training service, refer https://github.com/GoogleCloudPlatform/t5x-on-vertex-ai for the steps describing how to do that.

For exporting fine-tuned model refer to [Exporting as TensorFlow Saved Model](https://github.com/google-research/t5x#exporting-as-tensorflow-saved-model) section.

For the purpose of this guide, you can use already fine-tuned models available under gs://cloud-samples-data/vertex-ai/model-deployment/models/t5x/base/.

## Model Arithmetic and Weights Type

Note that there are 2 types of model, one is exported with `float32` weights, another one with `bfloat16` weights.

`bfloat16` is a native format for Google Cloud TPUs, and T5x model also using it by default.
NVIDIA A100 GPU has support for `bfloat16` arithmetic, and optimized TensorFlow runtime allows to take advantage of this.
If you plan to deploy model on GPU that doesn't have `bfloat16` support, such as NVIDIA T4 or NVIDIA V100, you'll need to use model with `float32` weights. Luckily optimized TensorFlow runtime has an optimization that allows running models on lower precision by specifying `--allow_compression` option.

In [None]:
T5X_BASE_FLOAT32_MODEL_URI = "gs://cloud-samples-data/vertex-ai/model-deployment/models/t5x/base/saved_model.float32/1"
T5X_BASE_BFLOAT16_MODEL_URI = "gs://cloud-samples-data/vertex-ai/model-deployment/models/t5x/base/saved_model.bfloat16/1"

You can observe model definition using `saved_model_cli` tool that is part of TensorFlow. Feel free to ignore error related to `SentencepieceOp`.

In [None]:
!saved_model_cli show --dir=$T5X_BASE_FLOAT32_MODEL_URI --all

# Deploy model to Vertex AI Endpoint

You deploy model using [Vertex AI SDK](https://cloud.google.com/python/docs/reference/aiplatform/latest), import it into your notebook environment.

In [None]:
from google.cloud import aiplatform

Define the node configuration to use for deployments. To learn about Vertex AI Prediction options, see [configure compute resources](https://cloud.google.com/vertex-ai/docs/predictions/configure-compute). 

You are going to deploy models using optimized TensorFlow runtime container, see the full list available containers in official documentation: https://cloud.google.com/vertex-ai/docs/predictions/optimized-tensorflow-runtime#available_container_images.

In [None]:
OPTIMIZED_TF_RUNTIME_IMAGE_URI = (
    "us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.nightly:latest"
)

You are going to deploy T5x model on NVIDIA T4 GPU.

In [None]:
DEPLOY_COMPUTE_T4 = "n1-standard-8"
DEPLOY_GPU_T4 = "NVIDIA_TESLA_T4"

Deploy T5x base model with float32 weights and no optimizations.

In [None]:
t5x_base_float32 = aiplatform.Model.upload(
    display_name="t5x_base_float32",
    artifact_uri=T5X_BASE_FLOAT32_MODEL_URI,
    serving_container_image_uri=OPTIMIZED_TF_RUNTIME_IMAGE_URI,
    serving_container_args=[],
    location=REGION,
)

t5x_base_float32_t4_endpoint = t5x_base_float32.deploy(
    deployed_model_display_name="t5x_base_float32_deployed",
    traffic_split={"0": 100},
    machine_type=DEPLOY_COMPUTE_T4,
    accelerator_type=DEPLOY_GPU_T4,
    accelerator_count=1,
    min_replica_count=1,
    max_replica_count=1,
)

Deploy T5x base model with float32 weights and precompilation.

In [None]:
t5x_base_float32_precompiled = aiplatform.Model.upload(
    display_name="t5x_base_float32_precompiled",
    artifact_uri=T5X_BASE_FLOAT32_MODEL_URI,
    serving_container_image_uri=OPTIMIZED_TF_RUNTIME_IMAGE_URI,
    serving_container_args=["--allow_precompilation"],
    location=REGION,
)

t5x_base_float32_precompiled_t4_endpoint = t5x_base_float32_precompiled.deploy(
    deployed_model_display_name="t5x_base_float32_precompiled_deployed",
    traffic_split={"0": 100},
    machine_type=DEPLOY_COMPUTE_T4,
    accelerator_type=DEPLOY_GPU_T4,
    accelerator_count=1,
    min_replica_count=1,
    max_replica_count=1,
)

Deploy T5x base model with float32 weights, precompilation and compression. Model compression optimizes model to make compute intensive parts of the model to run at lower float16 precision and utilize NVIDIA GPU TensorCores.

In [None]:
t5x_base_float32_precompiled_mixedprecision = aiplatform.Model.upload(
    display_name="t5x_base_float32_precompiled_mixedprecision",
    artifact_uri=T5X_BASE_FLOAT32_MODEL_URI,
    serving_container_image_uri=OPTIMIZED_TF_RUNTIME_IMAGE_URI,
    serving_container_args=["--allow_precompilation", "--allow_compression"],
    location=REGION,
)

t5x_base_float32_precompiled_mixedprecision_t4_endpoint = t5x_base_float32_precompiled_mixedprecision.deploy(
    deployed_model_display_name="t5x_base_float32_precompiled_mixedprecision_deployed",
    traffic_split={"0": 100},
    machine_type=DEPLOY_COMPUTE_T4,
    accelerator_type=DEPLOY_GPU_T4,
    accelerator_count=1,
    min_replica_count=1,
    max_replica_count=1,
)

For the best performance you can deploy T5x base model with bfloat16 weights on NVIDIA A100 that has support for bfloat16 arithmetic. Note that in order for effectively utilize bfloat16 logic model has to be deployed with precompilation. Since model is already running at half precision, model compression is not needed.

In [None]:
DEPLOY_COMPUTE_A100 = "a2-highgpu-1g"
DEPLOY_GPU_A100 = "NVIDIA_TESLA_A100"

In [None]:
t5x_base_float32_a100_endpoint = t5x_base_float32.deploy(
    deployed_model_display_name="t5x_base_float32_deployed",
    traffic_split={"0": 100},
    machine_type=DEPLOY_COMPUTE_A100,
    accelerator_type=DEPLOY_GPU_A100,
    accelerator_count=1,
    min_replica_count=1,
    max_replica_count=1,
)

In [None]:
t5x_base_bfloat16_precompiled = aiplatform.Model.upload(
    display_name="t5x_base_bfloat16_precompiled",
    artifact_uri=T5X_BASE_BFLOAT16_MODEL_URI,
    serving_container_image_uri=OPTIMIZED_TF_RUNTIME_IMAGE_URI,
    serving_container_args=["--allow_precompilation"],
    location=REGION,
)

t5x_base_bfloat16_precompiled_a100_endpoint = t5x_base_bfloat16_precompiled.deploy(
    deployed_model_display_name="t5x_base_bfloat16_precompiled_deployed",
    traffic_split={"0": 100},
    machine_type=DEPLOY_COMPUTE_A100,
    accelerator_type=DEPLOY_GPU_A100,
    accelerator_count=1,
    min_replica_count=1,
    max_replica_count=1,
)

## Sending prediction requests

You can send requests directly from each endpoint. T5x models expects data to be in a dictionary with "text_batch" key (see response from `saved_model_cli` call above).

In [None]:
instances = [{"text_batch": "translate English to German: this is good"}]

In [None]:
t5x_base_float32_t4_endpoint.predict(instances=instances)

In [None]:
t5x_base_float32_precompiled_t4_endpoint.predict(instances=instances)

In [None]:
t5x_base_float32_precompiled_mixedprecision_t4_endpoint.predict(instances=instances)

In [None]:
t5x_base_float32_a100_endpoint.predict(instances=instances)

In [None]:
t5x_base_bfloat16_precompiled_a100_endpoint.predict(instances=instances)

Alternatively you can send POST REST requests without using the SDK. Learn more about https://cloud.google.com/vertex-ai/docs/predictions/online-predictions-custom-models#online_predict_custom_trained-drest.
This method is slightly faster.

## Compare predictions

To make sure that all models returns same results, send same requests to all endpoints and compare predictions.

In [None]:
# Add your samples here
samples = [
    "hello world",
    "this is a prediction from T5x model",
    "this is good",
    "my name is T5x",
]

endpoints = {
    "t5x_base_float32_t4": t5x_base_float32_t4_endpoint,
    "t5x_base_float32_precompiled_t4": t5x_base_float32_precompiled_t4_endpoint,
    "t5x_base_float32_precompiled_mixedprecision_t4": t5x_base_float32_precompiled_mixedprecision_t4_endpoint,
    "t5x_base_float32_a100": t5x_base_float32_a100_endpoint,
    "t5x_base_bfloat16_precompiled_a100": t5x_base_bfloat16_precompiled_a100_endpoint,
}

prefix = "translate English to German: "

for sample in samples:
    print(f"Prediction for: {prefix}{sample}")
    for model_name, endpoint in endpoints.items():
        response = endpoint.predict(instances=[{"text_batch": f"{prefix}{sample}"}])
        prediction = response.predictions[0]["output_0"][0]
        print(f"Model: {model_name} Prediction: {prediction}")
    print("-----------")

## (optional) Compare performance of deployed models

You can run benchmarks from Colab environment, also in order to get reliable results you should use VM is in the same region as your model.

Import helper functions for benchmarking models.

In [None]:
!curl https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/community/vertex_endpoints/optimized_tensorflow_runtime/benchmark.py -o benchmark.py

In [None]:
from benchmark import benchmark

This code sends a specified number of requests asynchronously and uniformly at a given QPS, then records the observed latency. Next, the latency results are aggregated and percentiles are calculated. The actual_qps that the model can handle is calculated as the time it takes for a model to process the sent requests divided by the number of requests. By providing different implementations for send_request and build_request functions, the same code can be used for benchmarking models running locally or on Vertex AI Prediction using gRPC and REST protocols.

The main goal of this benchmark is to measure model latency on different loads, and maximum throughput the model can handle. In order to find maximum throughput, gradually increase QPS until actual_qps stops increasing and latency increases dramatically.

On the production deployment, the workload is not uniform, and therefore the maximum model throughput is likely to be lower. The goal is not to simulate production workload, this benchmark is meant to compare latency and throughput for same model running on different environments.

In [None]:
def build_rest_request(row_dict, model_name):
    return row_dict


def validate_response(response):
    assert response
    assert len(response.predictions) == 1
    assert "output_0" in response.predictions[0]
    assert response.predictions[0]["output_0"]

In [None]:
def send_rest_request(request):
    response = t5x_base_float32_t4_endpoint.predict(instances=[request])
    validate_response(response)


t5x_base_float32_t4_results = benchmark(
    send_rest_request,
    build_rest_request,
    "gs://cloud-samples-data/vertex-ai/model-deployment/models/t5x/requests/requests_100.jsonl",
    [0.5, 0.75, 1, 1.25, 1.5, 1.75, 2.0],
    10,
)

t5x_base_float32_t4_results

In [None]:
def send_rest_request(request):
    response = t5x_base_float32_precompiled_t4_endpoint.predict(instances=[request])
    validate_response(response)


t5x_base_float32_precompiled_t4_results = benchmark(
    send_rest_request,
    build_rest_request,
    "gs://cloud-samples-data/vertex-ai/model-deployment/models/t5x/requests/requests_100.jsonl",
    [0.5, 1, 2, 3, 4, 5],
    10,
)

t5x_base_float32_precompiled_t4_results

In [None]:
def send_rest_request(request):
    response = t5x_base_float32_precompiled_mixedprecision_t4_endpoint.predict(
        instances=[request]
    )
    validate_response(response)


t5x_base_float32_precompiled_mixedprecision_t4_results = benchmark(
    send_rest_request,
    build_rest_request,
    "gs://cloud-samples-data/vertex-ai/model-deployment/models/t5x/requests/requests_100.jsonl",
    [0.5, 1, 2, 3, 4, 5],
    10,
)

t5x_base_float32_precompiled_mixedprecision_t4_results

In [None]:
def send_rest_request(request):
    response = t5x_base_float32_a100_endpoint.predict(instances=[request])
    validate_response(response)


t5x_base_float32_a100_results = benchmark(
    send_rest_request,
    build_rest_request,
    "gs://cloud-samples-data/vertex-ai/model-deployment/models/t5x/requests/requests_100.jsonl",
    [0.5, 1, 1.25, 1.5, 1.75, 2.0, 2.25],
    10,
)

t5x_base_float32_a100_results

In [None]:
def send_rest_request(request):
    response = t5x_base_bfloat16_precompiled_a100_endpoint.predict(instances=[request])
    validate_response(response)


t5x_base_bfloat16_precompiled_a100_results = benchmark(
    send_rest_request,
    build_rest_request,
    "gs://cloud-samples-data/vertex-ai/model-deployment/models/t5x/requests/requests_100.jsonl",
    [0.5, 5, 7.5, 10, 12.5, 15],
    10,
)

t5x_base_bfloat16_precompiled_a100_results

Combine and visualize results.

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np


def build_graph(x_key, y_key, results_dict, axis, title="T5x base model latency"):
    matplotlib.rcParams["figure.figsize"] = [10.0, 7.0]

    fig, ax = plt.subplots(facecolor=(1, 1, 1))
    ax.set_xlabel("QPS")
    ax.set_ylabel("Latency(ms)")
    for label, results in results_dict.items():
        x = np.array(results[x_key])
        y = np.array(results[y_key])
        ax.plot(x, y, label=label, marker="s")
    ax.grid()
    ax.legend()
    ax.axis(axis)
    ax.set_title(title)
    return fig

In [None]:
fig = build_graph(
    "actual_qps",
    "p50",
    {
        "T5x base float32 on T4": t5x_base_float32_t4_results,
        "T5x base float32 on T4 with precompilation": t5x_base_float32_precompiled_t4_results,
        "T5x base float32 on T4 with precompilation and compression": t5x_base_float32_precompiled_mixedprecision_t4_results,
        "T5x base float32 on A100": t5x_base_float32_a100_results,
        "T5x base bfloat16 on A100 with precompilation": t5x_base_bfloat16_precompiled_a100_results,
    },
    (0, 10, 0, 2500),
    title="T5x base model p50 latency, batch size 1",
)
fig.savefig("t5x_base_p50_latency.png", bbox_inches="tight")

In [None]:
fig = build_graph(
    "actual_qps",
    "p99",
    {
        "T5x base float32 on T4": t5x_base_float32_t4_results,
        "T5x base float32 on T4 with precompilation": t5x_base_float32_precompiled_t4_results,
        "T5x base float32 on T4 with precompilation and compression": t5x_base_float32_precompiled_mixedprecision_t4_results,
        "T5x base float32 on A100": t5x_base_float32_a100_results,
        "T5x base bfloat16 on A100 with precompilation": t5x_base_bfloat16_precompiled_a100_results,
    },
    (0, 5, 0, 5500),
    title="T5x base model p99 latency, batch size 1",
)
fig.savefig("t5x_base_p99_latency.png", bbox_inches="tight")

As you can see Vertex AI Prediction optimized TensorFlow runtime optimizations offer signficantly higher throughput and lower latency for T5x base model.

## (Optional) Compare performance of deployed models using MLPerf Inference loadgen

MLPerf Inference is a benchmark suite for measuring how fast systems can run models in a variety of deployment scenarios. MLPerf is now an industry standard way of measuring model performance. You can follow instructions at https://github.com/tensorflow/tpu/tree/master/models/experimental/inference/load_test to run MLPerf Inferenence benchmark for deployed models.

Unlike naive benchmark that was used before, MLPerf loadgen is sending requests using [Poisson distribution](https://github.com/mlcommons/inference_policies/blob/master/inference_rules.adoc#3-scenarios).

In [None]:
project_id = t5x_base_float32_t4_endpoint.resource_name.split("/")[1]
project_id

endpoint_id = t5x_base_float32_t4_endpoint.resource_name.split("/")[-1]
endpoint_id

In [None]:
%cd tpu/models/experimental/inference

In [None]:
!python3 -m load_test.examples.loadgen_vertex_main \
  --project_id={project_id} \
  --region={REGION} \
  --endpoint_id={t5x_base_float32_t4_endpoint.resource_name.split("/")[-1]} \
  --dataset=generic_jsonl \
  --data_file=gs://cloud-samples-data/vertex-ai/model-deployment/models/t5x/requests/requests_100.jsonl \
  --api_type=rest \
  --min_query_count=10 \
  --min_duration_ms=10000 \
  --qps=0.5 --qps=1.0 --qps=1.25 --qps=1.5 --qps=1.75 --qps=2.0 \
  --csv_report_filename="t5x_base_float32_t4_results.csv"

In [None]:
!python3 -m load_test.examples.loadgen_vertex_main \
  --project_id={project_id} \
  --region={REGION} \
  --endpoint_id={t5x_base_float32_precompiled_t4_endpoint.resource_name.split("/")[-1]} \
  --dataset=generic_jsonl \
  --data_file=gs://cloud-samples-data/vertex-ai/model-deployment/models/t5x/requests/requests_100.jsonl \
  --api_type=rest \
  --min_query_count=10 \
  --min_duration_ms=10000 \
  --qps=0.5 --qps=1 --qps=2 --qps=3 --qps=4 --qps=5 \
  --csv_report_filename="t5x_base_float32_precompiled_t4_results.csv"

In [None]:
!python3 -m load_test.examples.loadgen_vertex_main \
  --project_id={project_id} \
  --region={REGION} \
  --endpoint_id={t5x_base_float32_precompiled_mixedprecision_t4_endpoint.resource_name.split("/")[-1]} \
  --dataset=generic_jsonl \
  --data_file=gs://cloud-samples-data/vertex-ai/model-deployment/models/t5x/requests/requests_100.jsonl \
  --api_type=rest \
  --min_query_count=10 \
  --min_duration_ms=10000 \
  --qps=0.5 --qps=1 --qps=2 --qps=3 --qps=4 --qps=5 --qps=6 \
  --csv_report_filename="t5x_base_float32_precompiled_mixedprecision_t4_results.csv"

In [None]:
!python3 -m load_test.examples.loadgen_vertex_main \
  --project_id={project_id} \
  --region={REGION} \
  --endpoint_id={t5x_base_float32_a100_endpoint.resource_name.split("/")[-1]} \
  --dataset=generic_jsonl \
  --data_file=gs://cloud-samples-data/vertex-ai/model-deployment/models/t5x/requests/requests_100.jsonl \
  --api_type=rest \
  --min_query_count=10 \
  --min_duration_ms=10000 \
  --qps=0.5 --qps=1.0 --qps=1.25 --qps=1.5 --qps=1.75 --qps=2.0 --qps=2.25 \
  --csv_report_filename="t5x_base_float32_a100_results.csv"

In [None]:
!python3 -m load_test.examples.loadgen_vertex_main \
  --project_id={project_id} \
  --region={REGION} \
  --endpoint_id={t5x_base_bfloat16_precompiled_a100_endpoint.resource_name.split("/")[-1]} \
  --dataset=generic_jsonl \
  --data_file=gs://cloud-samples-data/vertex-ai/model-deployment/models/t5x/requests/requests_100.jsonl \
  --api_type=rest \
  --min_query_count=10 \
  --min_duration_ms=10000 \
  --qps=0.5 --qps=5 --qps=7.5 --qps=10 --qps=12.5 --qps=15 \
  --csv_report_filename="t5x_base_bfloat16_precompiled_a100_results.csv"

In [None]:
import csv


def parse_report_csv(file_name):
    with open(file_name, newline="") as f:
        reader = csv.reader(f)

        d = {}
        index_to_key = {}
        for row in reader:
            if not d:
                for index in range(len(row)):
                    key = row[index]
                    index_to_key[index] = key
                    d[key] = []
            else:
                for index in range(len(row)):
                    if index_to_key[index] != "scenario":
                        d[index_to_key[index]].append(float(row[index]))
                    else:
                        d[index_to_key[index]].append(row[index])
        return d

In [None]:
fig = build_graph(
    "actual_qps",
    "p50",
    {
        "T5x base float32 on T4": parse_report_csv("t5x_base_float32_t4_results.csv"),
        "T5x base float32 on T4 with precompilation": parse_report_csv(
            "t5x_base_float32_precompiled_t4_results.csv"
        ),
        "T5x base float32 on T4 with precompilation and compression": parse_report_csv(
            "t5x_base_float32_precompiled_mixedprecision_t4_results.csv"
        ),
        "T5x base float32 on A100": parse_report_csv(
            "t5x_base_float32_a100_results.csv"
        ),
        "T5x base bfloat16 on A100 with precompilation": parse_report_csv(
            "t5x_base_bfloat16_precompiled_a100_results.csv"
        ),
    },
    (0, 10, 0, 2500),
    title="T5x base model p50 latency measured by MLPerf loadgen, batch size 1",
)
fig.savefig("t5x_base_p50_mlperf_latency.png", bbox_inches="tight")

In [None]:
fig = build_graph(
    "actual_qps",
    "p99",
    {
        "T5x base float32 on T4": parse_report_csv("t5x_base_float32_t4_results.csv"),
        "T5x base float32 on T4 with precompilation": parse_report_csv(
            "t5x_base_float32_precompiled_t4_results.csv"
        ),
        "T5x base float32 on T4 with precompilation and compression": parse_report_csv(
            "t5x_base_float32_precompiled_mixedprecision_t4_results.csv"
        ),
        "T5x base float32 on A100": parse_report_csv(
            "t5x_base_float32_a100_results.csv"
        ),
        "T5x base bfloat16 on A100 with precompilation": parse_report_csv(
            "t5x_base_bfloat16_precompiled_a100_results.csv"
        ),
    },
    (0, 10, 0, 3500),
    title="T5x base model p99 latency measured by MLPerf loadgen, batch size 1",
)
fig.savefig("t5x_base_p99_mlperf_latency.png", bbox_inches="tight")

These results are mostly consistent with results obtained using naive benchmarking code.

## Cleanup

After you are done, it's safe to remove the endpoints you created and the model you deployed.

In [None]:
# Undeploy models
t5x_base_float32_t4_endpoint.undeploy_all()
t5x_base_float32_precompiled_t4_endpoint.undeploy_all()
t5x_base_float32_precompiled_mixedprecision_t4_endpoint.undeploy_all()
t5x_base_float32_a100_endpoint.undeploy_all()
t5x_base_bfloat16_precompiled_a100_endpoint.undeploy_all()

In [None]:
# Delete models
t5x_base_float32.delete()
t5x_base_float32_precompiled.delete()
t5x_base_float32_precompiled_mixedprecision.delete()
t5x_base_bfloat16_precompiled.delete()

In [None]:
# Delete endpoints
t5x_base_float32_t4_endpoint.delete()
t5x_base_float32_precompiled_t4_endpoint.delete()
t5x_base_float32_precompiled_mixedprecision_t4_endpoint.delete()
t5x_base_float32_a100_endpoint.delete()
t5x_base_bfloat16_precompiled_a100_endpoint.delete()