In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# E2E ML on GCP: MLOps stage 5 : deployment: get started with configuring autoscaling for Vertex AI Endpoint deployment

<table align="left">
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/ml_ops/stage5/get_started_with_autoscaling.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
        <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/ml_ops/stage5/get_started_with_autoscaling.ipynb">
        <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
        </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/community/ml_ops/stage5/get_started_with_autoscaling.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>
<br/><br/><br/>

## Overview


This tutorial demonstrates how to use Vertex AI for E2E MLOps on Google Cloud in production. This tutorial covers stage 5 : deployment: get started with autoscaling for deployment.

### Objective

In this tutorial, you learn how to use fine-tune control auto-scaling configuration when deploying a `Model` resource to an `Endpoint` resource.

This tutorial uses the following Google Cloud ML services:

- `Vertex ML Prediction`

The steps performed include:

- Download a pretrained image classification model from TensorFlow Hub.
- Upload the pretrained model as a `Model` resource.
- Create an `Endpoint` resource.
- Deploy `Model` resource for no-scaling (single node).
- Deploy `Model` resource for manual scaling.
- Deploy `Model` resource for auto-scaling.
- Fine-tune scaling thresholds for CPU utilization.
- Fine-tune scaling thresholds for GPU utilization.
- Deploy mix of CPU and GPU model instances with auto-scaling to an `Endpoint` resource.

### Dataset

This tutorial uses a pre-trained image classification model from TensorFlow Hub, which is trained on ImageNet dataset.

Learn more about [ResNet V2 pretained model](https://tfhub.dev/google/imagenet/resnet_v2_101/classification/5). 

### Costs
This tutorial uses billable components of Google Cloud:

- Vertex AI
- Cloud Storage

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage pricing](https://cloud.google.com/storage/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

## Installations

Install the packages required for executing this notebook.

In [None]:
import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_WORKBENCH_NOTEBOOK = os.getenv("DL_ANACONDA_HOME")
IS_USER_MANAGED_WORKBENCH_NOTEBOOK = os.path.exists(
    "/opt/deeplearning/metadata/env_version"
)

# Vertex AI Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_WORKBENCH_NOTEBOOK:
    USER_FLAG = "--user"

# Install the packages

! pip3 install --upgrade google-cloud-aiplatform $USER_FLAG -q
! pip3 install --upgrade google-cloud-storage $USER_FLAG -q
! pip3 install tensorflow-hub $USER_FLAG -q

### Restart the kernel

Once you've installed the additional packages, you need to restart the notebook kernel so it can find the packages.

In [None]:
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

1. [Enable the Vertex AI, Compute Engine and Cloud Storage APIs](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,compute_component,storage_component).

1. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

1. Enter your project ID in the cell below. Then run the cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands.

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "[your-project-id]":
    # Get your GCP project id from gcloud
    shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID:", PROJECT_ID)

In [None]:
! gcloud config set project $PROJECT_ID

#### Region

You can also change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  Below are regions supported for Vertex AI. We recommend that you choose the region closest to you.

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-east1`

You may not use a multi-regional bucket for training with Vertex AI. Not all regions provide support for all Vertex AI services.

Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "[your-region]"  # @param {type: "string"}

if REGION == "[your-region]":
    REGION = "us-central1"

#### Timestamp

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append the timestamp onto the name of resources you create in this tutorial.

In [None]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

### Authenticate your Google Cloud account

**If you are using Vertex AI Workbench Notebooks**, your environment is already authenticated. Skip this step.

**If you are using Colab**, run the cell below and follow the instructions when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

In the Cloud Console, go to the [Create service account key](https://console.cloud.google.com/apis/credentials/serviceaccountkey) page.

1. **Click Create service account**.

2. In the **Service account name** field, enter a name, and click **Create**.

3. In the **Grant this service account access to project** section, click the Role drop-down list. Type "Vertex AI" into the filter box, and select **Vertex AI Administrator**. Type "Storage Object Admin" into the filter box, and select **Storage Object Admin**.

4. Click Create. A JSON file that contains your key downloads to your local environment.

5. Enter the path to your service account key as the GOOGLE_APPLICATION_CREDENTIALS variable in the cell below and run the cell.

In [None]:
# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

import os
import sys

# If on Vertex AI Workbench, then don't execute this code
IS_COLAB = "google.colab" in sys.modules
if not os.path.exists("/opt/deeplearning/metadata/env_version") and not os.getenv(
    "DL_ANACONDA_HOME"
):
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

When you initialize the Vertex SDK for Python, you specify a Cloud Storage staging bucket. The staging bucket is where all the data associated with your dataset and model resources are retained across sessions.

Set the name of your Cloud Storage bucket below. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.

In [None]:
BUCKET_NAME = "[your-bucket-name]"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

In [None]:
if BUCKET_URI == "" or BUCKET_URI is None or BUCKET_URI == "gs://[your-bucket-name]":
    BUCKET_NAME = PROJECT_ID + "aip-" + TIMESTAMP
    BUCKET_URI = "gs://" + BUCKET_NAME

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION $BUCKET_URI

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al $BUCKET_URI

### Set up variables

Next, set up some variables used throughout the tutorial.
### Import libraries and define constants

In [None]:
import google.cloud.aiplatform as aiplatform
import tensorflow as tf
import tensorflow_hub as hub

### Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [None]:
aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_URI)

#### Set hardware accelerators

You can set hardware accelerators for training and prediction.

Set the variables `DEPLOY_GPU/DEPLOY_NGPU` to use a container image supporting a GPU and the number of GPUs allocated to the virtual machine (VM) instance. For example, to use a GPU container image with 4 Nvidia Telsa K80 GPUs allocated to each VM, you would specify:

    (aip.AcceleratorType.NVIDIA_TESLA_K80, 4)


Otherwise specify `(None, None)` to use a container image to run on a CPU.

Learn more about [hardware accelerator support for your region](https://cloud.google.com/vertex-ai/docs/general/locations#accelerators).

Learn more about [GPU compatibility by Machine Type](https://cloud.google.com/vertex-ai/docs/training/configure-compute#gpu-compatibility-table).

In [None]:
if os.getenv("IS_TESTING_DEPLOY_GPU"):
    DEPLOY_GPU, DEPLOY_NGPU = (
        aiplatform.gapic.AcceleratorType.NVIDIA_TESLA_K80,
        int(os.getenv("IS_TESTING_DEPLOY_GPU")),
    )
else:
    DEPLOY_GPU, DEPLOY_NGPU = (aiplatform.gapic.AcceleratorType.NVIDIA_TESLA_K80, 1)

#### Set pre-built containers

Set the pre-built Docker container image for prediction.


For the latest list, see [Pre-built containers for prediction](https://cloud.google.com/ai-platform-unified/docs/predictions/pre-built-containers).

In [None]:
if os.getenv("IS_TESTING_TF"):
    TF = os.getenv("IS_TESTING_TF")
else:
    TF = "2.5".replace(".", "-")

GPU_VERSION = "tf2-gpu.{}".format(TF)
CPU_VERSION = "tf2-cpu.{}".format(TF)

DEPLOY_IMAGE_GPU = "{}-docker.pkg.dev/vertex-ai/prediction/{}:latest".format(
    REGION.split("-")[0], GPU_VERSION
)

DEPLOY_IMAGE_CPU = "{}-docker.pkg.dev/vertex-ai/prediction/{}:latest".format(
    REGION.split("-")[0], CPU_VERSION
)

print("Deployment:", DEPLOY_IMAGE_GPU, DEPLOY_IMAGE_CPU, DEPLOY_GPU, DEPLOY_NGPU)

#### Set machine type

Next, set the machine type to use for prediction.

- Set the variable `DEPLOY_COMPUTE` to configure  the compute resources for the VMs you will use for for prediction.
 - `machine type`
     - `n1-standard`: 3.75GB of memory per vCPU.
     - `n1-highmem`: 6.5GB of memory per vCPU
     - `n1-highcpu`: 0.9 GB of memory per vCPU
 - `vCPUs`: number of \[2, 4, 8, 16, 32, 64, 96 \]

*Note: You may also use n2 and e2 machine types for training and deployment, but they do not support GPUs*.

In [None]:
if os.getenv("IS_TESTING_DEPLOY_MACHINE"):
    MACHINE_TYPE = os.getenv("IS_TESTING_DEPLOY_MACHINE")
else:
    MACHINE_TYPE = "n1-standard"

VCPU = "4"
DEPLOY_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Train machine type", DEPLOY_COMPUTE)

## Get pretrained model from TensorFlow Hub

For demonstration purposes, this tutorial uses a pretrained model from TensorFlow Hub (TFHub), which is then uploaded to a `Vertex AI Model` resource. Once you have a `Vertex AI Model` resource, the model can be deployed to a `Vertex AI Endpoint` resource.

### Download the pretrained model

First, you download the pretrained model from TensorFlow Hub. The model gets downloaded as a TF.Keras layer. To finalize the model, in this example, you create a `Sequential()` model with the downloaded TFHub model as a layer, and specify the input shape to the model.

In [None]:
tfhub_model = tf.keras.Sequential(
    [hub.KerasLayer("https://tfhub.dev/google/imagenet/resnet_v2_101/classification/5")]
)

tfhub_model.build([None, 224, 224, 3])

tfhub_model.summary()

### Save the model artifacts

At this point, the model is in memory. Next, you save the model artifacts to a Cloud Storage location.

In [None]:
MODEL_DIR = BUCKET_URI + "/model"
tfhub_model.save(MODEL_DIR)

### Upload the TensorFlow Hub model to a `Vertex AI Model` resource

Finally, you upload the model artifacts from the TFHub model and serving function into a `Vertex AI Model` resource.

*Note:* When you upload the model artifacts to a `Vertex AI Model` resource, you specify the corresponding deployment container image. In this example, you are using a CPU only deployment container.

In [None]:
model = aiplatform.Model.upload(
    display_name="example_" + TIMESTAMP,
    artifact_uri=MODEL_DIR,
    serving_container_image_uri=DEPLOY_IMAGE_CPU,
)

print(model)

## Creating an `Endpoint` resource

You create an `Endpoint` resource using the `Endpoint.create()` method. At a minimum, you specify the display name for the endpoint. Optionally, you can specify the project and location (region); otherwise the settings are inherited by the values you set when you initialized the Vertex AI SDK with the `init()` method.

In this example, the following parameters are specified:

- `display_name`: A human readable name for the `Endpoint` resource.
- `project`: Your project ID.
- `location`: Your region.
- `labels`: (optional) User defined metadata for the `Endpoint` in the form of key/value pairs.

This method returns an `Endpoint` object.

Learn more about [Vertex AI Endpoints](https://cloud.google.com/vertex-ai/docs/predictions/deploy-model-api).

In [None]:
endpoint = aiplatform.Endpoint.create(
    display_name="example_" + TIMESTAMP,
    project=PROJECT_ID,
    location=REGION,
    labels={"your_key": "your_value"},
)

print(endpoint)

## Deploying `Model` resources to an `Endpoint` resource.

You can deploy one of more `Vertex AI Model` resource instances to the same endpoint. Each `Vertex AI Model` resource that is deployed will have its own deployment container for the serving binary. 

*Note:* For this example, you specified the deployment container for the TFHub model in the previous step of uploading the model artifacts to a `Vertex AI Model` resource.

### Scaling

A `Vertex AI Endpoint` resource supports three types of scaling:

- No Scaling: The serving binary is deployed to a single VM instance.
- Manual Scaling: The serving binary is deployed to a fixed number of multiple VM instances.
- Auto Scaling: The number of VM instances that the serving binary is deployed to varies depending on load.

### No Scaling

In the next example, you deploy the `Vertex AI Model` resource to a `Vertex AI Endpoint` resource, without any scaling -- i.e., single VM (node) instance. In otherwords, when the model is deployed, a single VM instance is provisioned and stays provisioned until the model is undeployed.

In this example, you deploy the model with the minimal amount of specified parameters, as follows:

- `model`: The `Model` resource.- `model`: The `Model` resource to deploy.
- `machine_type`: The machine type for each VM instance.
- `deployed_model_displayed_name`: The human readable name for the deployed model instance.

For no-scaling, the single VM instance is provisioned during the deployment of the model. Do to the requirements to provision the resource, this may take upto a few minutes.

In [None]:
response = endpoint.deploy(
    model=model,
    deployed_model_display_name="example_" + TIMESTAMP,
    machine_type=DEPLOY_COMPUTE,
)

#### Display scaling configuration

Once your model is deployed, you can query the `Endpoint` resource to retrieve the scaling configuration for your deployed model with the property `endpoint.gca_resource.deployed_models`.

Since an `Endpoint` resource may have multiple deployed models, the `deployed_models` property returns a list, with one entry per deployed model. In this example, there is a single deployed model and you retrieve the scaling configuration as the first entry in the list: `deployed_models[0]`. You then display the property `dedicated_resources`, which will return the machine type and min/max number of nodes to scale. For no-scaling, the min/max nodes will be set to one.

*Note:* The deployed model identifier refers to the deployed instance of the model and not the model resource identifier.

In [None]:
print(endpoint.gca_resource.deployed_models[0].dedicated_resources)

deployed_model_id = endpoint.gca_resource.deployed_models[0].id

#### Undeploy the model

When you are done doing predictions, you undeploy the model from the `Endpoint` resouce. This deprovisions all compute resources and ends billing for the deployed model.

In [None]:
endpoint.undeploy(deployed_model_id)

### Manual scaling

In the next example, you deploy the `Vertex AI Model` resource to a `Vertex AI Endpoint` resource for manual scaling -- a fixed number (greater than 1) VM instances. In otherwords, when the model is deployed, the fixed number of VM instances are provisioned and stays provisioned until the model is undeployed.

In this example, you deploy the model with the minimal amount of specified parameters, as follows:

- `model`: The `Model` resource.- `model`: The `Model` resource to deploy.
- `machine_type`: The machine type for each VM instance.
- `deployed_model_displayed_name`: The human readable name for the deployed model instance.
- `min_replica_count`: The minimum number of VM instances (nodes) to provision.
- `max_replica_count`: The maximum number of VM instances (nodes) to provision.

For manual-scaling, the fixed number of VM instances are provisioned during the deployment of the model. 

*Note:* For manual scaling, the minimum and maximum number of nodes are set to the same value.

In [None]:
MIN_NODES = MAX_NODES = 2

response = endpoint.deploy(
    model=model,
    deployed_model_display_name="example_" + TIMESTAMP,
    machine_type=DEPLOY_COMPUTE,
    min_replica_count=MIN_NODES,
    max_replica_count=MAX_NODES,
)

#### Display scaling configuration

In this example, there is a single deployed model and you retrieve the scaling configuration as the first entry in the list: `deployed_models[0]`. You then display the property `dedicated_resources`, which will return the machine type and min/max number of nodes to scale. For manual scaling, the min/max nodes will be set to the same value, greater than one.

In [None]:
print(endpoint.gca_resource.deployed_models[0].dedicated_resources)

deployed_model_id = endpoint.gca_resource.deployed_models[0].id

#### Undeploy the model

When you are done doing predictions, you undeploy the model from the `Endpoint` resouce. This deprovisions all compute resources and ends billing for the deployed model.

In [None]:
endpoint.undeploy(deployed_model_id)

### Auto scaling

In the next example, you deploy the `Vertex AI Model` resource to a `Vertex AI Endpoint` resource for auto scaling -- a variable number (greater than 1) VM instances. In otherwords, when the model is deployed, the minimum number of VM instances are provisioned. As the load varies, the number of provisioned instances may dynamically increase upto the maximum number of VM instances, and deprovision to the minimum number of VM instances. The number of provisioned VM instances will never be less than the minimum or more than the maximum.

In this example, you deploy the model with the minimal amount of specified parameters, as follows:

- `model`: The `Model` resource.- `model`: The `Model` resource to deploy.
- `machine_type`: The machine type for each VM instance.
- `deployed_model_displayed_name`: The human readable name for the deployed model instance.
- `min_replica_count`: The minimum number of VM instances (nodes) to provision.
- `max_replica_count`: The maximum number of VM instances (nodes) to provision.

For auto-scaling, the minimum number of VM instances are provisioned during the deployment of the model. 

*Note:* For auto scaling, the minimum number of nodes must be set to a value greater than zero. In otherwords, there will always be at least one VM instance provisioned.

In [None]:
MIN_NODES = 1
MAX_NODES = 2

response = endpoint.deploy(
    model=model,
    deployed_model_display_name="example_" + TIMESTAMP,
    machine_type=DEPLOY_COMPUTE,
    min_replica_count=MIN_NODES,
    max_replica_count=MAX_NODES,
)

#### Display scaling configuration

In this example, there is a single deployed model and you retrieve the scaling configuration as the first entry in the list: `deployed_models[0]`. You then display the property `dedicated_resources`, which will return the machine type and min/max number of nodes to scale. For auto scaling, the max nodes will be set to a value greater than the min.

In [None]:
print(endpoint.gca_resource.deployed_models[0].dedicated_resources)

deployed_model_id = endpoint.gca_resource.deployed_models[0].id

#### Undeploy the model

When you are done doing predictions, you undeploy the model from the `Endpoint` resouce. This deprovisions all compute resources and ends billing for the deployed model.

In [None]:
endpoint.undeploy(deployed_model_id)

### Setting scaling thresholds

An `Endpoint` resource supports auto-scaling based on two metrics: CPU utilization and GPU duty cycle. Both metrics are measured by taking the average utilization of each deployed model. Once the utilization metric exceeds a threshold by a certain amount of time, the number of VM instances (nodes) adjusts up or down accordingly.


#### CPU thresholds

In the previous examples, the VM instances deployed where with CPUs only -- i.e., no hardware accelerators. By default (in auto-scaling), the CPU utilization metric is set to 60%. When deploying the model, specify the parameter `autoscaling_target_cpu_utilization` to set a non-default value.

In [None]:
MIN_NODES = 1
MAX_NODES = 4

response = endpoint.deploy(
    model=model,
    deployed_model_display_name="example_" + TIMESTAMP,
    machine_type=DEPLOY_COMPUTE,
    min_replica_count=MIN_NODES,
    max_replica_count=MAX_NODES,
    autoscaling_target_cpu_utilization=50,
)

#### Display scaling configuration

In this example, there is a single deployed model and you retrieve the scaling configuration as the first entry in the list: `deployed_models[0]`. You then display the property `dedicated_resources`, which will return the machine type and min/max number of nodes to scale, and the target value for the CPU utilization: `autoscaling_metric_specs`.

In [None]:
print(endpoint.gca_resource.deployed_models[0].dedicated_resources)

deployed_model_id = endpoint.gca_resource.deployed_models[0].id

#### Undeploy the model

When you are done doing predictions, you undeploy the model from the `Endpoint` resouce. This deprovisions all compute resources and ends billing for the deployed model.

In [None]:
endpoint.undeploy(deployed_model_id)

### Upload TensorFlow Hub model for GPU deployment image

Next, you upload a second instance of your TensorFlow Hub model as a `Model` resourc -- but where the corresponding serving container supports GPUs.

In [None]:
model_gpu = aiplatform.Model.upload(
    display_name="example_" + TIMESTAMP,
    artifact_uri=MODEL_DIR,
    serving_container_image_uri=DEPLOY_IMAGE_GPU,
)

print(model)

#### GPU thresholds

In this example, the deployment VM instances are configured to use hardware accelerators -- i.e., GPUs, by specifying the following parameters:

- `accelerator_type`: The type of hardware (e.g., GPU) accelerator.
- `accelerator_count`: The number of harware accelerators per previsioned VM instance.

The type and number of GPUs supported is specific to machine type and region.

Learn more about [GPU types and number per machine type](https://cloud.google.com/vertex-ai/docs/predictions/configure-compute).

Learn more about [GPU types available per region](https://cloud.google.com/vertex-ai/docs/general/locations#accelerators).

By default (in auto-scaling), the GPU utilization metric is set to 60%. When deploying the model, specify the parameter `autoscaling_target_accelerator_duty_cycle ` to set a non-default value.

When serving, if either the CPU utilization or GPU duty cycle exceed or fall below the threshold for a certain amount of time, then auto-scaling is triggered.

In [None]:
MIN_NODES = 1
MAX_NODES = 2

response = endpoint.deploy(
    model=model_gpu,
    deployed_model_display_name="example_" + TIMESTAMP,
    machine_type=DEPLOY_COMPUTE,
    accelerator_type=DEPLOY_GPU.name,
    accelerator_count=DEPLOY_NGPU,
    min_replica_count=MIN_NODES,
    max_replica_count=MAX_NODES,
    autoscaling_target_accelerator_duty_cycle=50,
)

#### Display scaling configuration

In this example, there is a single deployed model and you retrieve the scaling configuration as the first entry in the list: `deployed_models[0]`. You then display the property `dedicated_resources`, which will return the machine type and min/max number of nodes to scale, and the target value for the GPU duty cycle: `autoscaling_metric_specs`.

In [None]:
print(endpoint.gca_resource.deployed_models[0].dedicated_resources)

deployed_model_id = endpoint.gca_resource.deployed_models[0].id

### Deploy multiple models to `Endpoint` resource

Next, you deploy two models to the same `Endpoint` resource and split the predictio request traffic between them. One model will use GPUs, with 80% of the traffic and the other the CPU with 20% of the traffic.

You already have the GPU version of the model deployed to the `Endpoint` resource. In this example, you add a second model instance -- the CPU version -- to the same `Endpoint` resource, and specify the traffic split between the models. In this example, the `traffic_split` parameter is specified as follows:

- `"0": 20`: The model being deployed (default ID is 0) will receive 20% of the traffic.
- `deployed_model_id: 80`: The existing deployed model (specified by its deployed model ID) will receive 80% of the traffic.

In [None]:
response = endpoint.deploy(
    model=model,
    deployed_model_display_name="example_" + TIMESTAMP,
    machine_type=DEPLOY_COMPUTE,
    min_replica_count=MIN_NODES,
    max_replica_count=MAX_NODES,
    autoscaling_target_cpu_utilization=50,
    traffic_split={"0": 20, deployed_model_id: 80},
)

#### Display scaling configuration

In this example, there are two deployed models, the CPU and GPU versions.

In [None]:
print(endpoint.gca_resource.deployed_models)

#### Undeploy the models

When you are done doing predictions, you undeploy all the models from the `Endpoint` resouce. This deprovisions all compute resources and ends billing for the deployed model.

In [None]:
endpoint.undeploy_all()

#### Delete the model instances

The method 'delete()' will delete the model.

In [None]:
model.delete()
model_gpu.delete()

#### Delete the endpoint

The method 'delete()' will delete the endpoint.

In [None]:
endpoint.delete()

# Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial.


In [None]:
# Set this to true only if you'd like to delete your bucket
delete_bucket = True

if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil rm -r $BUCKET_URI