In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Vertex AI Model Garden + Agent Engine - Build, Deploy and Test Agents using a Self-deployed Endpoint

<table><tbody><tr>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fcommunity%2Fmodel_garden%2Fmodel_garden_pytorch_llama3_1_agent_engine.ipynb">
      <img alt="Google Cloud Colab Enterprise logo" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" width="32px"><br> Run in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_pytorch_llama3_1_agent_engine.ipynb">
      <img alt="GitHub logo" src="https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png" width="32px"><br> View on GitHub
    </a>
  </td>
</tr></tbody></table>

## Overview

This notebook demonstrates downloading, deploying, and serving prebuilt Llama 3.1 models with [vLLM](https://github.com/vllm-project/vllm) (standard and optimized) and integrating with Agent Engine.

[Agent Engine](https://cloud.google.com/vertex-ai/generative-ai/docs/agent-engine/overview) (LangChain on Vertex AI) is a managed service in Vertex AI that helps you build and deploy model-based agents. It gives you the flexibility to choose how much reasoning you want to delegate to the LLM and how much you want to handle with custom code.

A previous [notebook](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_openai_api_llama3_1.ipynb) demonstrates how to use Llama 3.1 models as Model-as-a-service (MaaS) to build `chatbot` and `translator` agents.

This notebook demonstrates how to build, deploy and test these agents using [Agent Engine](https://cloud.google.com/vertex-ai/generative-ai/docs/agent-engine/overview) with self-deployed model in Vertex AI.


### Objective

- Select one of the following three ways to deploy Llama 3.1 to an endpoint
    - Deploy Llama 3.1 8B Instruct with the Fast Deployment feature
    - Deploy Llama 3.1 8B, 70B and 405B with standard vLLM on GPU, optionally with dynamic LoRA adapters
    - Deploy Llama 3.1 8B and 70B with optimized vLLM on GPU
    
- Integrate with Agent Engine: Use the Vertex AI SDK to build three simple agents with the deployed endpoint:
    - A Chatbot Agent
    - A Translator Agent
    - An Agent that uses [an Exchange Rate Tool](https://cloud.google.com/vertex-ai/generative-ai/docs/agent-engine/develop#define-function)
- Test your agent locally.
- Deploy and test your agent on the Agent Engine.


### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage pricing](https://cloud.google.com/storage/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

## Before you begin

In [None]:
# @title Request for quota

# @markdown By default, the quota for TPU deployment `Custom model serving TPU v5e cores per region` is 4, which is sufficient for serving the Llama 3.1 8B model. The Llama 3.1 70B model requires 16 TPU v5e cores. TPU quota is only available in `us-west1`. You can request for higher TPU quota following the instructions at ["Request a higher quota"](https://cloud.google.com/docs/quota/view-manage#requesting_higher_quota).

# @markdown By default, the quota for H100 deployment `Custom model serving per region` is 0. You need to request for H100 quota following the instructions at ["Request a higher quota"](https://cloud.google.com/docs/quota/view-manage#requesting_higher_quota).

In [None]:
# @title Setup Google Cloud project

# @markdown 1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

# @markdown 2. **[Optional]** [Create a Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets) for storing experiment outputs. Set the BUCKET_URI for the experiment environment. The specified Cloud Storage bucket (`BUCKET_URI`) should be located in the same region as where the notebook was launched. Note that a multi-region bucket (eg. "us") is not considered a match for a single region covered by the multi-region range (eg. "us-central1"). If not set, a unique GCS bucket will be created instead.

BUCKET_URI = "gs://"  # @param {type:"string"}

# @markdown 3. **[Optional]** Set region. If not set, the region will be set automatically according to Colab Enterprise environment.

REGION = ""  # @param {type:"string"}

# @markdown 4. If you want to run predictions with A100 80GB or H100 GPUs, we recommend using the regions listed below. **NOTE:** Make sure you have associated quota in selected regions. Click the links to see your current quota for each GPU type: [Nvidia A100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_a100_80gb_gpus), [Nvidia H100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_h100_gpus). You can request for quota following the instructions at ["Request a higher quota"](https://cloud.google.com/docs/quota/view-manage#requesting_higher_quota).

# @markdown > | Machine Type | Accelerator Type | Recommended Regions |
# @markdown | ----------- | ----------- | ----------- |
# @markdown | a2-ultragpu-1g | 1 NVIDIA_A100_80GB | us-central1, us-east4, europe-west4, asia-southeast1, us-east4 |
# @markdown | a3-highgpu-2g | 2 NVIDIA_H100_80GB | us-west1, asia-southeast1, europe-west4 |
# @markdown | a3-highgpu-4g | 4 NVIDIA_H100_80GB | us-west1, asia-southeast1, europe-west4 |
# @markdown | a3-highgpu-8g | 8 NVIDIA_H100_80GB | us-central1, europe-west4, us-west1, asia-southeast1 |

# Import the necessary packages

# Upgrade Vertex AI SDK.
! pip3 install --upgrade --quiet \
    "google-cloud-aiplatform>=1.64.0" \
    cloudpickle==3.0.0 \
    pydantic==2.10.6 \
    requests \
    langchain-openai
! git clone https://github.com/GoogleCloudPlatform/vertex-ai-samples.git

import datetime
import importlib
import os
import requests
import uuid
from typing import Tuple

from google.cloud import aiplatform

common_util = importlib.import_module(
    "vertex-ai-samples.community-content.vertex_model_garden.model_oss.notebook_util.common_util"
)

models, endpoints = {}, {}

# Get the default cloud project id.
PROJECT_ID = os.environ["GOOGLE_CLOUD_PROJECT"]

# Get the default region for launching jobs.
if not REGION:
    if not os.environ.get("GOOGLE_CLOUD_REGION"):
        raise ValueError(
            "REGION must be set. See"
            " https://cloud.google.com/vertex-ai/docs/general/locations for"
            " available cloud locations."
        )
    REGION = os.environ["GOOGLE_CLOUD_REGION"]

# Enable the Vertex AI API and Compute Engine API, if not already.
print("Enabling Vertex AI API and Compute Engine API.")
! gcloud services enable aiplatform.googleapis.com compute.googleapis.com

# Cloud Storage bucket for storing the experiment artifacts.
# A unique GCS bucket will be created for the purpose of this notebook. If you
# prefer using your own GCS bucket, change the value yourself below.
now = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
BUCKET_NAME = "/".join(BUCKET_URI.split("/")[:3])

if BUCKET_URI is None or BUCKET_URI.strip() == "" or BUCKET_URI == "gs://":
    BUCKET_URI = f"gs://{PROJECT_ID}-tmp-{now}-{str(uuid.uuid4())[:4]}"
    BUCKET_NAME = "/".join(BUCKET_URI.split("/")[:3])
    ! gsutil mb -l {REGION} {BUCKET_URI}
else:
    assert BUCKET_URI.startswith("gs://"), "BUCKET_URI must start with `gs://`."
    shell_output = ! gsutil ls -Lb {BUCKET_NAME} | grep "Location constraint:" | sed "s/Location constraint://"
    bucket_region = shell_output[0].strip().lower()
    if bucket_region != REGION:
        raise ValueError(
            "Bucket region %s is different from notebook region %s"
            % (bucket_region, REGION)
        )
print(f"Using this GCS Bucket: {BUCKET_URI}")

STAGING_BUCKET = os.path.join(BUCKET_URI, "temporal")
MODEL_BUCKET = os.path.join(BUCKET_URI, "llama3_1")


# Initialize Vertex AI API.
print("Initializing Vertex AI API.")
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=STAGING_BUCKET)

# Gets the default SERVICE_ACCOUNT.
shell_output = ! gcloud projects describe $PROJECT_ID
project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"
print("Using this default Service Account:", SERVICE_ACCOUNT)


# Provision permissions to the SERVICE_ACCOUNT with the GCS bucket
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.admin $BUCKET_NAME

! gcloud config set project $PROJECT_ID
! gcloud projects add-iam-policy-binding --no-user-output-enabled {PROJECT_ID} --member=serviceAccount:{SERVICE_ACCOUNT} --role="roles/storage.admin"
! gcloud projects add-iam-policy-binding --no-user-output-enabled {PROJECT_ID} --member=serviceAccount:{SERVICE_ACCOUNT} --role="roles/aiplatform.user"

# @markdown # Access Llama 3.1 models on Vertex AI for serving
# @markdown The original models from Meta are converted into the Hugging Face format for serving in Vertex AI.
# @markdown Accept the model agreement to access the models:
# @markdown 1. Open the [Llama 3.1 model card](https://console.cloud.google.com/vertex-ai/publishers/meta/model-garden/llama3_1) from [Vertex AI Model Garden](https://cloud.google.com/model-garden).
# @markdown 2. Review and accept the agreement in the pop-up window on the model card page. If you have previously accepted the model agreement, there will not be a pop-up window on the model card page and this step is not needed.
# @markdown 3. After accepting the agreement of Llama 3.1, a `gs://` URI containing Llama 3.1 pretrained and instruction-tuned models will be shared.
# @markdown 4. Paste the URI in the `VERTEX_AI_MODEL_GARDEN_LLAMA_3_1` field below.


VERTEX_AI_MODEL_GARDEN_LLAMA_3_1 = ""  # @param {type:"string", isTemplate:true}
assert (
    VERTEX_AI_MODEL_GARDEN_LLAMA_3_1
), "Click the agreement of Llama 3.1 in Vertex AI Model Garden, and get the GCS path of Llama 3.1 model artifacts."

### Select one of the following three ways to deploy the model

In [None]:
# @title Deploy prebuilt Llama 3.1 8B Instruct with standard vLLM and Fast Deployment

# @markdown This section demonstrates how to use the Fast Deployment feature.

# @markdown The Fast Deployment feature prioritizes speed for model exploration, making it ideal for initial testing and experimentation. For sensitive data or production workloads, use the Standard environment for enhanced security and stability.

# @markdown Note that only a subset of the models support the Fast Deployment feature.

FAST_DEPLOYMENT_REGION = "us-central1"  # @param ["us-central1"] {isTemplate:true}

API_ENDPOINT = f"{FAST_DEPLOYMENT_REGION}-aiplatform.googleapis.com"


def fast_deploy(
    publisher: str, publisher_model_id: str, version_id: str
) -> Tuple[aiplatform.Model, aiplatform.Endpoint]:
    url = f"https://{API_ENDPOINT}/v1/publishers/{publisher}/models/{publisher_model_id}@{version_id}"
    access_token = ! gcloud auth print-access-token
    access_token = access_token[0]

    response = requests.get(
        url,
        headers={
            "Authorization": f"Bearer {access_token}",
            "Content-Type": "application/json",
        },
    )
    response.raise_for_status()
    response = response.json()
    if (
        len(
            response.get("supportedActions", {})
            .get("multiDeployVertex", {})
            .get("multiDeployVertex", {})
        )
        == 0
    ):
        raise ValueError(
            f"No supportedActions.multiDeployVertex found in {FAST_DEPLOYMENT_REGION}. You can skip"
            " this section or try a different region."
        )
    deploy_configs = response["supportedActions"]["multiDeployVertex"][
        "multiDeployVertex"
    ]
    fast_deploy_config = [
        config
        for config in deploy_configs
        if config["deployMetadata"]
        .get("labels", {})
        .get("show-faster-deployment-option")
        == "true"
    ]
    if fast_deploy_config:
        fast_deploy_config = fast_deploy_config[0]
    else:
        raise ValueError(
            f"No Fast Deployment config found in {FAST_DEPLOYMENT_REGION}. You can skip this"
            " section or try a different region."
        )

    container_spec = fast_deploy_config["containerSpec"]
    machine_spec = fast_deploy_config["dedicatedResources"]["machineSpec"]
    machine_type = machine_spec["machineType"]
    accelerator_type = machine_spec["acceleratorType"]
    accelerator_count = machine_spec["acceleratorCount"]
    env = {item["name"]: item["value"] for item in container_spec.get("env", [])}
    if "DEPLOY_SOURCE" in env:
        del env["DEPLOY_SOURCE"]
    port = container_spec.get("ports", [{}])[0].get("containerPort")

    model = aiplatform.Model.upload(
        display_name=fast_deploy_config.get("modelDisplayName"),
        serving_container_image_uri=container_spec.get("imageUri"),
        serving_container_args=container_spec.get("args"),
        serving_container_environment_variables=env,
        serving_container_ports=[port],
        serving_container_predict_route=container_spec.get("predictRoute"),
        serving_container_health_route=container_spec.get("healthRoute"),
        serving_container_shared_memory_size_mb=(16 * 1024),  # 16 GB
        serving_container_deployment_timeout=7200,
        location=FAST_DEPLOYMENT_REGION,
        model_garden_source_model_name=(
            f"publishers/{publisher}/models/{publisher_model_id}"
        ),
    )
    endpoint = aiplatform.Endpoint.create(
        display_name=model.name + "-endpoint",
        location=FAST_DEPLOYMENT_REGION,
        dedicated_endpoint_enabled=True,
    )
    print(
        f"Deploying {model.name} on {machine_type} with"
        f" {accelerator_count} {accelerator_type} GPU(s)."
    )
    model.deploy(
        endpoint=endpoint,
        machine_type=machine_type,
        accelerator_type=accelerator_type,
        accelerator_count=accelerator_count,
        deploy_request_timeout=1800,
        disable_container_logging=True,
        fast_tryout_enabled=True,
        system_labels={
            "DEPLOY_SOURCE": "notebook",
            "NOTEBOOK_NAME": "model_garden_pytorch_llama3_1_agent_engine.ipynb",
        },
    )
    print("endpoint_name:", endpoint.name)
    return model, endpoint


# @markdown The Llama 3.1 8B Instruct model will be deployed to a dedicated endpoint on an `a2-ultragpu-1g` machine with Fast Deployment.
# @markdown **Currently, the Fast Deployment is only supported in the `us-central1` region.**

use_dedicated_endpoint = True  # Fast Deployment only supports dedicated endpoints.
models["vllm_fast"], endpoints["vllm_fast"] = fast_deploy(
    "meta", "llama3_1", "llama-3.1-8b-instruct"
)
ENDPOINT_RESOURCE_NAME = endpoints["vllm_fast"].resource_name
BASE_URL = (
    f"https://{REGION}-aiplatform.googleapis.com/v1beta1/{ENDPOINT_RESOURCE_NAME}"
)
try:
    if use_dedicated_endpoint:
        DEDICATED_ENDPOINT_DNS = endpoints[
            "vllm_fast"
        ].gca_resource.dedicated_endpoint_dns
        BASE_URL = f"https://{DEDICATED_ENDPOINT_DNS}/v1beta1/{ENDPOINT_RESOURCE_NAME}"
except NameError:
    pass

In [None]:
# @title Deploy prebuilt Llama 3.1 8B, 70B and 405B Instruct with standard vLLM

# @markdown This section uploads prebuilt Llama 3.1 models to Model Registry and deploys it to a Vertex AI Endpoint. It takes 15 minutes to 1 hour to finish depending on the size of the model.

# @markdown NVIDIA_L4 GPUs are used for demonstration. The serving efficiency of L4 GPUs is inferior to that of H100 GPUs, but L4 GPUs are nevertheless good serving solutions if you do not have H100 quota.

# @markdown H100 is hard to get for now. It's recommended to use the deployment button in the model card. You can still try to deploy H100 endpoint through the notebook, but there is a chance that resource is not available.

# @markdown Set the model to deploy.

base_model_name = "Meta-Llama-3.1-8B-Instruct"  # @param ["Meta-Llama-3.1-8B-Instruct", "Meta-Llama-3.1-70B-Instruct", "Meta-Llama-3.1-405B-Instruct-FP8"] {isTemplate:true}
model_id = os.path.join(VERTEX_AI_MODEL_GARDEN_LLAMA_3_1, base_model_name)
ENABLE_DYNAMIC_LORA = True  # @param {type:"boolean", isTemplate:true}
hf_model_id = "meta-llama/" + base_model_name

# The pre-built serving docker images.
VLLM_DOCKER_URI = "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20241210_0916_RC00"

# @markdown Set `use_dedicated_endpoint` to False if you don't want to use [dedicated endpoint](https://cloud.google.com/vertex-ai/docs/general/deployment#create-dedicated-endpoint).
use_dedicated_endpoint = True  # @param {type:"boolean"}
# @markdown Find Vertex AI prediction supported accelerators and regions at https://cloud.google.com/vertex-ai/docs/predictions/configure-compute.
if "8b" in base_model_name.lower():
    accelerator_type = "NVIDIA_L4"
    machine_type = "g2-standard-12"
    accelerator_count = 1
    max_loras = 5
elif "70b" in base_model_name.lower():
    accelerator_type = "NVIDIA_H100_80GB"
    machine_type = "a3-highgpu-4g"
    accelerator_count = 4
    max_loras = 1
elif "405b" in base_model_name.lower():
    accelerator_type = "NVIDIA_H100_80GB"
    machine_type = "a3-highgpu-8g"
    accelerator_count = 8
    max_loras = 1
else:
    raise ValueError(
        f"Recommended GPU setting not found for: {accelerator_type} and {base_model_name}."
    )

common_util.check_quota(
    project_id=PROJECT_ID,
    region=REGION,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    is_for_training=False,
)

gpu_memory_utilization = 0.95
max_model_len = 8192  # Maximum context length.


# Enable automatic prefix caching using GPU HBM
enable_prefix_cache = True
# Setting this value >0 will use the idle host memory for a second-tier prefix kv
# cache beneath the HBM cache. It only has effect if enable_prefix_cache=True.
# The range of this value: [0, 1)
# Setting host_prefix_kv_cache_utilization_target to 0 will disable the host memory prefix kv cache.
host_prefix_kv_cache_utilization_target = 0.7


def deploy_model_vllm(
    model_name: str,
    model_id: str,
    publisher: str,
    publisher_model_id: str,
    service_account: str,
    base_model_id: str = None,
    machine_type: str = "g2-standard-8",
    accelerator_type: str = "NVIDIA_L4",
    accelerator_count: int = 1,
    gpu_memory_utilization: float = 0.9,
    max_model_len: int = 4096,
    dtype: str = "auto",
    enable_trust_remote_code: bool = False,
    enforce_eager: bool = False,
    enable_lora: bool = False,
    enable_chunked_prefill: bool = False,
    enable_prefix_cache: bool = False,
    host_prefix_kv_cache_utilization_target: float = 0.0,
    max_loras: int = 1,
    max_cpu_loras: int = 8,
    use_dedicated_endpoint: bool = False,
    max_num_seqs: int = 256,
    model_type: str = None,
    enable_llama_tool_parser: bool = False,
) -> Tuple[aiplatform.Model, aiplatform.Endpoint]:
    """Deploys trained models with vLLM into Vertex AI."""
    endpoint = aiplatform.Endpoint.create(
        display_name=f"{model_name}-endpoint",
        dedicated_endpoint_enabled=use_dedicated_endpoint,
    )

    if not base_model_id:
        base_model_id = model_id

    # See https://docs.vllm.ai/en/latest/models/engine_args.html for a list of possible arguments with descriptions.
    vllm_args = [
        "python",
        "-m",
        "vllm.entrypoints.api_server",
        "--host=0.0.0.0",
        "--port=8080",
        f"--model={model_id}",
        f"--tensor-parallel-size={accelerator_count}",
        "--swap-space=16",
        f"--gpu-memory-utilization={gpu_memory_utilization}",
        f"--max-model-len={max_model_len}",
        f"--dtype={dtype}",
        f"--max-loras={max_loras}",
        f"--max-cpu-loras={max_cpu_loras}",
        f"--max-num-seqs={max_num_seqs}",
        "--disable-log-stats",
    ]

    if enable_trust_remote_code:
        vllm_args.append("--trust-remote-code")

    if enforce_eager:
        vllm_args.append("--enforce-eager")

    if enable_lora:
        vllm_args.append("--enable-lora")

    if enable_chunked_prefill:
        vllm_args.append("--enable-chunked-prefill")

    if enable_prefix_cache:
        vllm_args.append("--enable-prefix-caching")

    if 0 < host_prefix_kv_cache_utilization_target < 1:
        vllm_args.append(
            f"--host-prefix-kv-cache-utilization-target={host_prefix_kv_cache_utilization_target}"
        )

    if model_type:
        vllm_args.append(f"--model-type={model_type}")

    if enable_llama_tool_parser:
        vllm_args.append("--enable-auto-tool-choice")
        vllm_args.append("--tool-call-parser=vertex-llama-3")

    env_vars = {
        "MODEL_ID": base_model_id,
        "DEPLOY_SOURCE": "notebook",
    }

    # HF_TOKEN is not a compulsory field and may not be defined.
    try:
        if HF_TOKEN:
            env_vars["HF_TOKEN"] = HF_TOKEN
    except NameError:
        pass

    model = aiplatform.Model.upload(
        display_name=model_name,
        serving_container_image_uri=VLLM_DOCKER_URI,
        serving_container_args=vllm_args,
        serving_container_ports=[8080],
        serving_container_predict_route="/generate",
        serving_container_health_route="/ping",
        serving_container_environment_variables=env_vars,
        serving_container_shared_memory_size_mb=(16 * 1024),  # 16 GB
        serving_container_deployment_timeout=7200,
        model_garden_source_model_name=(
            f"publishers/{publisher}/models/{publisher_model_id}"
        ),
    )
    print(
        f"Deploying {model_name} on {machine_type} with {accelerator_count} {accelerator_type} GPU(s)."
    )
    model.deploy(
        endpoint=endpoint,
        machine_type=machine_type,
        accelerator_type=accelerator_type,
        accelerator_count=accelerator_count,
        deploy_request_timeout=1800,
        service_account=service_account,
        system_labels={
            "NOTEBOOK_NAME": "model_garden_pytorch_llama3_1_agent_engine.ipynb",
            "NOTEBOOK_ENVIRONMENT": common_util.get_deploy_source(),
        },
    )
    print("endpoint_name:", endpoint.name)

    return model, endpoint


models["vllm_gpu"], endpoints["vllm_gpu"] = deploy_model_vllm(
    model_name=common_util.get_job_name_with_datetime(prefix="llama3_1-serve"),
    model_id=model_id,
    publisher="meta",
    publisher_model_id="llama3_1",
    base_model_id=hf_model_id,
    service_account=SERVICE_ACCOUNT,
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    gpu_memory_utilization=gpu_memory_utilization,
    max_model_len=max_model_len,
    max_loras=max_loras,
    enforce_eager=True,
    enable_lora=ENABLE_DYNAMIC_LORA,
    enable_chunked_prefill=not ENABLE_DYNAMIC_LORA,
    enable_prefix_cache=enable_prefix_cache,
    host_prefix_kv_cache_utilization_target=host_prefix_kv_cache_utilization_target,
    use_dedicated_endpoint=use_dedicated_endpoint,
)

if use_dedicated_endpoint:
    DEDICATED_ENDPOINT_DNS = endpoints["vllm_gpu"].gca_resource.dedicated_endpoint_dns
ENDPOINT_RESOURCE_NAME = "projects/{}/locations/{}/endpoints/{}".format(
    PROJECT_ID, REGION, endpoints["vllm_gpu"].name
)

BASE_URL = (
    f"https://{REGION}-aiplatform.googleapis.com/v1beta1/{ENDPOINT_RESOURCE_NAME}"
)
try:
    if use_dedicated_endpoint:
        BASE_URL = f"https://{DEDICATED_ENDPOINT_DNS}/v1beta1/{ENDPOINT_RESOURCE_NAME}"
except NameError:
    pass

# @markdown Click "Show Code" to see more details.

In [None]:
# @title Deploy prebuilt Llama 3.1 8B instruct and 70B instruct with optimized vLLM

# @markdown This section uploads prebuilt Llama 3.1 models to Model Registry and deploys it to a Vertex AI Endpoint. It takes 15 minutes to 1 hour to finish depending on the size of the model.

# @markdown NVIDIA_L4 GPUs are used for demonstration. The serving efficiency of L4 GPUs is inferior to that of H100 GPUs, but L4 GPUs are nevertheless good serving solutions if you do not have H100 quota.

# @markdown H100 is hard to get for now. It's recommended to use the deployment button in the model card. You can still try to deploy H100 endpoint through the notebook, but there is a chance that resource is not available.

# @markdown Set the model to deploy.

base_model_name = "Meta-Llama-3.1-8B-Instruct"  # @param ["Meta-Llama-3.1-8B-Instruct", "Meta-Llama-3.1-70B-Instruct"] {isTemplate:true}
model_id = os.path.join(VERTEX_AI_MODEL_GARDEN_LLAMA_3_1, base_model_name)
hf_model_id = "meta-llama/" + base_model_name

# The pre-built serving docker images.
OPTIMIZED_VLLM_DOCKER_URI = "us-docker.pkg.dev/vertex-ai-restricted/vertex-vision-model-garden-dockers/pytorch-vllm-optimized-serve:20241029_0835_RC00"

# @markdown Find Vertex AI prediction supported accelerators and regions at https://cloud.google.com/vertex-ai/docs/predictions/configure-compute.
if "8b" in base_model_name.lower():
    accelerator_type = "NVIDIA_L4"
    machine_type = "g2-standard-12"
    accelerator_count = 1
elif "70b" in base_model_name.lower():
    accelerator_type = "NVIDIA_H100_80GB"
    machine_type = "a3-highgpu-8g"
    accelerator_count = 8
else:
    raise ValueError(
        f"Recommended GPU setting not found for: {accelerator_type} and {base_model_name}."
    )

common_util.check_quota(
    project_id=PROJECT_ID,
    region=REGION,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    is_for_training=False,
)

max_model_len = 8192  # Maximum context length.


def deploy_model_optimized_vllm(
    model_name: str,
    model_id: str,
    publisher: str,
    publisher_model_id: str,
    service_account: str,
    base_model_id: str = None,
    machine_type: str = "g2-standard-12",
    accelerator_type: str = "NVIDIA_L4",
    accelerator_count: int = 1,
    max_model_len: int = 4096,
    enable_trust_remote_code: bool = False,
    use_dedicated_endpoint: bool = False,
) -> Tuple[aiplatform.Model, aiplatform.Endpoint]:
    """Deploys models with optimized vLLM into Vertex AI."""
    endpoint = aiplatform.Endpoint.create(
        display_name=f"{model_name}-endpoint",
        dedicated_endpoint_enabled=use_dedicated_endpoint,
    )

    if not base_model_id:
        base_model_id = model_id

    vllm_args = [
        "python",
        "-m",
        "vllm.entrypoints.api_server",
        "--host=0.0.0.0",
        "--port=8080",
        f"--model={model_id}",
        f"--tensor-parallel-size={accelerator_count}",
        "--swap-space=16",
        f"--max-model-len={max_model_len}",
        "--disable-log-stats",
    ]

    if enable_trust_remote_code:
        vllm_args.append("--trust-remote-code")

    env_vars = {
        "MODEL_ID": base_model_id,
        "DEPLOY_SOURCE": "notebook",
    }

    # HF_TOKEN is not a compulsory field and may not be defined.
    try:
        if HF_TOKEN:
            env_vars["HF_TOKEN"] = HF_TOKEN
    except NameError:
        pass

    model = aiplatform.Model.upload(
        display_name=model_name,
        serving_container_image_uri=OPTIMIZED_VLLM_DOCKER_URI,
        serving_container_args=vllm_args,
        serving_container_ports=[8080],
        serving_container_predict_route="/generate",
        serving_container_health_route="/ping",
        serving_container_environment_variables=env_vars,
        serving_container_shared_memory_size_mb=(16 * 1024),  # 16 GB
        serving_container_deployment_timeout=7200,
        model_garden_source_model_name=(
            f"publishers/{publisher}/models/{publisher_model_id}"
        ),
    )
    print(
        f"Deploying {model_name} on {machine_type} with {accelerator_count} {accelerator_type} GPU(s)."
    )
    model.deploy(
        endpoint=endpoint,
        machine_type=machine_type,
        accelerator_type=accelerator_type,
        accelerator_count=accelerator_count,
        deploy_request_timeout=1800,
        service_account=service_account,
        system_labels={
            "NOTEBOOK_NAME": "model_garden_pytorch_llama3_1_agent_engine.ipynb",
        },
    )
    print("endpoint_name:", endpoint.name)

    return model, endpoint


(
    models["optimized_vllm_gpu"],
    endpoints["optimized_vllm_gpu"],
) = deploy_model_optimized_vllm(
    model_name=common_util.get_job_name_with_datetime(prefix="llama3_1-serve"),
    model_id=model_id,
    publisher="meta",
    publisher_model_id="llama3_1",
    base_model_id=hf_model_id,
    service_account=SERVICE_ACCOUNT,
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    max_model_len=max_model_len,
    use_dedicated_endpoint=use_dedicated_endpoint,
)

if use_dedicated_endpoint:
    DEDICATED_ENDPOINT_DNS = endpoints["vllm_gpu"].gca_resource.dedicated_endpoint_dns
ENDPOINT_RESOURCE_NAME = "projects/{}/locations/{}/endpoints/{}".format(
    PROJECT_ID, REGION, endpoints["vllm_gpu"].name
)

BASE_URL = (
    f"https://{REGION}-aiplatform.googleapis.com/v1beta1/{ENDPOINT_RESOURCE_NAME}"
)
try:
    if use_dedicated_endpoint:
        BASE_URL = f"https://{DEDICATED_ENDPOINT_DNS}/v1beta1/{ENDPOINT_RESOURCE_NAME}"
except NameError:
    pass
# @markdown Click "Show Code" to see more details.

### Authenticate your notebook environment (Colab only)

In [None]:
import sys

if "google.colab" in sys.modules:

    from google.colab import auth

    auth.authenticate_user()

In [None]:
# @title Set cloud storage

# @markdown To get started using Vertex AI, you must enable the Vertex AI API(https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com) and create a bucket created for agent engine.

BUCKET_NAME = ""  # @param {type:"string", placeholder: "[your-bucket-name]"}
STAGING_BUCKET = f"gs://{BUCKET_NAME}"

In [None]:
# @title Initialize Vertex AI SDK for Python

import vertexai

vertexai.init(project=PROJECT_ID, location=REGION, staging_bucket=STAGING_BUCKET)

In [None]:
# @title Import libraries

# Import libraries to use in this tutorial.

import google.auth
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from vertexai import agent_engines
from vertexai.preview import reasoning_engines

### Chat with `Agent Engine`

In [None]:
# @title `Agent Engine` use Llama 3.1 with different configuration

# @markdown To use the self-deployed API endpoint with Agent Engine capabilities, you need to request the access token and configure the langchain ChatOpenAI to point to the API endpoint.

# @markdown In previous [notebook](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_openai_api_llama3_1.ipynb), we demonstrated how to `Ask Llama 3.1 using different model configuration`.

# @markdown In this colab, we will show you how to use the `Agent Engine` to send a request to the Llama 3.1 model with different model configuration.


def model_builder(
    *,
    model_name: str,
    model_kwargs=None,
    project: str,  # Specified via vertexai.init
    location: str,  # Specified via vertexai.init
    **kwargs,
):

    # Note: the credential lives for 1 hour by default.
    # After expiration, it must be refreshed.
    creds, _ = google.auth.default(
        scopes=["https://www.googleapis.com/auth/cloud-platform"]
    )
    auth_req = google.auth.transport.requests.Request()
    creds.refresh(auth_req)

    if model_kwargs is None:
        model_kwargs = {}

    return ChatOpenAI(
        model="",
        base_url=BASE_URL,
        api_key=creds.token,
        **model_kwargs,
    )


# @markdown Use the following parameters to generate different answers:
# @markdown *   `temperature` to control the randomness of the response
# @markdown *   `top_p` to control the quality of the response

temperature = 1.0  # @param {type:"number"}
top_p = 1.0  # @param {type:"number"}

agent = reasoning_engines.LangchainAgent(
    model="",  # Required.
    model_builder=model_builder,  # Required.
    model_kwargs={
        "temperature": temperature,  # Optional.
        "top_p": top_p,  # Optional.
        "extra_body": {},
    },
)

# @markdown Now we can test the model and agent behavior to ensure that it's working as expected before we deploy it:

response = agent.query(input="Hello, Llama 3.1!")
print(response)

In [None]:
# @title Deploy your agent on Vertex AI

# @markdown Now that you've specified a model, and reasoning for your agent and tested it out, you're ready to deploy your agent as a remote service in Vertex AI!

remote_agent = agent_engines.create(
    agent,
    requirements=[
        "google-cloud-aiplatform[langchain,agent_engines]",
        "cloudpickle==3.0.0",
        "pydantic==2.10.6",
        "requests",
        "langchain-openai",
    ],
)

response = remote_agent.query(input="Hello, Llama 3.1!")
print(response)

In [None]:
# @title Reusing your deployed agent from other applications or SDKs

# @markdown The remotely deployed `Agent Engine` is now available for import and use. You can access it within your current notebook session, a different notebook, or a Python script.

AGENT_ENGINE_RESOURCE_NAME = remote_agent.resource_name
print(AGENT_ENGINE_RESOURCE_NAME)

# Afterwards, you can use the below code:

# from vertexai.preview import agent_engines`

# remote_agent = agent_engines.get(AGENT_ENGINE_RESOURCE_NAME)`
# response = remote_agent.query(input=query)`

# @markdown Alternatively, you can query your agent from other programming languages using any of the [available client libraries in Vertex AI](https://cloud.google.com/vertex-ai/docs/start/client-libraries), including C#, Java, Node.js, Python, Go, or REST API.

### Simple Translator Agent

In [None]:
# @title Use Agent Engine to build a simple translator agent

# @markdown In previous [notebook](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_openai_api_llama3_1.ipynb), we demonstrates how to use `LangChain Expression Language` (LCEL) to build a simple chain which translates some `text_to_translate` to the specified `target_language`.

# @markdown In this colab, we will show you how to use the `Agent Engine` to build and deploy the agent.


def lcel_builder(*, model, **kwargs):

    template = """Translate the following {text} to {target_language}:"""
    prompt = PromptTemplate(
        input_variables=["text", "target_language"], template=template
    )

    return prompt | model | StrOutputParser()


agent = reasoning_engines.LangchainAgent(
    model="",
    model_builder=model_builder,
    runnable_builder=lcel_builder,
)

text_to_translate = ""  # @param {type:"string", placeholder:"Hello Llama 3.1!"}
target_language = ""  # @param {type:"string", placeholder:"Italian"}

response = agent.query(
    input={"text": text_to_translate, "target_language": target_language}
)
print(response)

In [None]:
# @title Deploy your agent on Vertex AI

# @markdown Now that you've specified a model, and reasoning for your agent and tested it out, you're ready to deploy your agent as a remote service in Vertex AI!

remote_agent = agent_engines.create(
    agent,
    requirements=[
        "google-cloud-aiplatform[langchain,agent_engines]",
        "cloudpickle==3.0.0",
        "pydantic==2.10.6",
        "requests",
        "langchain-openai",
    ],
)

response = remote_agent.query(
    input={"text": text_to_translate, "target_language": target_language}
)
print(response)

In [None]:
# @title Reusing your deployed agent from other applications or SDKs

# @markdown The remotely deployed `Agent Engine` is now available for import and use. You can access it within your current notebook session, a different notebook, or a Python script.

AGENT_ENGINE_RESOURCE_NAME = remote_agent.resource_name
print(AGENT_ENGINE_RESOURCE_NAME)

# Afterwards, you can use the below code:

# from vertexai.preview import agent_engines`

# remote_agent = agent_engines.get(AGENT_ENGINE_RESOURCE_NAME)`
# response = remote_agent.query(input=query)`

# @markdown Alternatively, you can query your agent from other programming languages using any of the [available client libraries in Vertex AI](https://cloud.google.com/vertex-ai/docs/start/client-libraries), including C#, Java, Node.js, Python, Go, or REST API.

### Exchange Rate Tool

[Function calling](https://cloud.google.com/vertex-ai/docs/generative-ai/multimodal/function-calling) lets developers create a description of a function in their code, then pass that description to a language model in a request. The response from the model includes the name of a function that matches the description and the arguments to call it with.

In this example, we will use an Exchange Rate tool in the Agent Engine.

In [None]:
# @title Agent that uses an Exchange Rate Tool

# @markdown Tools and functions enable the generative model to interact with external systems, databases, document stores, and other APIs so that the model can get the most up-to-date information or take action with those systems.

# @markdown In this example, you'll define a function called get_exchange_rate that uses the requests library to retrieve real-time currency exchange information from an API:


def get_exchange_rate(
    currency_from: str = "USD",
    currency_to: str = "EUR",
    currency_date: str = "latest",
):
    """Retrieves the exchange rate between two currencies on a specified date.
    Args:
        currency_from: The source currency code.
        currency_to: The target currency code.
        currency_date: The date to retrieve the exchange rate.
    Returns:
        Exchange rate between two currencies on a specified date.
    """
    response = requests.get(
        f"https://api.frankfurter.app/{currency_date}",
        params={"from": currency_from, "to": currency_to},
    )
    return response.json()


get_exchange_rate(currency_from="USD", currency_to="SEK")


agent = reasoning_engines.LangchainAgent(
    model="",  # Required.
    model_builder=model_builder,  # Required.
    tools=[get_exchange_rate],  # Optional.
    agent_executor_kwargs={
        "return_intermediate_steps": True,
        "stream_runnable": False,
    },  # Optional.
)

# @markdown Test the function with sample inputs to ensure that it's working as expected:
response = agent.query(
    input="What's the exchange rate from US dollars to Swedish currency at 2024-07-26?"
)
print(response)

In [None]:
# @title Deploy your agent on Vertex AI

# @markdown Now that you've specified a model, and agent for your agent and tested it out, you're ready to deploy your agent as a remote service in Vertex AI!

remote_agent = agent_engines.create(
    agent,
    requirements=[
        "google-cloud-aiplatform[langchain,agent_engines]",
        "cloudpickle==3.0.0",
        "pydantic==2.10.6",
        "requests",
        "langchain-openai",
    ],
)

response = remote_agent.query(
    input="What's the exchange rate from US dollars to Swedish currency at 2024-07-26?"
)
print(response)

In [None]:
# @title Reusing your deployed agent from other applications or SDKs

# @markdown The remotely deployed `Agent Engine` is now available for import and use. You can access it within your current notebook session, a different notebook, or a Python script.

AGENT_ENGINE_RESOURCE_NAME = remote_agent.resource_name
print(AGENT_ENGINE_RESOURCE_NAME)

# Afterwards, you can use the below code:

# from vertexai.preview import agent_engines`

# remote_agent = agent_engines.get(AGENT_ENGINE_RESOURCE_NAME)`
# response = remote_agent.query(input=query)`

# @markdown Alternatively, you can query your agent from other programming languages using any of the [available client libraries in Vertex AI](https://cloud.google.com/vertex-ai/docs/start/client-libraries), including C#, Java, Node.js, Python, Go, or REST API.

## Clean up resources

In [None]:
# @title Delete the models, endpoints, buckets and agent engines

# @markdown  Delete the experiment models and endpoints to recycle the resources
# @markdown  and avoid unnecessary continuous charges that may incur.

# Undeploy model and delete endpoint.
for endpoint in endpoints.values():
    endpoint.delete(force=True)

# Delete models.
for model in models.values():
    model.delete()

delete_bucket = False  # @param {type:"boolean"}
if delete_bucket:
    ! gsutil -m rm -r $BUCKET_NAME

delete_agent_engine = False  # @param {type:"boolean"}

if delete_agent_engine:
    remote_agent.delete()