In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Vertex AI Model Garden - Serve Llama 3.1 with vLLM
## A Comprehensive Deployment Tutorial

<table><tbody><tr>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fcommunity%2Fmodel_garden%2Fmodel_garden_vllm_text_only_tutorial.ipynb">
      <img alt="Google Cloud Colab Enterprise logo" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" width="32px"><br> Run in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_vllm_text_only_tutorial.ipynb">
      <img alt="GitHub logo" src="https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png" width="32px"><br> View on GitHub
    </a>
  </td>
</tr></tbody></table>

## Overview

Large Language Models (LLMs) like Llama 3.1 have emerged as powerful tools, capable of generating creative text, translating languages, and answering questions in an informative way. However, harnessing their full potential often hinges on efficient and scalable deployment strategies. This is where **vLLM** steps in, fundamentally transforming the landscape of LLM serving.

Imagine an orchestra: each instrument (or query) requires attention, and a traditional server struggles to coordinate them all simultaneously, leading to delays and a cacophony of frustrated users. vLLM, on the other hand, acts as a masterful conductor, efficiently orchestrating requests to maximize throughput and minimize latency.

**vLLM** is more than an optimization; it's a paradigm shift. This open-source library revolutionizes LLM serving by:

*   **Democratizing Access:** Enabling efficient and cost-effective deployment of powerful models like Llama 3.1, even on limited hardware. Think of it as making a Formula 1 racing engine accessible to your everyday sedan, boosting its performance far beyond expectations.

*   **Boosting Throughput:** Maximizing the number of requests handled per second, ensuring responsive and seamless user experiences. vLLM transforms your LLM from a single lane road to a multi-lane superhighway, capable of handling significantly increased traffic.

*   **Enabling Dynamic Customization:** Supporting dynamic LoRA (Low-Rank Adaptation), allowing you to change the model behavior on-the-fly without the need for retraining the entire model. This is akin to having interchangeable lenses for a camera, each optimized for different shooting conditions, enabling you to adapt the model to specific tasks with unparalleled precision.

### Google Vertex AI vLLM Customizations: Enhancing Performance and Integration

The vLLM implementation within Google Vertex AI Model Garden is not merely a direct integration of the open-source library. Vertex AI maintains a customized and optimized version of vLLM, specifically tailored to enhance performance, reliability, and seamless integration within the Google Cloud ecosystem. These customizations provide tangible benefits for users deploying LLMs on Vertex AI.

Key Vertex AI vLLM customizations include:

*   **Performance Optimizations:**
    *   **Parallel Downloading from Google Cloud Storage (GCS):** Significantly accelerates model loading and deployment times by enabling parallel data retrieval from GCS, reducing latency and improving startup speed.
*   **Feature Enhancements:**
    *   **Dynamic LoRA with Enhanced Caching and GCS Support:** Extends dynamic LoRA capabilities with local disk caching mechanisms and robust error handling, alongside support for loading LoRA weights directly from GCS paths and signed URLs, simplifying management and deployment of customized models.
    *   **Llama 3.1/3.2 Function Calling Parsing:** Implements specialized parsing for Llama 3.1/3.2 function calling, improving the robustness in parsing.
    *   **Host memory prefix caching:** The external vLLM only supports GPU memory prefix caching.
    *   **Speculative decoding:** This is an existing vLLM feature, but Vertex ran experiments to find high-performing model setups.
*   **Vertex AI Ecosystem Integration:**
    *   **Vertex Prediction Input/Output Format Support:** Ensures seamless compatibility with Vertex AI prediction input and output formats, simplifying data handling and integration with other Vertex AI services.
    *   **Vertex Environment Variable Awareness:** Respects and leverages Vertex AI environment variables (AIP_*) for configuration and resource management, streamlining deployment and ensuring consistent behavior within the Vertex AI environment.
    *   **Enhanced Error Handling and Robustness:** Implements comprehensive error handling, input/output validation, and server termination mechanisms to ensure stability, reliability, and seamless operation within the managed Vertex AI environment.
    *   **Nginx Server for Scalability:** Integrates an Nginx server on top of the vLLM server, facilitating the deployment of multiple replicas and enhancing scalability and high availability of the serving infrastructure.

These Vertex AI-specific customizations, while often transparent to the end-user, are crucial for delivering a production-ready, highly optimized, and seamlessly integrated vLLM serving experience within the Google Cloud ecosystem. By leveraging these enhancements, users can maximize the performance and efficiency of their Llama 3.1 deployments on Vertex AI Model Garden.

## Objectives

This tutorial will guide you through the deployment process on Google Vertex AI Model Garden, showcasing how to harness the power of vLLM to unlock the full potential of Llama 3.1. We'll delve into:

1.  **vLLM's Core Innovations:** A high-level overview of the key technologies that drive vLLM's exceptional performance, including Paged Attention, Continuous Batching, and Optimized CUDA Kernels. Think of these as the secret ingredients that give vLLM its edge.

2.  **Vertex AI Model Garden: Your LLM Launchpad:** Discover how Google Vertex AI Model Garden provides a curated collection of pre-built Llama 3.1 models optimized for vLLM, simplifying deployment and eliminating complex configuration steps. This is your fast track to LLM success.

3.  **Streamlined Model Deployment:** Step-by-step instructions on deploying Llama 3.1 models with both standard and optimized vLLM configurations, demonstrating how to select the best approach for your specific needs. We'll cover all the essentials, from selecting the appropriate hardware to configuring the deployment environment.

4.  **Performance Optimization Techniques:** Learn how to leverage dynamic LoRA adapters to fine-tune model behavior for specific tasks and optimize resource utilization to minimize costs. This is where you'll learn how to squeeze every last drop of performance out of your LLM deployment.

This tutorial caters to users of all levels, from beginners eager to explore the world of LLMs to experienced practitioners seeking to optimize their existing deployments. Let's embark on this journey together and unlock the transformative power of Llama 3.1 with vLLM!

### Deployment Options
This notebook proceeds with the following deployment options.

1.  **Deploy prebuilt Llama 3.1 8B Instruct with standard vLLM and Fast Deployment:**  
This option prioritizes rapid deployment and exploration on a `a2-ultragpu-1g` machine with Fast Deployment in `us-central1`. Fast Deployment is used for initial validation.

2.  **Deploy prebuilt Llama 3.1 8B, 70B and 405B with standard vLLM:**  
This approach leverages the NVIDIA_L4 and NVIDIA_H100_80GB for general usage and demonstration of the implementation.

3.  **Deploy prebuilt Llama 3.1 8B and 70B with optimized vLLM:**  
This provides optimized deployment of the Llama 3.1 models. Our general recommendation is that the optimized vLLM serving container can provide better serving performance in some cases and it is preferable to use it.This demonstrates how to take LLMs to the highest standard for performance and throughput.

### Costs

This tutorial utilizes billable components of Google Cloud, including:

* Vertex AI
* Cloud Storage

Refer to [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage pricing](https://cloud.google.com/storage/pricing) to estimate costs based on your projected usage. The [Pricing Calculator](https://cloud.google.com/products/calculator/) is available to generate a more detailed cost breakdown.

## Before you begin

Before you begin executing this notebook, please review the following prerequisites. We guide you through the process of executing these steps and preparing your environment to serve Llama 3.1:

1. A Google Cloud Platform (GCP) project with billing enabled.
2. Request Quota.
3. The Google Cloud SDK (gcloud) installed and configured.
4. The Vertex AI API and Compute Engine API are enabled.
5. A Cloud Storage bucket is created for storing experiment outputs.

## Request for quota

Before deploying Llama 3.1 for serving, it's critical to ensure you have sufficient GPU quota in your Google Cloud project. Quota represents the maximum amount of resources your project can utilize and is essential for successful deployment.

**Machine and Accelerator Types:**

*   **L4 GPUs:** For serving, a minimum of **1 L4 GPU** is required in the `us-central1` region. To verify your current L4 GPU quota in `us-central1`, click [here](https://console.cloud.google.com/iam-admin/quotas?location=us-central1&metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_l4_gpus). If your quota is insufficient, you will need to request an increase using [these instructions](https://cloud.google.com/docs/quotas/view-manage#viewing_your_quota_console).
*   **Alternative GPUs (A100 80GB & H100 80GB):** You can also run Llama 3.1 predictions on **A100 80GB** or **H100 80GB** GPUs. However, this requires that you have sufficient quota in the chosen region.  It is CRITICAL to verify your quota for these GPUs using the following links:
    *   **Nvidia A100 80GB Quota:** [Click here](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_a100_80gb_gpus)
    *   **Nvidia H100 80GB Quota:** [Click here](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_h100_gpus)

**Important Considerations for Quota:**

*   **Quota Regions:** Quotas are region-specific. The links provided are for general quota viewing, but you must ensure your selected region has available resources. For example, the links above direct you to see A100 and H100 quota in *any region*. You may be planning on serving in `us-central1` but your quota may not allow for it.
*   **Quota Types:** The provided links direct you to *serving* quota. If your goal is *finetuning*, you will need to follow the finetuning specific instructions. Serving and finetuning quotas are different. This notebook focuses on serving
*   **Requesting Quota Increases:** If your existing quota is insufficient for your deployment, you'll need to request a quota increase via the Google Cloud console. The link provided earlier provides instructions on how to navigate the request process. This typically involves submitting a request with your requirements and may require review.
*   **Preemptible vs. Non-preemptible GPUs:** There are often different quotas for preemptible (cost-effective but interruptible) and non-preemptible GPUs. If you are using dynamic workload scheduler for finetuning and using H100s, make sure you have the correct type of quota. The links for the H100s are all for preemptible GPUs.

By carefully checking and managing your GPU quota, you can ensure a smooth and successful deployment of Llama 3.1.

In [None]:
# @title Setup Google Cloud project

# @markdown 1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

# @markdown 2. **[Optional]** [Create a Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets) for storing experiment outputs. Set the `BUCKET_URI` for the experiment environment. The specified Cloud Storage bucket (`BUCKET_URI`) should be located in the same region as where the notebook was launched. Note that a multi-region bucket (eg. "us") is not considered a match for a single region covered by the multi-region range (eg. "us-central1"). If not set, a unique GCS bucket will be created instead.

BUCKET_URI = "gs://"  # @param {type:"string"}

# @markdown 3. **[Optional]** Set region. If not set, the region will be set automatically according to Colab Enterprise environment. If you want to run predictions with A100 80GB or H100 GPUs, we recommend using the regions listed below.

REGION = ""  # @param {type:"string"}

# @markdown > | Machine Type | Accelerator Type | Recommended Regions |
# @markdown | ----------- | ----------- | ----------- |
# @markdown | a2-ultragpu-1g | 1 NVIDIA_A100_80GB | us-central1, us-east4, europe-west4, asia-southeast1, us-east4 |
# @markdown | a3-highgpu-2g | 2 NVIDIA_H100_80GB | us-west1, asia-southeast1, europe-west4 |
# @markdown | a3-highgpu-4g | 4 NVIDIA_H100_80GB | us-west1, asia-southeast1, europe-west4 |
# @markdown | a3-highgpu-8g | 8 NVIDIA_H100_80GB | us-central1, us-east5, europe-west4, us-west1, asia-southeast1 |

# Import the necessary packages

# Clone the vertex-ai-samples repository.
! git clone https://github.com/GoogleCloudPlatform/vertex-ai-samples.git

import datetime
import importlib
import os
import uuid
from typing import Tuple

import requests
from google.cloud import aiplatform

common_util = importlib.import_module(
    "vertex-ai-samples.community-content.vertex_model_garden.model_oss.notebook_util.common_util"
)

models, endpoints = {}, {}

# Get the default cloud project id.
PROJECT_ID = os.environ["GOOGLE_CLOUD_PROJECT"]

# Get the default region for launching jobs.
if not REGION:
    REGION = os.environ["GOOGLE_CLOUD_REGION"]

# Enable the Vertex AI API and Compute Engine API, if not already.
print("Enabling Vertex AI API and Compute Engine API.")
! gcloud services enable aiplatform.googleapis.com compute.googleapis.com

# Cloud Storage bucket for storing the experiment artifacts.
# A unique GCS bucket will be created for the purpose of this notebook. If you
# prefer using your own GCS bucket, change the value yourself below.
now = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
BUCKET_NAME = "/".join(BUCKET_URI.split("/")[:3])

if BUCKET_URI is None or BUCKET_URI.strip() == "" or BUCKET_URI == "gs://":
    BUCKET_URI = f"gs://{PROJECT_ID}-tmp-{now}-{str(uuid.uuid4())[:4]}"
    BUCKET_NAME = "/".join(BUCKET_URI.split("/")[:3])
    ! gsutil mb -l {REGION} {BUCKET_URI}
else:
    assert BUCKET_URI.startswith("gs://"), "BUCKET_URI must start with `gs://`."
    shell_output = ! gsutil ls -Lb {BUCKET_NAME} | grep "Location constraint:" | sed "s/Location constraint://"
    bucket_region = shell_output[0].strip().lower()
    if bucket_region != REGION:
        raise ValueError(
            "Bucket region %s is different from notebook region %s"
            % (bucket_region, REGION)
        )
print(f"Using this GCS Bucket: {BUCKET_URI}")

STAGING_BUCKET = os.path.join(BUCKET_URI, "temporal")
MODEL_BUCKET = os.path.join(BUCKET_URI, "llama3_1")


# Initialize Vertex AI API.
print("Initializing Vertex AI API.")
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=STAGING_BUCKET)

# Gets the default SERVICE_ACCOUNT.
shell_output = ! gcloud projects describe $PROJECT_ID
project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"
print("Using this default Service Account:", SERVICE_ACCOUNT)


# Provision permissions to the SERVICE_ACCOUNT with the GCS bucket
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.admin $BUCKET_NAME

! gcloud config set project $PROJECT_ID
! gcloud projects add-iam-policy-binding --no-user-output-enabled {PROJECT_ID} --member=serviceAccount:{SERVICE_ACCOUNT} --role="roles/storage.admin"
! gcloud projects add-iam-policy-binding --no-user-output-enabled {PROJECT_ID} --member=serviceAccount:{SERVICE_ACCOUNT} --role="roles/aiplatform.user"

### Using Hugging Face with Meta Llama 3.1 and vLLM for Efficient Model Serving

Meta’s Llama 3.1 collection provide a range of multilingual large language models (LLMs) designed for high-quality text generation across various use cases. These models are pretrained and instruction-tuned, excelling in tasks like multilingual dialogue, summarization, and agentic retrieval. For efficient deployment of these models, the vLLM library offers an open source streamlined serving environment with optimizations for latency, memory efficiency, and scalability.

The Llama 3.1 is a **gated** model. You need to agree to share your contact information to access the model.

<img src="https://cloud.google.com/static/vertex-ai/generative-ai/docs/open-models/vllm/images/meta_llama_agreement.png">

### HuggingFace User Access Tokens

This tutorial requires a read access token from the Hugging Face Hub to access the necessary resources. Please follow these steps to set up your authentication:

 #### Generate a Read Access Token:
- Navigate to your [Hugging Face account settings](https://huggingface.co/settings/tokens).
- Create a new token, assign it the Read role, and save the token securely.

<img src="https://cloud.google.com/static/vertex-ai/generative-ai/docs/open-models/vllm/images/access_token_settings_link.png">

#### Utilize the Token:
Use the generated token to authenticate and access public or private repositories as needed for the tutorial.
This setup ensures you have the appropriate level of access without unnecessary permissions. These practices enhance security and prevent accidental token exposure. For more information on setting up access tokens, visit this [page](https://huggingface.co/docs/hub/en/security-tokens).

Make sure to enable Read access. Once the access token has been created, Please be cautious with your Hugging Face access token. Avoid sharing or exposing it publicly or online. When you set your token as an environment variable during deployment, it remains private to your project. Vertex AI ensures its security by preventing other users from accessing your models and endpoints.

<img src="https://cloud.google.com/static/vertex-ai/generative-ai/docs/open-models/vllm/images/manage_access_token.png">

For more information on protecting your access token, please refer to the [Hugging Face Access Tokens - Best Practices](https://huggingface.co/docs/hub/en/security-tokens#best-practices).

## Deploy prebuilt Llama 3.1 8B Instruct with standard vLLM and Fast Deployment

Before diving into the code, let's understand the arguments provided to the `fast_deploy` function and the subsequent `aiplatform.Model.upload` and `model.deploy` calls within it. These arguments configure how the model is uploaded to Vertex AI and deployed for serving.

**Arguments Explained:**

**`fast_deploy` function arguments:**

*   `publisher` *(str)*: The name of the model publisher. This indicates the organization or entity that provides the model (e.g., "meta").
*   `publisher_model_id` *(str)*: The identifier of the model within the publisher's catalog (e.g., "llama3_1").
*   `version_id` *(str)*: The specific version of the model to be deployed (e.g., "llama-3.1-8b-instruct"). This allows for tracking and selecting particular model iterations.

**Arguments for `aiplatform.Model.upload` (inside `fast_deploy`):**

*   `display_name` *(str)*: The name that will be shown in the Vertex AI Model Registry. This helps identify the uploaded model in the Vertex AI console. It's automatically set based on the model from the model garden.
*   `serving_container_image_uri` *(str)*: The location of the container image that will be used for serving the model. Typically this would be a pre-built image from a container registry that includes the necessary libraries and tools to run the model.
*    `serving_container_args` *(list)*: Command-line arguments passed to the container when it starts, if needed by the containerized application.
*   `serving_container_environment_variables` *(dict)*: Environment variables that are set within the serving container. This is commonly used to configure model parameters, resource limits, and other operational settings. These are equivalent to text-generation-inference environment variables, which are in turn equivalent to the arguments for the text-generation-launcher. The Hugging Face DLCs for TGI also capture the `AIP_` environment variables as specified in the [Vertex AI Documentation](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements#environment-variables).
*   `serving_container_ports` *(list of int)*: The port number where the Vertex AI endpoint will be exposed within the container. By default, this is 8080.
*   `serving_container_predict_route` *(str)*: The route used for the prediction endpoint in the container.
*   `serving_container_health_route` *(str)*: The route used for the health check endpoint in the container.
*  `serving_container_shared_memory_size_mb` *(int)*: Sets the shared memory size in megabytes for the container.
*   `serving_container_deployment_timeout` *(int)*: Time in seconds before the deployment request will timeout.
*   `model_garden_source_model_name` *(str)*: The full path to the source model that is being deployed. This information is useful for tracking lineage of the model.

**Arguments for `model.deploy` (inside `fast_deploy`):**

*   `endpoint` *(aiplatform.Endpoint)*: The Vertex AI endpoint where the model will be deployed.
*   `machine_type` *(str)*: The type of virtual machine to use for prediction (e.g., `n1-standard-8`, `a2-highgpu-1g`).
*   `accelerator_type` *(str)*: The type of GPU accelerator to use for prediction (e.g., `NVIDIA_TESLA_A100`, `NVIDIA_L4`).
*   `accelerator_count` *(int)*: The number of GPU accelerators to use for prediction.
*   `deploy_request_timeout` *(int)*: The maximum time in seconds to wait for the deployment request to complete.
*   `disable_container_logging` *(bool)*: A flag to disable container logging.
*    `fast_tryout_enabled` *(bool)*: A flag to enable a faster tryout deployment of the endpoint.
*   `system_labels` *(dict)*:  Labels to attach to the deployed model. The code adds two specific labels here, "DEPLOY_SOURCE" to "notebook" and  "NOTEBOOK_NAME" with the name of the current notebook.

**Key Concepts and Considerations:**

*   **Containerization:** The model is packaged into a container image, ensuring consistency across different environments. The `serving_container_image_uri` specifies the location of this container image.
*   **Environment Variables:** These allow you to customize the serving behavior of your model without modifying the container image itself. This is very important for configuring specific model settings, and passing credentials.
*   **Resource Allocation:** The `machine_type`, `accelerator_type`, and `accelerator_count` arguments control the compute resources dedicated to your deployed model.
*   **Endpoint:** The endpoint is the access point for making predictions using your deployed model.
*   **Model Garden:** This model is being deployed from Google's Model Garden, which includes pre-trained models that can be deployed in Vertex AI.
*   **Dedicated Endpoint:** The `dedicated_endpoint_enabled=True` argument in `aiplatform.Endpoint.create()`  forces the endpoint to be dedicated which is required for fast deployments.

By carefully configuring these arguments, you can tailor your model deployment to meet your specific performance, cost, and operational requirements.

In [None]:
# @title Fast Deployment

# @markdown This section demonstrates how to use the Fast Deployment feature.

# @markdown The Fast Deployment feature prioritizes speed for model exploration, making it ideal for initial testing and experimentation. For sensitive data or production workloads, use the Standard environment for enhanced security and stability. It is enabled by setting `fast_tryout_enabled` to `True`.

# @markdown This is useful for quick experiments. Not for production workloads. Only available for most popular models and machine types.

API_ENDPOINT = f"{REGION}-aiplatform.googleapis.com"


def fast_deploy(
    publisher: str, publisher_model_id: str, version_id: str
) -> Tuple[aiplatform.Model, aiplatform.Endpoint]:
    url = f"https://{API_ENDPOINT}/v1/publishers/{publisher}/models/{publisher_model_id}@{version_id}"
    access_token = ! gcloud auth print-access-token
    access_token = access_token[0]

    response = requests.get(
        url,
        headers={
            "Authorization": f"Bearer {access_token}",
            "Content-Type": "application/json",
        },
    )
    response.raise_for_status()
    response = response.json()
    if (
        len(
            response.get("supportedActions", {})
            .get("multiDeployVertex", {})
            .get("multiDeployVertex", {})
        )
        == 0
    ):
        raise ValueError(
            f"No supportedActions.multiDeployVertex found in {REGION}. You can skip this section or try a different region."
        )
    deploy_configs = response["supportedActions"]["multiDeployVertex"][
        "multiDeployVertex"
    ]
    fast_deploy_config = [
        config
        for config in deploy_configs
        if config["deployMetadata"]
        .get("labels", {})
        .get("show-faster-deployment-option")
        == "true"
    ]
    if fast_deploy_config:
        fast_deploy_config = fast_deploy_config[0]
    else:
        raise ValueError(
            f"No Fast Deployment config found in {REGION}. You can skip this section or try a different region."
        )

    container_spec = fast_deploy_config["containerSpec"]
    machine_spec = fast_deploy_config["dedicatedResources"]["machineSpec"]
    machine_type = machine_spec["machineType"]
    accelerator_type = machine_spec["acceleratorType"]
    accelerator_count = machine_spec["acceleratorCount"]
    env = {item["name"]: item["value"] for item in container_spec.get("env", [])}
    if "DEPLOY_SOURCE" in env:
        del env["DEPLOY_SOURCE"]
    port = container_spec.get("ports", [{}])[0].get("containerPort")

    model = aiplatform.Model.upload(
        display_name=fast_deploy_config.get("modelDisplayName"),
        serving_container_image_uri=container_spec.get("imageUri"),
        serving_container_args=container_spec.get("args"),
        serving_container_environment_variables=env,
        serving_container_ports=[port],
        serving_container_predict_route=container_spec.get("predictRoute"),
        serving_container_health_route=container_spec.get("healthRoute"),
        serving_container_shared_memory_size_mb=(16 * 1024),  # 16 GB
        serving_container_deployment_timeout=7200,
        model_garden_source_model_name=f"publishers/{publisher}/models/{publisher_model_id}",
    )
    endpoint = aiplatform.Endpoint.create(
        display_name=model.name + "-endpoint",
        dedicated_endpoint_enabled=True,
    )
    print(
        f"Deploying {model.name} on {machine_type} with {accelerator_count} {accelerator_type} GPU(s)."
    )
    model.deploy(
        endpoint=endpoint,
        machine_type=machine_type,
        accelerator_type=accelerator_type,
        accelerator_count=accelerator_count,
        deploy_request_timeout=1800,
        disable_container_logging=True,
        fast_tryout_enabled=True,
        system_labels={
            "DEPLOY_SOURCE": "notebook",
            "NOTEBOOK_NAME": "model_garden_pytorch_llama3_1_deployment.ipynb",
        },
    )
    print("endpoint_name:", endpoint.name)
    return model, endpoint


# @markdown The Llama 3.1 8B Instruct model will be deployed to a dedicated endpoint on an `a2-ultragpu-1g` machine with Fast Deployment.
# @markdown **Currently, the Fast Deployment is only supported in the `us-central1` region.**

use_dedicated_endpoint = True  # Fast Deployment only supports dedicated endpoints.
models["vllm_fast"], endpoints["vllm_fast"] = fast_deploy(
    "meta", "llama3_1", "llama-3.1-8b-instruct"
)

In [None]:
# @title Raw predict

# @markdown Once deployment succeeds, you can send requests to the endpoint with text prompts. Sampling parameters supported by vLLM can be found [here](https://docs.vllm.ai/en/latest/dev/sampling_params.html).

# @markdown <img src="https://cloud.google.com/static/vertex-ai/generative-ai/docs/open-models/vllm/images/vertexai_llama3_1_endpoint.png">

# @markdown Example:

# @markdown ```
# @markdown Human: What is a car?
# @markdown Assistant:  A car, or a motor car, is a road-connected human-transportation system used to move people or goods from one place to another. The term also encompasses a wide range of vehicles, including motorboats, trains, and aircrafts. Cars typically have four wheels, a cabin for passengers, and an engine or motor. They have been around since the early 19th century and are now one of the most popular forms of transportation, used for daily commuting, shopping, and other purposes.
# @markdown ```
# @markdown Additionally, you can moderate the generated text with Vertex AI. See [Moderate text documentation](https://cloud.google.com/natural-language/docs/moderating-text) for more details.

# Loads an existing endpoint instance using the endpoint name:
# - Using `endpoint_name = endpoint.name` allows us to get the
#   endpoint name of the endpoint `endpoint` created in the cell
#   above.
# - Alternatively, you can set `endpoint_name = "1234567890123456789"` to load
#   an existing endpoint with the ID 1234567890123456789.
# You may uncomment the code below to load an existing endpoint.

# endpoint_name = ""  # @param {type:"string"}
# aip_endpoint_name = (
#     f"projects/{PROJECT_ID}/locations/{REGION}/endpoints/{endpoint_name}"
# )
# endpoint = aiplatform.Endpoint(aip_endpoint_name)

prompt = "What is a car?"  # @param {type: "string"}
# @markdown If you encounter an issue like `ServiceUnavailable: 503 Took too long to respond when processing`, you can reduce the maximum number of output tokens, by lowering `max_tokens`.
max_tokens = 50  # @param {type:"integer"}
temperature = 1.0  # @param {type:"number"}
top_p = 1.0  # @param {type:"number"}
top_k = 1  # @param {type:"integer"}
# @markdown Set `raw_response` to `True` to obtain the raw model output. Set `raw_response` to `False` to apply additional formatting in the structure of `"Prompt:\n{prompt.strip()}\nOutput:\n{output}"`.
raw_response = False  # @param {type:"boolean"}

# Overrides parameters for inferences.
instances = [
    {
        "prompt": prompt,
        "max_tokens": max_tokens,
        "temperature": temperature,
        "top_p": top_p,
        "top_k": top_k,
        "raw_response": raw_response,
    },
]
response = endpoints["vllm_fast"].predict(
    instances=instances, use_dedicated_endpoint=use_dedicated_endpoint
)

for prediction in response.predictions:
    print(prediction)

# @markdown Click "Show Code" to see more details.

In [None]:
# @title Chat completion

DEDICATED_ENDPOINT_DNS = endpoints["vllm_fast"].gca_resource.dedicated_endpoint_dns
ENDPOINT_RESOURCE_NAME = "projects/{}/locations/{}/endpoints/{}".format(
    PROJECT_ID, REGION, endpoints["vllm_fast"].name
)

# @title Chat Completions Inference

# @markdown Once deployment succeeds, you can send requests to the endpoint using the OpenAI SDK.

# @markdown First you will need to install the SDK and some auth-related dependencies.

! pip install -qU openai google-auth requests

# @markdown Next fill out some request parameters:

user_message = "How is your day going?"  # @param {type: "string"}
# @markdown If you encounter the issue like `ServiceUnavailable: 503 Took too long to respond when processing`, you can reduce the maximum number of output tokens, such as set `max_tokens` as 20.
max_tokens = 50  # @param {type: "integer"}
temperature = 1.0  # @param {type: "number"}

# @markdown Now we can send a request.

import google.auth
import openai

creds, project = google.auth.default()
auth_req = google.auth.transport.requests.Request()
creds.refresh(auth_req)

BASE_URL = (
    f"https://{REGION}-aiplatform.googleapis.com/v1beta1/{ENDPOINT_RESOURCE_NAME}"
)
try:
    if use_dedicated_endpoint:
        BASE_URL = f"https://{DEDICATED_ENDPOINT_DNS}/v1beta1/{ENDPOINT_RESOURCE_NAME}"
except NameError:
    pass

client = openai.OpenAI(base_url=BASE_URL, api_key=creds.token)

model_response = client.chat.completions.create(
    model="",
    messages=[{"role": "user", "content": user_message}],
    temperature=temperature,
    max_tokens=max_tokens,
)
print(model_response)

# @markdown Click "Show Code" to see more details.

## Deploy prebuilt Llama 3.1 8B, 70B and 405B with standard vLLM

In [None]:
# @title Deploy

# @markdown This section uploads prebuilt Llama 3.1 models to Model Registry and deploys it to a Vertex AI Endpoint. It takes 15 minutes to 1 hour to finish depending on the size of the model.

# @markdown NVIDIA_L4 GPUs are used for demonstration. The serving efficiency of L4 GPUs is inferior to that of H100 GPUs, but L4 GPUs are nevertheless good serving solutions if you do not have H100 quota.

# @markdown Recommended Serving Configurations: This example recommends using A100 80G or H100 GPUs for optimal serving efficiency and performance. These GPUs are now readily available and are the preferred options for deploying these models.

# @markdown Set the model to deploy.

base_model_name = "Meta-Llama-3.1-8B"  # @param ["Meta-Llama-3.1-8B", "Meta-Llama-3.1-8B-Instruct", "Meta-Llama-3.1-70B", "Meta-Llama-3.1-70B-Instruct", "Meta-Llama-3.1-405B-FP8", "Meta-Llama-3.1-405B-Instruct-FP8"] {isTemplate:true}
model_id = os.path.join(VERTEX_AI_MODEL_GARDEN_LLAMA_3_1, base_model_name)
ENABLE_DYNAMIC_LORA = True  # @param {type:"boolean", isTemplate:true}
hf_model_id = "meta-llama/" + base_model_name

# The pre-built serving docker images.
VLLM_DOCKER_URI = "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20241210_0916_RC00"

# @markdown Set use_dedicated_endpoint to False if you don't want to use [dedicated endpoint](https://cloud.google.com/vertex-ai/docs/general/deployment#create-dedicated-endpoint).
use_dedicated_endpoint = True  # @param {type:"boolean"}

# @markdown Find Vertex AI prediction supported accelerators and regions at https://cloud.google.com/vertex-ai/docs/predictions/configure-compute.
if "8b" in base_model_name.lower():
    accelerator_type = "NVIDIA_L4"
    machine_type = "g2-standard-12"
    accelerator_count = 1
    max_loras = 5
elif "70b" in base_model_name.lower():
    accelerator_type = "NVIDIA_H100_80GB"
    machine_type = "a3-highgpu-4g"
    accelerator_count = 4
    max_loras = 1
elif "405b" in base_model_name.lower():
    accelerator_type = "NVIDIA_H100_80GB"
    machine_type = "a3-highgpu-8g"
    accelerator_count = 8
    max_loras = 1
else:
    raise ValueError(
        f"Recommended GPU setting not found for: {accelerator_type} and {base_model_name}."
    )

common_util.check_quota(
    project_id=PROJECT_ID,
    region=REGION,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    is_for_training=False,
)

gpu_memory_utilization = 0.95
max_model_len = 8192  # Maximum context length.


# Enable automatic prefix caching using GPU HBM
enable_prefix_cache = True
# Setting this value >0 will use the idle host memory for a second-tier prefix kv
# cache beneath the HBM cache. It only has effect if enable_prefix_cache=True.
# The range of this value: [0, 1)
# Setting host_prefix_kv_cache_utilization_target to 0 will disable the host memory prefix kv cache.
host_prefix_kv_cache_utilization_target = 0.7


def deploy_model_vllm(
    model_name: str,
    model_id: str,
    service_account: str,
    base_model_id: str = None,
    machine_type: str = "g2-standard-8",
    accelerator_type: str = "NVIDIA_L4",
    accelerator_count: int = 1,
    gpu_memory_utilization: float = 0.9,
    max_model_len: int = 4096,
    dtype: str = "auto",
    enable_trust_remote_code: bool = False,
    enforce_eager: bool = False,
    enable_lora: bool = False,
    enable_chunked_prefill: bool = False,
    enable_prefix_cache: bool = False,
    host_prefix_kv_cache_utilization_target: float = 0.0,
    max_loras: int = 1,
    max_cpu_loras: int = 8,
    use_dedicated_endpoint: bool = False,
    max_num_seqs: int = 256,
    model_type: str = None,
) -> Tuple[aiplatform.Model, aiplatform.Endpoint]:
    """Deploys trained models with vLLM into Vertex AI."""
    endpoint = aiplatform.Endpoint.create(
        display_name=f"{model_name}-endpoint",
        dedicated_endpoint_enabled=use_dedicated_endpoint,
    )

    if not base_model_id:
        base_model_id = model_id

    # See https://docs.vllm.ai/en/latest/models/engine_args.html for a list of possible arguments with descriptions.
    vllm_args = [
        "python",
        "-m",
        "vllm.entrypoints.api_server",
        "--host=0.0.0.0",
        "--port=8080",
        f"--model={model_id}",
        f"--tensor-parallel-size={accelerator_count}",
        "--swap-space=16",
        f"--gpu-memory-utilization={gpu_memory_utilization}",
        f"--max-model-len={max_model_len}",
        f"--dtype={dtype}",
        f"--max-loras={max_loras}",
        f"--max-cpu-loras={max_cpu_loras}",
        f"--max-num-seqs={max_num_seqs}",
        "--disable-log-stats",
    ]

    if enable_trust_remote_code:
        vllm_args.append("--trust-remote-code")

    if enforce_eager:
        vllm_args.append("--enforce-eager")

    if enable_lora:
        vllm_args.append("--enable-lora")

    if enable_chunked_prefill:
        vllm_args.append("--enable-chunked-prefill")

    if enable_prefix_cache:
        vllm_args.append("--enable-prefix-caching")

    if 0 < host_prefix_kv_cache_utilization_target < 1:
        vllm_args.append(
            f"--host-prefix-kv-cache-utilization-target={host_prefix_kv_cache_utilization_target}"
        )

    if model_type:
        vllm_args.append(f"--model-type={model_type}")

    env_vars = {
        "MODEL_ID": base_model_id,
        "DEPLOY_SOURCE": "notebook",
    }

    # HF_TOKEN is not a compulsory field and may not be defined.
    try:
        if HF_TOKEN:
            env_vars["HF_TOKEN"] = HF_TOKEN
    except NameError:
        pass

    model = aiplatform.Model.upload(
        display_name=model_name,
        serving_container_image_uri=VLLM_DOCKER_URI,
        serving_container_args=vllm_args,
        serving_container_ports=[8080],
        serving_container_predict_route="/generate",
        serving_container_health_route="/ping",
        serving_container_environment_variables=env_vars,
        serving_container_shared_memory_size_mb=(16 * 1024),  # 16 GB
        serving_container_deployment_timeout=7200,
    )
    print(
        f"Deploying {model_name} on {machine_type} with {accelerator_count} {accelerator_type} GPU(s)."
    )
    model.deploy(
        endpoint=endpoint,
        machine_type=machine_type,
        accelerator_type=accelerator_type,
        accelerator_count=accelerator_count,
        deploy_request_timeout=1800,
        service_account=service_account,
        system_labels={
            "NOTEBOOK_NAME": "model_garden_pytorch_llama3_1_deployment.ipynb",
        },
    )
    print("endpoint_name:", endpoint.name)

    return model, endpoint


models["vllm_gpu"], endpoints["vllm_gpu"] = deploy_model_vllm(
    model_name=common_util.get_job_name_with_datetime(prefix="llama3_1-serve"),
    model_id=model_id,
    base_model_id=hf_model_id,
    service_account=SERVICE_ACCOUNT,
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    gpu_memory_utilization=gpu_memory_utilization,
    max_model_len=max_model_len,
    max_loras=max_loras,
    enforce_eager=True,
    enable_lora=ENABLE_DYNAMIC_LORA,
    enable_chunked_prefill=not ENABLE_DYNAMIC_LORA,
    enable_prefix_cache=enable_prefix_cache,
    host_prefix_kv_cache_utilization_target=host_prefix_kv_cache_utilization_target,
    use_dedicated_endpoint=use_dedicated_endpoint,
)
# @markdown Click "Show Code" to see more details.

In [None]:
# @title Raw predict

# @markdown Once deployment succeeds, you can send requests to the endpoint with text prompts. Sampling parameters supported by vLLM can be found [here](https://docs.vllm.ai/en/latest/dev/sampling_params.html).

# @markdown Example:

# @markdown ```
# @markdown Human: What is a car?
# @markdown Assistant:  A car, or a motor car, is a road-connected human-transportation system used to move people or goods from one place to another.
# @markdown ```

# @markdown Optionally, you can apply LoRA weights to prediction. Set `lora_id` to be either a GCS URI or a HuggingFace repo containing the LoRA weight.

# @markdown Additionally, you can moderate the generated text with Vertex AI. See [Moderate text documentation](https://cloud.google.com/natural-language/docs/moderating-text) for more details.

prompt = "What is a car?"  # @param {type: "string"}
# @markdown If you encounter an issue like `ServiceUnavailable: 503 Took too long to respond when processing`, you can reduce the maximum number of output tokens, by lowering `max_tokens`.
max_tokens = 50  # @param {type:"integer"}
temperature = 1.0  # @param {type:"number"}
top_p = 1.0  # @param {type:"number"}
top_k = 1  # @param {type:"integer"}
# @markdown Set `raw_response` to `True` to obtain the raw model output. Set `raw_response` to `False` to apply additional formatting in the structure of `"Prompt:\n{prompt.strip()}\nOutput:\n{output}"`.
raw_response = False  # @param {type:"boolean"}
lora_id = ""  # @param {type:"string", isTemplate: true}

# Overrides parameters for inferences.
instance = {
    "prompt": prompt,
    "max_tokens": max_tokens,
    "temperature": temperature,
    "top_p": top_p,
    "top_k": top_k,
    "raw_response": raw_response,
}
if lora_id:
    instance["dynamic-lora"] = lora_id
instances = [instance]
response = endpoints["vllm_gpu"].predict(
    instances=instances, use_dedicated_endpoint=use_dedicated_endpoint
)

for prediction in response.predictions:
    print(prediction)

# @markdown Click "Show Code" to see more details.

In [None]:
# @title Chat completion

if use_dedicated_endpoint:
    DEDICATED_ENDPOINT_DNS = endpoints["vllm_gpu"].gca_resource.dedicated_endpoint_dns
ENDPOINT_RESOURCE_NAME = "projects/{}/locations/{}/endpoints/{}".format(
    PROJECT_ID, REGION, endpoints["vllm_gpu"].name
)

# @title Chat Completions Inference

# @markdown Once deployment succeeds, you can send requests to the endpoint using the OpenAI SDK.

# @markdown First you will need to install the SDK and some auth-related dependencies.

! pip install -qU openai google-auth requests

# @markdown Next fill out some request parameters:

user_message = "How is your day going?"  # @param {type: "string"}
# @markdown If you encounter the issue like `ServiceUnavailable: 503 Took too long to respond when processing`, you can reduce the maximum number of output tokens, such as set `max_tokens` as 20.
max_tokens = 50  # @param {type: "integer"}
temperature = 1.0  # @param {type: "number"}

# @markdown Now we can send a request.

import google.auth
import openai

creds, project = google.auth.default()
auth_req = google.auth.transport.requests.Request()
creds.refresh(auth_req)

BASE_URL = (
    f"https://{REGION}-aiplatform.googleapis.com/v1beta1/{ENDPOINT_RESOURCE_NAME}"
)
try:
    if use_dedicated_endpoint:
        BASE_URL = f"https://{DEDICATED_ENDPOINT_DNS}/v1beta1/{ENDPOINT_RESOURCE_NAME}"
except NameError:
    pass

client = openai.OpenAI(base_url=BASE_URL, api_key=creds.token)

model_response = client.chat.completions.create(
    model="",
    messages=[{"role": "user", "content": user_message}],
    temperature=temperature,
    max_tokens=max_tokens,
)
print(model_response)

# @markdown Click "Show Code" to see more details.

## Deploy prebuilt Llama 3.1 8B and 70B with optimized vLLM

The optimized vLLM serving container can provide better serving performance in some cases.

In [None]:
# @title Deploy

# @markdown This section uploads prebuilt Llama 3.1 models to Model Registry and deploys it to a Vertex AI Endpoint. It takes 15 minutes to 1 hour to finish depending on the size of the model.

# @markdown NVIDIA_L4 GPUs are used for demonstration. The serving efficiency of L4 GPUs is inferior to that of H100 GPUs, but L4 GPUs are nevertheless good serving solutions if you do not have H100 quota.

# @markdown Recommended Serving Configurations: This example recommends using A100 80G or H100 GPUs for optimal serving efficiency and performance. These GPUs are now readily available and are the preferred options for deploying these models.

# @markdown Set the model to deploy.

base_model_name = "Meta-Llama-3.1-8B"  # @param ["Meta-Llama-3.1-8B", "Meta-Llama-3.1-8B-Instruct", "Meta-Llama-3.1-70B", "Meta-Llama-3.1-70B-Instruct"] {isTemplate:true}
model_id = os.path.join(VERTEX_AI_MODEL_GARDEN_LLAMA_3_1, base_model_name)
hf_model_id = "meta-llama/" + base_model_name

# The pre-built serving docker images.
OPTIMIZED_VLLM_DOCKER_URI = "us-docker.pkg.dev/vertex-ai-restricted/vertex-vision-model-garden-dockers/pytorch-vllm-optimized-serve:20241029_0835_RC00"

# @markdown Find Vertex AI prediction supported accelerators and regions at https://cloud.google.com/vertex-ai/docs/predictions/configure-compute.
if "8b" in base_model_name.lower():
    accelerator_type = "NVIDIA_L4"
    machine_type = "g2-standard-12"
    accelerator_count = 1
elif "70b" in base_model_name.lower():
    accelerator_type = "NVIDIA_H100_80GB"
    machine_type = "a3-highgpu-8g"
    accelerator_count = 8
else:
    raise ValueError(
        f"Recommended GPU setting not found for: {accelerator_type} and {base_model_name}."
    )

common_util.check_quota(
    project_id=PROJECT_ID,
    region=REGION,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    is_for_training=False,
)

max_model_len = 8192  # Maximum context length.


def deploy_model_optimized_vllm(
    model_name: str,
    model_id: str,
    service_account: str,
    base_model_id: str = None,
    machine_type: str = "g2-standard-12",
    accelerator_type: str = "NVIDIA_L4",
    accelerator_count: int = 1,
    max_model_len: int = 4096,
    enable_trust_remote_code: bool = False,
    use_dedicated_endpoint: bool = False,
) -> Tuple[aiplatform.Model, aiplatform.Endpoint]:
    """Deploys models with optimized vLLM into Vertex AI."""
    endpoint = aiplatform.Endpoint.create(
        display_name=f"{model_name}-endpoint",
        dedicated_endpoint_enabled=use_dedicated_endpoint,
    )

    if not base_model_id:
        base_model_id = model_id

    vllm_args = [
        "python",
        "-m",
        "vllm.entrypoints.api_server",
        "--host=0.0.0.0",
        "--port=8080",
        f"--model={model_id}",
        f"--tensor-parallel-size={accelerator_count}",
        "--swap-space=16",
        f"--max-model-len={max_model_len}",
        "--disable-log-stats",
    ]

    if enable_trust_remote_code:
        vllm_args.append("--trust-remote-code")

    env_vars = {
        "MODEL_ID": base_model_id,
        "DEPLOY_SOURCE": "notebook",
    }

    # HF_TOKEN is not a compulsory field and may not be defined.
    try:
        if HF_TOKEN:
            env_vars["HF_TOKEN"] = HF_TOKEN
    except NameError:
        pass

    model = aiplatform.Model.upload(
        display_name=model_name,
        serving_container_image_uri=OPTIMIZED_VLLM_DOCKER_URI,
        serving_container_args=vllm_args,
        serving_container_ports=[8080],
        serving_container_predict_route="/generate",
        serving_container_health_route="/ping",
        serving_container_environment_variables=env_vars,
        serving_container_shared_memory_size_mb=(16 * 1024),  # 16 GB
        serving_container_deployment_timeout=7200,
    )
    print(
        f"Deploying {model_name} on {machine_type} with {accelerator_count} {accelerator_type} GPU(s)."
    )
    model.deploy(
        endpoint=endpoint,
        machine_type=machine_type,
        accelerator_type=accelerator_type,
        accelerator_count=accelerator_count,
        deploy_request_timeout=1800,
        service_account=service_account,
        system_labels={
            "NOTEBOOK_NAME": "model_garden_pytorch_llama3_1_deployment.ipynb",
        },
    )
    print("endpoint_name:", endpoint.name)

    return model, endpoint


(
    models["optimized_vllm_gpu"],
    endpoints["optimized_vllm_gpu"],
) = deploy_model_optimized_vllm(
    model_name=common_util.get_job_name_with_datetime(prefix="llama3_1-serve"),
    model_id=model_id,
    base_model_id=hf_model_id,
    service_account=SERVICE_ACCOUNT,
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    max_model_len=max_model_len,
    use_dedicated_endpoint=use_dedicated_endpoint,
)
# @markdown Click "Show Code" to see more details.

In [None]:
# @title Raw predict

# @markdown Once deployment succeeds, you can send requests to the endpoint with text prompts. Sampling parameters supported by vLLM can be found [here](https://docs.vllm.ai/en/latest/dev/sampling_params.html).

# @markdown Example:

# @markdown ```
# @markdown Human: What is a car?
# @markdown Assistant:  A car, or a motor car, is a road-connected human-transportation system used to move people or goods from one place to another. The term also encompasses a wide range of vehicles, including motorboats, trains, and aircrafts. Cars typically have four wheels, a cabin for passengers, and an engine or motor. They have been around since the early 19th century and are now one of the most popular forms of transportation, used for daily commuting, shopping, and other purposes.
# @markdown ```
# @markdown Additionally, you can moderate the generated text with Vertex AI. See [Moderate text documentation](https://cloud.google.com/natural-language/docs/moderating-text) for more details.

# Loads an existing endpoint instance using the endpoint name:
# - Using `endpoint_name = endpoint.name` allows us to get the
#   endpoint name of the endpoint `endpoint` created in the cell
#   above.
# - Alternatively, you can set `endpoint_name = "1234567890123456789"` to load
#   an existing endpoint with the ID 1234567890123456789.
# You may uncomment the code below to load an existing endpoint.

# endpoint_name = ""  # @param {type:"string"}
# aip_endpoint_name = (
#     f"projects/{PROJECT_ID}/locations/{REGION}/endpoints/{endpoint_name}"
# )
# endpoint = aiplatform.Endpoint(aip_endpoint_name)

prompt = "What is a car?"  # @param {type: "string"}
max_tokens = 50  # @param {type:"integer"}
temperature = 1.0  # @param {type:"number"}
top_p = 1.0  # @param {type:"number"}
top_k = 1  # @param {type:"integer"}
raw_response = True  # @param {type:"boolean"}

# Overrides parameters for inferences.
instances = [
    {
        "prompt": prompt,
        "max_tokens": max_tokens,
        "temperature": temperature,
        "top_p": top_p,
        "top_k": top_k,
        "raw_response": raw_response,
    },
]
response = endpoints["optimized_vllm_gpu"].predict(
    instances=instances, use_dedicated_endpoint=use_dedicated_endpoint
)

for prediction in response.predictions:
    print(prediction)

# @markdown Click "Show Code" to see more details.

In [None]:
# @title Chat completion

ENDPOINT_RESOURCE_NAME = "projects/{}/locations/{}/endpoints/{}".format(
    PROJECT_ID, REGION, endpoints["optimized_vllm_gpu"].name
)

# @title Chat Completions Inference

# @markdown Once deployment succeeds, you can send requests to the endpoint using the OpenAI SDK.

# @markdown First you will need to install the SDK and some auth-related dependencies.

! pip install -qU openai google-auth requests

# @markdown Next fill out some request parameters:

user_message = "How is your day going?"  # @param {type: "string"}
# @markdown If you encounter the issue like `ServiceUnavailable: 503 Took too long to respond when processing`, you can reduce the maximum number of output tokens, such as set `max_tokens` as 20.
max_tokens = 50  # @param {type: "integer"}
temperature = 1.0  # @param {type: "number"}

# @markdown Now we can send a request.

import google.auth
import openai

creds, project = google.auth.default()
auth_req = google.auth.transport.requests.Request()
creds.refresh(auth_req)

BASE_URL = (
    f"https://{REGION}-aiplatform.googleapis.com/v1beta1/{ENDPOINT_RESOURCE_NAME}"
)
try:
    if use_dedicated_endpoint:
        BASE_URL = f"https://{DEDICATED_ENDPOINT_DNS}/v1beta1/{ENDPOINT_RESOURCE_NAME}"
except NameError:
    pass

client = openai.OpenAI(base_url=BASE_URL, api_key=creds.token)

model_response = client.chat.completions.create(
    model="",
    messages=[{"role": "user", "content": user_message}],
    temperature=temperature,
    max_tokens=max_tokens,
)
print(model_response)

# @markdown Click "Show Code" to see more details.

## Clean up resources

In [None]:
# @title Delete the models and endpoints

# @markdown  Delete the experiment models and endpoints to recycle the resources
# @markdown  and avoid unnecessary continuous charges that may incur.

# Undeploy model and delete endpoint.
for endpoint in endpoints.values():
    endpoint.delete(force=True)

# Delete models.
for model in models.values():
    model.delete()

delete_bucket = False  # @param {type:"boolean"}
if delete_bucket:
    ! gsutil -m rm -r $BUCKET_NAME

## Debugging Common Bugs and Issues

### HuggingFace Token Needed

<img src="https://cloud.google.com/static/vertex-ai/generative-ai/docs/open-models/vllm/images/hf_token_needed.png">

The screenshot displays a log entry in Google Cloud's Log Explorer showing an error message related to accessing the Meta LLaMA-3.2-11B-Vision model hosted on Hugging Face. The error indicates that access to the model is restricted, requiring authentication to proceed. The message specifically states, "Cannot access gated repo for URL," highlighting that the model is gated and requires proper authentication credentials to be accessed. This log entry can help troubleshoot authentication issues when working with restricted resources in external repositories.

To resolve this issue, verify the permissions of your HuggingFace access token. Copy the latest token and deploy a new endpoint.

### Chat Template Needed

<img src="https://cloud.google.com/static/vertex-ai/generative-ai/docs/open-models/vllm/images/chat_template_needed.png">

This screenshot shows a log entry in Google Cloud's Log Explorer, where a ValueError occurs due to a missing chat template in the transformers library version 4.44. The error message indicates that the default chat template is no longer allowed, and a custom chat template must be provided if the tokenizer does not define one. This error highlights a recent change in the library requiring explicit definition of a chat template, useful for debugging issues when deploying chat-based applications.

To bypass this, make sure you use an Instruct model during deployment as shown in the above sections. In the  case of Llava models for example, you can provide a chat template.

### Model Max Seq Len

<img src="https://cloud.google.com/static/vertex-ai/generative-ai/docs/open-models/vllm/images/model_max_seq_len.png">

ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (2256). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

To resolve this problem, set max_model_len 2048, which is less than 2256. Another resolution for this issue is to use more or larger GPUs. `tensor-parallel-size` will need to be set appropriately if opting to use more GPUs.