In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Vertex AI Model Garden - Serve Multimodal Llama 3.2 with vLLM
## A Comprehensive Deployment Tutorial

<table><tbody><tr>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fcommunity%2Fmodel_garden%2Fmodel_garden_vllm_multimodal_tutorial.ipynb">
      <img alt="Google Cloud Colab Enterprise logo" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" width="32px"><br> Run in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_vllm_multimodal_tutorial.ipynb">
      <img alt="GitHub logo" src="https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png" width="32px"><br> View on GitHub
    </a>
  </td>
</tr></tbody></table>

## Overview

Large Language Models (LLMs) like Llama 3.2 have revolutionized AI, showcasing impressive abilities in text generation, language translation, and knowledge retrieval. Now, with the advent of multimodal models, LLMs can process and understand not only text but also visual inputs, creating an entirely new dimension of interaction. However, effectively deploying and serving these models often requires specialized infrastructure and optimized serving strategies. This is where **vLLM** steps in, transforming how we interact with these complex models.

Imagine a traditional server trying to handle both a text and image simultaneously: it's akin to a single lane road trying to accommodate both cars and trucks at once, resulting in a significant slowdown. vLLM, on the other hand, acts as a multi-lane highway, efficiently processing diverse inputs to maximize throughput and minimize latency.

**vLLM** is more than just an optimization; it's a paradigm shift. This open-source library revolutionizes LLM serving, and it's especially designed for the newest challenges of Multimodal LLMs by:

*   **Democratizing Access to Multimodal LLMs:** Enabling efficient and cost-effective deployment of powerful multimodal models like Llama 3.2, even on limited hardware. This is akin to giving you a specialized vehicle that can navigate both asphalt roads and challenging terrains seamlessly.

*   **Boosting Multimodal Throughput:** Maximizing the number of requests, whether text-only or mixed media, handled per second, guaranteeing responsive interactions. vLLM's optimized architecture ensures that your model keeps pace with even the most demanding multimodal workflows.

*   **Enabling Dynamic Customization:** Supporting dynamic LoRA (Low-Rank Adaptation), allowing you to fine-tune model behavior on-the-fly without the need for retraining the entire model. This is like having interchangeable attachments for a versatile tool, each designed for different jobs, allowing you to fine tune for the current context without needing to swap the tool entirely.

### Google Vertex AI vLLM Customizations for Multimodal Applications: Enhancing Performance and Integration

The vLLM implementation within Google Vertex AI Model Garden is not merely a direct integration of the open-source library. Vertex AI maintains a customized and optimized version of vLLM, specifically tailored to enhance performance, reliability, and seamless integration within the Google Cloud ecosystem, including comprehensive support for multimodal capabilities. These customizations provide tangible benefits for users deploying LLMs on Vertex AI.

Key Vertex AI vLLM customizations include:

*   **Performance Optimizations:**
    *   **Parallel Downloading from Google Cloud Storage (GCS):** Significantly accelerates model loading and deployment times by enabling parallel data retrieval from GCS, reducing latency and improving startup speed.
    *   **Asynchronous Request Processing:**  Enhances serving efficiency by supporting concurrent handling of multiple prediction requests, optimizing GPU utilization and maximizing throughput, even with mixed modalities.
    *   **Optimized CUDA Kernels:** Incorporates hand-tuned CUDA kernels for further performance gains, ensuring maximal efficiency on NVIDIA GPUs, even when dealing with more complex multimodal inputs.
*   **Feature Enhancements:**
    *   **Dynamic LoRA with Enhanced Caching and GCS Support:** Extends dynamic LoRA capabilities with local disk caching mechanisms and robust error handling, alongside support for loading LoRA weights directly from GCS paths and signed URLs, simplifying management and deployment of customized models.
    *   **OpenAI Chat Completions API Compatibility:**  Provides seamless integration with the widely adopted OpenAI Chat Completions API, enabling straightforward migration and utilization of existing OpenAI-compatible applications.
    *   **Llama 3.1/3.2 Function Calling Parsing:** Implements specialized parsing for Llama 3.1/3.2 function calling, facilitating advanced interactions and tool utilization within multimodal applications.
    *   **Llama 3.2 Validation for Non-Leading Images** Ensures the correct processing of Llama3.2 image modalities.

*   **Vertex AI Ecosystem Integration:**
    *   **Vertex Prediction Input/Output Format Support:** Ensures seamless compatibility with Vertex AI prediction input and output formats, simplifying data handling and integration with other Vertex AI services, and supporting multimodal inputs.
    *   **Vertex Environment Variable Awareness:** Respects and leverages Vertex AI environment variables (AIP_*) for configuration and resource management, streamlining deployment and ensuring consistent behavior within the Vertex AI environment for multimodal models.
    *   **Enhanced Error Handling and Robustness:** Implements comprehensive error handling, input/output validation, and server termination mechanisms to ensure stability, reliability, and seamless operation within the managed Vertex AI environment, even when dealing with complex inputs.
    *   **Nginx Server for Scalability:** Integrates an Nginx server on top of the vLLM server, facilitating the deployment of multiple replicas and enhancing scalability and high availability of the serving infrastructure, especially when handling large image and text processing.
    *   **Multimodal Input Support:** Specifically designed to handle multiple types of inputs including images at the raw prediction route `/generate`, simplifying multimodal data integration.

These Vertex AI-specific customizations are vital for creating a production-ready, highly optimized, and seamlessly integrated vLLM serving experience within the Google Cloud ecosystem. By leveraging these enhancements, users can maximize the performance and efficiency of their Llama 3.2 multimodal deployments on Vertex AI Model Garden.

### Objectives

This tutorial aims to provide a practical guide to:

- Deploying Llama 3.2 multimodal models with optimized vLLM configurations on GPU, leveraging enhanced serving performance for improved throughput and latency.

This tutorial will guide you through the deployment process on Google Vertex AI Model Garden, showcasing how to harness the power of vLLM to unlock the full potential of Llama 3.2. We'll delve into:

1.  **vLLM's Core Innovations:** A high-level overview of the key technologies that drive vLLM's exceptional performance, including Paged Attention, Continuous Batching, and Optimized CUDA Kernels. Think of these as the secret ingredients that give vLLM its edge.

2.  **Vertex AI Model Garden: Your LLM Launchpad:** Discover how Google Vertex AI Model Garden provides a curated collection of pre-built Llama 3.2 models optimized for vLLM, simplifying deployment and eliminating complex configuration steps. This is your fast track to LLM success.

3.  **Streamlined Model Deployment:** Step-by-step instructions on deploying Llama 3.2 models with both standard and optimized vLLM configurations, demonstrating how to select the best approach for your specific needs. We'll cover all the essentials, from selecting the appropriate hardware to configuring the deployment environment.

4.  **Performance Optimization Techniques:** Learn how to leverage dynamic LoRA adapters to fine-tune model behavior for specific tasks and optimize resource utilization to minimize costs. This is where you'll learn how to squeeze every last drop of performance out of your LLM deployment.
  
5.  **Multimodal Support**: Explores the multimodal support in vLLM within the Vertex Ai infrastructure.

### Deployment Options
This notebook proceeds with the following deployment options.

 1.  **Deploy prebuilt Llama 3.2 Multimodal models with optimized vLLM:**  
This provides optimized implementation of the Llama 3.2 models. This demonstrates how to take LLMs to the highest standard for performance and throughput.


### Costs

This tutorial utilizes billable components of Google Cloud, including:

* Vertex AI
* Cloud Storage

Refer to [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage pricing](https://cloud.google.com/storage/pricing) to estimate costs based on your projected usage. The [Pricing Calculator](https://cloud.google.com/products/calculator/) is available to generate a more detailed cost breakdown.

## Before you begin

Before executing this notebook, ensure that the following prerequisites are satisfied:

1. Request Quota.
2.  A Google Cloud Platform (GCP) project with billing enabled.
3.  The Google Cloud SDK (gcloud) installed and configured.
4.  The Vertex AI API and Compute Engine API are enabled.
5.  A Cloud Storage bucket is created for storing experiment outputs.

In [None]:
# @title Request for quota

# @markdown By default, the quota for H100 deployment `Custom model serving per region` is 0. You need to request for H100 quota following the instructions at ["Request a higher quota"](https://cloud.google.com/docs/quota/view-manage#requesting_higher_quota).

In [None]:
# @title Setup Google Cloud project

# @markdown 1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

# @markdown 2. **[Optional]** [Create a Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets) for storing experiment outputs. Set the BUCKET_URI for the experiment environment. The specified Cloud Storage bucket (`BUCKET_URI`) should be located in the same region as where the notebook was launched. Note that a multi-region bucket (eg. "us") is not considered a match for a single region covered by the multi-region range (eg. "us-central1"). If not set, a unique GCS bucket will be created instead.

BUCKET_URI = "gs://"  # @param {type:"string"}

# @markdown 3. **[Optional]** Set region. If not set, the region will be set automatically according to Colab Enterprise environment.

REGION = ""  # @param {type:"string"}

# @markdown 4. If you want to run predictions with A100 80GB or H100 GPUs, we recommend using the regions listed below. **NOTE:** Make sure you have associated quota in selected regions. Click the links to see your current quota for each GPU type: [Nvidia A100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_a100_80gb_gpus), [Nvidia H100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_h100_gpus).

# @markdown > | Machine Type | Accelerator Type | Recommended Regions |
# @markdown | ----------- | ----------- | ----------- |
# @markdown | a2-ultragpu-1g | 1 NVIDIA_A100_80GB | us-central1, us-east4, europe-west4, asia-southeast1, us-east4 |
# @markdown | a3-highgpu-2g | 2 NVIDIA_H100_80GB | us-west1, asia-southeast1, europe-west4 |
# @markdown | a3-highgpu-4g | 4 NVIDIA_H100_80GB | us-west1, asia-southeast1, europe-west4 |
# @markdown | a3-highgpu-8g | 8 NVIDIA_H100_80GB | us-central1, us-east5, europe-west4, us-west1, asia-southeast1 |

# Import the necessary packages

# Upgrade Vertex AI SDK.
! pip3 install --upgrade --quiet 'google-cloud-aiplatform>=1.64.0'
! git clone https://github.com/GoogleCloudPlatform/vertex-ai-samples.git

import datetime
import importlib
import os
import uuid
from typing import Tuple

from google.cloud import aiplatform

common_util = importlib.import_module(
    "vertex-ai-samples.community-content.vertex_model_garden.model_oss.notebook_util.common_util"
)

models, endpoints = {}, {}

# Get the default cloud project id.
PROJECT_ID = os.environ["GOOGLE_CLOUD_PROJECT"]

# Get the default region for launching jobs.
if not REGION:
    REGION = os.environ["GOOGLE_CLOUD_REGION"]

# Enable the Vertex AI API and Compute Engine API, if not already.
print("Enabling Vertex AI API and Compute Engine API.")
! gcloud services enable aiplatform.googleapis.com compute.googleapis.com

# Cloud Storage bucket for storing the experiment artifacts.
# A unique GCS bucket will be created for the purpose of this notebook. If you
# prefer using your own GCS bucket, change the value yourself below.
now = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
BUCKET_NAME = "/".join(BUCKET_URI.split("/")[:3])

if BUCKET_URI is None or BUCKET_URI.strip() == "" or BUCKET_URI == "gs://":
    BUCKET_URI = f"gs://{PROJECT_ID}-tmp-{now}-{str(uuid.uuid4())[:4]}"
    BUCKET_NAME = "/".join(BUCKET_URI.split("/")[:3])
    ! gsutil mb -l {REGION} {BUCKET_URI}
else:
    assert BUCKET_URI.startswith("gs://"), "BUCKET_URI must start with `gs://`."
    shell_output = ! gsutil ls -Lb {BUCKET_NAME} | grep "Location constraint:" | sed "s/Location constraint://"
    bucket_region = shell_output[0].strip().lower()
    if bucket_region != REGION:
        raise ValueError(
            "Bucket region %s is different from notebook region %s"
            % (bucket_region, REGION)
        )
print(f"Using this GCS Bucket: {BUCKET_URI}")

STAGING_BUCKET = os.path.join(BUCKET_URI, "temporal")
MODEL_BUCKET = os.path.join(BUCKET_URI, "llama3_1")


# Initialize Vertex AI API.
print("Initializing Vertex AI API.")
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=STAGING_BUCKET)

# Gets the default SERVICE_ACCOUNT.
shell_output = ! gcloud projects describe $PROJECT_ID
project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"
print("Using this default Service Account:", SERVICE_ACCOUNT)


# Provision permissions to the SERVICE_ACCOUNT with the GCS bucket
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.admin $BUCKET_NAME

! gcloud config set project $PROJECT_ID
! gcloud projects add-iam-policy-binding --no-user-output-enabled {PROJECT_ID} --member=serviceAccount:{SERVICE_ACCOUNT} --role="roles/storage.admin"
! gcloud projects add-iam-policy-binding --no-user-output-enabled {PROJECT_ID} --member=serviceAccount:{SERVICE_ACCOUNT} --role="roles/aiplatform.user"


In [None]:
# @title Using Hugging Face with Meta Llama 3.1, 3.2 and vLLM for Efficient Model Serving

# @markdown Meta’s Llama 3.1 collection provide a range of multilingual large language models (LLMs) designed for high-quality text generation across various use cases. These models are pretrained and instruction-tuned, excelling in tasks like multilingual dialogue, summarization, and agentic retrieval. For efficient deployment of these models, the vLLM library offers an open source streamlined serving environment with optimizations for latency, memory efficiency, and scalability.

# @markdown The Llama 3.1 is a **gated** model. You need to agree to share your contact information to access the model.

# @markdown <img src="https://cloud.google.com/static/vertex-ai/generative-ai/docs/open-models/vllm/images/meta_llama_agreement.png" width="800">

In [None]:
# @title HuggingFace User Access Tokens

# @markdown This tutorial requires a read access token from the Hugging Face Hub to access the necessary resources. Please follow these steps to set up your authentication:

# @markdown #### Generate a Read Access Token:
# @markdown  - Navigate to your [Hugging Face account settings](https://huggingface.co/settings/tokens).
# @markdown  - Create a new token, assign it the Read role, and save the token securely.

# @markdown <img src="https://cloud.google.com/static/vertex-ai/generative-ai/docs/open-models/vllm/images/access_token_settings_link.png" width="800">

# @markdown  #### Utilize the Token:
# @markdown  Use the generated token to authenticate and access public or private repositories as needed for the tutorial.
# @markdown  This setup ensures you have the appropriate level of access without unnecessary permissions. These practices enhance security and prevent accidental token exposure. For more information on setting up access tokens, visit this [page](https://huggingface.co/docs/hub/en/security-tokens).

# @markdown Make sure to enable Read access. Once the access token has been created, Please be cautious with your Hugging Face access token. Avoid sharing or exposing it publicly or online. When you set your token as an environment variable during deployment, it remains private to your project. Vertex AI ensures its security by preventing other users from accessing your models and endpoints.

# @markdown <img src="https://cloud.google.com/static/vertex-ai/generative-ai/docs/open-models/vllm/images/manage_access_token.png" width="800">

# @markdown For more information on protecting your access token, please refer to the [Hugging Face Access Tokens - Best Practices](https://huggingface.co/docs/hub/en/security-tokens#best-practices).

## Deploy prebuilt Llama 3.2 11B-vision and 90B-vision with vLLM

In [None]:
# @title Deploy

# @markdown This section uploads prebuilt Llama 3.2 models to Model Registry and deploys it to a Vertex AI Endpoint. It takes 15 minutes to 1 hour to finish depending on the size of the model.

# @markdown Currently vLLM can be used with limited inputs for multi-modality models, which are "text only" format and "single leading image + text" format. More input formats will be supported later.

# @markdown Set the model to deploy.

base_model_name = "Llama-3.2-11B-Vision"  # @param ["Llama-3.2-1B", "Llama-3.2-1B-Instruct", "Llama-3.2-3B", "Llama-3.2-3B-Instruct", "Llama-3.2-11B-Vision", "Llama-3.2-11B-Vision-Instruct", "Llama-3.2-90B-Vision", "Llama-3.2-90B-Vision-Instruct"] {isTemplate:true}
model_id = os.path.join(VERTEX_AI_MODEL_GARDEN_LLAMA_3_2, base_model_name)
hf_model_id = "meta-llama/" + base_model_name

# The pre-built serving docker images.
VLLM_DOCKER_URI = "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20241001_0916_RC00"

# @markdown Find Vertex AI prediction supported accelerators and regions at https://cloud.google.com/vertex-ai/docs/predictions/configure-compute.
if "3.2-1B" in base_model_name or "3.2-3B" in base_model_name:
    accelerator_type = "NVIDIA_L4"
    machine_type = "g2-standard-8"
    accelerator_count = 1
elif "3.2-11B" in base_model_name:
    accelerator_type = "NVIDIA_TESLA_A100"
    machine_type = "a2-highgpu-1g"
    accelerator_count = 1
elif "3.2-90B" in base_model_name:
    accelerator_type = "NVIDIA_H100_80GB"
    machine_type = "a3-highgpu-8g"
    accelerator_count = 8
else:
    raise ValueError(f"Recommended GPU setting not found for: {base_model_name}.")

common_util.check_quota(
    project_id=PROJECT_ID,
    region=REGION,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    is_for_training=False,
)

gpu_memory_utilization = 0.9
max_model_len = 4096
max_num_seqs = 12


def deploy_model_vllm(
    model_name: str,
    model_id: str,
    service_account: str,
    base_model_id: str = None,
    machine_type: str = "g2-standard-8",
    accelerator_type: str = "NVIDIA_L4",
    accelerator_count: int = 1,
    gpu_memory_utilization: float = 0.9,
    max_model_len: int = 4096,
    dtype: str = "auto",
    enable_trust_remote_code: bool = False,
    enforce_eager: bool = False,
    enable_lora: bool = False,
    enable_chunked_prefill: bool = False,
    enable_prefix_cache: bool = False,
    host_prefix_kv_cache_utilization_target: float = 0.0,
    max_loras: int = 1,
    max_cpu_loras: int = 8,
    use_dedicated_endpoint: bool = False,
    max_num_seqs: int = 256,
    model_type: str = None,
) -> Tuple[aiplatform.Model, aiplatform.Endpoint]:
    """Deploys trained models with vLLM into Vertex AI."""
    endpoint = aiplatform.Endpoint.create(
        display_name=f"{model_name}-endpoint",
        dedicated_endpoint_enabled=use_dedicated_endpoint,
    )

    if not base_model_id:
        base_model_id = model_id

    # See https://docs.vllm.ai/en/latest/models/engine_args.html for a list of possible arguments with descriptions.
    vllm_args = [
        "python",
        "-m",
        "vllm.entrypoints.api_server",
        "--host=0.0.0.0",
        "--port=8080",
        f"--model={model_id}",
        f"--tensor-parallel-size={accelerator_count}",
        "--swap-space=16",
        f"--gpu-memory-utilization={gpu_memory_utilization}",
        f"--max-model-len={max_model_len}",
        f"--dtype={dtype}",
        f"--max-loras={max_loras}",
        f"--max-cpu-loras={max_cpu_loras}",
        f"--max-num-seqs={max_num_seqs}",
        "--disable-log-stats",
    ]

    if enable_trust_remote_code:
        vllm_args.append("--trust-remote-code")

    if enforce_eager:
        vllm_args.append("--enforce-eager")

    if enable_lora:
        vllm_args.append("--enable-lora")

    if enable_chunked_prefill:
        vllm_args.append("--enable-chunked-prefill")

    if enable_prefix_cache:
        vllm_args.append("--enable-prefix-caching")

    if 0 < host_prefix_kv_cache_utilization_target < 1:
        vllm_args.append(
            f"--host-prefix-kv-cache-utilization-target={host_prefix_kv_cache_utilization_target}"
        )

    if model_type:
        vllm_args.append(f"--model-type={model_type}")

    env_vars = {
        "MODEL_ID": base_model_id,
        "DEPLOY_SOURCE": "notebook",
    }

    # HF_TOKEN is not a compulsory field and may not be defined.
    try:
        if HF_TOKEN:
            env_vars["HF_TOKEN"] = HF_TOKEN
    except NameError:
        pass

    model = aiplatform.Model.upload(
        display_name=model_name,
        serving_container_image_uri=VLLM_DOCKER_URI,
        serving_container_args=vllm_args,
        serving_container_ports=[8080],
        serving_container_predict_route="/generate",
        serving_container_health_route="/ping",
        serving_container_environment_variables=env_vars,
        serving_container_shared_memory_size_mb=(16 * 1024),  # 16 GB
        serving_container_deployment_timeout=7200,
    )
    print(
        f"Deploying {model_name} on {machine_type} with {accelerator_count} {accelerator_type} GPU(s)."
    )
    model.deploy(
        endpoint=endpoint,
        machine_type=machine_type,
        accelerator_type=accelerator_type,
        accelerator_count=accelerator_count,
        deploy_request_timeout=1800,
        service_account=service_account,
        system_labels={
            "NOTEBOOK_NAME": "model_garden_pytorch_llama3_2_deployment.ipynb",
        },
    )
    print("endpoint_name:", endpoint.name)

    return model, endpoint


models["vllm_gpu"], endpoints["vllm_gpu"] = deploy_model_vllm(
    model_name=common_util.get_job_name_with_datetime(prefix="llama3_2-serve-vllm"),
    model_id=model_id,
    base_model_id=hf_model_id,
    service_account=SERVICE_ACCOUNT,
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    gpu_memory_utilization=gpu_memory_utilization,
    max_model_len=max_model_len,
    enforce_eager=True,
    use_dedicated_endpoint=use_dedicated_endpoint,
    max_num_seqs=max_num_seqs,
    model_type="llama3.1",
)
# @markdown Click "Show Code" to see more details.

In [None]:
# @title Chat completion for vision models

ENDPOINT_RESOURCE_NAME = "projects/{}/locations/{}/endpoints/{}".format(
    PROJECT_ID, REGION, endpoints["vllm_gpu"].name
)

# @title Chat Completions Inference

# @markdown Once deployment succeeds, you can send requests to the endpoint using the OpenAI SDK.

# @markdown First you will need to install the SDK and some auth-related dependencies.

! pip install -qU openai google-auth requests

# @markdown Next fill out some request parameters:

user_image = "https://upload.wikimedia.org/wikipedia/commons/thumb/c/cb/The_Blue_Marble_%28remastered%29.jpg/580px-The_Blue_Marble_%28remastered%29.jpg"  # @param {type: "string"}
user_message = "What is in the image?"  # @param {type: "string"}
# @markdown If you encounter the issue like `ServiceUnavailable: 503 Took too long to respond when processing`, you can reduce the maximum number of output tokens, such as set `max_tokens` as 20.
max_tokens = 50  # @param {type: "integer"}
temperature = 1.0  # @param {type: "number"}

# @markdown Now we can send a request.

import google.auth
import openai

creds, project = google.auth.default()
auth_req = google.auth.transport.requests.Request()
creds.refresh(auth_req)

BASE_URL = (
    f"https://{REGION}-aiplatform.googleapis.com/v1beta1/{ENDPOINT_RESOURCE_NAME}"
)
try:
    if use_dedicated_endpoint:
        BASE_URL = f"https://{DEDICATED_ENDPOINT_DNS}/v1beta1/{ENDPOINT_RESOURCE_NAME}"
except NameError:
    pass

client = openai.OpenAI(base_url=BASE_URL, api_key=creds.token)

model_response = client.chat.completions.create(
    model="",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": user_image}},
                {"type": "text", "text": user_message},
            ],
        }
    ],
    temperature=temperature,
    max_tokens=max_tokens,
)
print(model_response)

# @markdown Click "Show Code" to see more details.

## Clean up resources

In [None]:
# @title Delete the models and endpoints

# @markdown  Delete the experiment models and endpoints to recycle the resources
# @markdown  and avoid unnecessary continuous charges that may incur.

# Undeploy model and delete endpoint.
for endpoint in endpoints.values():
    endpoint.delete(force=True)

# Delete models.
for model in models.values():
    model.delete()

delete_bucket = False  # @param {type:"boolean"}
if delete_bucket:
    ! gsutil -m rm -r $BUCKET_NAME

## Debugging Common Bugs and Issues

In [None]:
# @title HuggingFace Token Needed

# @markdown <img src="https://cloud.google.com/static/vertex-ai/generative-ai/docs/open-models/vllm/images/hf_token_needed.png" width="800">

# @markdown  The screenshot displays a log entry in Google Cloud's Log Explorer showing an error message related to accessing the Meta LLaMA-3.2-11B-Vision model hosted on Hugging Face. The error indicates that access to the model is restricted, requiring authentication to proceed. The message specifically states, "Cannot access gated repo for URL," highlighting that the model is gated and requires proper authentication credentials to be accessed. This log entry can help troubleshoot authentication issues when working with restricted resources in external repositories.

# @markdown  To resolve this issue, verify the permissions of your HuggingFace access token. Copy the latest token and deploy a new endpoint.

In [None]:
# @title Chat Template Needed

# @markdown <img src="https://cloud.google.com/static/vertex-ai/generative-ai/docs/open-models/vllm/images/chat_template_needed.png" width="800">

# @markdown This screenshot shows a log entry in Google Cloud's Log Explorer, where a ValueError occurs due to a missing chat template in the transformers library version 4.44. The error message indicates that the default chat template is no longer allowed, and a custom chat template must be provided if the tokenizer does not define one. This error highlights a recent change in the library requiring explicit definition of a chat template, useful for debugging issues when deploying chat-based applications.

# @markdown To bypass this, make sure you use an Instruct model during deployment as shown in the above sections. In the  case of Llava models for example, you can provide a chat template.

In [None]:
# @title Model Max Seq Len

# @markdown <img src="https://cloud.google.com/static/vertex-ai/generative-ai/docs/open-models/vllm/images/model_max_seq_len.png" width="800">

# @markdown ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (2256). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

# @markdown To resolve this problem, set max_model_len 2048, which is less than 2256.