In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Deploying Multiple LoRA Adapters on Vertex AI with vLLM

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/get_started_with_vllm_lora_serving_on_vertex_ai.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fopen-models%2Fserving%2Fget_started_with_vllm_lora_serving_on_vertex_ai.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/open-models/serving/get_started_with_vllm_lora_serving_on_vertex_ai.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/get_started_with_vllm_lora_serving_on_vertex_ai.ipynb">
      <img width="32px" src="https://storage.googleapis.com/github-repo/generative-ai/logos/GitHub_Invertocat_Dark.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<p>
<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/get_started_with_vllm_lora_serving_on_vertex_ai.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/get_started_with_vllm_lora_serving_on_vertex_ai.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/get_started_with_vllm_lora_serving_on_vertex_ai.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/get_started_with_vllm_lora_serving_on_vertex_ai.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/get_started_with_vllm_lora_serving_on_vertex_ai.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>
</p>

| Author(s) |
| --- |
| [Ivan Nardini](https://github.com/inardini) |

## Overview

This tutorial provides a comprehensive guide to deploying multiple LoRA (Low-Rank Adaptation) adapters on Google Cloud's Vertex AI using vLLM. By the end of this tutorial, you'll be able to serve a single base model with multiple specialized adapters, allowing you to handle different types of tasks (like SQL generation and code generation) using the same infrastructure.

### What you'll cover

You'll deploy a Vertex AI endpoint that serves:
- **Base Model**: `google/gemma-2-2b-it` (Gemma 2 2B Instruct)
- **SQL Adapter**: `google-cloud-partnership/gemma-2-2b-it-lora-sql` (specialized for SQL query generation)
- **Magicoder Adapter**: `google-cloud-partnership/gemma-2-2b-it-lora-magicoder` (specialized for code generation)

All three models will be available simultaneously, and clients can switch between them on a per-request basis with minimal overhead.


## Get started


### Install required packages

In [None]:
%pip install --upgrade --quiet google-cloud-aiplatform huggingface_hub[hf_transfer]

### Authenticate your notebook environment (Colab only)

If you're running this notebook on Google Colab, run the cell below to authenticate your environment.

In [None]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Also you need to be sure that the following IAM permissions are assigned:

   - `roles/aiplatform.user` (Vertex AI User)
   - `roles/artifactregistry.admin` (Artifact Registry Admin)
   - `roles/cloudbuild.builds.editor` (Cloud Build Editor)
   - `roles/storage.admin` (Storage Admin)

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [None]:
# Use the environment variable if the user doesn't provide Project ID.
import os
import vertexai

# fmt: off
PROJECT_ID = "[your-project-id]"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
# fmt: on
if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

# Create GCS bucket
BUCKET_NAME = f"{PROJECT_ID}-vllm-peft-serving"
BUCKET_URI = f"gs://{BUCKET_NAME}"

! gcloud storage buckets create {BUCKET_URI} --location={LOCATION} --project={PROJECT_ID}

# Set fast download from HF
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

# Initialize Vertex AI SDK
vertexai.init(project=PROJECT_ID, location=LOCATION)

### Import libraries

Import required libraries.

In [None]:
from pathlib import Path as p
from huggingface_hub import interpreter_login
from huggingface_hub import get_token, snapshot_download
from google.cloud import aiplatform
import json

## Create Artifact Registry Repository

Artifact Registry is Google Cloud's service for storing and managing container images, packages, and other artifacts. Think of it as a private Docker Hub for your organization.

In the Artifact Registry, you will store your custom Docker container image.


In [None]:
DOCKER_REPOSITORY = "vllm-lora-repo"

# Create the repository
!gcloud artifacts repositories create {DOCKER_REPOSITORY} \
    --repository-format=docker \
    --project={PROJECT_ID} \
    --location={LOCATION} \
    --description="Repository for vLLM containers with LoRA support"

Verify the repository.

In [None]:
!gcloud artifacts repositories list --location={LOCATION} --project={PROJECT_ID}

## Download Models and Adapters


### Authenticate your Hugging Face account

Many models on HuggingFace, including Gemma, are "gated" - meaning you need to request access and accept terms of use before downloading. The HuggingFace token authenticates your downloads. This token is only needed during the download phase, not when serving the model.

To generate a new user access token with read-only access, you need:

1. Create a [HuggingFace account](https://huggingface.co/) if you don't have one
2. Go to **Settings → Access Tokens**
3. Click **New Token**
4. Set name (e.g., "vertex-ai-deployment") and role (Read)
5. Click **Generate**
6. Copy the token

In [None]:
interpreter_login()

### Create Build directory and download models

Prepare the directory you will use to build the serving image.

In [None]:
base_model_id = "google/gemma-2-2b-it"
sql_adapter_id = "google-cloud-partnership/gemma-2-2b-it-lora-sql"
magicoder_adapter_id = "google-cloud-partnership/gemma-2-2b-it-lora-magicoder"

models_dir = "./models"
adapters_dir = "./adapters"

p(models_dir).mkdir(exist_ok=True, parents=True)
p(adapters_dir).mkdir(exist_ok=True, parents=True)

Download base model and adapters.

In [None]:
# Download base model
base_model_path = snapshot_download(
    repo_id=base_model_id,
    token=get_token(),
    local_dir=f"{models_dir}/gemma-2-2b-it",
    local_dir_use_symlinks=False
)

# Download SQL LoRA adapter
sql_adapter_path = snapshot_download(
    repo_id=sql_adapter_id,
    token=get_token(),
    local_dir=f"{adapters_dir}/sql-lora",
    local_dir_use_symlinks=False
)

# Download Magicoder LoRA adapter
magicoder_adapter_path = snapshot_download(
    repo_id=magicoder_adapter_id,
    token=get_token(),
    local_dir=f"{adapters_dir}/magicoder-lora",
    local_dir_use_symlinks=False
)

#### Upload Models to Google Cloud Storage

Upload models to GCS for fast downloads during container startup.

> **Note**: Instead of baking models into the Docker image (which would create a ~13GB image), we store models in GCS and download them when the container starts. This keeps the Docker image lightweight, makes model updates easier (no rebuild needed), and leverages GCS's fast regional download speeds.


In [None]:
! gcloud config set storage/parallel_composite_upload_enabled True

# Upload base model
!gcloud storage cp -r {models_dir}/gemma-2-2b-it {BUCKET_URI}/models/

# Upload adapters
!gcloud storage cp -r {adapters_dir}/sql-lora {BUCKET_URI}/adapters/
!gcloud storage cp -r {adapters_dir}/magicoder-lora {BUCKET_URI}/adapters/

## Build Custom vLLM Container

Our custom Docker container will:

1. Start from the official vLLM GPU image
2. **Add a script to download the base model and LoRA adapters** from GCS at container startup
3. Configure vLLM to use the local models (no downloads at runtime)
4. Set up Vertex AI compatible health checks


### Create the build files

The build uses three key files:

- **`Dockerfile`**: Defines the container image
- **`entrypoint.sh`**: Downloads models from GCS at startup
- **`cloudbuild.yaml`**: Orchestrates the Docker build process

#### Create Dockerfile

This Dockerfile:

- Starts from the official vLLM GPU image
- Installs Google Cloud SDK (for downloading from GCS)
- Adds entrypoint script

The Dockerfile uses a multi-layer approach. The base vLLM image already contains Python, CUDA drivers, and vLLM itself. We add only the `gcloud` CLI tool to enable GCS downloads. The `ENTRYPOINT` directive ensures our custom script runs before vLLM starts, downloading models first.

In [None]:
build_dir = "build"
p(build_dir).mkdir(exist_ok=True, parents=True)

In [None]:
dockerfile = """
ARG BASE_IMAGE
FROM ${BASE_IMAGE}

ENV DEBIAN_FRONTEND=noninteractive

# Install gcloud SDK for downloading models from GCS
RUN apt-get update && \\
    apt-get install -y apt-utils git apt-transport-https gnupg ca-certificates curl && \\
    echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \\
    curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg && \\
    apt-get update -y && apt-get install google-cloud-cli -y && \\
    rm -rf /var/lib/apt/lists/*

WORKDIR /workspace/vllm

# Copy entrypoint script
COPY ./entrypoint.sh /workspace/vllm/vertexai/entrypoint.sh
RUN chmod +x /workspace/vllm/vertexai/entrypoint.sh

ENTRYPOINT ["/workspace/vllm/vertexai/entrypoint.sh"]
"""

# Write Dockerfile
with open(f"{build_dir}/Dockerfile", "w") as f:
    f.write(dockerfile)


#### Create entrypoint.sh

This script downloads models/adapters from GCS and starts vLLM.

In particular, it intercepts vLLM arguments, detects GCS paths (gs://...), downloads those resources to local directories, rewrites the paths to point locally, then launches vLLM with the updated arguments. This happens transparently - vLLM never knows it's loading from GCS. The `set -euo pipefail` ensures the script fails fast if any download fails, preventing vLLM from starting with missing models.

In [None]:
entrypoint = """#!/bin/bash

set -euo pipefail

readonly LOCAL_MODEL_DIR="/workspace/models"
readonly LOCAL_ADAPTER_DIR="/workspace/adapters"

gcloud config set storage/parallel_composite_upload_enabled True

download_from_gcs() {
    gcs_uri=$1
    local_dir=$2

    echo "Downloading from $gcs_uri to $local_dir..."
    parent_dir=$(dirname "$local_dir")
    mkdir -p "$parent_dir"

    # Download contents to parent, which creates the final directory
    if gcloud storage cp -r "$gcs_uri" "$parent_dir/"; then
        echo "Downloaded successfully to ${local_dir}"
    else
        echo "Failed to download from: $gcs_uri" >&2
        exit 1
    fi
}

updated_args=()
for arg in "$@"; do
    # Check if argument starts with --model= and points to gs://
    if [[ $arg == --model=gs://* ]]; then
        model_path="${arg#--model=}"
        base_model_name=$(basename "$model_path")
        local_path="${LOCAL_MODEL_DIR}/${base_model_name}"

        download_from_gcs "$model_path" "$local_path"
        updated_args+=("--model=${local_path}")

    # Check if argument is a LoRA module path with gs://
    elif [[ $arg == *=gs://* && $arg != --model=* ]]; then
        # Format: name=gs://path
        adapter_name="${arg%%=*}"
        gcs_path="${arg#*=}"
        adapter_basename=$(basename "$gcs_path")
        local_adapter_path="${LOCAL_ADAPTER_DIR}/${adapter_basename}"

        download_from_gcs "$gcs_path" "$local_adapter_path"
        updated_args+=("${adapter_name}=${local_adapter_path}")
    else
        updated_args+=("$arg")
    fi
done

echo "Starting vLLM with arguments: ${updated_args[@]}"
exec "${updated_args[@]}"
"""

with open(f"{build_dir}/entrypoint.sh", "w") as f:
    f.write(entrypoint)

#### Create cloudbuild.yaml

This file tells Cloud Build how to build your container:
- Uses the official vLLM GPU image as a base
- Adds custom entrypoint for Vertex AI compatibility
- Pushes the image to Artifact Registry

In [None]:
cloudbuild = """
steps:
  - name: 'gcr.io/cloud-builders/docker'
    automapSubstitutions: true
    script: |
      #!/usr/bin/env bash
      set -euo pipefail

      base_image=${_BASE_IMAGE}
      image_name="vllm-lora-gcs"

      echo "Building container image..."
      docker build -t $LOCATION-docker.pkg.dev/$PROJECT_ID/${_REPOSITORY}/$image_name --build-arg BASE_IMAGE=$base_image .

      echo "Pushing image to Artifact Registry..."
      docker push $LOCATION-docker.pkg.dev/$PROJECT_ID/${_REPOSITORY}/$image_name

substitutions:
  _BASE_IMAGE: vllm/vllm-openai:v0.11.0
  _REPOSITORY: {DOCKER_REPOSITORY}

timeout: 1800s
""".replace(
    "{DOCKER_REPOSITORY}", DOCKER_REPOSITORY
)

with open(f"{build_dir}/cloudbuild.yaml", "w") as f:
    f.write(cloudbuild)

### Build the Container

Now let's build the container using Cloud Build. This process takes less than 10 minutes.


In [None]:
# Build the container using Cloud Build
!cd {build_dir} && \
gcloud builds submit \
    --config=cloudbuild.yaml \
    --project={PROJECT_ID} \
    --region={LOCATION} \
    --timeout="1h" \
    --machine-type=e2-highcpu-8 \
    --substitutions=_REPOSITORY={DOCKER_REPOSITORY},_BASE_IMAGE=vllm/vllm-openai:latest

Verify image exists in Artifact Registry


In [None]:
! gcloud artifacts docker images list {LOCATION}-docker.pkg.dev/{PROJECT_ID}/{DOCKER_REPOSITORY} --include-tags

## Configure and Upload Model to Vertex AI

When you upload a model to Vertex AI, you're creating a **Model Resource** that contains:
- Container image location
- Server startup arguments
- Environment variables
- Health check configuration
- Resource requirements

#### Define model configuration and environment variables

Below you have a description of some of the main parameters to deploy the model.

| Argument | Purpose | Impact |
|----------|---------|--------|
| `--model=gs://...` | GCS path to base model | Downloaded at startup by entrypoint.sh |
| `--enable-lora` | Enables LoRA adapter support | **Required** for serving adapters |
| `--lora-modules name=gs://...` | Pre-load LoRA adapters from GCS | Downloaded and loaded at startup |
| `--max-loras=4` | Max adapters in memory simultaneously | Higher = more adapters, more memory usage |
| `--max-lora-rank=64` | Max rank of LoRA matrices | Must match or exceed your adapter ranks |
| `--max-model-len=2048` | Max sequence length (tokens) | Lower = more memory for batch processing |
| `--gpu-memory-utilization=0.9` | GPU memory to use | 0.9 leaves 10% buffer for safety |
| `--enable-prefix-caching` | Cache common prefixes | Improves latency for similar requests |


> **Note**: The L4 GPU has 24GB VRAM. With our settings: ~5GB for the base model, ~8GB for KV cache, ~50MB per LoRA adapter, ~2GB CUDA overhead = ~15GB used, leaving ~9GB buffer. If you increase `--max-model-len` or `--max-loras`, you may need to reduce `--gpu-memory-utilization` to avoid OOM errors.

In [None]:
# Model configuration
MODEL_NAME = "gemma-2-2b-multi-lora"
MACHINE_TYPE = "g2-standard-8"
ACCELERATOR_TYPE = "NVIDIA_L4"
ACCELERATOR_COUNT = 1
DOCKER_URI = f"{LOCATION}-docker.pkg.dev/{PROJECT_ID}/{DOCKER_REPOSITORY}/vllm-lora-gcs"

# vLLM server arguments - using GCS paths (downloaded by entrypoint.sh at startup)
vllm_args = [
    "python3",
    "-m",
    "vllm.entrypoints.openai.api_server",
    "--host=0.0.0.0",
    "--port=8080",
    f"--model={BUCKET_URI}/models/gemma-2-2b-it",  # GCS path
    "--max-model-len=2048",
    "--gpu-memory-utilization=0.9",
    "--enable-lora",  # CRITICAL: Enable LoRA support
    "--max-loras=4",  # Allow up to 4 LoRA adapters in memory
    "--max-lora-rank=64",
    "--enable-prefix-caching",
    f"--tensor-parallel-size={ACCELERATOR_COUNT}",
    # Load LoRA adapters at startup from GCS
    "--lora-modules",
    f"sql-lora={BUCKET_URI}/adapters/sql-lora",
    f"magicoder-lora={BUCKET_URI}/adapters/magicoder-lora",
]

# Environment variables for the container
env_vars = {
    "LD_LIBRARY_PATH": "$LD_LIBRARY_PATH:/usr/local/nvidia/lib64",  # NVIDIA libraries
}

#### Upload Model to Model Registry

We are now ready to register the model in Vertex AI Model Registry, a managed model repository to version your models.


In [None]:
vertexai_model = aiplatform.Model.upload(
    display_name=MODEL_NAME,
    serving_container_image_uri=DOCKER_URI,
    serving_container_args=vllm_args,
    serving_container_ports=[8080],
    serving_container_predict_route="/v1/completions",
    serving_container_health_route="/health",
    serving_container_environment_variables=env_vars,
    serving_container_shared_memory_size_mb=(16 * 1024),  # 16 GB shared memory
    serving_container_deployment_timeout=1800,  # 30 minutes timeout
)

## Create Vertex AI Endpoint

An **Endpoint** in Vertex AI is a stable URL for making predictions. You can host one or more deployed models, it handles load balancing and traffic splitting and provides monitoring and logging.

Let's create a new endpoint to deploy our models.


In [None]:
vertexai_endpoint = aiplatform.Endpoint.create(
    display_name=f"{MODEL_NAME}-endpoint"
)

## Deploy Model to Endpoint

Finally, let's deploy our models. Vertex AI provisions a VM with the specified GPU, pulls your Docker image from Artifact Registry, starts the container, and monitors health checks. Your entrypoint.sh script downloads ~5.1 GB of model files from GCS (fast because it's in the same region), then vLLM loads them into GPU memory and starts the OpenAI-compatible API server on port 8080. The deployment completes when the health check endpoint (`/health`) returns 200 OK.

This would take 15-25 minutes.


In [None]:
vertexai_model.deploy(
    endpoint=vertexai_endpoint,
    deployed_model_display_name=MODEL_NAME,
    machine_type=MACHINE_TYPE,
    accelerator_type=ACCELERATOR_TYPE,
    accelerator_count=ACCELERATOR_COUNT,
    min_replica_count=1,  # Minimum number of instances
    max_replica_count=4,  # Maximum for autoscaling
    autoscaling_target_accelerator_duty_cycle=60,  # Scale up at 60% GPU utilization
    traffic_percentage=100,  # Route 100% of traffic to this model
    deploy_request_timeout=1800,  # 30 minute timeout
)

## Testing Your Deployment

Now for the fun part - testing your multi-LoRA deployment.

### Test 1: Base Model (No Adapter)

Let's first test the base Gemma model without any adapter.

Without a LoRA adapter, you're using the pure Gemma-2-2b-it model. It will provide general conversational responses. This serves as a baseline - when you specify a LoRA adapter in subsequent tests, you'll see how the model's behavior changes for specialized tasks.

In [None]:
# Test base model without adapter
prompt = "What is machine learning?"

request_body = json.dumps({
    "prompt": prompt,
    "max_tokens": 100,
    "temperature": 0.7,
})

response = vertexai_endpoint.raw_predict(
    body=request_body,
    headers={"Content-Type": "application/json"}
)

if response.status_code == 200:
    result = json.loads(response.text)
    generated_text = result["choices"][0]["text"]
    print(f"Response:\n{generated_text}\n")
else:
    print(f"Error: {response.status_code}")
    print(response.text)

### Test 2: SQL Adapter

Now let's test the SQL adapter for database query generation.

Notice we're adding `"model": "sql-lora"` to the request body. This tells vLLM to apply the SQL LoRA adapter on top of the base model. The adapter was trained specifically for text-to-SQL tasks, so it should generate syntactically correct SQL that accurately answers the question. We also set `temperature=0.0` for deterministic output - SQL queries shouldn't be creative.

In [None]:
# Test SQL LoRA adapter
prompt = """Write a SQL query to answer the question based on the table schema.

context: CREATE TABLE employees (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    department VARCHAR(50),
    salary DECIMAL(10,2),
    hire_date DATE
)

question: What is the average salary of employees in the Engineering department?

SQL query:"""

request_body = json.dumps({
    "model": "sql-lora",  # Specify the adapter
    "prompt": prompt,
    "max_tokens": 150,
    "temperature": 0.0,  # Use 0 for deterministic SQL generation
    # "stop": [";", "\n\n"]  # Stop at query end
})

response = vertexai_endpoint.raw_predict(
    body=request_body,
    headers={"Content-Type": "application/json"}
)

if response.status_code == 200:
    result = json.loads(response.text)
    generated_sql = result["choices"][0]["text"]
    print(f"Generated SQL:\n{generated_sql}\n")
else:
    print(f"Error: {response.status_code}")
    print(response.text)

### Test 3: Magicoder Adapter

Test the code generation adapter.

In [None]:
# Test Magicoder LoRA adapter for code generation
prompt = """Write a Python function to count to 10"""

request_body = json.dumps({
    "model": "magicoder-lora",  # Specify the adapter
    "prompt": prompt,
    "max_tokens": 200,
    "temperature": 0.2,
    # "stop": ["\n\n", "# Example", "# Test"]
})

response = vertexai_endpoint.raw_predict(
    body=request_body,
    headers={"Content-Type": "application/json"}
)

if response.status_code == 200:
    result = json.loads(response.text)
    generated_code = result["choices"][0]["text"]
    print(f"Generated Code:\n{generated_code}\n")
else:
    print(f"Error: {response.status_code}")
    print(response.text)

## Advanced Usage Patterns

### Pattern 1: Batch Processing with Different Adapters

Process multiple requests with different adapters in parallel using the Vertex AI SDK.

In [None]:
import json
from concurrent.futures import ThreadPoolExecutor, as_completed

def batch_predict_multi_adapter(endpoint: aiplatform.Endpoint, requests: list):
    """
    Send multiple requests to different adapters in parallel.

    Args:
        endpoint: The Vertex AI endpoint object
        requests: List of (model_name, prompt) tuples

    Returns:
        List of (model_name, response_text) tuples
    """
    def single_predict(model_name, prompt):
        request_body = {
            "prompt": prompt,
            "max_tokens": 300,
            "temperature": 0.7,
        }

        if model_name:
            request_body["model"] = model_name

        response = endpoint.raw_predict(
            body=json.dumps(request_body),
            headers={"Content-Type": "application/json"}
        )

        if response.status_code == 200:
            result = json.loads(response.text)
            return (model_name or "base", result["choices"][0]["text"])
        else:
            return (model_name or "base", f"Error: {response.status_code}")

    # Execute requests in parallel
    results = []
    with ThreadPoolExecutor(max_workers=4) as executor:
        futures = [executor.submit(single_predict, model_name, prompt)
                   for model_name, prompt in requests]

        for future in as_completed(futures):
            results.append(future.result())

    return results

# Example usage
requests = [
    (None, "What is Python?"),
    ("sql-lora", "Generate SQL: SELECT all users"),
    ("magicoder-lora", "Write a function to reverse a string"),
    ("sql-lora", "Generate SQL: JOIN users and orders"),
]

results = batch_predict_multi_adapter(vertexai_endpoint, requests)

for model_name, response in results:
    print(f"\n{model_name}:")
    print(f"  {response}...")
    print("="*50)

### Pattern 2: Longer Generation with Streaming

Although Vertex AI supports `stream_raw_predict` method, Streaming responses require additional server-side configuration in the vLLM container. For production streaming needs, consider using the vLLM server's native OpenAI-compatible streaming endpoint directly with appropriate authentication.

### Pattern 3: Dynamic Adapter Loading

Dynamic adapter loading at runtime is an advanced feature that requires:
1. Setting `VLLM_ALLOW_RUNTIME_LORA_UPDATING=True` in environment variables
2. Redeploying the model with this configuration
3. Direct access to vLLM's API endpoints (not available through Vertex AI raw_predict)

For this tutorial's setup, **adapters are pre-loaded at startup** via the `--lora-modules` flag. To add new adapters, you would:

```python
# Steps to add a new adapter (requires redeployment):

# 1. Upload new adapter to GCS
# !gcloud storage cp -r ./adapters/new-adapter gs://{BUCKET_NAME}/adapters/new-adapter

# 2. Update vLLM args to include the new adapter
new_vllm_args = [
    "python3", "-m", "vllm.entrypoints.openai.api_server",
    "--host=0.0.0.0", "--port=8080",
    f"--model={BUCKET_URI}/models/gemma-2-2b-it",
    "--max-model-len=2048",
    "--gpu-memory-utilization=0.9",
    "--enable-lora",
    "--max-loras=4",
    "--max-lora-rank=64",
    "--enable-prefix-caching",
    "--tensor-parallel-size=1",
    "--lora-modules",
    f"sql-lora={BUCKET_URI}/adapters/sql-lora",
    f"magicoder-lora={BUCKET_URI}/adapters/magicoder-lora",
    f"new-adapter={BUCKET_URI}/adapters/new-adapter",  # New adapter
]

# 3. Upload new model version with updated args
# 4. Redeploy to endpoint
```


## Cleaning up

To avoid extra charges, don't forget to delete your resources.


In [None]:
delete_endpoint = True
delete_model = True
delete_docker_repo = True
delete_bucket = True


if delete_endpoint and "vertexai_endpoint" in globals():
    vertexai_endpoint.undeploy_all()
    vertexai_endpoint.delete()
    print("Endpoint deleted.")


if delete_model and "vertexai_model" in globals():
    vertexai_model.delete()
    print("Model deleted.")

if delete_docker_repo:
    !gcloud artifacts repositories delete {DOCKER_REPOSITORY} \
        --location={LOCATION} \
        --quiet

if delete_bucket:
    !gcloud storage rm --recursive {BUCKET_URI} && gcloud storage buckets delete {BUCKET_URI} --quiet