In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Finetune Gemma using KerasNLP and deploy to Vertex AI

<table align="left">
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fcommunity%2Fmodel_garden%2Fmodel_garden_gemma_kerasnlp_to_vertexai.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Run in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_gemma_kerasnlp_to_vertexai.ipynb">
      <img width="32px" src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>


> This notebook was tested in the following environment:
>
> - Python 3.10
> - Colab Enterprise with a `g2-standard-8` runtime:
>   - 32 GB of system RAM
>   - 24 GB of GPU RAM (NVIDIA L4)


## Overview


Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models.

This notebook demonstrates loading, finetuning, converting, and deploying Gemma to Vertex AI.


### Objective

- Load Gemma using KerasNLP
- Finetune Gemma using KerasNLP
- Convert Gemma to Hugging Face Transformers
- Deploy Gemma to Vertex AI


### Costs

This tutorial uses billable components of Google Cloud:

- Vertex AI
- Cloud Storage

Learn about [Vertex AI](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage](https://cloud.google.com/storage/pricing) pricings,
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.


## Installation


Install the following packages required to execute this notebook:


In [None]:
# Keras & KerasNLP
# Install Keras 3 last, see https://keras.io/getting_started
%pip install --upgrade --quiet keras-nlp
%pip install --upgrade --quiet keras

# Hugging Face Transformers
%pip install --upgrade --quiet accelerate sentencepiece transformers

# Vertex AI SDK
%pip install --upgrade --quiet google-cloud-aiplatform

## Before you begin


### Kaggle credentials


Gemma models are hosted by Kaggle. To use Gemma, request access on Kaggle:

- Sign in or register at [kaggle.com](https://www.kaggle.com)
- Open the [Gemma model card](https://www.kaggle.com/models/google/gemma) and select _"Request Access"_
- Complete the consent form and accept the terms and conditions

Then, to use the Kaggle API, create an API token:

- Open the [Kaggle settings](https://www.kaggle.com/settings)
- Select _"Create New Token"_
- A `kaggle.json` file is downloaded. It contains your Kaggle credentials

Run the following cell and enter your Kaggle credentials.


In [None]:
import kagglehub

kagglehub.login()

> Note: If `kagglehub.login()` doesn't work for you, an alternative way is to set `KAGGLE_USERNAME` and `KAGGLE_KEY` environment variables.


### Google Cloud setup


1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).


### Google Cloud authentication


If you run this notebook from Colab Enterprise, the Cloud SDK, code, and other libraries already run using your Google Cloud account.

Check your active account:


In [None]:
!gcloud config get core/account

If your account is not defined, you need to authenticate:


In [None]:
# Authenticate the Cloud SDK with your credentials
# !gcloud auth login

# Authenticate code and libraries with your credentials
# !gcloud auth application-default login

### Google Cloud project


If you run this notebook in Colab Enterprise, the default project is automatically defined:


In [None]:
res = !gcloud config get core/project
PROJECT_ID = res[0]

print(f"{PROJECT_ID=}")

Otherwise, list your projects and define the default project manually:


In [None]:
# List your projects
# !gcloud projects list

# Define the default project
# PROJECT_ID = ""  # @param {type:"string"}
# !gcloud config set core/project $PROJECT_ID

### Vertex AI region


Define your default Vertex AI region. See available [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).


In [None]:
REGION = "us-central1"  # @param {type: "string"}

!gcloud config set ai/region $REGION

> Note: This notebook deploys a Gemma model to a single region. In production, you can deploy to multiple regions, to serve your worldwide users with optimal latencies.


### Cloud Storage bucket


Create a storage bucket (or use an existing one) to store artifacts such as model weights or datasets.


In [None]:
# Define a bucket related to your project
BUCKET_URI = f"gs://gemma-{PROJECT_ID}-unique"
# Or use an existing one
# BUCKET_URI = "gs://"  # @param {type:"string"}

res = !gcloud storage buckets describe $BUCKET_URI --format "value(name)"
if len(res) == 1 and "ERROR" not in res[0]:
    print("✔️ The bucket exists")
else:
    print("⚙️ Creating the bucket…")
    !gcloud storage buckets create $BUCKET_URI --project $PROJECT_ID --location $REGION

### Service account


When deploying Gemma to a Vertex AI endpoint, the model service will require a service account with "Storage Object Admin" and "Vertex AI User" roles.

Create the service account (or use an existing one):


In [None]:
# Create the service account for the Vertex AI endpoint
SERVICE_ACCOUNT_NAME = "gemma-vertexai"
SERVICE_ACCOUNT_DISPLAY_NAME = "Gemma Vertex AI endpoint"
SERVICE_ACCOUNT = f"{SERVICE_ACCOUNT_NAME}@{PROJECT_ID}.iam.gserviceaccount.com"
# Or use an existing one
# SERVICE_ACCOUNT = ""  # @param {type:"string"}
assert SERVICE_ACCOUNT.endswith(f"@{PROJECT_ID}.iam.gserviceaccount.com")

res = !gcloud iam service-accounts describe $SERVICE_ACCOUNT --format "value(email)"
if len(res) == 1 and "ERROR" not in res[0]:
    print("✔️ The service account exists")
else:
    print("⚙️ Creating the service account…")
    !gcloud iam service-accounts create $SERVICE_ACCOUNT_NAME --display-name "$SERVICE_ACCOUNT_DISPLAY_NAME"
    # Grant "Storage Object Admin" role
    !gcloud projects add-iam-policy-binding $PROJECT_ID --member "serviceAccount:$SERVICE_ACCOUNT" --role "roles/storage.objectAdmin"
    # Grant "Vertex AI User" role
    !gcloud projects add-iam-policy-binding $PROJECT_ID --member "serviceAccount:$SERVICE_ACCOUNT" --role "roles/aiplatform.user"

### Dependencies


In [None]:
import datetime
import json
import locale

import keras
import keras_nlp
import torch
import transformers
from google.cloud import aiplatform
from numba import cuda

### Model constants

Gemma models are available in several sizes and variants. This notebook uses the `gemma_2b_en` version, which has lower resource requirements. To learn more about Gemma, see the [Gemma Model Garden card](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma).

Define the model and related constants:


In [None]:
MODEL_NAME = "gemma_2b_en"
# MODEL_NAME = "gemma_instruct_2b_en"
# MODEL_NAME = "gemma_7b_en"
# MODEL_NAME = "gemma_instruct_7b_en"

# Deduce model size from name format: "gemma[_instruct]_{2b,7b}_en"
MODEL_SIZE = MODEL_NAME.split("_")[-2]
assert MODEL_SIZE in ("2b", "7b")

# Dataset
DATASET_NAME = "databricks-dolly-15k"
DATASET_PATH = f"{DATASET_NAME}.jsonl"
DATASET_URL = f"https://huggingface.co/datasets/databricks/{DATASET_NAME}/resolve/main/{DATASET_PATH}"

# Finetuned model
FINETUNED_MODEL_DIR = f"./{MODEL_NAME}_{DATASET_NAME}"
FINETUNED_WEIGHTS_PATH = f"{FINETUNED_MODEL_DIR}/model.weights.h5"
FINETUNED_VOCAB_PATH = f"{FINETUNED_MODEL_DIR}/vocabulary.spm"

# Converted model
HUGGINGFACE_MODEL_DIR = f"./{MODEL_NAME}_huggingface"

# Deployed model
DEPLOYED_MODEL_URI = f"{BUCKET_URI}/{MODEL_NAME}"

### Dataset

To finetune Gemma, this notebook uses the [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) test dataset.

Download the dataset:


In [None]:
!wget -nv -nc -O $DATASET_PATH $DATASET_URL

## Load Gemma

In this step, you will configure Keras precision settings and load Gemma with KerasNLP.


### Keras precision settings

When training on NVIDIA GPUs, mixed precision (`keras.mixed_precision.set_global_policy("mixed_bfloat16")`) can be used to speed up training with minimal effect on training quality. In most cases, it is recommended to turn on mixed precision as it saves both memory and time. However, be aware that at small batch sizes, it can inflate memory usage by 1.5x (weights will be loaded twice, at half precision and full precision).

For inference, half-precision (`keras.config.set_floatx("bfloat16")`) will work and save memory (while mixed-precision is not applicable).

Configure your precision settings:


In [None]:
# Run inferences at half precision
keras.config.set_floatx("bfloat16")

# Train at mixed precision (enable for large batch sizes)
# keras.mixed_precision.set_global_policy("mixed_bfloat16")

### Model summary

Load the Gemma model using the `GemmaCausalLM.from_preset()` method:


In [None]:
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset(MODEL_NAME)

Downloading from https://www.kaggle.com/api/v1/models/keras/gemma/keras/gemma_2b_en/1/download/config.json...
100%|██████████| 555/555 [00:00<00:00, 634kB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/gemma/keras/gemma_2b_en/1/download/model.weights.h5...
100%|██████████| 4.67G/4.67G [02:28<00:00, 33.7MB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/gemma/keras/gemma_2b_en/1/download/tokenizer.json...
100%|██████████| 401/401 [00:00<00:00, 554kB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/gemma/keras/gemma_2b_en/1/download/assets/tokenizer/vocabulary.spm...
100%|██████████| 4.04M/4.04M [00:00<00:00, 5.27MB/s]


Display the model summary:


In [None]:
gemma_lm.summary()

### Test examples

Define test examples and functions that will be used to test models before and after finetuning:


In [None]:
TEST_EXAMPLES = [
    "What are good activities for a toddler?",
    "What can we hope to see after rain and sun?",
    "What's the most famous painting by Monet?",
    "Who engineered the Statue of Liberty?",
    'Who were "The Lumières"?',
]

# Prompt template for the training data and the finetuning tests
PROMPT_TEMPLATE = "Instruction:\n{instruction}\n\nResponse:\n{response}"

TEST_PROMPTS = [
    PROMPT_TEMPLATE.format(instruction=example, response="")
    for example in TEST_EXAMPLES
]

### Samplers

You can control how tokens are generated for `GemmaCausalLM` by calling the `compile()` method with the `sampler` parameter.

For example:

- `greedy`: picks the next token with the largest probability
- `top_k`: randomly picks the next token from the tokens of top K probability

To get deterministic outputs in this notebook, make sure you're using the `greedy` sampler:


In [None]:
gemma_lm.compile(sampler="greedy")

To learn more about available samplers, see [Samplers](https://keras.io/api/keras_nlp/samplers).


### Inference before finetuning

Check how the model responds to the test examples:


In [None]:
for test_example in TEST_EXAMPLES:
    response = gemma_lm.generate(test_example, max_length=48)
    output = response[len(test_example) :]
    print(f"{test_example}\n{output!r}\n")

What are good activities for a toddler?
'\n\nWhat are the best activities for a toddler?\n\nWhat are the best activities for a toddler?\n\nWhat are the best activities for a toddler?\n\nWhat are the best activities for a toddler'

What can we hope to see after rain and sun?
'\n\nThe answer is: a lot.\n\nThe rain and sun are the two most important elements in the world of photography.\n\nThe rain is the most important element because it creates'

What's the most famous painting by Monet?
"\n\nWhat's the most famous painting by Van Gogh?\n\nWhat's the most famous painting by Picasso?\n\nWhat's the most famous painting by Dali?\n\nWhat'"

Who engineered the Statue of Liberty?
'\n\nA. George Washington\nB. Napoleon Bonaparte\nC. Robert Fulton\nD. Gustave Eiffel\n\nIn the following sentence, underline the correct modifier from the pair given in parentheses. Example 1'

Who were "The Lumières"?
' What did they invent?\n\nIn the following sentence, underline the correct modifier from the pair

A pretrained model can generate text that deviates from the output you are expecting. Here are some examples:

- The output doesn't follow your output requirements.
- The output is too generic or not consistent enough.
- The output is factually incorrect or outdated.
- The output must be aligned with your specific safety policies.

More specific inputs (prompt engineering) can fix some of these issues, at the expense of more complex and longer prompts. If the expected output is not part of the model training data, LLMs generate plausible text anyway and produce what is sometimes called hallucinations.

You can perform a model finetuning to improve the performance of the model and keep simpler prompts.


## Finetune Gemma

Finetune your Gemma model to improve its performance in the specific task of answering questions more consistently and more factually.


### Training data

Generate the training examples using the dataset:


In [None]:
def generate_training_data(training_ratio: int = 100) -> list[str]:
    assert 0 < training_ratio <= 100
    data = []
    with open(DATASET_PATH) as file:
        for line in file.readlines():
            features = json.loads(line)
            # Skip examples with context, for simplicity
            if features["context"]:
                continue
            data.append(PROMPT_TEMPLATE.format(**features))
    total_data_count = len(data)
    training_data_count = total_data_count * training_ratio // 100
    print(f"Training examples: {training_data_count}/{total_data_count}")

    return data[:training_data_count]


# Limit to 10% for test purposes
training_data = generate_training_data(training_ratio=10)

Training examples: 1054/10544


### Low-Rank Adaptation (LoRA)

[Low Rank Adaptation](https://arxiv.org/abs/2106.09685) (LoRA) is a finetuning technique which greatly reduces the number of trainable parameters for downstream tasks by freezing the full weights of the model and inserting a smaller number of new trainable weights into the model. This technique makes training much faster and more memory-efficient.

Enable LoRA for the model and set the LoRA rank to 4:


In [None]:
gemma_lm.backbone.enable_lora(rank=4)

Check that the number of trainable parameters is significantly reduced:


In [None]:
gemma_lm.summary()

The number of trainable parameters decreased from 2.5B down to 1.4M (1,800x less), making it possible to finetune the model with reasonable GPU memory requirements.


### Finetuning

Finetune the model with the training data. This step can take a couple of minutes:


In [None]:
def finetune_gemma(model: keras_nlp.models.GemmaCausalLM, data: list[str]):
    # Reduce the input sequence length to limit memory usage
    model.preprocessor.sequence_length = 128

    # Use AdamW (a common optimizer for transformer models)
    optimizer = keras.optimizers.AdamW(
        learning_rate=5e-5,
        weight_decay=0.01,
    )

    # Exclude layernorm and bias terms from decay
    optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

    model.compile(
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        optimizer=optimizer,
        weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
        sampler="greedy",
    )
    model.fit(data, epochs=1, batch_size=1)


finetune_gemma(gemma_lm, training_data)

[1m1054/1054[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m134s[0m 77ms/step - loss: 19.3561 - sparse_categorical_accuracy: 0.5872


### Inference after finetuning

Test the finetuned model:


In [None]:
for prompt in TEST_PROMPTS:
    output = gemma_lm.generate(prompt, max_length=30)
    print(f"{output}\n{'- '*40}")

Instruction:
What are good activities for a toddler?

Response:
The best activities for a toddler are those that are fun and engaging.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Instruction:
What can we hope to see after rain and sun?

Response:
After rain and sun, we can see the rainbow.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Instruction:
What's the most famous painting by Monet?

Response:
The most famous painting by Monet is "Impression, Sunrise".
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Instruction:
Who engineered the Statue of Liberty?

Response:
The Statue of Liberty was designed by a French sculptor, Frederic Auguste Bartholdi
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Instruction:
Who were "The Lumières"?

Response:
The Lumières were the inventors of the first motion picture camera. They were
- - - - - - - - - - - - - - - 

You should observe that outputs are now structured, more consistent, and more factual.


## Convert Gemma to Hugging Face Transformers

In the next step, the model will be deployed to Vertex AI, served by a [vLLM](https://docs.vllm.ai) container image. vLLM is an optimized LLM serving library which supports Hugging Face [Transformers](https://huggingface.co/docs/transformers). To be loaded by the vLLM service, the finetuned model needs to be converted to the Hugging Face architecture. KerasNLP provides a conversion script for this.


### Checkpoint

Save the finetuned model assets:


In [None]:
# Make sure the directory exists
%mkdir -p $FINETUNED_MODEL_DIR

gemma_lm.save_weights(FINETUNED_WEIGHTS_PATH)

gemma_lm.preprocessor.tokenizer.save_assets(FINETUNED_MODEL_DIR)

List the checkpoint files:


In [None]:
!du -shc $FINETUNED_MODEL_DIR/*

4.7G	./gemma_2b_en_databricks-dolly-15k/model.weights.h5
4.1M	./gemma_2b_en_databricks-dolly-15k/vocabulary.spm
4.7G	total


Release the resources to make sure the GPU is available for the next steps:


In [None]:
del gemma_lm

device = cuda.get_current_device()
cuda.select_device(device.id)
cuda.close()

### Model conversion

Run the KerasNLP conversion script:


In [None]:
# Download the conversion script from KerasNLP tools
!wget -nv -nc https://raw.githubusercontent.com/keras-team/keras-nlp/master/tools/gemma/export_gemma_to_hf.py

# Run the conversion script
# Note: it uses the PyTorch backend of Keras (hence the KERAS_BACKEND env variable)
!KERAS_BACKEND=torch python export_gemma_to_hf.py \
    --weights_file $FINETUNED_WEIGHTS_PATH \
    --size $MODEL_SIZE \
    --vocab_path $FINETUNED_VOCAB_PATH \
    --output_dir $HUGGINGFACE_MODEL_DIR

### Inference with Transformers

Before deploying the converted model, test it using the `transformers` library.

Load the model and the tokenizer:


In [None]:
model = transformers.GemmaForCausalLM.from_pretrained(
    HUGGINGFACE_MODEL_DIR,
    local_files_only=True,
    device_map="auto",  # Library "accelerate" to auto-select GPU
)
tokenizer = transformers.GemmaTokenizer.from_pretrained(
    HUGGINGFACE_MODEL_DIR,
    local_files_only=True,
)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Test the model:


In [None]:
def test_transformers_model(
    model: transformers.GemmaForCausalLM,
    tokenizer: transformers.GemmaTokenizer,
) -> None:
    for prompt in TEST_PROMPTS:
        inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
        outputs = model.generate(**inputs, max_length=30)

        output = tokenizer.decode(outputs[0], skip_special_tokens=True)
        print(f"{output}\n{'- '*40}")


test_transformers_model(model, tokenizer)

Instruction:
What are good activities for a toddler?

Response:
Toddlers are very active and curious. They love to explore and learn
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Instruction:
What can we hope to see after rain and sun?

Response:
After rain and sun, we can see the rainbow.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Instruction:
What's the most famous painting by Monet?

Response:
The most famous painting by Monet is "Impression, Sunrise".
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Instruction:
Who engineered the Statue of Liberty?

Response:
The Statue of Liberty was designed by a French sculptor, Frederic Auguste Bartholdi
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Instruction:
Who were "The Lumières"?

Response:
The Lumières were the inventors of the first motion picture camera. They were
- - - - - - - - - - - - - - - - 

Release the resources:


In [None]:
# Release resources
del model, tokenizer

# Free GPU RAM
torch.cuda.empty_cache()

# Restore the default encoding (current issue with the transformers library)
locale.getpreferredencoding = lambda: "UTF-8"

You're ready to deploy your finetuned model to Vertex AI!


## Deploy Gemma to Vertex AI


### Vertex AI initialization

Initialize Vertex AI:


In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

### Model upload

Upload the model to the Cloud Storage bucket:


In [None]:
!gcloud storage rsync --recursive --verbosity error $HUGGINGFACE_MODEL_DIR $DEPLOYED_MODEL_URI

Check the bucket content:


In [None]:
!gcloud storage du $DEPLOYED_MODEL_URI --readable-sizes

### Helper functions

Define helper functions to deploy the model with a vLLM container:


In [None]:
VLLM_DOCKER_URI = "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20240220_0936_RC01"


def get_job_name_with_datetime(prefix: str) -> str:
    suffix = datetime.datetime.now().strftime("_%Y%m%d_%H%M%S")
    return f"{prefix}{suffix}"


def deploy_model_vllm(
    model_name: str,
    model_uri: str,
    service_account: str,
    machine_type: str = "g2-standard-8",
    accelerator_type: str = "NVIDIA_L4",
    accelerator_count: int = 1,
    max_model_len: int = 8192,
    dtype: str = "bfloat16",
) -> tuple[aiplatform.Model, aiplatform.Endpoint]:
    # Upload the model to "Model Registry"
    job_name = get_job_name_with_datetime(model_name)
    vllm_args = [
        "--host=0.0.0.0",
        "--port=7080",
        f"--tensor-parallel-size={accelerator_count}",
        "--swap-space=16",
        "--gpu-memory-utilization=0.95",
        f"--max-model-len={max_model_len}",
        f"--dtype={dtype}",
        "--disable-log-stats",
    ]
    model = aiplatform.Model.upload(
        display_name=job_name,
        artifact_uri=model_uri,
        serving_container_image_uri=VLLM_DOCKER_URI,
        serving_container_command=["python", "-m", "vllm.entrypoints.api_server"],
        serving_container_args=vllm_args,
        serving_container_ports=[7080],
        serving_container_predict_route="/generate",
        serving_container_health_route="/ping",
    )

    # Deploy the model to an endpoint to serve "Online predictions"
    endpoint = aiplatform.Endpoint.create(display_name=f"{model_name}-endpoint")
    model.deploy(
        endpoint=endpoint,
        machine_type=machine_type,
        accelerator_type=accelerator_type,
        accelerator_count=accelerator_count,
        deploy_request_timeout=1800,
        service_account=service_account,
    )

    return model, endpoint

### Model deployment

Deploy the model. This step can take 10+ minutes.


In [None]:
MODEL_NAME_VLLM = f"{MODEL_NAME}-vllm"

# Start with a G2 Series cost-effective configuration
match MODEL_SIZE:
    case "2b":
        machine_type = "g2-standard-8"
        accelerator_type = "NVIDIA_L4"
        accelerator_count = 1
    case "7b":
        machine_type = "g2-standard-12"
        accelerator_type = "NVIDIA_L4"
        accelerator_count = 1
    case _:
        assert MODEL_SIZE in ("2b", "7b")

# See supported machine/GPU configurations in chosen region:
# https://cloud.google.com/vertex-ai/docs/predictions/configure-compute

# For even more performance, consider V100 and A100 GPUs
# > Nvidia Tesla V100
# machine_type = "n1-standard-8"
# accelerator_type = "NVIDIA_TESLA_V100"
# > Nvidia Tesla A100
# machine_type = "a2-highgpu-1g"
# accelerator_type = "NVIDIA_TESLA_A100"

# Larger `max_model_len` values will require more GPU memory
max_model_len = 2048

model, endpoint = deploy_model_vllm(
    MODEL_NAME_VLLM,
    DEPLOYED_MODEL_URI,
    SERVICE_ACCOUNT,
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    max_model_len=max_model_len,
)

### Online inference

The model is deployed! Test the endpoint:


In [None]:
def test_vertexai_endpoint(endpoint: aiplatform.Endpoint):
    for question, prompt in zip(TEST_EXAMPLES, TEST_PROMPTS):
        instance = {
            "prompt": prompt,
            "max_tokens": 10,
            "temperature": 0.0,
            "top_p": 1.0,
            "top_k": 1,
            "raw_response": True,
        }
        response = endpoint.predict(instances=[instance])
        output = response.predictions[0]
        print(f"{question}\n{output}\n{'- '*40}")


test_vertexai_endpoint(endpoint)

What are good activities for a toddler?
The best activities for a toddler are those that are
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
What can we hope to see after rain and sun?
After rain and sun, we can see the rainbow
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
What's the most famous painting by Monet?
The most famous painting by Monet is "Impression,
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Who engineered the Statue of Liberty?
The Statue of Liberty was designed by a French sculptor
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Who were "The Lumières"?
The Lumières were the inventors of the first motion
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 


> See [vLLM `SamplingParams`](https://github.com/vllm-project/vllm/blob/main/vllm/sampling_params.py) for more details about the sampling parameters supported by vLLM.


## Clean up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) used for the tutorial.

Otherwise, you can delete the individual resources created in this tutorial:


In [None]:
delete_model = False
delete_objects = False
delete_bucket = False

if delete_model:
    endpoint.delete(force=True)
    model.delete()
if delete_objects:
    !gcloud storage rm --recursive $BUCKET_URI/**
if delete_bucket:
    !gcloud storage buckets delete $BUCKET_URI

## What's next

- Explore the [Vertex AI Model Garden](https://console.cloud.google.com/vertex-ai/model-garden)
- See also how to [Serve Gemma open models using GPUs on GKE with vLLM](https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-gpu-vllm)
- Learn more about [KerasLP](https://keras.io/keras_nlp)
- Learn more about [vLLM](https://github.com/vllm-project/vllm)
