In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Vertex AI - Gemma distributed tuning with LoRA on TPUv5e, serving on L4 GPU

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/training/tpuv5e_gemma_peft_finetuning_and_serving.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab (Will require the higher memory Colab pro)
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/training/tpuv5e_gemma_peft_finetuning_and_serving.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/notebooks/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/training/tpuv5e_gemma_peft_finetuning_and_serving.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
Open in Vertex AI Workbench
    </a> (An e2-standard-8 CPU w/ 250GB disk is recommended)
  </td>
</table>

## Overview

This notebook is based on the [LoRA tuning example on ai.google.dev](https://ai.google.dev/gemma/docs/distributed_tuning). It follows an existing [Model Garden example written for fine-tuning on GPUs](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_gemma_finetuning_on_vertex.ipynb), and has been modified to use the latest TPUv5e chips for training. It demonstrates fine-tuning and deploying Gemma models with [Vertex AI Custom Training Job](https://cloud.google.com/vertex-ai/docs/training/create-custom-job). A Vertex AI Custom Training Job allows for a higher level of customization and control over the fine-tuning job. All of the examples in this notebook use parameter efficient fine-tuning methods [PEFT](https://github.com/huggingface/peft) to reduce training and storage costs.

This notebook deploys the model with a [vLLM](https://github.com/vllm-project/vllm) docker


### Objective

- Fine-tune and deploy Gemma models with a Vertex AI Custom Training Job.
- Send prediction requests to your fine-tuned Gemma model.


### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage pricing](https://cloud.google.com/storage/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

### Dataset

In this example, you will use the IMDB reviews dataset from TensorFlow datasets to finetune the model. Details of the dataset can be found here: https://www.tensorflow.org/datasets/catalog/imdb_reviews

### Costs 

This tutorial uses billable components of Google Cloud:

Vertex AI (Training, TPUv5e, L4 GPU), Cloud Storage

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage pricing](https://cloud.google.com/storage/pricing), [Cloud NL API pricing](https://cloud.google.com/natural-language/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

## Installation

Install the following packages required to execute this notebook. 

Run this to install the latest google cloud platform library that supports TPUv5e

In [None]:
import os

# (optional) update gcloud if needed
if os.getenv("IS_TESTING"):
    ! gcloud components update --quiet

! pip3 install --upgrade --quiet google-cloud-aiplatform

### Colab only: Uncomment the following cell to restart the kernel.

In [None]:
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

1. [Select or create a Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets) for storing experiment outputs.

1. [Create a service account](https://cloud.google.com/iam/docs/service-accounts-create#iam-service-accounts-create-console) with `Vertex AI User` and `Storage Object Admin` roles for deploying fine tuned model to Vertex AI endpoint.

### Kaggle credentials
Gemma models are hosted by Kaggle. To use Gemma, request access on Kaggle:

* Sign in or register at [kaggle.com](https://www.kaggle.com)
* Open the [Gemma model card](https://www.kaggle.com/models/google/gemma) and select "Request Access"
* Complete the consent form and accept the terms and conditions

Then, to use the Kaggle API, create an API token:

* Open [Kaggle settings](https://www.kaggle.com/settings)
* Select "Create New Token"
* A kaggle.json file is downloaded. It contains your Kaggle credentials. Note the username and key as you will populate this later

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

#### Region

You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

TPUv5e is available in the [following regions listed here](https://cloud.google.com/tpu/pricing)

In [None]:
REGION = "us-west1"  # @param {type: "string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [None]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [None]:
# from google.colab import auth
# auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### Import libraries

In [None]:
import os
from datetime import datetime, timedelta

from google.cloud import aiplatform

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

In [None]:
BUCKET_URI = f"gs://your-bucket-name-{PROJECT_ID}-unique"  # @param {type:"string"}

#### Set folder paths for staging, environment, and model artifacts

In [None]:
STAGING_BUCKET = os.path.join(BUCKET_URI, "temporal")
MODEL_BUCKET = os.path.join(BUCKET_URI, "model")

# The service account looks like:
# '@.iam.gserviceaccount.com'
# Please go to https://cloud.google.com/iam/docs/service-accounts-create#iam-service-accounts-create-console
# and create service account with `Vertex AI User` and `Storage Object Admin` roles.
# The service account for deploying fine tuned model.
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}

### Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project.

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=STAGING_BUCKET)

### Select the Gemma base model

In [None]:
# The Gemma base model.
base_model = "google/gemma-2b"  # @param ["google/gemma-2b", "google/gemma-2b-it", "google/gemma-7b", "google/gemma-7b-it"]

### Create the artifact registry repository and set the custom docker image uri

In [None]:
REPOSITORY = "tpuv5e-training-repository-unique"

In [None]:
image_name_train = "gemma-lora-tuning-tpuv5e"
hostname = f"{REGION}-docker.pkg.dev"
tag = "latest"

In [None]:
# Register gcloud as a Docker credential helper
!gcloud auth configure-docker $REGION-docker.pkg.dev --quiet

In [None]:
# One time or use an existing repository
!gcloud artifacts repositories create $REPOSITORY --repository-format=docker \
--location=$REGION --description="Vertex TPUv5e training repository"

In [None]:
# Define container image name
KERAS_TRAIN_DOCKER_URI = (
    f"{hostname}/{PROJECT_ID}/{REPOSITORY}/{image_name_train}:{tag}"
)

# Set the docker image uri for the vLLM serving container
VLLM_DOCKER_URI = "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20240220_0936_RC01"

# Set the docker image uri for the model conversion container that converts the fine-tuned model to HF format
KERAS_MODEL_CONVERSION_DOCKER_URI = "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/jax-keras-model-conversion:20240220_0936_RC01"

### Define common functions

In [None]:
def get_job_name_with_datetime(prefix: str) -> str:
    """Gets the job name with date time when triggering training or deployment
    jobs in Vertex AI.
    """
    return prefix + datetime.now().strftime("_%Y%m%d_%H%M%S")


def deploy_model_vllm(
    model_name: str,
    model_uri: str,
    service_account: str,
    machine_type: str = "g2-standard-8",
    accelerator_type: str = "NVIDIA_L4",
    accelerator_count: int = 1,
    max_model_len: int = 8192,
    dtype: str = "bfloat16",
) -> tuple[aiplatform.Model, aiplatform.Endpoint]:
    # Upload the model to "Model Registry"
    job_name = get_job_name_with_datetime(model_name)
    vllm_args = [
        "--host=0.0.0.0",
        "--port=7080",
        f"--tensor-parallel-size={accelerator_count}",
        "--swap-space=16",
        "--gpu-memory-utilization=0.95",
        f"--max-model-len={max_model_len}",
        f"--dtype={dtype}",
        "--disable-log-stats",
    ]
    model = aiplatform.Model.upload(
        display_name=job_name,
        artifact_uri=model_uri,
        serving_container_image_uri=VLLM_DOCKER_URI,
        serving_container_command=["python", "-m", "vllm.entrypoints.api_server"],
        serving_container_args=vllm_args,
        serving_container_ports=[7080],
        serving_container_predict_route="/generate",
        serving_container_health_route="/ping",
    )

    # Deploy the model to an endpoint to serve "Online predictions"
    endpoint = aiplatform.Endpoint.create(display_name=f"{model_name}-endpoint")
    model.deploy(
        endpoint=endpoint,
        machine_type=machine_type,
        accelerator_type=accelerator_type,
        accelerator_count=accelerator_count,
        deploy_request_timeout=1800,
        service_account=service_account,
        sync=False,
    )

    return model, endpoint

### Build the Docker container files

#### Create the trainer directory

In [None]:
import os

if not os.path.exists("trainer"):
    os.makedirs("trainer")

#### Kaggle credentials are required for KerasNLP training and Hex-LLM deployment with TPUs.
Set the KAGGLE_USERNAME AND KAGGLE_KEY to pass in as an environment variable for Vertex Training to use
Fenerate the Kaggle username and key by following [these instructions](https://github.com/Kaggle/kaggle-api?tab=readme-ov-file#api-credentials).
You will need to review and accept the model license as mentioned earlier.

In [None]:
KAGGLE_USERNAME = "your-kaggle-username"  # @param {type:"string", isTemplate:true}
KAGGLE_KEY = "your-kaggle-key"  # @param {type:"string", isTemplate:true}

#### Create the Dockerfile for the custom container. This will install JAX[TPU], Keras, and TensorFlow datasets

In [None]:
%%writefile trainer/Dockerfile
# This Dockerfile fine tunes the Gemma model using LoRA with JAX

FROM python:3.10

ENV DEBIAN_FRONTEND=noninteractive

# Install basic libs
RUN apt-get update && apt-get -y upgrade && apt-get install -y --no-install-recommends \
        cmake \
        curl \
        wget \
        sudo \
        gnupg \
        libsm6 \
        libxext6 \
        libxrender-dev \
        lsb-release \
        ca-certificates \
        build-essential \
        git \
        libgl1

# Copy Apache license.
RUN wget https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/LICENSE

# Install required libs
RUN pip install --upgrade pip
RUN pip install jax[tpu]==0.4.25 -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
RUN pip install tensorflow==2.15.0.post1
RUN pip install tensorflow-datasets==4.9.4
RUN pip install -q -U keras-nlp==0.8.2
RUN pip install keras==3.0.5

# Copy other licenses.
RUN wget -O MIT_LICENSE https://github.com/pytest-dev/pytest/blob/main/LICENSE
RUN wget -O BSD_LICENSE https://github.com/pytorch/xla/blob/master/LICENSE
RUN wget -O BSD-3_LICENSE https://github.com/pytorch/pytorch/blob/main/LICENSE

ENV KERAS_BACKEND=jax
ENV XLA_PYTHON_CLIENT_MEM_FRACTION=0.9
ENV TPU_LIBRARY_PATH=/lib/libtpu.so

# Copy install libtpu to PATH above
RUN find ./usr/local/lib -name 'libtpu.so' -exec cp {} /lib \;

WORKDIR /
COPY train.py train.py
ENV PYTHONPATH ./

ENTRYPOINT ["python", "train.py"]

#### Add the __init__.py file

In [None]:
!touch trainer/__init__.py

#### Add the train.py file
This code is from the LoRA distributed fine-tuning code from this example: https://ai.google.dev/gemma/docs/distributed_tuning

The IMDB TensorFlow dataset is used to fine-tune the Gemma model. Additional logic is added to handle the TPU topology setting required by TPUv5e


In [None]:
%%writefile trainer/train.py
import os
import argparse
import shutil
import locale

# Model saving variables
_ENCODING_FOR_MODEL_SAVING = "UTF-8"
_VOCABULARY_FILENAME = "vocabulary.spm"
_TOKENIZER_FILENAME = "tokenizer.model"

import keras
import keras_nlp
import tensorflow
import tensorflow_datasets as tfds
print (keras.__version__)
print (tensorflow.__version__)

parser = argparse.ArgumentParser()
parser.add_argument(
    "--tpu_topology",
    help="Topology to use for the TPUv5e (1x1, 1x4, 2x2, etc.)",
    type=str
)
parser.add_argument(
    "--model_name",
    help="Kaggle model name (gemma_2b_en, gemma_7b_en)",
    type=str
)
parser.add_argument(
    "--output_folder",
    type=str,
    required=True,
    help="Path to the output folder.",
)
parser.add_argument(
    "--checkpoint_filename",
    type=str,
    default="fine_tuned.weights.h5",
    help="Checkpoint filename.",
)
args = parser.parse_args()

def main():
    x = args.tpu_topology.split("x")
    tpu_topology_x = int(x[0])
    tpu_topology_y = int(x[1])
    print (f'TPU topology is ({tpu_topology_x}, {tpu_topology_y})')
    print (f'Model name is {args.model_name}')

    device_mesh = keras.distribution.DeviceMesh(
        (tpu_topology_x, tpu_topology_y),
        ["batch", "model"],
        devices=keras.distribution.list_devices())

    model_dim = "model"

    layout_map = keras.distribution.LayoutMap(device_mesh)
    # Weights that match 'token_embedding/embeddings' will be sharded on 8 TPUs
    layout_map["token_embedding/embeddings"] = (None, model_dim)
    # Regex to match against the query, key and value matrices in the decoder
    # attention layers
    layout_map["decoder_block.*attention.*(query|key|value).*kernel"] = (
        None, model_dim, None)
    layout_map["decoder_block.*attention_output.*kernel"] = (
        None, None, model_dim)
    layout_map["decoder_block.*ffw_gating.*kernel"] = (model_dim, None)
    layout_map["decoder_block.*ffw_linear.*kernel"] = (None, model_dim)
    model_parallel = keras.distribution.ModelParallel(device_mesh, layout_map,
                                                    batch_dim_name="batch")
    keras.distribution.set_distribution(model_parallel)
    model_name = args.model_name
    gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset(args.model_name)
    print (f'Running inference on the base {args.model_name} model')
    lm_output = gemma_lm.generate("Prompt: Return 3 things I ask for in this format. \
        Response: 1) item 1 2) item 2 3) item 3. \
        Prompt: List the 3 best comedy movies in the 90s Response: ", max_length=100)
    print (lm_output)

    # Start training
    imdb_train = tfds.load(
        "imdb_reviews",
        split="train",
        as_supervised=True,
        batch_size=2,
    )
    # Drop labels.
    imdb_train = imdb_train.map(lambda x, y: x)

    imdb_train.unbatch().take(1).get_single_element().numpy()

    gemma_lm.backbone.enable_lora(rank=4)

    # Fine-tune on the IMDb movie reviews dataset.

    # Limit the input sequence length to 128 to control memory usage.
    gemma_lm.preprocessor.sequence_length = 128
    # Use AdamW (a common optimizer for transformer models).
    optimizer = keras.optimizers.AdamW(learning_rate=5e-5,weight_decay=0.01,)

    # Exclude layernorm and bias terms from decay.
    optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

    gemma_lm.compile(
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        optimizer=optimizer,
        weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
    )
    gemma_lm.summary()
    gemma_lm.fit(imdb_train, epochs=1)

    print (f'Running inference on the fine-tuned {args.model_name} model')
    lm_output = gemma_lm.generate("Prompt: Return 3 things I ask for in this format. \
        Response: 1) item 1 2) item 2 3) item 3. \
        Prompt: List the 3 best comedy movies in the 90s Response: ", max_length=100)
    print (lm_output) 

    # Save checkpoint and tokenizer.
    print("Saving checkpoint and tokenizer.")
    if not os.path.exists(args.output_folder):
        os.makedirs(args.output_folder)
    locale.getpreferredencoding = lambda: _ENCODING_FOR_MODEL_SAVING
    gemma_lm.save_weights(
        os.path.join(args.output_folder, args.checkpoint_filename)
    )
    gemma_lm.preprocessor.tokenizer.save_assets(args.output_folder)

    # Copy and rename the tokenizer file.
    print("Copying tokenizer file.")
    shutil.copy(
        os.path.join(args.output_folder, _VOCABULARY_FILENAME),
        os.path.join(args.output_folder, _TOKENIZER_FILENAME),
    )
    print ('Exiting job')

if __name__ == "__main__":
    main()

## Fine-tune with Vertex AI Custom Training Jobs

This section demonstrates how to fine-tune and deploy Gemma models with PEFT LoRA on Vertex AI Custom Training Jobs. LoRA (Low-Rank Adaptation) is one approach of PEFT (Parameter Efficient Fine-tuning), where pretrained model weights are frozen and rank decomposition matrices representing the change in model weights are trained during fine-tuning. Read more about LoRA in the following publication: [Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. and Chen, W., 2021. Lora: Low-rank adaptation of large language models. *arXiv preprint arXiv:2106.09685*](https://arxiv.org/abs/2106.09685).

#### Enable docker to run as a regular user

In [None]:
!sudo usermod -a -G docker ${USER}

#### Change to the trainer directory to build the docker container

In [None]:
%cd trainer

#### Build the custom docker container and push to artifact registry

In [None]:
!docker build -t $KERAS_TRAIN_DOCKER_URI -f Dockerfile .

In [None]:
!docker push $KERAS_TRAIN_DOCKER_URI

#### Change back to your home directory

In [None]:
%cd ..

#### Set GCS folder locations and job configurations settings

In [None]:
# Create a GCS folder to store the merged model with the base model and the
# fine-tuned LORA adapter.
merged_model_dir = get_job_name_with_datetime("gemma-lora-model-tpuv5")
merged_model_output_dir = os.path.join(MODEL_BUCKET, merged_model_dir)
merged_model_output_dir_gcsfuse = merged_model_output_dir.replace("gs://", "/gcs/")

# Set the checkpoint output filename
checkpoint_filename = "fine_tuned.weights.h5"

DISPLAY_NAME_PREFIX = "gemma-lora-train"  # @param {type:"string"}
tpuv5e_gemma_peft_job = {
    "display_name": get_job_name_with_datetime(DISPLAY_NAME_PREFIX),
    "job_spec": {
        "worker_pool_specs": [
            {
                "machine_spec": {
                    "machine_type": "ct5lp-hightpu-1t",
                    "tpu_topology": "1x1",
                },
                "replica_count": 1,
                "container_spec": {
                    "image_uri": KERAS_TRAIN_DOCKER_URI,
                    "args": [
                        "--tpu_topology=1x1",
                        "--model_name=gemma_2b_en",
                        f"--output_folder={merged_model_output_dir_gcsfuse}",
                        f"--checkpoint_filename={checkpoint_filename}",
                    ],
                    "env": [
                        {"name": "KAGGLE_USERNAME", "value": KAGGLE_USERNAME},
                        {"name": "KAGGLE_KEY", "value": KAGGLE_KEY},
                    ],
                },
            },
        ],
    },
}

tpuv5e_gemma_peft_job

#### Create job client and run job

In [None]:
job_client = aiplatform.gapic.JobServiceClient(
    client_options=dict(api_endpoint=f"{REGION}-aiplatform.googleapis.com")
)

In [None]:
create_tpuv5e_gemma_peft_job_response = job_client.create_custom_job(
    parent="projects/{project}/locations/{location}".format(
        project=PROJECT_ID, location=REGION
    ),
    custom_job=tpuv5e_gemma_peft_job,
)
print(create_tpuv5e_gemma_peft_job_response)

#### Check on job progress
This may take 20-60 minutes or more depending on the model size. Run this cell multiple times to check progress

In [None]:
get_tpuv5e_gemma_peft_job_response = job_client.get_custom_job(
    name=create_tpuv5e_gemma_peft_job_response.name
)
get_tpuv5e_gemma_peft_job_response

#### Click on the console log url output from this cell to see your logs

In [None]:
job_id = create_tpuv5e_gemma_peft_job_response.name[
    create_tpuv5e_gemma_peft_job_response.name.rfind("/") + 1 :
]
startdate = datetime.today() - timedelta(days=1)
startdate = startdate.strftime("%Y-%m-%d")
print(
    f"https://console.cloud.google.com/logs/query;query=resource.labels.job_id=%22{job_id}%22%20timestamp%3E={startdate}"
)

### Convert the fine-tuned Keras checkpoint to HF format

#### Download the conversion script from KerasNLP tools
The GitHub repo is https://github.com/keras-team/keras-nlp

In [None]:
!wget -nv -nc https://raw.githubusercontent.com/keras-team/keras-nlp/master/tools/gemma/export_gemma_to_hf.py

#### Download the fine-tuned checkpoint files locally

In [None]:
!gcloud storage cp -r $merged_model_output_dir .

#### Install libraries for model conversion

In [None]:
!pip install torch==2.1
!pip install --upgrade keras-nlp
!pip install --upgrade keras>=3
!pip install --upgrade accelerate sentencepiece transformers

#### Run the model conversion script

In [None]:
os.environ["KERAS_BACKEND"] = "torch"
os.environ["KAGGLE_USERNAME"] = KAGGLE_USERNAME
os.environ["KAGGLE_KEY"] = KAGGLE_KEY
MODEL_SIZE="2b"
!KERAS_BACKEND=torch python export_gemma_to_hf.py \
  --weights_file ./$merged_model_dir/fine_tuned.weights.h5 \
  --size $MODEL_SIZE \
  --vocab_path ./$merged_model_dir/vocabulary.spm \
  --output_dir ./$merged_model_dir/fine_tuned_gg_hf

#### Copy converted HF files to GCS

In [None]:
HUGGINGFACE_MODEL_DIR = os.path.join("./", merged_model_dir, "fine_tuned_gg_hf")
HUGGINGFACE_MODEL_DIR_GCS = os.path.join(merged_model_output_dir, "fine_tuned_gg_hf")
HUGGINGFACE_MODEL_DIR

In [None]:
!gcloud storage cp $HUGGINGFACE_MODEL_DIR/* $HUGGINGFACE_MODEL_DIR_GCS

### Deploy fine tuned models
This section uploads the model to Model Registry and deploys it on the Endpoint using [vLLM](https://github.com/vllm-project/vllm)

The model deployment step will take 15 minutes to 1 hour to complete, depending on the model sizes.

In [None]:
MODEL_NAME_VLLM = get_job_name_with_datetime(prefix="gemma-vllm-serve")

# Start with a G2 Series cost-effective configuration
if MODEL_SIZE == "2b":
    machine_type = "g2-standard-8"
    accelerator_type = "NVIDIA_L4"
    accelerator_count = 1
elif MODEL_SIZE == "7b":
    machine_type = "g2-standard-12"
    accelerator_type = "NVIDIA_L4"
    accelerator_count = 1
else:
    assert MODEL_SIZE in ("2b", "7b")

# See supported machine/GPU configurations in chosen region:
# https://cloud.google.com/vertex-ai/docs/predictions/configure-compute

# For even more performance, consider V100 and A100 GPUs
# > Nvidia Tesla V100
# machine_type = "n1-standard-8"
# accelerator_type = "NVIDIA_TESLA_V100"
# > Nvidia Tesla A100
# machine_type = "a2-highgpu-1g"
# accelerator_type = "NVIDIA_TESLA_A100"

# Larger `max_model_len` values will require more GPU memory
max_model_len = 2048

model, endpoint = deploy_model_vllm(
    MODEL_NAME_VLLM,
    HUGGINGFACE_MODEL_DIR_GCS,
    SERVICE_ACCOUNT,
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    max_model_len=max_model_len,
)

#### Click on the console log url output from this cell to see your logs

In [None]:
startdate = datetime.today() - timedelta(days=1)
startdate = startdate.strftime("%Y-%m-%d")
log_link = "https://console.cloud.google.com/logs/query;query=resource.type=%22aiplatform.googleapis.com%2FEndpoint%22"
log_link += f"%20resource.labels.endpoint_id=%22{endpoint.name}%22"
log_link += f"%20resource.labels.location={REGION}"
log_link += f"%20timestamp%3E={startdate}"
print(log_link)

NOTE: The overall deployment can take 30-40 minutes or more. After the deployment succeeds (15-20 minutes or so), the fine-tuned model will be downloaded from the GCS bucket used in training above. Thus, an additional ~15-20 minutes (depending on the model sizes) of waiting time is needed **after** the model deployment step above succeeds and before you run the next step below. Otherwise you might see a `ServiceUnavailable: 503 502:Bad Gateway` error when you send requests to the endpoint.

### Send a prediction request

Once deployment succeeds, you can send requests to the endpoint with text prompts. Use the same example used earlier in the notebook

Example:

```
Prompt: Return 3 things I ask for in this format and do not repeat my prompt. Response: 1) item 1 2) item 2 3) item 3. List the 3 best comedy movies in the 90s Response:
Response:  1) The Cable Guy 2) Scooby-Doo 3) Beethoven Requirements
```

In [None]:
PROMPT = "Prompt: Return 3 things I ask for in this format and do not repeat my prompt. \
Response: 1) item 1 2) item 2 3) item 3. \
Prompt: List the 3 best comedy movies in the 90s Response: "

instances = [
    {"prompt": PROMPT},
    {"max_tokens": 500},
    {"temperature": 1.0},
    {"top_p": 1.0},
    {"top_k": 1.0},
]

response = endpoint.predict(instances=instances)

for prediction in response.predictions:
    print(prediction)

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

In [None]:
# Delete the train job.
job_client.delete_custom_job(name=create_tpuv5e_gemma_peft_job_response.name)

# Undeploy model and delete endpoint.
endpoint.delete(force=True)

# Delete models.
model.delete()

import os

# Delete Cloud Storage objects that were created
delete_bucket = False
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI